Okay, this second exercise will introduce you to the main functions, including topic modeling, sentiment analysis, text preprocessing, and many others of yTextMiner provides. Okay let's first as we did before, let's open up Eclipse by double-clicking on Eclipse icon. Then if this workspace is fine then click on OK button Okay, so assuming that you have yTextMiner in your workspace and you see your yTextMiner on your left-hand side of ID. Let's expand yTextMiner. So let me now explain each one of those components. The first, NIB, which stands for Third File Libraries. Third File Library in our case I call it NIB. NIB has all kind of third file libraries required to run yTextMiner properly. The first, a very important third party library
is Stanford CoreNLP. The version is 3.6. We have two jar files related to Stanford CoreNLP. First one is API. Second one is model-based. The models one is a huge file. It includes central analysis, document classification, dependency parsing. All kinds of supervised learning-related models in order for you to run Stanford CoreNLP you need to have those two. 3.6 compiled with Java 1.8. Our other one is Linkpipe 4.1. Linkpipe is required to do centimal analysis with Alias-I. LBSM is linear SCBM, and you need this for document classification. Malletsdeps.jar file and mallet.jar, those two files are required to run topic modelling, as I demonstrated in the last exercise. Here, Twitter, those Twitters, if you want to use Twitter's APIs and to avoid JSON output, then you should use Twitter for J jar files. Those two Twitter for J.jar files you are going to simply have expected elements like user information, their friend's information, their tweets and so on and so forth, without any cumbersome parsing JSON. And JSOUP is the top five for parsing htmlp I'm going to explain other dependency jar files, libraries, throughout the course of this, Text mining, when I believe you need to know very particular type of library. For instance, text preprocessing stays if you want to know Porter algorithm, or stubber, or other explicit processing techniques. Then I'm going to refer to those jar files in NIB. Let me move on to data folder. The data folder, let me expand this, then you have two subfolders. One is a Corpus, the other is Util. Util one is needed for running supervise-faced document classification, or central analysis. Or if you need that dictionary-based central analysis, then you need the central word net. Corpus, you have seven. So first three is from GitHub. If you just simply use a GitHub APIs then, this GitHub other json.txt file has accumulated return results in it. Since it's a big file, it takes time. As you see this is JSON format result. So later on we're going to parse this JSON file and extract some related important information. Same as New York Times news articles in JSON format. And twitter_stream, this is because it is not in format of json because I used Twitter for J jar files. As I said before, if you use Twitter for J jar files then you avoid a cumbersome amount of parsing, Json format of Twitter. Okay, now let me move onto our source code. The source code has several packages. The first package is classification. Second package is related to the classification which is a feature set. The third one is pre-processing package. Fourth one is sentiment analysis-related package. This one is calling or collecting data-related package. Sixth one, main page, which you need this main object or class to execute its program. The seventh one is top modeling-related package. The last one is utility classes. So utility package, there are several utility packages. Most of them are related to Document units or analysis unit, that yTextMiner has. A collection, document, sentence token, if you remember, in previous lecture note, those are the structures of analyses unit. Preprocessing pair keys is based on CoreNLP pre-processing module, and secondly it's Porter stemming module so let me briefly talk about CoreNLP pre-process object. It use, first, initialize stage. So, initialize stage particularly in this case, stub words, we want to remove stub words so that, stub words file must be provided and then, later on, will be applied to token if you want to exclude particular tokens. And pre-process, there are two different functions related to pre-process. So pre-process function is basically, if you parse raw sentence and then Stanford CoreNLP package kicks in and parse the raw sentence and then creates so-called token object. There are other types of pre-process and this is critical method that one case if you already construct token and you pass token to pre-process, if you have sentence. This is a sentence unit not raw data, and if you have document you pass document to pre-process, you can do that, if you collection level analysis unit, you can parse the collection object to pre-process and you can then process collection down toward token. Okay. Then, sentiment package. A sentiment package is, I explained the three approaches in the lecture note. The first one is record newer net phase. Sentiment analysis separates the linkpipe central analysis. Third one is dictionary-based. So dictionary-based, as you assume that you need to pass dictionary. So first, you initialize dictionary, this is where you pass dictionary file to sentiment dictionary file as an argument to this function and then you goes through this logic to create dictionary object. So, dictionary object basically consists of tri-data structure. The last one, the last two packages are topic model and then classification. The topic model is based on Mallet, topic modeling package. The basic one is LDA, the advanced one is TMR. So let me talk about Mallet LDA. The Mallet LDA is latent directory allocation, and developed by UMass Amherst TextMining Group. As excellent text mining package, on this Mallet LDA is called by topic main in Java in your test main package. I'm going to explain this more in depth in Topic Modeling With four or five, when I cover topic modeling. Classification. There's three different operatives. One is Linear SCBMs. Second one is logistic regression. Third one is live agent model. Classify your approach. So we are going to shield those, and for document classification, we're going to three approaches for sentiment analysis. We're going to use two topic modeling approaches. So basically I provide simple data and simple main function but what you need to do, is you need to modify those main functions and utilize those utility files, classes, to complete your project. So we're going to talk about project next week but please make sure that you review those classes provided in yTextMiner, and try to understand those classes and extend or modify them to suit your own need to conduct your project. I'm going to see you next week. Thank you.