This section will introduce you to N-grams. How they look like, how you can use yTextMiner to extract N-grams out of sentences. Now, let's look at the one Java cluster I created related to N-grams and many other preprocessing techniques. We need this Java class called sentence.java. Sentence.java is under edu.yonsei.util, I click on the Radio button, you expand the packet, and there is sentence.java. By double-clicking on sentence.java, and this is sentence.java class. I explain this Java class briefly in previous lab session. Let's take a closer look at this. There are several variables, global variables. The first one, serial origin ID is you don't have to pay attention to that, but what you need to know is sentence. Sentence is, so sentence is string, okay. So it is, word text sentence and parse tree form of sentence is string with presentation, we'll discuss this later. Dependencies, okay? Dependencies, there's dependency, which were found in the sentence, we will discuss this later too. SentimentCore, okay, SentimentCore is double same as SentimentLing, SentimentWord. Those three Boolean variables are the result of three sentiment analysis techniques. Okay, one is stand for core and LP. Second is, ling pipe. Third one is, dictionary sentiment word NAT. Okay, so all of those quite useful for later text mining applications, and we'll talk about those in later lectures and lab sessions. Aside from the get and set message and preprocessed messages, there are a lot of get and set message. And preprocessed message too. We have one function called getNgrams, okay, getNgrams. Let me explain this a little bit more. GetNgrams function returns list, Java list container, and element type is string. It takes one argument. This argument is call the number of consecutive word. Okay, and if it's true, then bigram, three, then trigram. So depending on the number of N-gram size, we look through, the size means the sentence size. Then, for each sentence up to the N-gram, we'll create N-grams, the += sign means appends, okay? So for each N-gram, we append the consecutive words corresponding to the N-gram size. At this point, we use getLemma instead of getToken, so lemma form will be appended to N-gram. Then if there's any trailing white space, then we use string APIs stream, Function, and there we add those N-grams to arraylist, and then it returns the arraylist. So this arraylist will be based on the size of string, size of sentence string, so each sentence has different size. So depending on size of sentence and depending on the the size of N-grams, the size of returned array, will be determined, okay? So let's create N-gram main.java, okay? So we previous session, we create NormalizationMain.java, in the same manner, we create NGramMain.java. So here, it will be under Main package. So you select the Main Package > New > Class. Here the name should be NGramMain. Same as before, I already created NGramMain. So I cannot create this again for you. You should not have any problem, so you just clicking on Finish button to make NGramMain.java, okay? Once you click on Finish button, you will see NGramMain.java under Main Package. It's very simple. This point, The difference between this and previous NormalizationMain is calling N-grams, okay. But other than that, what you need to do is, you create, instantiate scanner object to open the file and read the file. And for a number of lines, you call next line, and then you create sentence object. And you call preprocess function and then do preprocessing stage. Same as NormalizationMain.java, Ngram.java, you need to do a similar thing, okay? This time, MI times news articles under data and corpus, so it's on the data corpus MI time. News articles are TXT. The first thing is URL, so you don't want to get the N-grams out of the first line. So you will probably want to skip the first line, okay? So you're not going to do anything with the first line, but the next line and on, you will call next line of scanner object. If the corresponding text is empty, then you break. Otherwise, you create sentence object by passing the root text of each line and core preprocess. This is similar with NormalizationMain.java as a set. But from this point on, you set N-gram to 2, which means you'll do bigram. That means for each line of text, you will get bigram, okay? So we'll call, if you print, you will print out each sentence and then for each sentence, you'll print out bigrams. All right? So you probably want to, this is simple, you just copy from the NormalizationMain.java, and then, you skip the first line and the while loop means. And until this means, it runs forever until the break in condition met. The break in condition here is if text is, Empty, then it breaks, otherwise it keep going, okay? So assuming that you already code those lines of, Java code, and then Ctrl + S, you say this file, if this is the first time. So you need to right-click on NgramMain.java > Run As. Then select Run As, it's a very sensitive mouse, then you select Java Application. And as you see, the program prints out the login information, and then for each sentence, you have biogram. The biogram, like Washington stellar to question mark, when we move to question mark, because the lemma form is not there. The Washington stellar, stellar pitching, pitching keep, keep..meds, meds flood in and so on and so forth. So basically, we're getting N-grams, in this case bigrams out of each sentence. Okay, if you want to print out, Each N-gram, then what you need to do is simply, Okay, so here, the error is because we didn't import this, so the list is under Java.util, okay. So you just get this and what you need to do then is, you just simply look through list, arraylist of string. Let's call N-gram. Then results, There's a spelling error. Then, what you need to do is just call N-gram. Okay, just simply re-execute the N-gramMain. If you recall correctly, if you're running into memory chip size, memory error, then you have to increase the virtual memory by selecting Run as a Java configuration. There's a VM argument, and you set xsx, I'm sorry, xmx 1,300 or 1,500 megabytes. By doing so, you will avoid the memory problem, so let me just. Okay, so as you see, each line has bigram. If you want to increase, N-gram size is simply change 2 to 3 or 2 to 4 and so on and so forth, by doing this. You will get N-grams out of sentences.