In this handle lecture, we'll discuss about one technique on sentiment classification called logistic regression using LingPipe library. Unlike the CoreNLP and SentiWordNet sentiment classifiers, LingPipe did not provide any available model. So, in this lecture we'll use a sentiment dataset in order to train a logistic regression model and use the trained model to classify sentiment. So, let’s do this. As I briefly mentioned we need to first train the classifier. So in order for us to train the classifier we need to have answer sets. Which means some data needs to be pre-identified as negative data. And the other data sets we need to have positive data if it's a binary classifier. So negative data which means if it's movie review data then each movie review, we need to identify whether it's a negative or positive. If it's a positive movie review than we place it under a certain folder. If it's a negative review we place that in a certain folder. Okay. This information is available under the Data folder. There is a util, txt, send token, and connect or pause. If you click one of them then the information is there. This is manually classified by experts. If they assign a positive score to this review that means whatever tokens, whatever features in there are leaning toward the positive expressions of movie reviews. What we're going to do is, we're going to use that collected information on movie reviews, whether negative or positive. We are going to read them all and then basically use them to train the model. Now here train the model. You basically create the LingPipe central model if you go there. And what it does is take those positive directories and negative directories. And then you pass that to a classifier, dynamic LM classifier. And then you train the classifier based on the correct pre-identified answer sets. And then after that you simply create sentiment.model. You serialized your classifier into the sentiment model. And then close out the file. And after this is finished, then what you do is simply load that into memory. And then for sample data, five or less data. Then basically for each sentence you predict the polarity of the sentence based upon that trained classifier, logistic regression classifier. And then predict the polarity of the particular sentence. So, let's execute this. It may take a while because of the training phase. Please remember that this is just a demonstration of how sentiment classifier works. In the real world scenario or in your final project, I expect you are facing a much bigger size of dataset. That means it takes more times, sometimes you run into keep size memory error or you run into several other problems. So, what you need to do is you're probably going to simply increase the memory size, keep memory size. Or you're probably going to reduce the feature space. What I mean by reducing feature space is, in terms of document metrics, you simply reduce the column size. How do you reduce the column size? You remove stammers, you remove some words that less frequently occur, minimum occurrence of term is two or five. Or commonly, very commonly occuring terms, you just simply eliminate all of them. By doing that, what you have is you only keep very meaningful terms, important terms. Importance here means discriminative value is high for that particular term. If that's the case, it reduces the feature space. Which means it takes much less time. As you see here, document zero's actual label is Sports. I'm sorry. I think I, yes. I was on the wrong one. You must run SentimentLingPipeModel. So let's run this again. I ran this previous one which is document classification. But it's very simple and somewhat similar to document classification based on a LingPipe classifier. As I said before, it takes time because of a training phase. And while it's under execution, let me explain more about what's happening inside the code. Since this directory, Neg, is a sub-directory of Negative Movie Reviews in there. What you do is simply look through all the files there. And if the file is inside an infected directory then you just skip it. If if it's not then you go into the file and open the file and then you add them into collection object. Same as negative, positive goes through the same procedure. Okay, what prints out in this console is, it's printing that sentence and each score. Okay? So let's just close and look at the result. Let's say, for this sentence, plays him just fine. Well, for this sentence, sentiment is one. Since this is a binary, the negative positive 0.0, which means it's negative, 1.0 is a positive. For this sentence, LingPipe's logistic regression based sentence classifier predicts this is a positive, and so on and so forth. Compared to this, Stanford CoreNLP depends on the size of data. Stanford CoreNLP performs better. Simply because the training data of this logistic regression model is based on the very handful size of the training dataset. If you have more data, more high quality data, then it performs better. The performance is significantly improved.