[NOISE] This lecture is about the sentiment classification. If we assume that most of the elements in the opinion representation are all ready known, then our only task may be just a sentiment classification, as shown in this case. So suppose we know who's the opinion holder and what's the opinion target, and also know the content and the context of the opinion, then we mainly need to decide the opinion sentiment of the review. So this is a case of just using sentiment classification for understanding opinion. Sentiment classification can be defined more specifically as follows. The input is opinionated text object, the output is typically a sentiment label, or a sentiment tag, and that can be designed in two ways. One is polarity analysis, where we have categories such as positive, negative, or neutral. The other is emotion analysis that can go beyond a polarity to characterize the feeling of the opinion holder. In the case of polarity analysis, we sometimes also have numerical ratings as you often see in some reviews on the web. Five might denote the most positive, and one maybe the most negative, for example. In general, you have just disk holder categories to characterize the sentiment. In emotion analysis, of course, there are also different ways for design the categories. The six most frequently used categories are happy, sad, fearful, angry, surprised, and disgusted. So as you can see, the task is essentially a classification task, or categorization task, as we've seen before, so it's a special case of text categorization. This also means any textual categorization method can be used to do sentiment classification. Now of course if you just do that, the accuracy may not be good because sentiment classification does requires some improvement over regular text categorization technique, or simple text categorization technique. In particular, it needs two kind of improvements. One is to use more sophisticated features that may be more appropriate for sentiment tagging as I will discuss in a moment. The other is to consider the order of these categories, and especially in polarity analysis, it's very clear there's an order here, and so these categories are not all that independent. There's order among them, and so it's useful to consider the order. For example, we could use ordinal regression to do that, and that's something that we'll talk more about later. So now, let's talk about some features that are often very useful for text categorization and text mining in general, but some of them are especially also needed for sentiment analysis. So let's start from the simplest one, which is character n-grams. You can just have a sequence of characters as a unit, and they can be mixed with different n's, different lengths. All right, and this is a very general way and very robust way to represent the text data. And you could do that for any language, pretty much. And this is also robust to spelling errors or recognition errors, right? So if you misspell a word by one character and this representation actually would allow you to match this word when it occurs in the text correctly. Right, so misspell the word and the correct form can be matched because they contain some common n-grams of characters. But of course such a recommendation would not be as discriminating as words. So next, we have word n-grams, a sequence of words and again, we can mix them with different n's. Unigram's are actually often very effective for a lot of text processing tasks, and it's mostly because words are word designed features by humans for communication, and so they are often good enough for many tasks. But it's not good, or not sufficient for sentiment analysis clearly. For example, we might see a sentence like, it's not good or it's not as good as something else, right? So in such a case if you just take a good and that would suggest positive that's not good, all right so it's not accurate. But if you take a bigram, not good together, and then it's more accurate. So longer n-grams are generally more discriminative, and they're more specific. If you match it, and it says a lot, and it's accurate it's unlikely, very ambiguous. But it may cause overfitting because with such very unique features that machine oriented program can easily pick up such features from the training set and to rely on such unique features to distinguish the categories. And obviously, that kind of classify, one would generalize word to future there when such discriminative features will not necessarily occur. So that's a problem of overfitting that's not desirable. We can also consider part of speech tag, n-grams if we can do part of speech tagging an, for example, adjective noun could form a pair. We can also mix n-grams of words and n-grams of part of speech tags. For example, the word great might be followed by a noun, and this could become a feature, a hybrid feature, that could be useful for sentiment analysis. So next we can also have word classes. So these classes can be syntactic like a part of speech tags, or could be semantic, and they might represent concepts in the thesaurus or ontology, like WordNet. Or they can be recognized the name entities, like people or place, and these categories can be used to enrich the presentation as additional features. We can also learn word clusters and parodically, for example, we've talked about the mining associations of words. And so we can have cluster of paradigmatically related words or syntaxmatically related words, and these clusters can be features to supplement the word base representation. Furthermore, we can also have frequent pattern syntax, and these could be frequent word set, the words that form the pattern do not necessarily occur together or next to each other. But we'll also have locations where the words my occur more closely together, and such patterns provide a more discriminative features than words obviously. And they may also generalize better than just regular n-grams because they are frequent. So you expected them to occur also in tested data. So they have a lot of advantages, but they might still face the problem of overfeeding as the features become more complex. This is a problem in general, and the same is true for parse tree-based features, when you can use a parse tree to derive features such as frequent subtrees, or paths, and those are even more discriminating, but they're also are more likely to cause over fitting. And in general, pattern discovery algorithm's are very useful for feature construction because they allow us to search in a large space of possible features that are more complex than words that are sometimes useful. So in general, natural language processing is very important that they derive complex features, and they can enrich text representation. So for example, this is a simple sentence that I showed you a long time ago in another lecture. So from these words we can only derive simple word n-grams, representations or character n-grams. But with NLP, we can enrich the representation with a lot of other information such as part of speech tags, parse trees or entities, or even speech act. Now with such enriching information of course, then we can generate a lot of other features, more complex features like a mixed grams of a word and the part of speech tags, or even a part of a parse tree. So in general, feature design actually affects categorization accuracy significantly, and it's a very important part of any machine learning application. In general, I think it would be most effective if you can combine machine learning, error analysis, and domain knowledge in design features. So first you want to use the main knowledge, your understanding of the problem, the design seed features, and you can also define a basic feature space with a lot of possible features for the machine learning program to work on, and machine can be applied to select the most effective features or construct the new features. That's feature learning, and these features can then be further analyzed by humans through error analysis. And you can look at the categorization errors, and then further analyze what features can help you recover from those errors, or what features cause overfitting and cause those errors. And so this can lead into feature validation that will revised the feature set, and then you can iterate. And we might consider using a different features space. So NLP enriches text recognition as I just said, and because it enriches the feature space, it allows much larger such a space of features and there are also many, many more features that can be very useful for a lot of tasks. But be careful not to use a lot of category features because it can cause overfitting, or otherwise you would have to training careful not to let overflow happen. So a main challenge in design features, a common challenge is to optimize a trade off between exhaustivity and the specificity, and this trade off turns out to be very difficult. Now exhaustivity means we want the features to actually have high coverage of a lot of documents. And so in that sense, you want the features to be frequent. Specifity requires the feature to be discriminative, so naturally infrequent the features tend to be more discriminative. So this really cause a trade off between frequent versus infrequent features. And that's why a featured design is usually odd. And that's probably the most important part in machine learning any problem in particularly in our case, for text categoration or more specifically the senitment classification. [MUSIC]