Hello there, you're still here. So you want more. All right. Well, welcome to the second course in our sequence here in the Master of Data Science here at the University of Colorado Boulder. Again I'm Chris Fargo, I'm an associate professor of advertising. I researched text data from around the web using computational social science. In this course I'm going to highlight some of the computational methods that I love to use on text data. Will apply these marketing methods to marketing tasks and we'll see how we can reveal insights from data that we know nothing about. So unsupervised machine learning topic modeling, clustering of unstructured data. They all share this central idea of taking a large collection of data and trying to make sense of it. Let's say that you had 10 million tweets. If you wanted to organize these tweets into clusters or topics, topic modeling is the way to go. If you know nothing about the dataset that you're about to embark upon, topic modeling can help you make sense of it. It's also known as unsupervised machine learning. Remember from our previous lecture on machine learning, there are two types of machine learning, unsupervised and supervised. You now know a fair amount about supervised machine learning but topic modeling by default is an unsupervised method. What we mean by that is we aren't telling the computer what the right answers are. There's nothing for the computer to train from. When we build a machine learning algorithm that is supervised, we show it evidence and we say, hey, try to recreate this evidence as best you can with what you got. Over time the computer learns about the evidence and is able to recreate that prediction, right? But in the unsupervised machine learning, we don't know what the answers are. We don't know anything about our dataset often because the data is just too big. And so we need to take a look at it. So what is a topic model? Topic models are created using mathematical permutations on the words as they appear (1) inside of documents and (2) across the collection of documents that you're studying. Commonly, topic models, calculate the co-occurrence of words in the document. We've been using this term co-occurrence quite a bit. If you recall from our previous lecture, co-occurrences are really common with pre-trained neural networks. Remember that these pre-trained models such as Bert kind of know the co-occurrences of words and when words tend to appear together. Topic models decipher that relationship for a specific corpus or collection of documents. So imagine your delta. You want to use topic modeling/unsupervisedmachinelearning/clus- tering to make sense of survey responses. Well, I got this out on my phone the other day when I was flying, which actually was a couple of years ago now. And at the end of my Wi-Fi experience this box invited me to share my thoughts about delta. I immediately thought about the poor souls who would have to read this, right. It's gosh, there's probably so many complaints in here, you can't even imagine. Then I thought, wait, no, there's probably no one reading this data. Right. Delta is a pretty lean organization and I bet there's not even anyone really looking at it. What a depressing job that would be if you had to read these comments all day. Can you imagine the complaints? If not, you probably haven't used airplane Wi-Fi or been stuck on a tarmac for 2 hours. It's terrible. Also, it costs money to employ someone just to read these responses. So corporations like delta probably aren't going to spend that kind of money to really dive into the insights qualitatively. So maybe there isn't a person reading this stuff at all. We can use topic modeling to solve this problem. Let's imagine that approximately 20,000 people take the time to fill out the survey per month. We have 20,000 little documents. We don't want to read all 20,000 because honestly we might get post traumatic stress disorder. Instead we want to know what commonly people are saying so that we can address the most common concerns. When I see something like this, the first thing I think about is what people might actually put in the box. Maybe people will commonly use this as a venue to complain about flight related issues. Maybe people talk about the quality of the food, the sizes of the seats so on and so forth. Maybe even people will complain about the the airport right? And there's a lot of different things that people might put in this box because this prompt is vague. It's just saying, hey, tell us about your experience. Topic modeling helps us go quickly from 10,000 or 20,000 responses down to 5 or 6 key themes that we can use for improvement. So delta can use topic modeling here to generate insights quickly. The idea is data in key concerns out. Whether, topic modeling is right for us hinges on the size of the data. No topic model is perfect topics can be overbroad or have words spuriously associated to them. What you should think about in deciding whether a topic model is appropriate is whether you're willing to accept that error. If we get 200 responses a month, the best research method might just be qualitative to spend an hour or so reading the responses and taking down some notes. Don't do topic modeling unless the data is big enough to warrant it. To summarize the dataset qualitatively, you only need to read a representative sample of it in a very big collection of documents that could be as little as a percentage or two of the data. But the general rule of thumb is sampling around 10% of the data will reveal the major themes in the data there in. The risk with this approach however, is that we might randomly just by chance happened to get a collection of people or documents complaining about seat size for instance. But in reality that sample was biased and that was just by the luck of the draw. Maybe in reality the majority of people didn't complain about seats they was just overrepresented in our random sample. If it's not big there's really no point of doing a topic model but big isn't just straightforward. I've been talking a lot about 20,000 documents. That may sound like a lot but what if it took 10 seconds to read each document like a tweet, then 20,000 might not be too big. You can read 2000 tweets in a few hours because it just takes a minute or two to read a large collection of tweets. You can't read 2000 books in a few hours. So topic modeling is going to be really applicable there. Going into topic modeling to build a good model it takes a reasonable amount of time and you must know that. If you just fit a model in a few seconds it's not going to be any good do topic modeling when manual annotation is just not feasible. If we can't manually annotate the data for ourselves in a reasonable amount of time. It makes sense to do a topic model if we can we don't need to do the topic modeling. The most important thing you need to know about topic models is even if a computer comes up with a cluster of words that you think cleanly represents a topic, this machine is really headless. It doesn't really understand what It is about one topic or another. So when I see topic modeling or unsupervised machine learning used in real world business applications such as contextual advertising for instance. I get really worried because there's really no way to guarantee that a cluster of words genuinely and truly maps to what a human would interpret as a topic. It's just a collection of words that tend to go together. So remember if we don't have to do topic modeling, we shouldn't, but there are lots of use cases where we should use topic modeling. Anytime you have a large collection of documents you want to make sense of or extract a specific subset of documents from. And there's too many to read or classify, then we probably want to do topic modeling no better example than social media data. And I think this is where topic modeling really kind of started to take off in the late 2000s. During major public events such as the Oscars or the her or hurricane or the fires that just happened in Lewisville here in Colorado. There's millions of tweets. We can't even begin to manually label, 1% of the tweets that are out there right now about the college football playoff for instance. Instead we have to use computational methods to extract the most common topics. Again, the concept of unsupervised machine learning means that we aren't telling the computer what's right and what's wrong. Supervised machine learning is examples based on labeled data where we show the computer the evidence. Supervision is the process where the computer evaluates its learning against the gold standard data, the human labeled data. Unsupervised machine learning doesn't start with known labels. We don't have any preconceived notion of anything. The fork in the road here between supervised and unsupervised approaches is important. If we want to teach a computer how to mathematically separate a collection of documents, unsupervised is the way to go. Or if we want to empirically classify something that we as humans have a preconceived notion of and supervised machine learning is the way to go. That's why with contextual advertising it just doesn't make sense to use unsupervised machine learning. We can manually label examples of each category that we want to see the computer recreate. Topic models will never have that human granularity or match. Go back to the delta example. If we know that people complain about seat size and we only want to label data on whether they complain about seat size or not it's probably easier to use supervised machine learning. If you can find and label a dozen comments or so that talk about seat size and a few 100 that don't. And you're probably going to have a decent enough training set to build something that works better than a topic model. This is helpful if I want to know the total number of comments that were about seat size but the rest of the data remains unknown. Unsupervised machine learning is the exact opposite. Unsupervised machine learning can really help us observe patterns and tell you what those patterns are for a collection of documents. Another way to think of it is that unsupervised machine learning identifies clusters of documents in your larger collection of documents. It's up to you how you use these clusters, you can use clustering to classified documents or you can use clustering to just generally get a sense of what's inside your big data set. The real goal with unstructured is to learn the hidden structure behind the data set. This is the most commonly used LDA topic modeling visualization. It's called LDAvis in Python and we're going to load it up here in a few lectures. It's a popular tool because it visualizes the distance between topics. The closer to topics are to each other, the more they are similar semantically. So using unsupervised methods to extract unknown commonalities of unstructured text data. That's topic modeling. That's the best way I can describe this to you as a research method. It's what we're going to do when we say topic modeling in this class. The only thing that I guess we really haven't talked too much about is this definition of unstructured text. Any document whether it's a book or a tweet or a survey response or a transcription from recorded audio is unstructured text. Unstructured text isn't just written, it could be transcribed from audio. It could even be metadata derived from vision AI that topic modeling in a nutshell is taking something that is unstructured and deriving the structure that lies underneath. So what does it look like? The workflow of a topic model starts with a collection of documents. So let's say that the collection in this case is 1 million tweets. We have 1 million tweets and they're stored in a Jason Foul on a server. Every entry represents one tweet. We have 1 million entries were putting into a black box or an algorithm. Topic modeling algorithms are hard to observe or inspect but they usually allow for parameters to be adjusted or tuned to optimize their quality of the model. How do you optimize quality? Well, honestly, I think the best way to do it is human intuition. But we'll look at some statistics that can help us. Most topic models are specified with a known number of topics from the start. This is referred to as capital K. Documents in topic models can also be clustered in different ways. Soft clustering allows for documents to belong to more than one topic. For instance, I can tweet about immigration and the economy together in one tweet. These are two separate topics but they're present on the same document slash tweet. I might want to build a topic model that treats that tweet is belonging to one topic or another topic. Alternatively, I might want to build a topic model that says that both topics are present in the tweet. That's really the difference between hard, soft, and hierarchical clustering. Hierarchical clustering is somewhat in the middle. As we go further down our topics, we get more granular as we cluster up in hierarchy the topics get more generalized. So imagine that we're looking at political tweets. There might be one grand topic about each political candidate present in the tweets but underneath each candidate there might be things like specific issues. Hierarchical allows us to fetch those documents of varying granularity. In theory I can extract all of the tweets that mentioned Joe Biden or I could zoom in further to get all of the posts that mentioned economy plus Joe Biden. If we suspect that there's some type of real structure in our data, then this hierarchical approach works really well. Not all topics in reality have this kind of structure. Tercer shorter text documents tend to have little hierarchy because there's little granularity to extract in the topics. Hierarchical works well with big documents and big collections of documents where there are lots of topics and sub topics. We'll quickly review hierarchical models in our last lecture but keep them in mind when you suspect this kind of structure is present in your data. Topic modeling is done in all different ways across the web and I think it's good to really talk about the different examples. So Google news is a big topic model essentially. How do you think Google news uses topic modeling to make sense of the way it presents thousands of news articles to consumers. What's it doing here? How is it using topic modeling on this page and why is that helpful? They're taking articles and clustering them based on similarity. Imagine that for a given day, Google news is taking all of the news articles published in that day and it's extracting the text features from those articles. It's looking at words that are commonly used together in those articles and then it's putting them into topics or clusters and then presenting them to you organized by that cluster. So if you're interested in one of these topics, you can quickly see seven or so articles based on this semantic representation of this topic. It's not perfect. And there will be times where it will suggest articles as related that are in fact not. So this is a topic modeling in the wild. We could do this with any large collection of documents that we have. If we have a company internet for instance and we want to go through and look at all the marketing assets that lie there in and then cluster them by certain topic products or services. Then we could ingest all of the content from our internet into a topic model and we could present something like this. We can cluster our marketing documents in the very same way that Google is clustering news documents. So this is probably the most scandalous slide that I'll put up in the sequence, but this is Yik Yak, your back is not back from the dead, which is fascinating to me. A blast from the past. I had a bright idea when I was a young professor, I wanted to listen to Yik Yak because I wanted to know what it was and what people were talking about. So I listened to all of the major universities, Yik Yaks, I collected it for about a month and until the tool that I was using just stopped working. So this is a topic model for see you for that month. You can actually start to see what the conversations are though, right? This was about a million messages. I had no idea what was in it beforehand and I mean I had some idea, right? Because if you open the app you kind of get it, but it is very much sex drugs and there's not much rock and roll in here. There's topics that clearly represent professors. There's topics that represent university life, but in by and large you can see that it's a topic that's really talking about the devices that we all have, especially in our college years. [LAUGH]. So the topic model really helped us learn a little more the order that you see them here is based on prevalence. So the most commonly occurring topics are at the top of this list. So here's one for Wendy's tweets. Here's a marketing example. Here's an example of a few million tweets. How informative is this? I don't know, but there might be an insight in here or two. Topic # 10 people really want their spicy nuggets back, let's say that you're a huge brand and you're doing social listening. Topic modeling can help us get a better feel for what's trending as it regards to Wendy's and twitter. So this topic model comes from tweets collected in 2012 during the presidential election. On the left column you'll see words that were in the topic that the computer found in the clusters. And then on the right are human descriptions that we wrote as researchers, we read about 100 tweets for each topic model to make sure that it was generally on the right track. But let me tell you there's a good deal of error that comes into this approach. That's why when I see these types of approaches being used in the wild, like in contextual advertising, I get nervous. Because essentially what we're doing is we're taking a cluster of words that really doesn't have a firm definition that the computer is trying to emulate. It's just telling you what tends to cluster together and there can be error in there. And that's a problem. But we can try to minimize that. And if we accept that as a limitation that some documents will be misclassified, it can still be a valid tool.