We're here in the last lecture of our second course. You're almost two-thirds of the way through the sequence and I'm hoping all of these concepts in Natural Language Processing, and information retrieval are starting to really coalesce in your mind and you're starting to really begin to understand how all of this is related but oddly, somehow a little bit different. There's a really wonderful new package out called BERTopic, depending how you want to pronounce it. But it really is genius because a researcher realized that the semantic relationships that words tend to have together can be incorporated into a topic model from the beginning so that our topic model has that general semantic understanding of co-occurrences of words built into the model right when it gets started. Also whenever we use these types of pre-trained models, it makes that pre-processing part that we spent so much time on tmtoolkit really easy. Let's take a look. Remember any time you use deep learning, you are going to need your GPU. Make sure you go to your runtime and you change your runtime type and you make sure it's set to GPU, mine is. Now we're going to import BERTopic and again, we need to restart our runtime. After we do that, we're going to load in our review JSON, just the raw JSON, not the stuff we were doing with tmtoolkit. We're going to go through and just put in our review text into a list, one document after another. We get a list of 21,570 reviews. It's just a simple Python list, not a special data format at all and we use BERTopic. The first thing that you got to do is tell it what language you want, we definitely wanted to include the topic probabilities and we want it to go ahead and print out as it goes so that we know that it's still chugging. We just give it the text documents and that is it. Can you believe it? That's three lines of code. It's amazing. You can see here that we can actually extract our topics. We can see here that we get some topics. They seem to be good. We could actually print out any individual topic. This is just Topic 0. If you remember the Topic 0 here, it's saying in the documentation that when you see a negative one, it's really saying, "Hey, this is just stuff we couldn't classify." It's important to understand here that there were 21,000 reviews and 7,500 of them, it couldn't find a good topic fit for. About a third of the data. That's probably problematic, but at least the majority of the data it was able to classify. It doesn't even want you to really play around with this topic because it's just the other topic. It's not really one thing. It's calling that topic negative 1. Topic 0 is our real topic and we can see here that does it make sense qualitatively? Watch, band, wrist, battery, watches. Yeah, I definitely think that this topic is talking about the Nike watch or the Nike Apple Watch perhaps but it does have some features in it that it probably shouldn't have. What could we do to address this? Well, we could do some pre-processing using Natural Language Toolkit, like we did in the first class to remove parts of speech. We could remove stop words from our corpus before we give it to BERTopic. Any pre-processing that we want to do before we give it to BERT, we can do. We can remove these features so that BERT never even has the word is or this or that, or it's because we're definitely seeing here that some very common words that really don't have meaning are included in this. Whenever we see this, we either want to filter the parts of speech and include things like nouns, adjectives, and verbs, or we want to remove stop words from our list. But that being said, I think it's definitely picking up on the topic here. You're actually given a topic distance map that looks a lot like LDAvis, and you can actually manipulate it a little bit better. You can actually zoom in and you can take pictures of stuff as you want to show people. You can actually get a feel for where the top topics are in the document. What are you seeing here? What should you see? There's a ton of overlap here in the distance of the documents. I specified the number of topics when I built this model. Did I? No, I didn't actually. How do I lower the number of topics that I see in my visualization? Well, you can use what they call hierarchical clustering in our topic to really figure out where the main topics are. If you remember from our top lecture when we were talking about concepts, hierarchical clustering is this ability to take data and slowly get more granular with the clustering as time goes on. When we look at the individual words here, you can see that for Topic 31 and 37, the top word is comfortable. That means to me that these two topics are probably related, and so this most granular level of topic modeling is probably not applicable with this dataset. I would think that probably we would be talking about this level of granularity or maybe in one more level here. But you can see this finest level of granularity the bird topic is trying to pull out of the topics, I just don't think exists. What can you do? Well, simply, you can just can treat Topic 46, 20, 36, 6, 41, 21, 8, you can consider these all as the same topic. You can literally just fold those and merge those into one big topic after you get the classifications. You can just say, look, the model thinks these are pretty semantically related, and looking at them, I can't find the difference between them, so I'm going to put them all in one bin. Certainly with the comfortable gym workout shoes, I'm seeing that with the red cluster here as well. With hierarchical clustering, you get to see how the different topics are related and you get to decide that level of granularity that you want to accept into your model and you can simply merge topics that are really similar together. Even though this doesn't look great up here and we see lots of overlap, we can just actually merge these topics in the bigger topics by just simply in pandas or wherever just saying, well, when we see this combo of topics, it really just means this one cluster. If we want this combo of topics, it really just means this other cluster. You can also visualize top terms and so you can see what are the top terms. Maybe if you hate word clouds like I do, maybe this is a better way to represent a topic model. It gives you the ability to save the screenshots as you go and as you play around with this and you can specify the number of top topics that you want here as well. This is really nice and this actually just shows you the similarity of topics. You can almost think of this as a correlation matrix. This is just showing you how two topics are correlated. If there's a dark blue there, you can see that there's a correlation. You can definitely see some dark blue clusters. It seems like all of these topics in here are related, seems like there's a cluster of documents in here that's very related, a cluster of documents in here. When you're looking at a correlation matrix, you always want to look at the dark shaded regions, in this case dark blue, and you can definitely see that there are some clusters in this data. Again, that suggests to us that we probably want to fold some of these topics into larger clusters of topics so that we can remove that similarity from our data, remove that overlap. What's really nice here as well is you can see that the decline in score per topic per term rank. If you go back to one of our major assumptions of topic models is that words and topics should be unique from other topics, so what you want to see is that the term score drops dramatically across time. This really helps you understand what that ideal number of terms that you should include in your topic model should be. You can see that most topics really see a strong term score decline after two words. Most topics really are just two to three words. Now there are a few topics like topic 122, it seems to have maybe more eight words to it. But the majority of topics in here really just have two or three words. We should just keep in mind here that we shouldn't probably print out the words for a topic beyond two or three words because at that point, the words that are listed after are going to have such small term scores that really indicate that they're very seldomly or very vaguely involved with that topic. Ber topic allows you to update topics. It allows you to drop topics without rebuilding the model. So you can actually reduce topics. You can say, well, I've looked at this, it's very clear to me that if I look at this graph up here, that there are really only 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 topics. So if I want to reduce my number of topics, I can fold them together manually or I can have it do it for me by just specifying this parameter here and telling you how many topics to the one. So that's really something that they can really help you with that topic overlap as well. You can search for topics that contain words. If you want to know which topics contain the word vehicle, you can simply just searched for the word vehicle using this parameter and you're going to get back the topics that actually have that match. It's very easy to save your ber topic models so that you can go through and load stuff again and it's just a few lines of code. I take no credit for this wonderful ber topic package. I think you have to give all the credit to the author here and I encourage you to take the time to look at the documentation here. There's so much more you can do with ber topic, topic modeling. And it's definitely the most elegant approach. And I'll be honest, if I'm doing an academic research paper where I want to be really sure the topics are absolutely perfect, I'm going to spend the time and I'm going to try to make it work with TM toolkit. However, if I just want a quick and dirty topic model, I can quickly understand what's going on inside of the data. I have been very impressed with the quality of these ber topic models. And I think again, that all goes back to that additional knowledge that you gain with these pre-trained semantic models. You're building a topic model with a general understanding of the human language from the onset and that really translates to high-quality models. So that's it. You've really mastered all there is to know about topic modeling. I think there's always more to know, but seriously, you have mastered the basics now. You've got all of the tools that you need to build topic models that perform as best as they possibly can. And now I'm excited to see how you can fit this data with your topic modeling approaches in your final project. So good luck and remember, communicate with your fellow students if you're struggling, or if you want to talk about what you did to get something to work. I don't care if you share on this class. I just want you to learn as much as you possibly can about using these texts methods to create useful products. Thanks so much for listening to this class. We're going to come back with one more course. We're going to actually do network analysis which again is about the semantic relationships that words have together in visualizing them. And then you will be certified in marketing text analytics. Thanks so much.