We're here with our last set of notebooks here in network analysis for this sequence in marketing text analytics. If you look outside as Bob Dylan once said it's not dark yet but it's getting there. So let's close the evening with taking a look at how to prepare networks and visualized networks in python. So if we look at what we have here, it's really a simple set of packages. We're going to use two main ones that pertains to network analysis. We're going to use network X. We're going to use pi viz. Again you may have trouble getting these installed and so you may need to restart your your terminal but as of right now it looks like all you gotta do is run these and if they're not installed they'll install themselves using this logic. Got to set our working director, we've talked about the merits of that and we got to download our actual data. So we're going to download some twitter data that I just fetched from from twitter of course. So we've got tweets that mentioned Nike tweets that mentioned Lululemon and tweets that mentioned Adidas. These are commented out here in my notebook but just don't comment them and give them a run and you will be able to download these and remember to change these references and remember if you've got spaces in your google drive folders you've got to escape them. So once you get that open this is in a Jason L format which stands for Jason lines. So each line of the text of this text file responds or corresponds to a Jason entry. And so we've got to iterate through them to see what they look like. So if we just go ahead and make a list of unique users here that we'll use in a minute and then we iterate through the Jason file. I'm just going to say if the iteration is less than 50, I want to print out the text of the tweet. And you can see if Nike's in there, I'm going to print out that text of the tweet. So I went through 50 tweets and it looks like of those, there was probably a dozen or so that mentioned Nike and you can see what the text of those treats look like. So we've got a collection of a ton of tweets and now we've got to prepare this into a mention network. So today we're going to look at who mentions who on twitter and we're going to try to create that not semantic network but that social network analysis of people. So the first thing that we need to do when we're iterating through this data is just create a database of all unique users. Because remember in this network analysis there is no way we're going to be able to visualize all the people as nodes in this network, we're going to have to make a heuristic choice to filter the data down to something that can be easily data viz. So we should make some intentional moves here to get people that are influential, and I think we'll do that. So we're going to take the Jason file and we're going to iterate through it line by line because I told you every line of the Jason file is a tweet. And we're going to use our good old counter here to say once we've gotten through 10,000 tweets tell us that we have reached that far because we'll get a little impatient as there are one. Let's see how many tweets there are. About 170,000. By the way this is about three months Of Twitter data from the end of 2021. So Jason is pretty easy to load when it's a string. So it's a string of text that we're that we're having Jason package interpret. And so we used the load S for load string because we're loading in one string of Jason one string at a time instead of a whole document at once. So that's why we use load as versus load. Now the real magic of twitter data is that they've got this wonderful taxonomy of different fields that are returned when we ask for the data for one tweet. So we've got all these different fields and so you can see here that I'm taking this tweet Jason which is now a dictionary and I'm looking at the user key and then I'm pulling out the screen name. Here's another tweet where I'm looking at the user id and then I'm pulling out just the ID. So what we have here is we have the ability to parse out different fields of this tweet to do different things with. And if you haven't checked out the the homework yet on twitter Jason you know you're going to get really familiar with how to parse stuff out of tweets just by completing that. So we're just going to make it first a dictionary of unique people and then just count up the number of times that they tweeted. So using that same kind of basic dictionary counter logic that we used before, we're going to ask python A if the idea of the person who tweeted is not yet in the unique users. Then we're going to make a new dictionary entry for them. That's what this line is doing and saying hey, go ahead in the dictionary add the key of the user and just make that a dictionary in and of itself and inside of that dictionary want to put the number of times that that person has tweeted. We want to put the ID of the user and we want to put the number of followers that user has so that we can do some filtering later on. We're only going to look at people that have a bunch of followers. So we're counting the number of tweets with tweet counts. We're going to start with one the first time we encounter someone and then every subsequent encounter we're going to take it up by one so that we can get that total count. And remember the ticking function happens here with the plus equals one. So we go through our tweets, we have 175,000 tweets. That's a lot of tweets. And if we look at the unique number of users by printing the length of the dictionary, because remember, each entry in the dictionary as one user, 104,000 unique users. Can't visualize 104,000 notes that just ain't going to go. So we got to figure out a way to filter it. How are we going to do that? Well, we got to go through the data again and make some heuristic choices as to who to include, not to include a lot like what we did with the Amazon reviews and the Amazon assassins. So go back through the data and we're going to say this time for every user in our unique user dictionary. If the users tweeted more than once they've tweeted two plus. And if they have over 100,000 followers, then we're going to go ahead and include them in our database. I think we should just take a second to think about what that means. So it means these are people that tweeted at least twice about these three brands over three months. Got to be passionate about shoes to do that, right? I don't think any single person in this room falls under that category. And I don't think anyone in this room has over a 100,000 twitter followers, my apologies if you do. But what we want to do is interpret this and we want to make sure that whenever we're presenting these results, we can textualize what we've done here. We have looked at people that tweet somewhat frequently about these brands and we are looking at people that are very influential, right? Over 100,000 on Twitter means that you are basically a brand or a celebrity or a product or a service or a news organization. It does not mean that you're an everyday person, don't know too many everyday folks with 100,000 followers on Twitter. So if the user satisfies both of these requirements, then we're going to go ahead and put them in the users to include set. We're going to get and print out the list and we got 196 people to visualize and that is a tiny drop in the bucket compared to the 104,000 that we were looking at but that's okay. Because remember network analysis, it's just hard to visualize a lot of stuff. So you can see here that we are literally reducing our nodes by 99% and we need to be able to make sure that we present that as such. One thing I'll say is hey, there's probably still some bots in here, probably not with 100,000 followers, there's very few bots that have that many followers. But I would probably do some other things if it were me, figure out who I want to include and not. One thing I might do is run each user through bottom media and just kind of see what they say if they think someone looks like they're bought, I probably wouldn't include them in the graph because I don't really know what advantage it would be to make sense of bots. Now you could do a separate network analysis of just bots and see what bots are tweeting about Nike, that would be a fascinating analysis. But for now we're going to focus on people today. So every time we iterate through Jason file, we just got to reopen it and so that's why we're reopening it. If we try to iterate over the dataset with Jason and not reopen it, will just say there's nothing to reiterate. So got to reopen that file and we are opening up something a network X called a digraph. What do you think digraph means? Doesn't mean dimensional graph, but it actually means directed graph. So it's a graph with direction because remember with twitter mentions, there's a direction. If I tweet at you, I am sending you're receiving. So we want to capture these relationships in our social network analysis. So once again let's go through the raw tweet data and let's pull out the relative fields that we're going to need, we're going to need the screen name, we're going to need their ID. We only want to know their follower count. So we're saying the python here, if the ID who tweeted is in that list of IDs and we're going to do something, right? If they're not, we're just going to pass over the tweet. But if they do fall inside of that list, then we're going to pull out all of the users that they're mentioning in their tweet. And we can actually pull out the users that are being mentioned in a tweet by going inside of the entities dictionary and then pulling out the user mentions. Now user mentions is a list. And if that list is greater than zero then let's go ahead and iterate through the users that are being mentioned one at a time. And each time we're going to go ahead and take the screen name and the ID, And we're going to go ahead and append it as an edge in our edge lists. So we created this graph instance and this graph instance accepts edges with the add edge parameter. So if we use add edge and then we specify the direction which is the user who tweeted, and they're tweeting at the screen name or the person being mentioned, that's all we have to do for a social network analysis, right? It's just tell the network graph every single interaction or every single connection that exists in the network. When people mention each other multiple times, it will just keep summing that up. So remember we have a directed valued network here that we've created in just a few lines of code, takes a while to iterate, but you can see we end up with 126 nodes in 210 edges. So if we visualize this, which we can do quite easily with the plot functionality, you can see that we actually have a picture of that network and who's talking to who and so on. Now. You can see that I've set the figure size here to 300 by 300 which is quite large for network X. But it's really helpful for you to just go ahead and download this. You can right click on it and click save as and then open it up in your application of choosing preview on mac or paint or whatever you use to look at photos in windows. And you actually zoom in all the way to see these network structures, the names of their nodes and you can actually see who is tweeting and who. Right here It's kind of zoomed out, right? And you really can't capture that. And that's a limitation again of big networks, however you really can if you zoom in do this. One thing I'll say is that this network is good in that it's capturing relationships but it's not perfect, right? I had to play around with these parameters quite a bit to even get this to be visualized able I had to change the font color of the labels, I had to change the font size, I had to change the node size of the nodes. I had to change the width of the of the actual edges or ties and I had to change the size of the arrows so they were visualize double. There's more stuff that you can tweak in your visualization here at this URL a lot more you can do with visualizing these nodes. But for now just keep in mind that the most important thing is that this graph is not too crowded. If you see that your nodes are overlapping each other to an extreme extent, change the figure size to be larger on your visualization. When you plot just make the figure size or larger and that'll push the nodes out further. And again I see a little bit of overlap but when you zoom in on this, you actually can see that they look good. So let's go ahead and do that. I'm going to go ahead and save this On my desktop, open it up, Check it over. You can't see anything when you first at first blush but when you get in there and this is a big image file, you can actually see the names of the different people in the network as we get closer to the center here you can see that there is some overlap here in the nodes. But there's not a ton there. Now remember visualizing the network is the first step in interpreting the network refer to the rubric of the project on some of the questions that I would like you to answer with these networks. You can see that in this network, Adidas is actually the most central node which is really fascinating. I can't even find Nike on this node, this graph and I've been looking for it and so it must be here somewhere. Nike must be here somewhere and lulu, lemon must be here. But it is clear that for the influentials out there on Twitter, the most talked about people are actually Adidas. So this is a decent network. I think it could be improved a little bit. I don't like the self loops. I would get rid of the self loops but for now it's pretty good and it's a nice clean data viz. So of course you can actually graph this kind of organically using what is called? Pi, I get all of these package names confused. This is called pie viz. So pie viz gives you an interactive visualization and it is kind of slow and you will have to be patient as this loads but you can actually zoom into your network and actually get a feel for what the nerds are. Let's see if I can get it to zoom, I'm zooming and once you get in there you can actually see that all of these nodes are visualized here. And this is kind of like this package is definitely still a work in progress and something that you have to play with and it really only works well with few nodes. It doesn't work well with lots of nodes and what you're seeing here is just too many nodes being visualized. So this is one way that you can create an interactive visualizations by using pipe is and you can play around with some of this stuff such as how do you want the network to be clustered? You can play around with can turn the physics off to freeze the network as soon as you freeze the network, it'll go away. So there's Nike appearing and it is a tool that can be used inside of the browser. You have to be patient again for it to load because it is not perfect and when you turn physics off, the network freezes and you can begin to make sense of it there. So every network analysis state of his program is different and this one is no different. But it's another way to do interactive network analysis. And there really are some nice features that you can use here inside of the pie viz package. We're just barely scratching the surface here, but it is a nice way for you to visualize something kind of interactively and play with.