Hello and welcome back. We're continuing our analysis of Elon Musk's tweet. This time slightly revisited a little bit. If some of you are taking a look at this, you might notice something a little different. Our tweet, dataframe/table has changed slightly and we're pulling from a JSON instead of a CSV. Part of the reason for that is that I got into the middle of working through my data set and was realizing that the data set I had had truncated tweets and really I'm looking for full tweets, full text so that we can go through into some text analysis later. So hence we've got a slightly different data set. This data set, I'm going to go through and just move through some of these things here, is a little bit bigger. So now we're dealing with about four times the size. It's been refreshed to be very current as of just a few days ago, early December, and I've already done some data-prep. As things have come in, pandas wants to go through and convert some numbers into floats, but that doesn't really helped me as I'm going through and wanting to use IDs to link and whatnot. So I needed to fix up some of those columns. I've also added a couple columns. Because now I have that data where I've got the beginning and end of the actual tweets, so minus all of it mentions and the media that gets loaded in. So moving past that. Here we're going out and getting quoted tweets because we have that ability as well. We've got some code that I was able to find in a GitHub repo. We'll point that out to you. So in case you want to go and take a look at that, that's right here. That was a Twitter scraping repo that utilizes Selenium, Python, and Tweepy which is a library that allows you to go off and grab real-time using the Twitter API. So this maybe helpful for anybody that's taking a look at tweets. You can take a look at that if you've got questions. As you're going through your project, feel free to contact me. So let's just get into here. I'm going to pick this up where I've already gone through and loaded up some of the metadata. I guess one thing to cover before we go too far, you'll notice that my ERD has been modified and this is something that you'll have to do as you're going through your project. As you discover more data or as you introduce new features, you're going to want to make sure that you keep your ERD up to date. This is very helpful in making sure that you know how your data is connected and what's connected to what and how you can link across all of those different tables. So next I'm going to go through and set up a quoted tweets table. This is all primarily just to get out my retweet information. This segment is really focused in on taking a look at some descriptive stats. So we're going to get through some of these. I'm setting up a new em_tweet which is a subset of the em_tweet_full which has lots and lots of other data. We have new start and end dates as I was telling you before. So now we're going out to 2019-12-10. So we can see what fields we've got available now that match. I'm also going to add a column that strips just the text out so that I can do some analysis on that later. So I'll go ahead and do that now and just take a look and see what we've got. Okay, here we go. So here's my full count and just a sampling of some of the things that are out there. So my created, favorites, retweet counts, full text, start, stop, and then the actual texts where it's stripped out. Some of the mentions and things like that. I've also got a few other tables that I am going to take a look at just for descriptive stats purposes. I'm going to look at just tweet frequencies per hour, per month, per year. So I'm using that created_at field and extracting out the hour, month, and year, and then grabbing account and grouping by the hour, month, and year. So here's just an example. Here's what the tweet_freq_year looks like, so you can see. Then we're going to get into some descriptive stats. So these are some functions but really the functions that are just wrapped around SQL. So you can take a look at this and see. Here we go. Basically this function is taking in as an input, a table name, and a column. This is just to facilitate the ability to call this over and over again. All of these are usable as just regular SQL. So if you're looking for the account, you can see here the mean, the median, the mode, the minimum and maximum, etc. Then we dumped those into a couple of different dataframes. So we'll go ahead and execute that. Then we've got a function that gets our interquartile ranges. So if you're familiar with stats, your 25th, 50th, 75th, 100th percentile. This is just for SQL. So if you want to take a look at this, we break that down, we are using as part of SQLite ntile, which is like percentile. We're breaking out into four which gives us our quartiles. Then from there, we're selecting out each of the percentages and looking specifically at each percent mark that we're looking at. So one would correspond to 25th to 50th, 75th, etc. Then we just bring all those back together. So we're going to put that together. Now we're going to go through and actually run our stats against the tweet_freq tables. Plus I'm going to run it across em_tweet. So this basically is saying for every column that's in em_tweet, I want to go through and run my descriptive stats passing em_tweet as the table and then each column. Then I'm also going to get the interquartile ranges as well. So we'll run that real quick and then we can take a look. So here's just an output of those descriptive stats. So I can see my count, my mean, my median, min, max, etc. This will start to give you a feel for what your data looks like. How does it break down? Is it heavily weighted towards the top or the bottom? What does it look like? You can also see interesting things. Let's take a look as we start to get into the em_tweets. So IDs. Okay, IDs has to make a whole lot of sense necessarily for us to take a look at that the mean, median, mode because that's not that kind of variable. We're looking at the created. Created, that's interesting. But it also, as we get past some of these, here you'll say, okay, really there's account but there's no mean, median, min or max. So this is a Boolean field. So I don't think we're really capturing for that at the moment. So you can see that reflected. But it gives you lots of information here available to you. Now, this is all via SQL. We're going to show you a little bit later another segment how we do the same thing in pandas. I'm going to create just a few other tables here and we can take a look at our user table and our tweet users, here we go, and then I can go through and just take a look now at some of the stats that are coming out of each of our tables. So if I'm looking at our user table specifically now and I'm trying to understand, who is in the public tweet list? So here's all the people that he's publicly tweeted. Who was at the top of the mentions lists? So we've got Tesla, and then Erdayastronaut, and SpaceX. So we're still Tesla and SpaceX like our hypothesis. However, we've got somebody that's crept up to the top. Then in our retweets, Tesla and SpaceX, and then Teslarati is right up there as well. Then these mentions and retweets just shows us a similar thing here. So we're going to end our segment there. We will jump in to showing you how we can do a similar stats using pandas in our next segment.