[MUSIC] Hello again, in this section I'm going to give you a quick example of a highly used Python Library called Pandas. By now, you've already been exposed to the concept of a data frame within Spark and possibly in R. Pandas is the de facto standard python library for creating and working with data frames. And it's a dependency for many other machine learning libraries. Today, the definition of a dataframe differs slightly between Python, Spark and R, however the industry is Quickly moving towards a standard unified definition. Just a quick note, if you're coming from a programming or software engineering background as I did, the concept and approach to data frames may take a minute to digest. However after you gain the understanding of how data frame functions can make your life easier, you will wonder why you waited so long. Dataframe functions are designed to operate primarily on the data frame itself, with zero to minimal code around them to operate. Built into a data frame are functions, which iterate over columns or rows otherwise known as axes, utilize indexes which can be defined or reset on the fly and provide the ability to go back and forth between multiple object types with the help of NumPy, another core library in machine learning. Okay enough talk, here are some quick examples. In this segment, going to talk about Pandas, which is a Python library very core to functionality for data scientists within Python. It's part of the Anaconda distribution, it's also installable via the Python pip installer, etc. Used in lots and lots of other libraries as well as the kind of a core foundation for data frames, which is what we've been accessing via Pandas SQL. If you recall last time, we were going through descriptive statistics and we were taking a look at how to calculate those statistics using SQL, write our counts, our means and medians, etc, mins and max's. We were also taking a look at interquartile ranges and how to calculate those kinds of things. So a fair amount of SQL just to kind of get some of those things. We created some additional tables and then, took a look at some of those things. So in this segment, we're really going to take a look at just Pandas and how to do the same thing, but using Pandas modules instead. So before we get started, we'll take a look at setting an option here for the display format. I don't know about you, but sometimes when I see scientific notation, it takes me a second to know, I want to just see what I want to see rather than have to convert it. And so I'm going to set this option to show me just two significant figures and otherwise just kind of expand everything out so I can quickly go by and do my analysis. All right with that then we're going to take a look at the e/m tweet table. Just to refresh you, we're going to go look at the ERD. So we've got a number of different fields that are here. We've also got text fields and whatnot. So all of our text fields will will get wiped out, basically meaning they won't come up in our describe. Describe is going to give us all those same kinds of steps that we're looking for at before but just for numerical. So we'll go ahead and run that, and so here with a quick click, now I can see my accounts my means, etc, standard deviations, min, max all my IQR ranges ,just at my fingertips, very quick, very easy. I also have access to each of those If I want. This is a print statement that will just display the word mean against its mean. So we're going to take a look at the count field specifically in the tweet freak table, tweet freak year specifically, the mean, median, standard deviation, min and the max. And then we'll also take a look at the described for that one. And then the another module that's available, or a function you can use as plot which is based off matplotlib. So a quick little visualization, we're not going to get too deep into that in this segment, but we will be following up with that in another second. So stay tuned for that. So let me go ahead and run this. And then you'll see here I have access to each of the individual metrics as well as the the overall for the described so I can take a look at any of any of those things, but very very quickly. And then in addition here's the plot, that we can take a look at, we were looking at year versus the number of tweets per year that Elon was posting. So you can see, kind of this this amping up as we get closer to current time. Just a quick note before we leave this segment. Just some of the favorite things that I use all the time, reading csvs in Reading Json files and dumping data out to excel very quickly. Creating data frames, converting things that are from a NumPy series to lists, grabbing values from a Pandas series format. Which also dumps it to a list using the concatenate merge functions to do joins and unions. And also doing things where the dataframe apply. So if you recall we had some some functions up there, the anonymous functions that we were using. So these are just some of my my daily go tos for using Pandas. Okay with that, we're going to stop here and we'll meet you back up when we start talking about visualizations.