Sometimes we want to select data based on groups and understand aggregated data at the group level. We've seen that even though Pandas allows us to iterate every row in a data frame, it's generally very slow to do this. Fortunately, Pandas has a groupby function to speed up such tasks. The idea behind groupby is that it takes some data frame, splits it into chunks based on some key values, and then applies computation on those chunks, and then combines the result back together into another data frame. In Pandas, this is referred to as the split-apply-combine pattern. Let's take a look at an example. First, we'll bring in our Pandas in NumPy libraries. We'll import Pandas and import NumPy. Let's look at some US Census data. We'll read this data from datasets/census.csv. We're going to exclude state-level summarizations, which have a sum level value of 40. We just want to keep those that have a sum level value of 50. It turns out there's only two sum levels in this datafile, 40 and 50. Let's look at the head of that. That's the data that we're going to work with. In the first example for groupby, I want to use this census date. Let's get a list of all the unique states then we can iterate over all of those states. For each state, we can reduce the dataframe and calculate the average. Let's run such a task for three times and time it. For this, we're going use this cell magic function, percent percent timeit. I'll say percent, percent timeit minus n3. So I'm going to run this cell three times and a Jupyter will present me some of the averaged values. For state in df sub STNAME.unique. This projects the data frame state name and then a list of all the unique values in there. We're just going to iterate over that. We're just going to calculate the average using NumPy for this particular state. The average equals np.average, then we'll do df.where df sub STNAME is equal to state. We want to exclude those that aren't and we'll use a Boolean mask here.dropna. We just want to project the census 2010 pop and pass that all into average. Then we're going to print it to the screen. We're going to print counties in state, state, have an average population of, and then we'll print the average. We see that that gets up and running. If you scroll down to the bottom of that output, you can see it takes a fair bit of time to finish. Now, let's try another approach using groupby. Again, we're going to say timeit minus n3. For this method, we're going to start by telling Pandas we're interested in grouping the state by name. This is our split. For group, frame in df.groupby STNAME. We're saying that we want to split on the state name. You'll notice that there are two values that we set here. Groupby actually returns a tuple where the first value is the value of the key that we're trying to group by, in this case a specific state name. The second one is the projected dataframe that was found for this group. So here, tuple back from groupby. Now, we include our logic in the apply step, and that's just to calculate an average of the Census 2010 pop. Here it's easy. We just say the average is equal to np.average and we just project from the frame, the one column we're interested in, Census 2010 pop. We don't have to do any further reduction here because we know it's all for a given state. Then we're going to print the results. This looks the same. Print counties in the state group, have an average population of whatever the string of the average is. We don't have to worry about the combined step in this case, because all of our data transformation is actually printing out our results. Let's run this. That's a huge difference in speed and improvement of roughly by two factors. Now, 99 percent of the time you'll use groupby on one or more columns. But you can also provide a function to groupby and use that to segment your data. This is a bit of a fabricated example, but let's say that you have a big batch job with lots of processing and you want to work only on a third or so of the states at a given time. We could create some function which returns a number between zero and two based on the first character of the state name. Then we can tell groupby to use this function to split up our data frame. It's important to note that in order to do this, you need to set the index of the data frame to be the column that you want to group by first. We'll create some new function called set_batch_number. If the first number of the parameter is the capital M, we'll return a zero. If it's capital Q, we'll return a one. Otherwise, we'll return a two. Then we'll pass this function to the DataFrame. Df equals df.set_index STNAME, and then here's our function. We'll define set_batch. If item sub-zero is less than M, then we'll return a zero. If it's less than Q, then it's in our second batch, we'll return a one. Otherwise, we'll just lump it into our last batch, and we'll return a two. The DataFrame is supposed to be grouped by according to the batch number. We're going to loop through each batch group. For a group and frame in df.groupby, and we just pass in our function to the set_batch_number. Then we'll print out. There are, let's say the length of the DataFrame, number of records in group and the group name for processing. Notice that this time, I didn't pass in a column name to groupby, instead, I set the index of the DataFrame to be STNAME. If no column identifier is passed, groupby will automatically use that index. Let's take one more look at an example of how we might group data. In this example, I want to use a data set of housing from Airbnb. In this dataset, there's two columns of interest. One is the cancellation policy, and the other is the review of scores value. We'll bring this in. Df equals pd.read_csv from datasets, and we'll bring in listings.CSV, and let's print out the head of that. How would I group by both of these columns? A first approach might be to promote them to a multi-index, and then just call it groupby. Here, I'll just say df equals df.set_index, and I'll just say I want the cancellation policy there, and the review score is value. When we have a multi-index, we need to pass in the levels that we're interested in groupingby. By default, groupby does not know and does not assume that you want to group by all levels. Here, we just say for group and for frame in df.groupby, and then we say we want levels zero and one. Let's print that out. There, we can see that we get lots of different groups by both levels that we're interested in. This seems to work out okay. But what if we wanted to group by the cancellation policy and review scores, but separate all of the 10s from those under 10. In this case, we could use a function to manage the groupings. We'll define some function called grouping_fun and pass in an item. We'll check that the review_scores_value portion of the index. Item is in the format of cancellation policy and review_scores_value. So it's a tuple. If item sub-one equals 10, then we're going to return items sub-zero and 10.0. We'll leave that one basically alone. Otherwise, we're going to return item sub-zero and just the string not 10 by zero because we just care about breaking them into two groups. Then for group and frame in df.groupby, we pass in our function grouping fun, and let's print out the groups. There we see that we've groupedby either things that have some cancellation policy, flexible, moderate, strict, and super strict, and/or some score, either 10 or not 10. Let's look at the head of our DataFrame. So that's what our DataFrame looks like. To this point, we've applied very simple processing to our data after splitting, really just outputting some print statements to demonstrate how the splitting works. The panda's developers have three broad categories of data processing to happen during the apply step. Aggregations of group data, transformation of group data, and filtration of group data. The most straightforward apply step is the aggregation of data. This uses a method called agg on the groupby object. Thus far, we've only iterated through the groupby object, unpacking it into a label, the group name, and a DataFrame. But with agg, we can pass in a dictionary of the columns we are interested in aggregating along with the function that we're looking to apply. Let's reset the index for our Airbnb data. Df equals df.reset_index. Remember, you have to assign this to the DataFrame, or you'll have to use in-place equals true. That catches me up quite a bit still. So keep that in mind. Now let's group by the cancellation policy and find the average review scores value by group. We want df.groupby cancellation policy. That's as it was before, but now on that groupby object that gets returned, we want to call.agg and to.agg we give it a dictionary. The dictionary is the column we're interested in targeting and the operation or the function that we're interested. Here I'll pass in review scores value and np.average and that's just on NumPy. So that didn't seem to work at all. Just a bunch of not a numbers. The issue is actually in the function that we sent to aggregate and this is me just showing a mistake that I did as I was preparing this lecture. Np.average does not ignore not a numbers. However, there is a function that we can use for this. Actually, I want to write the statement all over again, just the same as it was. So df.groupbycancellation policy.agg, but now in there I want to pass the function np.nanmean, and this just gets the mean values, the average, but excludes not a number values from that. There we can see that we actually have our review scores values for each category in nice aggregate form. You could see how simple and readable that was to write. We can just extend this dictionary to aggregate by multiple functions if we want to or multiple columns. So df.groupby cancellation policy.agg and now in our dictionary, we just pass in all of the different functions and columns that we're looking to create from that value. We'll pass in review scores value. I'm going to send an a tuple here of NumPy is not a number mean, and NumPy is not a number of standard deviation and then reviews per month and here I'll pass in the NumPy not a number mean. Take a moment to make sure you understand the previous cell, since it's somewhat complex. First, we're doing a groupby on the dataframe object by the column cancellation policy. This creates a new groupby object. Then we're invoking the agg function on that object. The agg function is going to apply one or more functions that we specify to the group dataframes and return a single row per dataframe/group. When we call this function, we sent it two dictionary entries, each with the key indicating which column we wanted functions applied to. For the first column, we actually supplied a tuple of two functions. Note that these are not function invocations, like np.nanmean with parentheses after it, or function names like "nanmean" and a string. They are actually references to functions which will return single values. The groupby object will recognize the tuple and call each function in order on the same column. The results will then be in a hierarchical index, but since they are columns they don't show up as an index per se, then we indicated that another column and a single function we wanted to be run should be run. This is really important that you understand what's happened here and how that statement was created. If you understand that, you're flying, that's great. If not, you really should go back and review that statement and this text or this part of the video to understand better how this is being called underneath. Transformation is different from aggregations. Where agg returns a single value per column, so one row per group, transform returns an object that is the same size as the group. Essentially, it broadcasts the function you supply over the group dataframe, returning a new dataframe. This makes combining data later quite easy. For instance, suppose we wanted to include the average rating values in a given group by cancellation policy, but preserve the dataframe shapes so that we could generate a difference between an individual observation and the sum. First, let's just define some subset of the columns that we're interested in. Here I'm just going to make a variable cols and say we're interested in the cancellation policy and the review scores value, and this just makes it a little cleaner as I'm typing it in. Now let's transform it. I'm going to store this in its own dataframe. Transform df equals df sub cols.groupby cancellation policy.transform and I want np.nanmean, and let's take a look at the head of this dataframe. We can see that the index here is actually the same as the original dataframe. Let's just join this in. Before we do that, let's rename the column in the transformed version. I'm going to say transform df.rename and I'm going to take the review scores value, because it's not actually the review scores value anymore, and rename it to the mean review scores. I want this axis equals columns and inplace equals true and then I'm just going to merge the dataframes. I pass in Transform Df as the other one. I want to merge on the indexes because the indexes between our two data frames are the same. Let's take a look at the new Df that we've created. We consider our new column is in place the Mean review scores. So now we could create, for instance, the difference between a given row and it's group the cancellation policy means. When you do that just by Df Sub Mean Diff, so that's the new column we are creating and assign this to Np dot Absolute. I want to take the absolute value of the review scores value minus the mean review scores. Here you can see there's a number of things going on. I'm taking our review scores value, I'm vectorized, passing the subtract operator across that with the Df mean review scores. We're making the difference there. Then I'm calling the absolute value also vectorized across that and sending it back to the mean Df. Let's take a look at the head. So the Group-by object is built-in support for filtering groups as well. It's often that you'll want to group by some features then make some transformations to the groups, then drop certain groups as part of your cleaning routine. The Filter Function takes in a function which it applies to each group data frame and returns either a true or false, depending on whether that group should be included in the results. For instance, if we wanted to, only those groups which have a mean rating above nine, included in our results. Say Df dot groupby, the cancellation policy. Then we'd say dot filter. Here we would pass in a function, so Lambda X, and we just wanna take a look at the mean of X. Again, remember we have to use NaN-mean because we have NaNs in there. And we just want them means where the review score values are greater than 9.2 Notice that the results are still indexed, but that any of the results which were in the group with the Mean review score of less than or equal to 9.2. We're not copied over because of how we did our projection. By far, the most common operation I invoke on Group-by objects is the Apply function. This allows you to apply an arbitrary function to each group and stitched the results back together for each apply into a single data frame where the index is preserved. So let's look at an example using our Airbnb data. I'm gonna get a clean copy of that data frame. So we'll just load that from the CSV listings dot CSV and let's just include some of the columns that we were interested in previously. So I'm going to drop those columns except for cancellation policy and review scores value. Let's look at that. In previous work, we wanted to find the average review score of a listing and its deviation from the group mean. This was a two-step process. First we use Transform on the Group-by object, and then we had to broadcast to create a new column. With Apply, we could wrap this logic in one place. I'm going to write a function Def Calc Mean review scores and pass in a group. So this is working on a data frame. So group is a data frame just of whatever we've grouped, in this case the cancellation policy. So we can treat this as the complete data frame that we're operating on. It's a bit of a mind-shift. Here I want to create an average value. So that's just equal to Np NaN Mean of the group review scores value. Then we want to broadcast our formula and create a new group. So a group Sub review scores mean equals Np dot Abs absolute value, the average minus the group Sub review scores value. And then we'll return the group. So now we just want to apply this to all of the groups. So Df dot Group-by cancellation policy dot apply send in our function that we want to work on all of the groups, the Calc Mean review scores. And let's look at the head of this. Using Apply can be slower than using some of the specialized functions, especially Agg. But if your data frames are not huge, it's a solid general-purpose approach. Group-by is a powerful and commonly used tool for data cleaning and data analysis. Once you've grouped the data by some category, you have a data frame of just those values. And you can conduct aggregate analysis on the segments that you're interested in. The Group-by function follows a Split Apply combine approach. First the data is split into some groups. Then you apply some transformation filtering or aggregation, and then the results are automatically combined for you by Pandas.