As we've seen both Series and DataFrames can have indices applied to them. The index is essentially a row-level label and in Pandas the rows correspond to axis zero. Indices can either be autogenerated, such as when we created a new Series without an index, in which case we get a numeric value, or they can be set explicitly, like when we use the dictionary object to create the Series, or when we load data from the CSV file and set appropriate parameters. Another option for setting an index is to use the set_index function. This function takes a list of columns and promotes those columns to an index. In this lecture, we'll explore more about how indexes work in Pandas. The set_index function is a destructive process, and it doesn't keep the current index. If you want to keep the current index, you need to manually create a new column and copy into it values from the index attribute. So let's import Pandas in our admissions dataset. So import Pandas as pd. We'll create a new DataFrame, df equals pd.read_csv datasets/Admissions_Predict.csv, and we'll set the index column to zero, and let's look at the head of this. So let's say that we don't want to index the DataFrame by serial numbers, but instead by chance of admission. But let's assume we want to keep the serial number for later. So let's preserve the serial number in a new column. We can do this using the indexing operator on the string that has the column label, then we can use the set index to set the index of the column to the chance of admit. So we copy the indexed data into its own column, so df sub 'Serial Number" equals df.index. So we just make a copy of that into a column that we've labeled serial number, then we set the index to another column. So df equals df.set_index 'Chance of Admit' and df.head. You'll see that when we create a new index from an existing column the index has a name, which is the original name of the column. We can get rid of the index completely by calling the function reset_index. This promotes the index into a column, and creates a default numbered index. So df equals df.reset_index, and then df.head, and we see that Chance of Admit is now promoted back into a column, and we have a numeric index. One nice feature of Pandas is multi-level indexing. This is similar to composite keys in relational database systems. To create a multi-level index, we simply call set_index and give it a list of columns that we're interested in promoting to an index. Pandas will search through these in order, finding the distinct data and form composite indices. A good example of this is often found when dealing with geographical data, which is sorted by regions or demographics. Let's change data sets, and look at some census data for a better example. This data is stored in the file census.csv. It comes from the United States Census Bureau. In particular, this is a breakdown of the population level data in the US county level. It's a great example of how different kinds of data sets might be formatted when you're trying to clean them. So let's import and see what the data looks like. So df equals pd.read_ csv, and from data sets census.csv, and we'll look at the head. In this data set, there are two summarized levels; one that contains summary data for the whole country, and one that contains summary data for each state. I want to see a list of all the unique values in a given column. In this DataFrame, we see that the possible values for the sum level are using the unique function of the DataFrame. This is similar to the SQL distinct operator. So here, we can run unique on the sum level of our current DataFrame. So df, and we'll just project our SUMLEV and.unique to see the unique values there. So we see that there's actually only two different values; 40 and 50. Let's exclude all of the rows that are summaries at the state level, and just keep the county data. So I'll override our DataFrame, df equals df sub, and we'll want to do that where df sub 'SUMLEV' is equal to 50, and let's look at the head of that. Also, while this data set is interesting for a number of different reasons, let's reduce the data that we're going to look at to just the total population estimates, and the total number of births. We can do this by creating a list of column names that we want to keep then project those and assign the resulting DataFrame to our df variable. So columns to keep, and we'll write STNAME, CTYNAME, BIRTHS for 2010, 2011, 2013, 2014-15, and then our POPESTIMATES for 10, 11, 12, 13, 14, and 15. So our DataFrame is just DataFrames sub columns_to_keep, and let's look at the head of that. So smaller DataFrames, still plenty big. The US Census data breaks down population estimates by both state and county. We can load the data and set the index to be a combination of the state and county values, and see how Pandas handles it in a DataFrame. We do this by creating a list of the column identifiers that we want to make up the index, and then calling set_index with this list, and assigning the output as appropriate. We can have a dual index, so first the state name, and second the county name. So DataFrame equals df.set_index, and then I'm going to pass it little list, and I want to have first the state name, and then the county name, and let's look at the head of that. That's a nice rendering as well of how the counties are kept within the state there. The immediate question which comes up is how we can query this DataFrame. We saw previously that the loc attribute of the DataFrame can take multiple arguments, and it could query both the row and the columns. When you use a MultiIndex, you must provide the arguments in order by the level you wish to query. Inside of the index, each column is called a level and the outermost column is level zero. So if I wanted to see the population results from Washtenaw County in Michigan the State, which is where I live, the first argument would be Michigan and the second would be Washtenaw County. So df.loc sub 'Michigan', 'Washtenaw County'. If you are interested in comparing two counties, for example, Washtenaw and Wayne County, we can pass a list of tuples describing the indices that we wish to query into the loc attribute. Since we have a MultiIndex of two values, the state and the county, we need to provide two values as each element of our filtering list. Each tuple should have two elements, the first element being the first index, and the second element being the second index. In this case, we want to have a list of two tuples, in each tuple, the first element is Michigan, and the second element is either Washtenaw County, or Wayne County. So df.loc and remember we use the set operator, or the indexing operator on loc, and then we actually create a list inside there, and the first is the tuple Michigan and Washtenaw County, and the second is the tuple Michigan and Wayne County. So that's how hierarchical indices work in a nutshell. They're a special part of the Pandas library which I think can make management and reasoning about data easier. Of course, hierarchical labeling isn't just for rows. For example, you can transpose this matrix, and now have hierarchical column labels. Projecting a single column which has these labels works exactly the way that you would expect it to. Now, in reality, I don't tend to use hierarchical indices very much, instead just keep everything as columns and manipulate those. But it's a unique and sophisticated aspect of Pandas that's useful to know, especially if viewing your data in a tabular form.