In this lecture, will talk about one of the primary data types of the Pandas Library, the Series. You'll learn about the structure of the Series, how to query and emerge Series objects together, and the importance of thinking about parallelization when engaging in data science programming. A pandas Series can be queried either by the index position or the index label. If you don't give an index to the series when querying, the position in the label are effectively the same values. To query by numeric location, starting at 0, use the iloc attribute. To query by the index label, you can use the loc attribute. So let's start with an example. We'll use students enrolled in classes coming from a dictionary, so first will import pandas as pd is norm. And then will create a dictionary called student classes, and will take Alice and put her in physics, Jack in Chemistry, Molly in English, and Sam in History. And now will create some new series from this dictionary with pd.Series and print that to the screen. So, for this series, if you wanted to see the 4th entry, we could use the iloc attribute with the parameter 3. S.iloc sub 3, remember, we always start from zero. If you wanted to see what class Molly has, we would use the loc attribute with a parameter of Molly, so s.loc Sub Molly. So keep in mind that iloc and loc are not methods, they are attributes, so you don't use parentheses to query them, but square brackets instead, and this is called the indexing operator. In Python this calls get or set for an item depending on the context of its use. This might seem a bit confusing if you're used to languages where encapsulation of attributes, variables, and properties is common, such as in Java. Pandas tries to make our code a bit more readable and provides a sort of smart syntax using the indexing operator directly on the series itself. For instance, if you pass in an integer parameter, the operator will behave as if you want to query via the Iloc attribute, so will do s sub 3. And it's just as if we used s.iloc sub 3, and if you pass in an object, it will query as if you wanted to use the label based on the loc attribute. So s sub Molly actually queries as if we did s.loc sub Molly. So what happens if your index is actually a list of integers? And this is a bit complicated and Pandas can't determine automatically whether you're intending to query by index position or index label. So you need to be careful when you're using the indexing operator on the Series itself. The safer option is to be more explicit and to use the iloc and loc attributes directly. Here's an example using classes in their classcode Information, or classes are indexed by class codes in the form of integers. So let's create some new dictionary classcode will say 99 maps to Physics, 100 to Chemistry, 101 to English, and 102 to History, and will create some new series. If we try and call, s sub zero were going to get a key error because there's no item in the class list with an index of 0. Instead, we have to call iloc explicitly if we want the first item. So s sub zero, and that gives us this nasty looking key error. So, that didn't call s.iloc sub zero underneath as one might expect, and instead it generated this error. Now we know how to get data out of this series, let's talk about working with the data. A common task is to want to consider all of the values inside of a series and do some sort of operation. This could be trying to find a certain number, or summarizing the data or transforming the data in some kind of way. A typical programmatic approach to this would be to iterate over all of the items in the series, and invoke the operation one is interested in. For instance, we could create a Series of integers representing student grades, and just try and get the average grade. So let's do that, grades = pd.Series and will pass in a list of all the grades 90, 80, 70, 60. Then we create some counting variable total, and then will just iterate. So for grade in grades total is equal to itself plus the grade, and then we're just going to print out the total divided by the length of grade. So just a very simple averaging function, this works, but it's slow. Modern computers can do many tasks simultaneously, especially, but not only tasks involving mathematics. Pandas and the underlying NumPy support. A number of methods for computation. And vectorization in particular works with most of the functions in the NumPy library, including the sum function. So here's how we would really write the code using the NumPy sum method. First we need to import the NumPy module, so import NumPy as np. Then we'll just call np dot sum and pass in an iterable item. In this case, our pandas series. So we say total equals m.sum and we pass in grades and then we just print out the total divided by the length of grades. Now both of the methods I just showed create the same value, but is one actually faster? The Jupyter notebook has a magic function which can help. So first let's create a big series of random numbers, and this is actually used a lot when demonstrating techniques with pandas, so you should get used to seeing this. So numbers equals, and we'll just do pd.series and then here we just call np.random.random int and we pass it some parameters indicating how many random numbers we want and what we want those numbers between. So here I'm asking for 10,000 random numbers between 0 and 1,000. Now let's look at the top five items in this series to make sure they actually seem random, and we could do this with the head function. So if we do numbers.head, remember this is a series, we're able to see what the first five numbers are. And we can actually verify the length of the series is correct using the len function. So if we do len numbers in the series we get 10,000, which is what we were looking for. Okay, so now we're confident that we have a big series. The IPython interpreter has something called magic functions that begin with a percentage sign. If we type this sign and hit the tab key, you can see a list of the available magic functions. You could write your own magic functions too, but that's a little bit outside of the scope of this course. So here we're actually going to use cellular magic function. These start with two percentage signs, an wrap the code in the current Jupyter cell. The function we're going to use is called time it. This function will run our code a few times to determine on average how long it takes. So let's run time it with our original iterative code. You can give time it the number of loops that you would like to run. By default it's 1,000 loops. I'll ask time it here to use 100 runs because we're recording this. Note that in order to use the cellular magic function, it has to be the first line of each cell, so two percentage signs, time it is the function that we're interested in running, and we'll pass in a parameter of 100. Now we'll just write our cell as normal, so we'll set total equals to 0 for number in numbers, total we're just incrementing it, and now we're going to divide total by the length. So this is just the function we saw before. All right, not bad, time it run the code and it doesn't seem to take very long at all. Now let's try with vectorization. So we'll just use time it with two percentage signs at the beginning, we'll pass in a parameter of 100 again. We'll make the total equal to np.sum the numbers and we'll divide total by the length of numbers, and we'll give this a run. Wow, this is pretty shocking difference in the speed and demonstrates why one should be aware of parallel computing features and start thinking in functional programming terms. Put more simply, vectorization is the ability for a computer to execute multiple instructions at once. With high performance chips, especially graphics cards, you can get dynamic speedups. Modern graphics cards can run thousands of instructions in parallel. A related feature in pandas and NumPy is called broadcasting. With broadcasting, we can apply an operation to every value in the series, changing the series. For instance, if we wanted to increase every random variable by two, we could do so quickly using the plus equals operator directly on the series object. Let's look at the head of our series, so s.head. And now let's just increase everything in the series by two, so s plus equals 2. So here we're applying the plus equals operator directly to the series object, not a single value. And now let's look at the head. The procedural way of doing this would be to iterate through all of the items in the series and increase the values directly. Pandas does support iterating through the series much like a dictionary, allowing you to unpack values easily. So we can use the iteritems function in particular which returns a label and value. So for label and value in s.iteritems, now for the item which is returned, let's call the set value. So s.set_value, we indicate the label, and we say the value we just want to increment that by two. And then let's check the result of the This for loop computation by looking at the head. So the result is the same, though you may notice a warning depending on the version of Pandas being used. But if you find yourself iterating pretty much anytime in Pandas, you should question whether you're doing things in the best possible way. Let's take a look at some speed comparisons. First, let's try five loops using the iterative approach. So we'll call timeit -n, and I'll use 10 here. We'll create a blank new series of items to deal with. Always good when timing to create a new series. So we use our np.random.randint with our three parameters. The first two being the range of our data values, and the next one being the number values, we'll add 1,000 items in there. And we'll just rewrite our loop from above. So for label value in s.iteritems, s.loc[label]= value+2. Now let's try using that broadcasting method. So we'll time it with 10 loops again. We need to recreate the series s = pd.Series, and we'll pass in our values. And now we're just going to broadcast with +=, so s+=2. Amazing. Not only is it significantly faster, but it's more concise, an even easier to read too. The typical mathematical operations that you would expect are vectorized, and the NumPy documentation outlines what it would take to create vectorized functions of your own. One last note on using the indexing operators to access series data. The .loc attribute lets you not only modify data in place, but also add new data as well. If the value you passed in as the index doesn't exist, then a new entry is created. And keep in mind indices can have mixed types. While it's important to be aware of the typing going on underneath, Pandas will automatically change the underlying NumPy types as appropriate. Here's an example using a series of a few numbers. So s = pd.Series([1, 2, and 3]). Let's add some new value, maybe a University course, so s.loc[sub History] = 102. Let's look at s. We see that mixed types for data values or index labels are no problem for Pandas. Since history is not in the original list of indices, s.loc[ sub History] essentially creates a new element in the series, with the index name of History, and the value of 102. Up until now, I've shown only examples of a series where the index values were unique. I want to end this lecture by showing an example where index values are not unique. And this makes the Pandas Series a little different conceptually than, for instance, a relational database. Let's create a series with students and courses which they have taken. So student_classes = pd.Series, well, Alice and Physics, we'll put Jack in Chemistry, Molly in English, and Sam in history, and let's print out those student_classes. Now let's create a series just for some new student, Kelly, which lists all of the courses that she's taken. We'll set the index to Kelly, and the data to be the names of the courses. So we'll create some new series of kelly_classes. We'll say she's taken Philosophy, Arts, and Math, and the index for this will just be Kelly, Kelly, and Kelly. And let's take a look at what that looks like. Okay, finally we can append all of the data in this new series to the first using the .append function. So we'll create some variable, all_student_classes, and it's just equal to student_classes.append(kelly_classes). And this creates a series which has our original people in it as well as all of Kelly's courses. So let's take a look at that. There are a couple of important considerations when using .append. First, Pandas will take the series and try to infer the best data types to use. In this example, everything is a string, so there's no problems here. Second, the append method doesn't actually change the underlying series objects. It instead returns a new series which is made up of the two appended together. And this is actually a common pattern in Pandas. By default, returning a new object instead of modifying one in place. And it's one that you should come to expect. By printing the original series, we can see that that series hasn't changed. So here, and we'll take a look at it. Finally, we can see that when we query the appended series for Kelly, we don't get a single value, but a series itself. And so if we take all_student_classes.loc[sub Kelly]. And we actually get a series itself. In this lecture, we focused on one of the primary data types of the Pandas Libra. The series you learn how to query the series with lock and I lock that the series is an index data structure. How to merge two series objects together with append an the importance of vectorization. There are many more methods associated with the series object that we haven't talked about, but with these basics down will move on to talking about pandas, 2 dimensional data structure, the data frame. The data frame is very similar to the series object, but includes multiple columns of data. Is the structure you'll spend the majority of your time working on when cleaning and aggregating data?