In the previous lecture, we used the NHL Data to introduce how to perform regression analysis in Jupyter Notebook. We focus on team level performance analysis. In this lecture, we'll continue the discussion of regression analysis with a focus on player performance. In particular, we would like to examine the relationship between player performance and their salaries using cricket data. In Week 1, we briefly introduced cricket data in the Indian Premier League. We looked at the game level statistics in 2018 IPL games. In this week, we will look at the player level statistics. Now please first open the Jupyter Notebook regression analysis with cricket data. In our data repository, there's dataset called IPL18Player.csv. This dataset contains performance statistics as well as salary information of cricket players in the Indian Premier League in 2018. We'll investigate whether the player performance impact their salaries. First, let's import some useful libraries into Jupiter Notebook. We'll import pandas, numpy, matplotlib.pyplot, seaborn, as well as statsmodels.formula.api. We'll also import the player data. Let's name this data frame as IPL Player. Before we move on to analyzing this dataset, let's talk a little bit about the basic measurement of player performance in cricket. There are two main types of players in cricket, batsmen and bowler. There are also players that are good at both batting and bowling, and they are usually called all-rounders. Additionally, there's a wicketkeeper in each team. For batsman, similar to baseball, the more runs he scores, the better he performs. We should know that when we are counting the number of runs, we'll also have to consider how many balls the batsman face and a number of times that he was out. Therefore, there are two measures that are commonly used to describe the performance of a batsman. One is the batting average, which is the total number of runs divided by the numbers the batsman was out. Batting average represent how many runs on average a batsman scores before getting out. If other batman's innings were completed, in other words, the batsmen were out in every inning, the batting average would be the average number of runs that they score per innings. If the batsman did not complete all their innings, in other words, some innings they finish without being out, then the batting average would be an estimate of the unknown average number of runs that they score per innings. If a batsman has score runs but has not been dismissed, his batting average is technical infinite. Another measure for the performance of batsman is the batting strike rate, which is the average number of runs scored per 100 balls faced. This measures how quickly a batsman achieves scoring runs. For bowlers, on the other hand, the less runs conceded, the better performance he has. There are usually three major measurements for a bowler's performance; the bowling average, the economy rate, and the strike rate. Bowling average is the number of runs conceded per wicket taken. The lower the bowling average, the better the bowler is performing. Economy rate is defined as the number of runs conceded per over bowl. Similarly, the lower the economy rate, the better the bowler is performing. Bowling strike rate is the average number of balls bowl per wicket taken. This measures how quickly a bowler achieves the primary goal of bowling which is to take wickets, or in other words, to get batsman out. The lower the strike rate, the more effective a bowler is as ticking wickets quickly. Now let's analyze our data. We first explore the datasets and do some necessary data cleaning and preparation. We can first use the shape function to get a general idea about the size of our data set. You can see that there are 149 observations in a dataset. Since each observation represents a player, this means that we have 149 players in this dataset, and there are 35 variables in the dataset. Let's also check if there are any missing values in our dataset using the info function. We can see that we only have salary information for 141 players. We do have all other information for all 149 players. Since salary is the main variable that we will be analyzing with job observations, readout information on salary, using the job NA function. As we mentioned earlier in cricket, there are batsman and bowlers and there are players that can do both, the all rounders. For player who does only bowl, his statistics in batting would be zero, but he still gets pay for his ability to bowl. If we do not distinguish the different types of players, we will not be able to get meaningful estimates in our regression analysis. Let's create variables to indicate whether a player ever batted and whether a player ever bowled in any match. The variable innings in our dataset indicates how many innings a player had batted in. We'll use this variable to create a batsman's dummy variable. We will define batsman equals to one if innings is positive and batsman equal to zero if innings is zero. We will use the where function from the NumPy library to create this variable. The variable matches bowled in our dataset indicates the number of games a player has played as a bowler, we would define a bowler variable in a similar fashion. We also use the describe function to calculate some basic summary statistics of these two newly created dummy variables. As we can see, more than 90 percent of the players have batted in at least one match. But only about 60 percent of the players have bowled during this season. The last type of player that's not captured by either batsman or bowler variable would be wicket keeper. In a data set, the variable matches keeper indicates the number of matches a player played as a wicket keeper. Now let's turn to some performance measures for the players. Ideally, we would like to create the five performance measures indicated above. However, we don't have the number of all that bowled. Thus, we're not able to calculate the economy rates for bowlers. For the other four measures, notice that we may create some missing values or some infinite values. This is because some players have zero numbers of out, or some players face zero number of balls and some bowlers have zero numbers of wickets taken. In other words, when we create this new measures, we'll divide by 0. There are two ways we can consider to deal with this issue. The first way is to add 1 to the number of outs, the number of balls faced in the wicket taken in calculating the above variables. Another way to deal with this issue would be instead of creating the above measures, we can simply include the total number of runs, the total number of out, the balls faced to measure a batsman's performance, and we can also include the total number of runs considered the number of balls bowled and the number of wickets taken to measure a bowlers performance. Let's try our first approach to add one to the number of outs, balls faced and wickets taken to calculate the above four measures. In our data set, there's not a variable that indicates the number of outs. But we can calculate this variable by subtracting the number of not out from the number of innings, and we only need to create this variable for batsman. Again, we'll use the wear function from the NumPy library to create this variable. Now we can create the four measures by adding one to the denominators to make sure that we're not dividing by 0. We'll also take a look at the summary statistics of these four newly-created variables. Our slightly modified batting average has a mean of 15.09 and it has a maximum of 65. The batting strike is average at 0.01 and max at 0.025. The bowling average has a mean of 17.49 and a maximum of 72, while the bowling strike is average at 11.48 and max at 42.