In the following lesson, we introduce the notion of centering variables. In particular mean centering variables in the regression model. We discussed what is mean centering and how does it change interpretations in our regression model. We demonstrated using data on heights and weights of some Olympic athletes. Let us use an example to introduce Mean-Centering of Exponential Variables. The file, Height and Weight.xlsx was used in course one of this specialization on business statistics and analysis. We had used the data in this file to illustrate construction and interpretation of scatter plots in Excel. This file contains data on some athletes from some past Olympic Games. Specifically, it has data on the height, weight and gender of the athletes. Column A and B are the name of the athlete. Column C is the person's gender, either a man or a woman. Column D is his or her height in centimeters. Column E is the weight of the olympian in kilograms. And column F is the country that the person represents. In course one of the specialization, we had used these data to construct a scatter plot of weights and heights of these athletes. And found a positive correlation between the two. Let us explore this further and develop a relationship between the weight of these athletes and their heights. Specifically, we will estimate a regression model as shown. Weight is equal to beta 0 + beta 1 times Male + beta 2 times Height. This equation relates the weight of athlete to the person's gender and his or her height. The variable male is a dummy variable representing the categorical variable gender. It takes a value 1 for a male athlete and a 0 for a female athlete. Before we estimate this regression, a word of caution. It is tempting to interpret regression results as causation, this may or may not be true. We had discussed causation versus correlation and some detail in course two of the specialization. Regression is essentially a correlation technique. It establishes a correlation between the y variable and set of x variables. To interpret that relation as causation, would require some additional requirements for causation to be met, which we will not go into. For example, in a regression, it would be wrong to interpret the results as saying, Height causes Weight. Height may be an important factor causing weight but there could be many other factors causing Weight. And to establish the strength of causality would require a careful analysis. Let us estimate the regression and introduce the notion of Mean-Centering of Variables and what impact it has on our interpretation. Let's estimate the regression with Weight as my dependant variable and Gender and Height as my independent variables. The variable Gender is a categorical variable, so I'll need a dummy variable to represent that gender. Gender has two categories, so I'll need a single dummy variable. So I'll insert a dummy variable, I'll call it Male. And I'll code it as is equal to if whatever is in the gender column, if that is equal to quote unquote m which stands for a male, then I'll put a 1, else I will put a 0, close parenthesis. So that's my dummy variable recording for Male, and I can copy and paste and I will leave it done. So that codes up my dummy variable, so let's run the regression model. I will data, data analysis, regression, my Y variable is the weight. And make sure the labels box is checked. My X variables are the dummy variable male and the height of the Olympian. Shift+Ctrl+down arrow, I select both the columns. I select an empty cell in my spreadsheet to put my output in. So that's my estimated regression model. Let us interpret these coefficients, the coefficient on height. Is 0.97, if I round it off to two decimal points, and the P value tells me that this is a significant estimate. The weight of the person increases by 0.97 kilograms, all other variables remain the same. The weight of the person increases by 0.97 kilograms, all other variables remain the same. That is if a person is male or female or every one centimetre increase in height the persons weight increases by 0.97 kilograms. The interpretation on the coefficient of Male. Male Olympians as compared to Female Olympians, on average, have a weight which is 5.53 kilograms more. All of the variables remaining at the same level. That is if you consider two Olympians of the same height, one male, one female. You'll expect the weight of the male Olympian to be 5.53 kilograms more. Finally, let's interpret the intercept. In this case, the intercept is interpreted as. In all cases, the intercept is interpreted as the value of my y-variable when all x-variables are 0. So in this case it implies, the weight of an Olympian, what kind of an Olympian? A person who's height is zero and who is a female, why? Because when the dummy variable male is equal to zero it implies it's a female. So a female Olympian with zero height that is what that weight of that person is. It clearly does not make sense, because talking about heights being 0 does not have imaginative interpretation. So we'll tell ourselves the intercept does not have imaginative interpretation. It is there simply to fit the data to the model. If I wish to have a situation where the intercept has a meaningful interpretation in this case. I'll have to make sure that all other variables being zero is meaningful in some way. Male being zero is meaningful because that implies a female olympian, however, height being zero is not meaningful. But I can make that meaningful by mean centering the height. So let's see how do I mean center this variable height. I insert a column next to height, I'll create another height column. And just so that this is segregated from my original height column, I'll include them square parenthesis. There's only for the purposes of separating it out from my original height value. Now this I'll create a mean centered variable height. How do I create a mean centred variable height? I'll take the value of height, each individual height and from that height I'll subtract the average of the entire height column. Is equal to. I'll pick up the height from the height column minus, I'll subtract the average of the height's across all this are one things observed in the data. Which is average, open parenthesis I select the entire height column. Shift+Control+Down arrow, close parenthesis. So that gives you my mean centered height. The particular value of height minus the the average of the entire height column, Enter. And I can copy and paste this formula all the way down. However, before I copy and paste the formula down, I need to put in dollar signs appropriately in this formula. So I'll put a dollar sign in front of number two there. So then I'm fixing the second row so that when they're copy and pasted. My relative referencing is not and also seeks the last row. So that every time when this formula calculates average, it always reference to F2 through F1 580, so that's how I fix those particular rows, Enter. And now I can copy and paste this from all the way down. So this creates my mean center height. So this is my mean centered height. Now let me run my regression with weight as a dependent variable. And as independent variables, I'll use male and the mean centered height. So I go to Data, Data Analysis, Regression, my Y variable is weight. X variables now are the dummy variable male, and the mean-centered height. And I put the output beneath my earlier output, do an OK. So this my estimated regression model. Notice on this estimated regression model, the coefficients on male dummy variable male and height are exactly same as we got in the earlier regression model. However, the coefficient on the intercept, the beta zero coefficient has changed. The new intercept is 69.63, and the interpretation of this intercept is this is a value of my Y variable which is the weight and all my X variables is here. In this case, all my X variable's being zero implies the dummy variable male is equal to 0, implying that it's a female Olympian. The mean center height variable is zero. Now when is the mean center height variable equal to zero. It is equal to zero when the height of the person is equal to the average height observed in the entire data. So the intercept does have a managerial interpretation, and their interpretation is the weight of a female Olympian. Whose height is equal to the average height observed in the entire data, the weight of such a female Olympian is equal to 69.63 kilograms. If I round it off to two decimal points. And what is the weight of a male Olympian whose height is equal to the average height observed in the entire data. That would be intercept plus coefficient on the dummy variable made. As we saw, mean centering simply implies subtracting the mean from every value of a variable. What it does is to redefine the zero point for that variable. We could also center the variable around some value other than the mean, as long as that value has a meaningful interpretation. When should you mean center your X variables? There is no one answer here. Centering primarily aids the interpretation of coefficients. In our example, the Intercept. So it really depends on the kind of interpretations you're looking for.