So here are a few more additional examples talking about some of the points we discussed regarding the normal distribution in lecture set three. First, let's look at some data from the 2015 Youth Risk Behavior Survey. This survey contains self-reported weight and height values for a large sample of US residents, 12 to 18 years in age. Since we have weight and height, we can use these to compute body mass index on each of these individuals. So for this exercise, we're going to focus on the data for the 1,860 16-year-olds in the sample. The mean BMI for this group is 23.6 kilograms per meter squared, with a standard deviation of 4.9 kilograms per meter squared. I looked at a histogram and I could show it to you, but in the interest of time, I did not put it in these slides. But these BMI values are approximately normally distributed in the sample, roughly symmetric and bell-shaped. So in youth, less than 18 years old, there's no singular cutoff for obesity like there is in adults over 18. So what the standard approach is to do, is to use the 95th percentile for the BMI values for children of a certain given age as the cutoff. So, by definition, make five percent of the age group obese and the remaining 95 percent not obese. So in youth less than 18 years old, there's no singular cutoff, BMI cutoff for obesity like there is in adults, and what the standard approach is to do, is to use the 95th percentile for the BMI values for children of a given age to establish the cutoff. So, this ensures or relegates that five percent of any age group under 18 will be obese by definition when using the 95th percentile in that age group as the cutoff. So based on the information given on the previous slide, the mean BMI and standard deviation of BMI and for the sample of 16-year-olds, let's estimate the cutoff for obesity in the population of 16-year-old US residents based on these 1,800 plus observations in the sample. So in order to do this using only the mean and standard deviation given, we need to know the value on a standard normal curve, in other words, the number of standard deviations that corresponds to the cutoff for the 95th percentile. How many standard deviations above the mean would this be on a normal curve? So we could go to an old school table in the back of a statistics textbook or use new technology to get it by googling for a paper copy or a PDF, but let's use R again to help us with this, and R has another function that we can use is using R as a calculator or a standard normal table. If we give it the percentile of interest, it will return the number of standard deviations we would need to add or subtract to the mean to get that percentile. So, for in our example, the percentile of interest is the 95th percentile. We put this in decimal form, and if I type qnorm and then in parentheses 0.95, this tells me that I would need to go 1.644854 standard deviations above the mean to get the 95th percentile in a normal distribution. So what I'm going to do is, I'm going to round that to 1.64, and based on our data, I'm going to use our observed mean of 23.6 kilograms per meter squared and I'm going to add 1.64 times the standard deviation, which is 4.9 kilograms per meter squared, and rounding this to one decimal, it comes out to approximately 31.6. So this would be the cutoff for obesity in 16-year-olds. If a 16-year old presented with a BMI greater than 31.6, they would be classified as obese, and otherwise, if they presented with a BMI less than or equal to this, they would not be classified as obese. So let's use some cutoffs for adult BMIs even though this isn't necessarily protocol, just for the exercise. In adult, let's apply these cutoff to the sample of 16-year-olds and see what we get. So, in adults, BMIs between 18.5 and 24.9 are considered indicative of healthy weight or a healthy mass. So let's estimate the percentage of 16-year-olds who have BMIs in the range of 18.5 to 24.9. So what we're going to need to do, let's think about what we're doing pictorially is we're going to need to estimate, we're assuming a normal curve. We're going in sample, it would be centered at the sample mean of 23.6. So, we've smoothed the curve over the histogram of our values, we expect it to be roughly bell-shaped and centered at the mean. Sorry for the not quite drawn to scale picture. What we want to know is the proportion of BMIs under in the sample that are between 18.4 on the low end and 24.9 on the upper end. So we want to capture the percentages of observations between these two values. So one way to get at this, is compute the Z scores for these boundary values and then use the pnorm function in order to help us figure out what's going on. So the corresponding Z-score number of standard deviations, that 18.4 is below that mean of 23.6, will be found by taking that value 18.4, subtracting the mean of 23.6, divided by the units of standard deviation which is 4.9 kilograms per meter squared, and we get something that comes in a little more than one standard deviation below the mean. Z-score is negative 1.06. Similarly on the upper end, our value of 24.9 kilograms per meter squared, the upper cutoff, comes in at, if we do the Z-score and hence compute it into number standard deviations, it comes into 0.27 standard deviations above that mean. So I'm going to go to R now and compute the associated areas associated with these two values. Remember, in R, we're going to get the area or the percentage of values less than that measure above or below the mean. So, coming to pnorm, now we have a standard normal curve we convert with mean zero, the Z score of negative 1.06. What R tells us about this is that, only we draw this up here, and we have a curve. Roughly, I'm going to round it, it's 14.89, I'll round it for ease of computation to 0.15 or 15 percent. Fifteen percent of the observations are less than one standard deviation, 1.06 standard deviations below the mean. Similarly, what it tells us for this other piece is with a Z score of 0.27, that the percent of observations less than 0.27 standard deviations above the mean is on the order of 60.4 percent, and again just for ease of computation, I'll round that to 60 percent. So, if we want to get the percentage between these two values, we take the 60 percent below 0.27 and subtract the 15 percent below the negative 1.06 SD from the mean for approximately, pardon me, I'd jumped to the answer there, 60 minus 15 is 45 percent, but approximately because I rounded the other things. So, approximately 45 percent, the estimate of 16-year-olds in this US population have BMIs between 18.4 and 24.9. We did that by converting those values to their respective Z scores and using the logic and symmetry of the normal curve. So, let's look at data from a random sample of citywide population of HIV positive subjects, 1000 HIV positive, and let's split this out into the group that did not respond to treatment and the group that did. So, I should have put the N's here, but these are roughly the same size, I think it's 498 versus 502. So, here are some summary statistics and visual displays of the data for the two groups. So the mean CD4 count in the group did do not respond to treatment. Their means CD4 count at the start of treatment was 292 cells per millimeter cube versus a mean of 234 in the group that did respond, and here are the other numerical summary statistics as well and you can see the histograms as well. So if we wanted to quantify the difference in these two CD4 count distributions, we can use the mean difference. I've superimposed via bolded vertical line on both histograms their respective means and just trying to reiterate did that difference in means will numerically capture that shift to the right in the distribution for those who did not respond compared to those who did respond. So the difference if we compare the non-responders to responders in that direction, is 292 minus 234 or 58 cells per millimeter cubed. We go in the other direction, compare the responders to non-responders, it's just the opposite or negative 58 cells per millimeter cubed. This is just to remind you the direction of comparison is arbitrary and we'll get the same overall result and interpretation, but it's important to specify and acknowledge this direction to interpret correctly. So, assuming incorrectly that the CD4 counts are normally distributed in both groups, let's figure out the range of values that covers the middle 95 percent of the data for the non-responders. So, I will just refresh your memory so you don't have to jump back in the video, but the mean for the non-responders was 292 cells per millimeter cubed and the standard deviation was 194 cells per millimeter cubed. So if we were to do our standard, take the sample mean and subtract two sample standard deviations to get an estimate of the 2.5th percentile, we will get a negative value, negative 96 cells per millimeter cubed and that's unrealistic because CD4 counts can not be negative. Mainly close to zero in very sick patients but they can't be negative. On the upper end, we will get 680 which is a realistic value, but let's see how this compared to the observed 2.5 and 97.5 percentiles or using the computer to get them, and that was actually a range of 14 cells. So not much above zero but certainly not negative up to 723 cells. On the upper end, it's a larger value than we estimated by incorrectly assuming normality and using the mean plus two standard deviations. So I hope these have been helpful and we'll look forward to moving on the course to the next lecture set and I'll see you come lecture four.