So, step number one is sorting the data.
And so, I've already done this for us in the next slides,
so you'll notice that we have the same number of at-bats,
it's just that I've taken them and ordered them from lowest to highest,
and what we're trying to do with quantiles is figure out,
again, where the cutoff is that X percent of our data lies in a given range.
And so for example, I could say, "Okay,
I want to create five different quantiles."
And so, the user decides the number of quantiles.
So let's say, we want to have five quantiles.
What this means is I'm going to calculate the probability at 20 percent,
40 percent, 60 percent,
80 percent, and 100 percent.
And so, quantile really means,
the cutoff point at which 20 percent of the samples are less than this value.
So, how many is 20 percent of the samples?
Well, I first need to figure out how many samples I have in my data set.
So all that really is, is accounting profit using to count
how many baseball players I put in my data set here.
So, I have one, two, three, four, five, six, seven, eight, nine, 10,
11, 12, 13, 14,
15, 16, 17, 18, 19,
we have 20 baseball players in our data set.
All right. So, if I have 20 baseball players and I want five quantiles,
so 20 divided by five is four.
So since I've sorted the data,
and this is position number four,
then this is where my cutoff value,
my quantile is, so 514.
So I want to guarantee that four of my samples,
so four samples from my data set,
are going to have a at-bat less than this number of at-bats.
So since it's 514, and since this number was an integer,
I really want to take the average between sample four and sample five,
because 514, I have an equal to, not a less than.
So really, I can think about my 20th percentile here would be something like, 515 or 516,
so 515, I can guarantee that four samples are less than 515.
At the 40 percent,
so this was my quantile one,
and my quantile two,
I have to take two times my number of samples,
so two times 20, over the number of quantiles I want.
So now, this is going to be eight.
So, I go to my eighth position, five, six,
seven, eight, this is my cutoff there.
So again, I could take the average between these two numbers,
and I could get something like 550.
So, I can guarantee that eight samples in my data set are less than 550.
And so to continue, for 60 percent,
it's the quantile number times the number of
samples divided by the number of quantiles you have.
And so, for the 60th percentile,
this is our third quantile out of five times 20 samples divided by,
we're going to have five quantiles,
and so we wind up with 12.
So again, we go to our 12th number here,
and we could say something like
579 and a half because we take the average between those two.
And we can continue filling out the rest.
And again, I didn't have to take five quantiles,
I could have taken 10,
I could have taken 15, I could have taken 100.
All of those are just going to try to split this up into a more refined space.
The nice thing about this is that I'm sort of
guaranteeing an equal probability distribution.
The problem with this is,
is that my data samples could actually all be identical.
So all these people could have taken the exact same number of at-bats, and then,
I would have, let's say 500, 500, 500,
500, 500, and so forth.
So, there's no less than if I split my four sample,
if I had five quantiles, and 20 samples again,
I split here, I wouldn't be able to
calculate the quantiles for the data set that has all identical numbers.
So, just keep those things in mind where there are
some issues here that this is exactly what
a quantile is representing is the percentage chance
that a sample is going to be below this number.
So, we have a 20 percent chance that one of our samples is going to be below 515 at-bats.
We have a 40 percent chance that one of
our samples is going to be below 550, and so forth.
And quantiles are really useful because
they are less susceptible to long-tailed distributions and outliers.
Outliers don't really influence the data set.
If we think about our histogram,
if we have one book with a really huge number of pages,
I can wind up with one bin over here,
but then the rest of the data winds up smashed over into the lower end of the range.
Quantiles don't have this because
this one value point would just wind up in the upper quantile.
Often, quantiles are more descriptive statistics than means,
standard deviations, and so forth. So again, they help us
do more than just take a moment with the data. Quantiles