0:09

Hello. This lesson introduces distributions both

empirical and theoretical which provide concise representations of a data set.

Often you will be given a data set and you may do

some basic statistical analysis with descriptive statistics.

You may visualize the data set and use for instance,

box plots or scatter plots or

even histograms to try to understand how data are distributed.

But, usually, you're going to also want to look

at theoretical distributions to understand,

is there some sort of physical basis for what I see in this data?

We can use these theoretical distributions to

gain insight into modeling and interpreting a data set

based on what we know about different distributions

such as a Poisson distribution or a normal distribution.

In this lesson, you're going to look at a visual web site from Seeing Theory that

explores random variables both continuous and discrete distributions

as well as look at the Introduction to Distributions notebook.

So first, the distributions site,

first this talks about a random variable and

goes through how to play with random variables.

So for instance you can enter values and

submit and you can select cells and you will generate

different types of random variables through this distribution

and see how the probability space can generate a distribution.

You also can play with continuous and discrete variables.

So for instance, at discrete,

you can look at Bernoulli, binomial, etc.

Or if we click continuous,

you can look at uniform, normal exponential, etc.

Lastly, the central limit theorem.

This is a very important concept that talks about how even

if we have a distribution that is not normally distributed,

if we have enough samples that we average together,

the results will generally follow a normal distribution.

And that's what this particular part of the website shows.

So I encourage you to play with these different ideas and see, on this website,

you can build some deeper physical intuition by seeing them visually demonstrated.

Now, these same concepts are also

demonstrated in the Introduction to Distributions notebook.

We look at theoretical distributions that can be discrete or continuous.

First, we'll look at a uniform distribution where we

have probability uniformly spread between

two endpoints and we see how to do this in both a discrete and a continuous case.

The code here actually demonstrate these by making plots.

So on the left, we have

a discrete uniform distribution and on the right, we have a continuous.

Notice that because it's continuous,

we actually bin the data and that's shown by this soft gray line.

We also show our continuous and frozen distributions.

Now, one thing to keep in mind,

is we're making this plot and we're using the SciPy library,

scipy.stats, to get these functions.

The way it works in SciPy is we create the distribution and we

can effectively have a frozen distribution where we specify the parameters.

So, here we say, we want

a uniformly distributed integers discrete distribution

and the probability is uniform between low and high.

Thus whenever we call this,

this is now a function that is predefined with these parameters.

So that saves us time and it makes it easier to compute things.

So for instance we have UDRV here,

we can actually compute things from it as we go through this particular code cell.

And that's what we do here,

we pass this function and do some functions we defined and we add

sample from that distribution inside these other functions and create these plots.

The rest of the notebook looks at some other distributions, like the Poisson,

we briefly mentioned other discrete distributions

before moving on to the Gaussian distribution.

Here we see different versions of the Gaussian.

Also talk about some other continuous distributions.

I want to just focus on the plots themselves so you

can see these other distributions demonstrated here.

All in all, I believe we look at eight different distributions.

Here is the plots themselves so there's the Power Law.

Note that this is a logarithmic scaling.

These are very interesting distributions.

A lot of times you see things in queuing theory that may follow a Power Law.

So the number of calls that come into a call center or the time it takes for shipping,

things like that maybe following a Power Law.

A related one is the exponential distribution.

You may have heard of the Pareto distribution and the Cauchy is a very interesting thing,

it looks like a normal distribution but there's a lot more power out in

the tails and so they're more broad than a normal distribution.

Next, we look at random sampling how to actually,

given a sample, draw from it.

So here's a data we've drawn from a model which is shown in red.

And in the blue we see the actual data that we've drawn.

We can do this with other distributions as well.

So here's an exponential distribution shown in red and in blue,

soft blue, is our actual data.

Now this is kind of hard to see because it's so

strongly peaked so we can change the axis to be

logarithmic and then you could see that it's just a straight line.

We also look at some alternative distribution forms including the CDF,

and the percent point function, the survival function.

The SciPy module provides methods to calculate these very easily,

so here we show a Gaussian PDF.

And for that same function we show the CDF,

the percent point function,

the survival function, and

the inverse survival function and talk a little bit about why these are important.

Fundamentally the idea of the CDF is we can read off a probability and say,

what is the value of our variable at which the probability is that value?

So we're going to be able to say,

what's the value below which we know our probability is below this?

So this tells us quickly,

there's your median, right?

And there is your seventy fifth percentile.

The other functions behave similarly.

The survival function for instance tells you what's

the probability that you've lasted this long?

So if you think about this in terms of time,

you could read this off from zero to time T,

that's the probability that you survive 50 percent of the total time.

This is important for manufacturing, for instance,

where you want to understand how long is

a given piece of equipment going to survive if it's operating at a certain rate.

Other things that we're going to look at include

the central limit theorem which you saw on the visual web site.

This demonstrates the central limit theorem by sampling coin flips.

We take 10 coins if we only flip them once they're distributed crazy.

Here, we're basically averaging over all the flips we do,

10 coins, 10 flips.

It starts to look a little more like a Gaussian 100 flips.

And by the time we've done a thousand flips of 10 coins,

it looks very close to a Gaussian.

And this is the central limit theorem in action.

We also talk about QQ plots and fitting distributions.

The idea is, can we actually take a data set and infer

whether a given distribution would be a good approximation to it.

And that we can then say,

let's fit a model to our data,

a theoretical model and we can then actually see.

So here we generate data,

the model is in this strange dot dash line.

The fit is in this red line,

and you can see they're very close,

and yet if you look at this data you might have thought,

what kind of distribution is it?

And yet we've derived very accurately the underlying distribution.

So I hope this has given you a better feel for theoretical distributions,

how we can use them both in calculating probabilities as well as interpreting data.

If you have any questions let us know in the course forums. And good luck.