0:10

The basic form of a model in an associational analysis will look

Â something like this.

Â You'll have an outcome, y and you'll have a key predictor, x.

Â And then you have maybe some potential confounder that will call z.

Â And then you have some independent random error that we call epsilon.

Â In addition to those factors,

Â we have a number of parameters that we wanna estimate.

Â So alpha here is the intercept.

Â It's the value of y when x and z are zero.

Â We have beta, which is the change in y associated with a one unit increase in x,

Â adjusted for z.

Â And then gamma is the change in y associated with a one-unit increase in z,

Â adjusting for x.

Â So, this is a linear model, and

Â the parameter that we're trying to estimate here is beta.

Â That's what tells us how our outcome changes along with our key predictor.

Â Now, there are other parameters, gamma and alpha, that are in the model and

Â we need to have them in the model for the model to work.

Â But we're actually not interested in those parameters.

Â So we sometimes call them nuisance parameters,

Â because we have to estimate them, but we don't actually care about their value.

Â So this is what I might consider to be a primary model.

Â It's very simple, there's a key predictor and there's only one confounder.

Â And so you may need to consider other things later on, but

Â this is sometimes good as a primary model.

Â On the other hand,

Â sometimes we'll use a primary model that doesn't have any confounders.

Â And then slowly add things to the mall to see how our results change.

Â So the example I'm going to use here is going to be an advertising campaign for

Â a new product.

Â Imagine you're selling something and you're thinking you're buying ads on

Â Facebook and you wanna know how effective those ads are gonna be.

Â So ultimately what you wanna do is sell more products and

Â make more money from this.

Â And so one thing you might do is try to pilot a one week experiment where you buy

Â Facebook ads for a week and see how it does.

Â So this is a very simple design, you might say look at the one week before

Â the ad campaign, the one week during the ad campaign.

Â And then the one week after the ad campaign just to see

Â how the sales numbers change while you're running the ads.

Â Then you could compare the total sales for the three weeks during,

Â the three weeks before and the three weeks after.

Â And see if there is any reasonable increase in the total sales.

Â So using this type of design and

Â this kind of experiment what would you expect to see?

Â So here's a data set that's not real, but it kind of represents the ideal scenario

Â for what you might see in an experiment like this.

Â So in the first seven days you have an average about 200 dollars per day.

Â Then the next 7 days,

Â this is during the campaign, you have an average of about $300.

Â And then after the campaign finishes you have an average of, again, about $200.

Â So it's possible to tell just from this graph, without doing anything fancy,

Â that the ad campaign seems to add about $100 per day to the total daily sales.

Â So, in this case, your primary model might be very simple,

Â it might look something like this.

Â Where you have Y, is the outcome, that would be the daily sales.

Â And then X is just an indicator of whether a given day fell during

Â the ad campaign or not.

Â And then still your primary interest is on the coefficient beta

Â which tells you how your total daily sales increases with the ad campaign in place.

Â So for example, the data for

Â this might look something like this in this table here where you have 21 days.

Â And you have seven days without the campaign, seven days with the campaign,

Â and seven days without.

Â I can see that the daily sales change as you go in and out of the campaign.

Â So it's a very basic setup, very simple, and

Â this is kinda what you would love to see in your data in an ideal world.

Â Of course, in reality, you will never see data like this.

Â There will always be something more complicated.

Â So here is a picture of what kinda more realistic

Â data might look like from an experiment like this.

Â Typically real world data are more noisy.

Â There are other trends in the background that are kind of messing up your

Â relationship.

Â So it makes it harder to analyze the data.

Â So you'll notice that in this picture there does appear

Â to be an increase in sales during the campaign period.

Â But the problem is that it seems like the increase started,

Â actually before the campaign even started, the sales were kind of going up.

Â And so you might wanna ask, well are there background trends that for

Â some reason increase sales over a three week period?

Â So it's possible that we would've seen higher sales in the product,

Â even without the ad campaign, just because of these background trends.

Â And so the question you really wanna know is,

Â did the ads cause an increase in sales.

Â Over and above whatever background trends that might of been going on that you

Â that you're not aware of.

Â So let's take our primary model,

Â which is just gonna be a simple model with the outcome and indicator of the campaign.

Â If we use that model, what you'll see is that we'll estimate beta,

Â the increase in the daily sales due to the ad campaign to be $44.75, okay?

Â Now, now let's suppose we add a background trend into our model.

Â So instead of the primary model where we just had the key predictor and

Â the outcome, we fit the following model.

Â Which has a quadratic trend for time, so this allows for kinda a little curvature,

Â and allows for kind of rising and falling of the daily total sales.

Â Okay, so if we use this model, and we still try to estimate beta,

Â what we get is that the, our estimate of beta is $39.86.

Â So that's somewhat less than the beta that we estimated for the primary model.

Â 6:07

So there are a number of ways to evaluate your formal modeling and

Â your examination of primary and secondary models.

Â The first thing you want to look at is the effect size.

Â So the effect size essentially is what is the value of beta that you estimate.

Â And is it big or is it small, or

Â how do they compare to each other from the different models?

Â So the three different models represent a range of going from roughly $39 to $49.

Â And so you might wanna ask yourself is that a big range?

Â Do you care about those differences?

Â It's a little hard to answer that question without knowing kind

Â of what the context is and what your situation is.

Â For example, if the cost of the ad campaign were really low, and

Â it didn't really matter how much you made as long as you made something.

Â Then maybe you don't care whether it's $39 or $49.

Â As long as you make something back, then the ad campaign's worth it.

Â On the other hand, if the ad campaign were really expensive, so

Â maybe it's $20 a day to run these ads.

Â Then you might care which end of the range you fall on when you

Â run this ad campaign, because maybe $39 is not worth it, but $49 would be worth it.

Â So ultimately,

Â the question you're asking is is it worth the risk of buying these ads.

Â Given the evidence that you've seen from these different models showing you that

Â there might be a range of say, $39 $49 increase in total sales

Â during the period of the ad campaign.

Â The second factor you wanna think about is plausibility.

Â So even though we fit many different models,

Â not all of them are equally plausible, okay?

Â So the primary model may not be plausible, because it didn't incorporate

Â the possibility of any sort of background trend, okay?

Â And likely with real world data,

Â there's gonna be lots of things going on in the background.

Â And you don't want to just forget about them because they could affect how your

Â sales go, and you may think your campaign is having a big effect.

Â When actually there's something just going on in the background that's moving your

Â sales numbers.

Â Now the third model that we did which had this kind of fourth order polynomial

Â was a relatively complex model to capture a trend.

Â And it may be too complex or

Â more than we need to capture a very simple smooth trend in the background.

Â And so model two seems kind of reasonable.

Â We allow for this background trend.

Â It's a polynomial model.

Â But it's not very complicated, or

Â at least not overly complicated, to try to capture this trend.

Â So whether a model could be considered more or

Â less plausible will depend on your knowledge of the subject.

Â And your ability to map real world events to mathematical formulations of the model.

Â The last concept to think of is parsimony.

Â When different models tell the same story, for example,

Â if you don't care about the range of $39 to $49, then it's

Â sometimes better to choose the simpler model or the model with fewer parameters.

Â And the reason is it's usually easier to tell a story about the data

Â when you have fewer parameters and so this makes the model more useful.

Â And then finally a simpler model would be more efficient

Â because you're using more data to estimate fewer parameters and so

Â that's generally good from a statistical standpoint.

Â That will afford you to have less uncertainty about those

Â parameter estimates because you're able to use more data for each of the parameters.

Â 9:17

So in this example, this was an associational analysis, and

Â it focused on estimating the association between two features.

Â Total sales and the ad campaign, while adjusting for other confounding factors,

Â like this potential background trend.

Â The primary models kind of capture the basic relationship.

Â While the secondary models are used to kind of adjust for

Â different factors in different ways.

Â So we had two different models kind of adjusting for the background trends.

Â So what conclusion you make ultimately may depend on outside factors like cost,

Â or kind of timing issues.

Â And so you have to factor all this in when you kind of ultimately draw your

Â conclusions about what to do or what decisions to make.

Â And also, you may have to think about the plausibility of the various

Â models that you've chosen.

Â To determine what evidence you're gonna weight more heavily than other evidence.

Â So this is the basic kinda outline of an associational analysis.

Â Obviously there'll often be many more things that you're gonna wanna do.

Â But they'll generally follow this framework, and you're gonna wanna iterate

Â between kinda fitting primary models and secondary models.

Â And so you get to an answer that you can use or can make a decision on.

Â