0:00

I want to finish this module with a look at a different form of regression.

Â So, regression actually takes many different forms.

Â We saw a linear regression, we talked about a multiple regression.

Â Now, I'm going to briefly introduce you to what is termed, logistic regression.

Â So, It's useful to know about logistic regression, because it's appropriate for

Â a certain type of problem.

Â So, linear regression,

Â the type of regression that we have been discussing so far.

Â Is appropriate when the outcome variable, why is continous..

Â So, we saw the price of a diamond, we saw the fuel economy of a vehicle.

Â We saw the amount of time it takes to do a job.

Â Those are all examples of continuous variables.

Â But not every variable you're going to come across, in a business setting,

Â is going to be continuous.

Â In fact, some of them are discrete.

Â So examples of discrete variables that you might well come across?

Â Well, within the marketing context, there's a classic one.

Â It's did that consumer buy my product?

Â Either yes or no?

Â So that's a two-level outcome.

Â Yes or no,

Â purchase don't purchase, and I might well be interested in modeling such a variable.

Â For example, as a function of the age of the consumer,

Â their sex, their income, etc.

Â So, sometimes we find ourselves wanting to model a categorical, or discrete.

Â And in this particular case, I would say dichotomous.

Â It can take on one of two values, a dichotomous outcome.

Â Another example of a dichotomous outcome, it's a pretty brutal one.

Â But it's certainly out there if you run drug trials, medical experiments.

Â Is the patient alive after five years?

Â Yes or no?

Â I might like to model that as a function of the severity of the illness.

Â The drugs that the patient was able to take for the illness, etc.

Â So that's another example of a dichotomous outcome variable that we'd like to model.

Â And within the internet or web-based world,

Â those dichotomous outcome variables are very common.

Â So, one of them might be, you go onto a website, and immediately.

Â A page pops up which says something like,

Â would you give us your email address, please?

Â And as a consumer, you're either going to say yes or no to that.

Â And as the person who runs the website.

Â I might be very interested In understanding what the drivers are.

Â 2:24

As to whether or not you chose to give me your email address.

Â Sign up for some email newsletter, for example.

Â Another place where these dichotomous outcome variables are used

Â in web-based businesses all the time is conversion.

Â Conversion could be understood as, did you buy my product?

Â Yes or no?

Â You were on the web page,

Â it went through the shopping cart, did you actually get to the checkout?

Â Yes or no?

Â And so, these dichotomous values,

Â these dichotomous variables are absolutely commonplace in business processes.

Â They are not continuous nd we're going to need a slightly different methodology.

Â If we're going to create a realistic model for such outcomes.

Â And here's the methodology.

Â It's logistic regression.

Â 3:24

So logistic regression is used to estimate the probability

Â that a Bernoulli random variable is a success.

Â The probability that a consumer buys my product.

Â But we will estimate that probability as a function of a set of predictive variables.

Â As a function of a set of X's.

Â So, it is a regression set up.

Â X is trying to tell us something about why we believe there is an association.

Â But the model is formulated in terms of the probability of a success.

Â So we might say, and here is the example I'm going to work with.

Â 4:01

How does the probability that a website is compromised,

Â vary as a function of the number of plugins that the website has installed?

Â So, if you have a website, you like it to be functional to your consumers.

Â So if its nice, highly functional then it should be engaging.

Â People stay on the site.

Â That's usually viewed as an outcome, good outcome.

Â But the more functionality we want to offer the user.

Â Than the more plug-ins they typically after have going on your website.

Â And the unfortunate truth is that the more functionality you provide.

Â The more plug-ins that you put in your websites,

Â that might be a shopping cart for example, as a plug-in.

Â It might be a blog, all sorts of plugins.

Â But the more of those you have associated with your website.

Â Unfortunately the higher the probability of compromise is, and that's not good.

Â So, we're thinking of our outcome variable.

Â Is this website compromised?

Â Yes or no?

Â And we'll think about predictive variable as the number of plugins that the site

Â has installed.

Â 5:00

Now, we got out and collected some data here.

Â And we have looked at a large number of websites.

Â We've counted the number of plugins that they've got.

Â And we've looked to see how many of them are compromised.

Â So in this particular example, we've got 100 websites with no plugins.

Â And of those websites 16 are compromised and 84 are not compromised.

Â That gives you a compromise percentage or proportion of 16%, not .16.

Â So, those are the numbers in the second column of this table.

Â For example, picking out another one, websites that had five plugins.

Â I looked at 100 of those, 55 were compromised, 45 were not compromised.

Â By the time we get up to 10 plugins on the website.

Â Then we got 88 of those sites were compromised, 12 were not compromised.

Â Got an 88% compromise rate.

Â So, remember our outcome here is, was the website compromised?

Â Yes or no?

Â Now, you might take data that looks like this.

Â And, when I say data, I'm going to look at the proportion compromised as a function

Â of the number of plug-ins.

Â If I choose to plot it like this.

Â 6:12

I can put a line through the data.

Â There's nothing stops me running a linear regression.

Â But what I'm saying now, is that's not totally appropriate here.

Â And the thing to realize is that the outcome that I'm modeling.

Â The probability of compromise has to lie between zero and one.

Â Probabilities must lie between zero and one.

Â Proportions must lie between zero and one.

Â So, if I put a line through data that looks like this.

Â You can see that something odd is going to happen.

Â Especially, if I extrapolate the line.

Â So first of all, the line doesn't fit the data too well.

Â But absolutely if I extrapolate.

Â If I took an extrapolation out to 12 plugins, for example.

Â t's going to predict a proportion compromise or

Â a probability of compromise greater than one.

Â It's nonsensical, so the underlying issue here.

Â s that my outcome has to lie within a range 0,1 and unfortunately,

Â my line doesn't respect that range.

Â So what am I going to do?

Â So, this slide shows you that the linear regression isn't necessarily

Â a smart thing to do.

Â There are alternatives and the alternative that I would typically use,

Â would be a logistic regression.

Â Let me show you what a logistic regression looks like.

Â So, a logistic regression actually fits on a transform scale.

Â And what I'm showing you here,

Â is the back transformed to the original scale of the data.

Â And so I'm not really going to dig into the details here.

Â The most important point that I'm making for you.

Â Is that, if you're looking at dichotomous outcome data like, live and die, buy and

Â don't buy.

Â You might find a logistic regression model much much more appropriate.

Â If we were to fit a logistic model for this data, which I've done here.

Â It provides a different sort of fit.

Â This sort of fit, I would often term an s-shaped curve.

Â In fact, it's a logistic curve to be more precise, and hence,

Â we call it the logistic regression.

Â But it has some very good features associated with it.

Â And the main feature is,

Â it's never going above one and it can never go beneath zero.

Â So, it provides a more suitable model, when you're trying to predict outcomes.

Â That are things like probabilities and

Â proportions, that should live between zero and one.

Â So, here's the fit of the logistic regression model, and

Â once you have got that fit.

Â You can see how you can use it for prediction.

Â If I have, for example, four plugins and

Â I want to predict the probability of site compromise.

Â I just take four, and I go up to the curve and I read off the value.

Â And that's what this regression methodology will give me.

Â It's a prediction methodology that's more suitable for these dichotomous outcomes.

Â