Just want to give you a little bit of a thought exercise. I've picked a couple of companies here that deal a lot in consumer data. Apple or any online business really, Netflix, and then Whole Foods. One brick and mortar company in the bunch. But think for a second about the types of decisions consumers make with these businesses and framing them in terms of choices that consumers make. So, for example at Whole Foods, it might be am I going to buy a particular brand on a given shopping trip? Yes or no? Well, for the company's standpoint, it might be helpful to know which of those brands are going to be popular. Which ones are people going to buy on different trips? Is there seasonality associated with their products when they're making their ordering decisions? Am I going to come to Whole Foods when I need groceries? Yes or no? I could go to one of the other grocery stores that's available to me. So what about the people who choose to shop at Whole Foods? They're more likely to go there when they're making their shopping trips. With Netflix, do I retain service this month? Yes or no? Do I choose to watch the recommended series? Yes or no? Do I choose a larger plan this month? Do I add on the DVD service this month, yes or no? Similar types of decisions you can imagine consumers making, whether it's Apple or Amazon or any other business. So there are a lot of customer choices that are driving these businesses. Again highlighting the importance of understanding what's the right way for us to be analyzing this choice data. And the reason that I wanted to talk upfront about distributional assumptions, saying we're used to using the normal distribution. Well, what we're really going to be changing is that distributional assumption. When it comes to binary choices, we're not going to be using the normal distribution. We're going to assume that customer choices between a yes or a no outcome follows a Bernoulli distribution and there are only two values allowed under a Bernoulli decision. 1 or 0, yes or no, and the only parameter that's associated with the Bernoulli distribution is the probability p. So with the probability p you get a 1, with the probability of 1- p, you get a 0, again framed differently. With a probability of p, there is a yes outcome, with a probability of 1- p, there is a no outcome. Now we can calculate the mean and the variance associated with the Bernoulli distribution, and we've done that here. All right, so, the expected value under a Bernoulli distribution if we take what are the outcomes, 1 and 0. And what are the probabilities associated with those outcomes? Our expectation is that's the mean of the Bernoulli distribution, it's the probability p. We can also calculate the variants under the Bernoulli distribution. So when it comes to writing out the likelihood of a single observation from the Bernoulli distribution, this is the form that it takes on. Now, notice it's the probability p raised to the power y times 1-p raised to the power of 1-y. Now, it looks a little bit foreign, but let's break it down based on the values that y can take on. Suppose we observe a 1. All right, y = 1. Well, p raised to the power of y means I have a value of p. (1- p) raised to the power of 1- y, so raised to the power of 0, that term is going to go away. So the likelihood for a single draw from a Bernoulli distribution, if I observe a 1, y = 1, the likelihood is p. Well, what if I drove y = 0? If y = 0, it's p raised to the power of y. P raised to the 0, well that term equals 1, so that essentially goes away. And then, I'm left with a likelihood of 1- p, raised to the power of 1- 0. So when I observe a 1, the likelihood is p. When I observe a 0, the likelihood is 1- p. That's just mapping onto the two values that we talked about earlier. And then product say let's multiply that function over all the data points that we observe. All right, how do we go about bringing co-variates or marketing activity into this? Recall when we looked at linear regression what we said was the outcomes y follow a normal distribution with a mean mu. And we said mu was a function of marketing activity. Well what we're going to do here is say my outcome is a function of the parameter p. Well my probability p is going to be a function of marketing activity. We're just going to change the form in which that marketing activity affects the probability p. All right, so we talked about this piece already, I said outcomes follow Bernoulli distribution and we can write out the likelihood function. When we bring in marketing activity, we're going to change that a little bit and say that the probability's p. Well there going to be a function of the marketing activity. All right, so we're going to look at an example for customer acquisition. Well marketing actions are going to affect the acquisition probability. So the acquisition probability may be affected by, did I send you an email? Did I send you a coupon? So we want to bring that and say, those factors influence the acquisition probability, whether or not someone decides to acquire. The product is going to be driven by that probability. We're using a technique, it's GLM if you haven't seen the abbreviation, generalized linear model. And what we're saying is a function of the expectation is going to actually look like a regression equation. So we can think in using the same logic from linear regression, it's just going to look slightly differently when we put it into math. All right, so two different models that are commonly used, one is the logit model, and you could see here, this is the functional form that we're going to use. So the probability, it's the exponential function where e raised to the power of x transpose beta divided by 1 + e raised to the power of x transpose beta. One thing to keep in mind, we're talking about a probability. p is always going to be a value between 0 and 1. This x transpose beta term, well that's actually our regression equation. Our progression equation previously looked like we had an intercept beta 0 + coefficient beta 1 times x1 + coefficient beta 2 times x2 and however many coefficients we have. That's our regression term. So every time you see that x transpose beta, just plug in your regression equation because that's all we're doing. So think of this as rescaling your regression equation. That regression equation can take on values negative and positive. We've got to somehow make that into a probability, bounded between 0 and 1. So the exponential e raised to that power divided by 1 + e raised to that power guarantees that it's going to be between 0 and 1. That's the Logit model. Another model that we could use, it's referred to as the Probit model where we plug, excuse me, we plug in the regression equation we have into the normal CDF. And that's going to give us our probability between 0 and 1. For the most part you're going to get very similar predictions between these two approaches, with the exception of when we get far out to the tails of the distribution. All right, just to give you sense, this is going to be consistent with economic theory, random utility theory, where you choose the option that provides you the highest utility. So, utilities is going to be comprised of two components. x transpose beta, that's our deterministic component, that's the place where the marketing activity comes in. And then the random component. Well, depending on what assumptions we make about the distribution that, that random component comes from, we're either going to end up with a Logit Model or the Probit Model, all right? So, we have the Logit Model on one side, we've got the Probit Model on the other. Just different ways of translating that utility into a probability. For our demonstration purposes, we're going to stick with using the Logit model, but very similar intuition carries through for implementing the Probit model. And in fact, that's something that also can be done within Excel using the, I believe, it's equal norm cvf function.