Joining me is Casey faculty member at Darden colleague of mine focused on analytics and data science, and currently a visiting researcher at Google. Casey, thanks for joining us. >> Students of this course have seen Drew Conway's Venn diagram. They're familiar with it. Can you talk a little bit more about the danger zone? And for example, for the generalist who is tinkering with analytics, how do they kind of self-diagnose when they're headed into the danger zone and they need to get advice, or counsel from somebody with a data science, or statistics background. Well, first of all Alex, I love the Venn diagram that Drew Conway put together. I start every class I teach on data science with that Venn diagram, and we have a robust conversation about the three circles, and all the intersections, and I get lots of examples from the students in the class about, when they've found themselves in the past in these different areas in the Venn diagram. And the danger zone is clearly one of the worst places to be, which is why it has the word danger assigned to it. That intersection between computer science and having some business domain knowledge, can be pretty troubling from a statistical perspective. You can draw lots of conclusions when you're in that space, that just simply do not hold up when you go and try to apply the learnings to the future, or to some product design that you're then going to unleash into the world and then things don't hold up for one reason or another. There are several problems that occur statistically. One of the biggest ones is over-fitting. You build a model, your model is able to explain everything that's gone on in the past. Of course the world is time varying, and what's going to come in the future is going to be different than what was in the past, and you've over-fit yourself to many of the things you couldn't explain about the past. Those things don't repeat themselves into the future. And this model that you built that explains this world you were once in, when you make decisions on the basis of it, thinking that that's going to hold stationery that pass world is going to repeat itself into the future and then it doesn't. And then you're surprised by that. And you can really do some harm by launching products on the basis of learnings from the past when your mental model, or even your statistical model is over-fit to that path data. And so what you need to do is back off these conclusions in some way. You need to leave some room for flexibility, that what you're seeing in the past may not repeat itself into the future. And there are ways to protect yourself against that, to borrow some ideas from the world of statistics that math and stats bubble, so that you get yourself out of the danger zone and into the data science zone. And the biggest one is to set up training and testing sets. So, you set up a training set which is the slice of your total pass, that you're going to treat as a piece of the past, that you will then use to predict another piece of your past, that you will not use to fit your model to. That will be your testing set. And will be your like pretend future, your hypothetical future. And if you can do well on that set, then maybe the model you fit to the smaller piece of the past is a good one you can deploy into the future. >> And what do you find that it takes to kind of bring students that I don't come into your classes with an analytics or data science background. What does it take? Like what's most helpful for sort of intuitively understanding that concept, and the importance, and relevance of that that train test split. >> This this this mindset around simulating what might happen in a future that you can't predict, is really what we're getting at. And it's not all that different than from what architects do when they build little models, or what aerospace engineers when they build a little small version of their plane and they put it in a wind tunnel. It's hard to anticipate by just working with a bunch of formulas, and a bunch of past data. What it's going to be like when it faces real world conditions. And so you build your little prototype, and you subjected to some real-world forces. That's the basic idea behind test training and testing. >> And can you talk a little bit about sampling? I mean in a world where places like Google have made it so easy to run A/B tests and and everyone's doing this stuff, which is great. How do you think about sampling issues as a general manager, and make sure that you're not stepping into the danger zone? >> Yeah, small sample sizes are a big problem. And even when you've got bigger sample sizes, you think you've got more ability to draw a valid conclusion. If you've got heterogeneity in the sample you've drawn. So maybe you've got ten segments and you thought you just had one. And you really don't have that much information on each of the ten segments, even though maybe you drew a hundred drawers, but you've only got 10 observations on each of the ten segments, that may not be enough to draw a conclusion. And drawing a conclusion about the hundred people may be faulty to begin with, because of the multi segment problem. So there are things like that, that can come up. And, they're generally, their additional co-variance that are available, other independent variables that can help you identify segments and then really try to collect up enough samples within each segment, so that you can draw valid conclusions about each segment. This is this in a commercial business. There's bound to be many segments of customers and if you can identify them, it's really hard to design your product in a way that can meet the needs of multiple segments. >> And you've got so many great examples of this that have helped me sort of think about this, and specifically watch out for certain things in my work. Can you talk about some specific examples that you've seen? >> Which one did you have in mind, Alex? >> One of my favorites was the one that you mentioned about sampling people at a certain time of day and where you're, so you end up getting into a segment that uses the service when they come home. And they're materially different than people who use the service during the day. >> Yeah. >> Did I had that one right? >> Yeah, that's a good one. So you think you're learning about the whole population, and if you just sample at a certain day part, you're only learning about a particular segment. Those who have the time say after work if that's when you're sampling to come, and relax, and enjoy your product or service. Whereas if you did it a different day part like in the middle of the day, you potentially tap into a different segment. So, and it can be costly to sample more people or multiple points during the day, but this may be something that's worth it to get a more representative sample for each of your segments. >> I mean another one that I see a lot is, when you run a test that's an opt-in and you get like go-getters, or people that are not representative of the rest of your population. That's when you mentioned the past that I've seen a lot in practice. Are there any others you think are useful for helping the learner who's just sort of dipping their toe in this water? Think about how to how to think about it, whether they're there in an okay place on the sample size or not. What else might they do to work with their analytics team to make that determination. >> Well, maybe moving away from the sample size thing for just a second. Just backing up on the, even thinking about doing an A/B test. If this idea that you're going to put in front of random subsets of your customers to different treatments, is sort of a big deal even before you get to the how big of a sample do I take for each of my treatments. If you're going to try to learn from just simply the current iteration of your product, and observe the behaviors of your customers and just build out a giant data set that says okay, here's how our current product is satisfying our current customer base. The learnings you draw from that are potentially very suspect. This is the issue of correlation versus causation. It could just be that you just are getting by chance to a decent place with your product. And you may not be really putting your finger on the the thing that would really move the needle, with maybe more customers or your existing customers. And kind of the only way to discover whether you can move the needle with an improvement to your product, is to put two treatments of your product out. A Product that's status quo and then a product that's the improved version and collect up samples, large enough samples may be over multiple segments. So that you can really understand whether you can move the needle if you offer some potential improvement. >> And some great perspective on avoiding the danger zone from Casey. Casey, thanks again for joining us. Thank you Alex.