[MUSIC] When interested in predicting when an event will happen, one very often relies on survival analysis. It's a set of techniques originally coming from life science. As Keynes said, in the long run everybody dies. But the pragmatic question is actually okay, but how long will I enjoy life before it happens. The answer is provided by survival analysis. While it has many applications in life science, it may easily be translated into business analytics. Replace the event of "death" by another event and you can apply to many different fields. For instance we could model again how long before an employee leaves. We now have identified what were the drivers of attrition and who was likely to leave, but we don't know when exactly. Since we have discussed HR analytics a lot already, we could use another example instead and apply the same question to marketing analytics: how long before a customer churns? From the modeling perspective, it's exactly the same question. Or even in grade scoring. How long before a lender defaults. But let's take an example in predictive maintenance, which is a topic we haven't discussed until now. How long before a certain mechanical element breaks down, for instance. This question is actionable in practice. Because if you can anticipate failures you can allocate your maintenance resources more efficiently and reduce the downtime of your processes by replacing the broken pieces proactively. Let's take an example where we have statistics about 1,000 mechanical elements. Our dataset contains the lifetime of each element in weeks. We report how long it has been used until now. Or until it broke. And whether it is broken with (a one) or still working (with a zero). It indicates if the "event" we are interested in, here the "death" of the element, happened already or not. Then, we have some measures about its environment. A pressure index, a moisture index, and a temperature index. We also have information about the team that is in charge of maintaining each specific piece. Team A, B, or C. And finally, we know the provider of the element. We can compute that we have 39.7% of broken pieces in the sample. Those have been replaced already. But that also means that more than 60% of them are still working well since their installation. One could think that we could rely on the standard linear regression to find the causes leading to a failure. If we do so, we obtain those results. We see that we have only one or two variables that have a significant effect. But actually, this is neither reliable nor accurate! Because there are many elements that didn't break yet and we don't know how long they will still last. So for the 60% of the observations that didn't break until now, we cannot really say when it will happen. And since, with the standard linear regression, we cannot estimate what's driving an event that didn't happen yet we cannot say anything about those, and we cannot only include these elements that broke either, because we would then have a biased model. This is actually, what we call in statistics, a right-censored problem. We know who is "alive" until now, but even if we suppose that on the long run everybody dies, we don't know when it will happen for those who are alive today. Let me take an example of four employees who started to work with us at the same time. Let's say it's a month ago. Employee one left after a while. We can hence measure the "time to death" for this employee. The same for employee two. But employee three and four are still around. It may be that in the long run, they leave, but that part of the timeline - the hard part of this plot - is unknown. It's censored. That's why it's called a high-censored problem. And so we cannot say what would be the time to death of employees three and four in the end exactly. But we - we can make some assumptions. Hence, we need to rely on a model adapted for right-censored situations. A model that can estimate what's driving the probability to die or to break or to default, even for the observations that didn't die yet. Because the fact that they are still alive today still conveys some information. So let's use the survival library in R. As usual, I won't enter into the details of how these models work, and I'll just focus on how we can interpret and use the result in a business-oriented way. We'll use the Survreg function, which is basically a regression that adapted to a right-censored context. We have to provide the lifetime and whether the element broke or not for each observation as the dependent variable. And once we provide the pressure, moisture, temperature, team and provider information as the exponential variables, we'll get these results. Now in practice we should first access the quality of a model. One could for instance rely on an out of sample dataset, as we did with Credit Scoring. To assess whether it does a good job or not at fitting observations that were not used for the estimation of the model. Or we could only use data to a certain time point, for instance in the past, and look at what's the quality of the predictions after the time point. But, each time, we'll been limited by this right-censoring issue. We need to wait for the death of an observation to know what is exactly its total time to death. You can do some tests in R yourself if you want, but you could see that the model that we chose does, actually, a very good job at fitting the data we have. So let's assume that we trust the model here and go on. As usual we identify the significant variables by looking at the p values. The smaller the better. Here we emphasize in bold the variables that have a p-value smaller than 0.05. Those are the moisture index, the temperature index, whether the element is taken care of by Team C, and all the provider variables. We then look at the sign of the estimates in order to assess whether the effect is positive or negative for the expected lifetime. Positive means that the factor increases the chance of survival, and negative means the opposite. We see that moisture has, an average, a positive effect on the lifetime, and that, everything else being equal, the units provided by providers two and four have better expected lifetime than the others. Now that doesn't mean that we should only work with providers two and four. They may be more expensive or they may provide elements that are used in conditions that are less demanding. But that should certainly be investigated. And the same for the negative effects, temperature, and when team C is in charge of the maintenance. It may be explained but we should investigate further. And if there is no satisfying explanation, we should act upon it. Now that we have a model, we can use its predicting power to anticipate failures and do some predictive maintenance. The predict function allows to use the result of the survival model estimations for predicting the expected median "time to death" of each individual element. Hence, for each observation, we can compare this expected time to death with the current lifetime and compute the expected remaining lifetime, which is just the difference between the actual lifetime and the expected time to death. Hence, we can now predict which elements are the more likely to break in the near future. We obviously need to remove those that broke already. And if we then rank the observations by their expected "remaining time until death", we then obtain this table. It can consequently be used to prioritize our maintenance actions. And replace the pieces that will break soon, before it happens, by anticipating failures, acting proactively, and therefore decreasing the downtime of the process, we benefit from more efficient resource allocation. We avoid unnecessary tasks. Here maintenance of pieces that still have a long expected lifetime. Which in turn, will make more resources available to focus instead on actions that would be more impactful. That is, here, to replace elements that would have broken soon. In this video, we only provided an introduction to survival analysis. We didn't enter into the details of different models available or how to assess and compare their accuracy. But we saw that using a library "out of the box" already did provide a lot of insightful information. And at the end of the day, did the job very well. Now, you should be careful in practice, because many hypotheses need to be respected for the model conclusions and predictions to be valid and reliable. But here, since we emphasize the managerial usefulness of those tools, I focused on the interpretation and actionability of the standard approach. If you like the topic, I would certainly advise you to learn more about survival analysis on the web, in books and the like. As I explain before, if you're just starting to deal with computer sciences and statistics, you can see this training as an introduction to analytics and the first step in your data science journey. In that case, you should go further in the understanding of those models. In contrast, if you're already a computer scientist or a statistician, you can see this training as a way to work on all those analytics techniques that you've learned in the past, but applied to a business context instead.