One of the challenging things about managing machine-learning-based products in production is that machine learning products are subject to a very large number of risks and possible causes of failure. You have all the normal failure causes of software-based products, but you also have a number of additional risk factors relating to the model itself. A recent Google study of one of their models that has been running over the course of 15 years in production actually found that the majority of failure causes for the system we're not due to the model itself, but we're due to a number of other factors relating to software or the infrastructure, or perhaps the input data coming in. However, when a machine learning-based system fails due to model related causes, they can be particularly dangerous because they're often very difficult to detect. Let's take a look at an example. A normal software failure can often be somewhat easy to detect. You might see a screen such as this, gives you an immediate indication that there's an issue. Model related failures, however, can be notoriously difficult to identify. They might manifest themselves as predictions which are slightly off or maybe they're significantly off. But it's hard or even impossible for the user to know that. Machine learning models perform at their peak when they're initially released into production. However, over time their performance degrades. It may degrade very quickly. For example, when the environment into which they're released does not match the environment in which they were trained and you have a mismatch of the input data that's coming in now that the model is in production relative to the data that you use for training. Or they may degrade very slowly over time as the environment around your system and the model changes and the data itself starts to drift very, very slowly and changes the performance of the model. It's important to identify for your particular model what the rate of decay is and what an acceptable level of decay is, before you need to take action. Let's look at some of the model related issues that come up once models are released into production. Let's start with training serving skew, which is a mismatch between the data you've used for training and the data that the model is now receiving as an input. Discuss excessive latency issues and we'll talk about the concept of drift. Both data drift and concept drift. Let's start with training serving skew. Training serving skew is a mismatch between the data you've used for training your model and the data that the model is now receiving as input to generate a prediction on. This often results from using data for training purposes that does not match the data that the model receives in real life in the hands of your users. For example, this can come about by training your model on an artificially constructed or augmented data set or using a highly cleaned and tuned data set for training purposes when you're not performing the same level of cleaning or filtering on your input data as the models in production. Let's look at a couple of examples. Suppose we're training a computer vision model to detect skin cancer based on images of different areas of the skin. We train our model using a large set of high resolution imagery, it's taken under perfect lighting. However, once we raise our model into production and it's in the hands of doctors for use. The lighting can vary the angle, the pictures are taken can vary, the resolution of the images may vary. We may end up with a significant mismatch between the input image when we're now feeding to our model to generate predictions relative to the high-quality queen images that we use to train on. Another example might be the development of a natural language processing model, which is trained to answer questions from your users. We may train the model initially on a limited subset of questions about specific topics. However, we might find that after we launched our model into production, we failed to anticipate the large variety of different questions that users are actually asking and so our model performance is not sufficient on a diverse sets of questions that users are actually asking in production. Training serving skew typically manifests itself right away when you release the model into production. You often find out very quickly that the data that's now being fed into your model is mismatch with your training model and it can show itself in a significant degradation of your model performance in real life in the hands of your users. Excessive latency is another common issue with model-based products. Latency and generating predictions from a model can vary significantly based on a number of factors such as the volume of the input data coming into your system, the extent of the data pipeline you've created, and how long it takes data to flow through your pipeline. The choice of algorithm and model itself can have an impact on the model's ability to quickly generate predictions for a user. Sometimes latency is less of a concern, but other times it's a major concern. For example, when we're dealing with online or edge models, latency may be critical to the functionality of your product. Models that unlocks a phone, for example, using a thumbprint or an image of your face, is a situation where latency is a critical issue. Likewise, in designing systems for autonomous driving, we can't afford any excessive latency at all. We have to think carefully about the design of our system and the choice of the algorithms, and models, and data pipelines that we're using to minimize any possible latency. Another common issue with machine learning models in production is drift, which comes in two forms, data drift and concept drift. With data drift, we have a shift in the input data that's coming into your system. Your model is initially trained on a static training data set. But after it's released into production, the environment around your model changes over time. You may have shifts in the distribution of your input features due to population shifts or even adversarial reactions to your model. For example, if you've built a model to detect spam by picking up on certain keywords in spam emails, spammers may identify what your model is doing and may shift the language they're using in their emails and use a different set of vocabulary to prevent your model from picking up on it. Data drift results in the shifting distribution of features, which can either occur very quickly or it may occur very slowly over time. But the issue that data drift causes is that it changes the feature space of the inputs to your model. It may move you into a feature space where your model performance degrades. Let's look at a real life example of data drift. Suppose we've built a model to predict demand for rental bicycles from a bike share system in a city. We might see data drift for a variety of different reasons with this type of a system. For example, we may have trained our model based on typical patterns of usage in the city over a number of years, which are driven by things such as tourists in the city or business people working in the city using bicycles to go out to lunch or to run errands after work. However, once COVID hit, we might have seen a very quick change in our input feature distribution. The number of tourists in this city might have gone down to almost zero. The business people working in the city using bicycles at the lunch hour may have significantly changed. This may have changed the performance of our model and being able to predict the number of bicycles that were going to be used. Or we may have seen a case of slow data drift. For example, the change in demographics in a particular neighborhood may have caused the demand for bicycles in that neighborhood to go up or down slowly over time. The other type of drift that we commonly see is called concept drift. Unlike data drift, with concept drift, the distributions of your input data may stay the same. But the patterns that the model has learned between the inputs to your model and the output predictions may no longer apply or may significantly change. This generally results from some shift in the relationships between the inputs and outputs due to things such as changing consumer preferences or shifts in human behavior. One example of this might be with fraud detection models. A common feature that's used in fraud detection models is identifying purchases of one-way flight tickets as a sign of potential credit card fraud. Again, due to COVID, when tourism significantly declined and people were no longer buying two-way tickets for purposes of visiting relatives or tourism, there were a larger number of people actually purchasing one-way tickets to get someplace, knowing that they expect to stay there in lockdown for some amount of time. The purchase of one-way tickets could no longer be considered a potential sign of credit card fraud at that time. The relationship between the inputs to our model, purchase of one-way flight tickets, and the output of our model, identification of possible fraud , had significantly changed. This is a good example of concept drift.