Where do we get data to use for our models? There are a number of possible sources of data that we can obtain. One source might be internal data coming from things like log files or files of user data that we've collected from website that we maintain. Or internal data coming from our own operations or if we're a hardware company, perhaps a machinery that we're operating. Our ERP systems can be a rich source of data that we can use for models. Secondly, we can collect data from our customers. We might be deploying sensors with our customers that are collecting data that we can use to build models. We can also use operational systems or hardware that they're using and collect data coming off of that hardware. Or if we're a web company we may be collecting web data in the form of votes or ratings, or asking our users to fill out online forums all of which are generating data that we can use to build models off of. We might also consider external third party data sources that might be relevant for a modeling problem. For example, in many cases, data on whether is useful in modeling, or demographics data to try to model a certain user base or social media data that we can obtain and use for generating model predictions. Let's talk about a couple of best practices in collecting data. It's important to collect data intentionally. We're not trying to collect every single data point that we can find on every single feature imaginable but we should focus our data collections on only what we really believe we need for a model. There are a couple of reasons for doing this. One is that we need to store all the data that we're collecting. We need to think about storage costs and processing costs for using the data. Second is we want to be aware of any potential privacy or ethics concerns and so it's important to really try to hone in as early as possible on the specific features and data that we need for model and focus our data collections only on them. Secondly, as we collect data, we want to be aware of introducing bias. We need to be very thoughtful in how we're collecting data and where we're collecting it from so that we're not introducing bias into our datasets. Thirdly, as we collect data, we want to make sure it's representative. What does this mean? This means that we want to collect data that's representative of the population we're trying to model. We want to be careful of avoiding situations where we're collecting data only on a small subset of users for example where that subset is not representative of the entire population of the people who we might be modeling. It's also important to continue to update our data as the environment changes. Again, in the real-world, things change around our model, the environment doesn't stay constant. If we build a model once using a set of historical data and allow it to run forever, it's likely to start degrading performance over time because the environment and the data around it is changing. It's important to continue our process of data collection and then using our new data to periodically retrain our model. Thirdly, it's critical to document what we're doing when we're collecting data. We should be documenting the sources from which our data is coming as well as documenting metadata or data about our data. As we go forward into modeling and as our team evolves to add new people, it's important for them to understand where data is coming from, the attributes it has, the relationships that can be found in it. If we don't document this as we're collecting our data it can cause a lot of pain later on in the process of having to go back and figure things out. For example where certain sets of data came from. User data is a popular source of data to use in modeling. There's a number of options for collecting user data. You might have forms on a website that users are filling out and providing data through. You may collect data on user behavior from your website logs for example, looking at the behavior of people coming to your site. You may also be collecting user data through things such as votes or rankings or ratings that you are collecting on your site. If you're collecting user data, ideally your data collection should be non-obtrusive and not painful to your user. You want to focus on collecting data as an integrated part of a user's workflow. As they naturally engage with your site or with your product, you're collecting data in a very natural way that's an embedded part of their workflow. It's ideal if you could provide some benefit to the user through the data that you're collecting. Let's take a look at a couple of creative examples of how to accomplish this. The first one is the CAPTCHA product from Google which I'm sure we're all familiar with. Often when you submit a form or try to download something from a website you'll be presented with a CAPTCHA. The point of the CAPTCHA is to ensure that you're actually a human being trying to accomplish what you want to do rather than an automated bot that's crawling the web site. What we have on screen is an example of a CAPTCHA that's asking us to select all of the images from a set of nine images that contain pictures of a cross-walk. By doing this, we're verifying that we're a human user, so we're allowed to proceed forward with the action we're trying to take. But behind the scenes, Google is actually collecting this information we've provided where we've selected images that have a crosswalk and using that to label images. For example, if Google has an interest in training a model to recognize crosswalks, we've now provided it some labeled data or images with labeled crosswalks in them, that they can then use to train this model. Another creative example is a company called Stitch Fix. Stitch Fix sends a weekly box of clothes to their user's home, and the user has the choice of either keeping the clothes and paying for them, or returning the clothes back to the company and the following week they'll get a new box. Obviously, Stitch Fix has an interest in sending clothes to users that they want to keep. Stitch Fix is interested in understanding the personal preferences of each user when it comes to buying clothes. To do this, they build a machine learning model which gets trained to understand preferences of users. Just by the act of deciding which items of clothing to keep, and which ones to return from a box each week, the user is actually providing valuable information back to Stitch Fix, which the company can then use to fine tune its model to label data as either keep or return so they can build a high quality model that really understands your preferences and which items of clothing you're more likely to keep versus returning. Often when we're collecting data from our users, we can generate over time what's called a flywheel effect. The idea of a flywheel effect is that users are generating data by interacting with an AI enabled system typically through a website. As our users generate this data, it can be fed into our AI systems to further strengthen the quality of our systems, but also to present new opportunities to use that data that the users are generating to build AI in other places in our product. Let's illustrate this with an example of Amazon. Amazon collects a lot of information about its users. What users search for, purchases they make, ratings that they give, and they can use all of this information to train models which accomplish different things to help their users have a better experience. For example, it might use data on searches for a product and what the users ultimately end up purchasing to reorder their listings. For example, it might look at all searches done by users for a flashlight, and then look at what flashlight model those users ultimately most commonly purchased to reorder listings, to present the most commonly purchased items right up top for users, to give them an easier experience. Might also take data on the purchases and the ratings from an individual user to identify that user's preferences and provide personalized recommendations back to the user. Finally, it may use purchase records, identifying when customers have purchased multiple items together to present items called shoppers also bought items. This means that if you're looking to buy a flashlight, for example, they might also present under the shoppers also bought menu batteries that you need for your flashlight. These are items that are commonly purchased together and Amazon can figure this out by mining the data from purchase records. This is generally called a co-occurrence matrix, we're identifying multiple items that are commonly purchased together. One of the challenges, however, on relying on user-supplied data for our model, is that when you initially start, you may not have enough data to build a high quality model. This is a particular issue with recommendation systems because for every new user, we don't know anything about their preferences when they first start engaging with our service. Because we don't know anything about them, it's very difficult to use a machine learning model to provide personalized recommendations. There are a couple of ways of approaching this cold-start problem. For example, we might start with using a simple heuristics based approach rather than the machine learning approach to accomplish what we're trying to deliver our customer. Or we might need to add a calibration step in our process to gather enough data about the customers to train a simple machine learning model. Then as the user engages with the model, we can continue the retrain and prove the model over time. Let's consider the case of Netflix, for example. Netflix is famous for operating their movie recommendation algorithm, which provides recommended movies for a user to watch. However, if you're a new user who has just signed up for Netflix and you haven't yet watched a single movie, Netflix doesn't know anything about you in order to provide recommendations. If they were to employ a heuristics based approach, they might just simply recommend to you the highest rated or most popular movies among all users on Netflix. Or they might choose to add a calibration step, so when you first sign up for Netflix, you're presented with a number of movies and asked to rate each movie. By rating a small number of movies, Netflix can start to develop a sense of what your preferences are and then can present more logical movie recommendations to you.