Since the beginning of this course and really from the entire specialization, we've talked about how data is essential for machine learning, and when there is no data, there's no machine learning. But we've been glossing over one very important detail. You can't do that machine learning until the data lives somehow on a computer that you have access to. In this video, we'll talk about data warehousing and getting data into your hands. By the end, you'll understand the steps required for identifying and joining data sources. In course one of the specialization, we talked about the importance of a data pipeline process. After all, data never actually arrives in the exact perfect form you want it to. To recap some of that video, we discussed the three typical stages of a data pipeline process namely; data extraction, data transformation and data loading often known as ETL. This video focuses on the extraction part of ETL. But before diving into that, we'll give a quick review of all three stages. First, data extraction. This is the stage where you need to think about where you're going to get that data that you plan to use. Next, data transformation. Different data sources may have different characteristics and your data never arrives in the format you need. The data transformation phase is all about, you'll never guess transforming your data into a schema that's specific to where it's being loaded. Finally, there's data loading. Data must be stored somewhere and the data loading stage is where you decide exactly what you want to save and where. No matter what you choose, data must be stored at least temporarily in order to be accessed somehow by your machine learning model. What tools and processes you use in each of these stages and overall, is an important decision which impacts every stage of the project. As we mentioned in previous courses, it's good to consult an expert and worthwhile to spend some time exploring what technology works best for you. Using the ETL concepts as a framework helps guide you overall, but there's still lots of detailed things to consider. For example, you probably want to set up your data pipeline in a way that can be repeated easily, but don't let setting up a perfect pipeline hold you up from getting your machine learning project started. Experiment with the files you have to get a sense of what's helpful to your project. You might be able to work without some of your available data sources. Let's talk more about data extraction. Most real life use cases for machine learning are ongoing and you'll likely want to automate data retrieval, but this automation is rarely simple. You need to identify the sources of data and put all the files in the same place, obvious, but it can be overlooked. In particular because this pretty much always involves getting permission from the appropriate source. Here at Amil we've seen cases where a company has some data, but it's not at all related to the problem they want to tackle. But they also didn't have access to the data they needed for their specific problem, because it was owned by a different part of the company. Because of this data permission issue, there was a huge delay in getting the machine learning Project started. The point is thinking about where to get the right data for your problem and how to obtain permissions for this data is a really important step. When it comes to data storage, you should also consider a data privacy and intellectual property protection. You have to think about whether anonymizing the data is a necessary step for your project. Keep in mind that certain data is allowed only within certain political boundaries. Sometimes, data's not allowed to reside outside a particular province or state or country. This means when you're hosting your data, you have to ensure its warehoused according to the relevant policies. This is especially true for Cloud computing. You must make sure that the data lives under the policies required by the owner of the data. Then there's extreme cases like a project requiring no internet connection or some data might have to be looked at on custom hardware. This is a barrier for the machine learning system and should be avoided if you can. It's worth at least attempting to solve the problem you care about with some work easily accessible data. Gathering data all in one place might also involve scanning and pre-processing. This involves everyone's favorite step, cleaning the data. Big topic of course, but here are some quick checks you should do. You should make sure there's no typographical errors and that the use of upper and lowercase letters is consistent. If there are categories, make sure those are consistent. If there are numbers, make sure the units are consistent. In other words, make sure your data schema is consistent. You'll want to check for unwanted observations from your data including duplicate or irrelevant observations. Note that duplicate observations are particularly common if you're combining data sets from multiple sources. Irrelevant observations are ones that don't actually fit the specific problem you're trying to solve. Often data files will have fields for old, irrelevant characteristics that don't actually hold different values. There's no way to learn anything from static signals or the data might just contain a lot more detail than you need. It's important to think about your data in the context of your specific problem. You'll also want to think about detecting outliers and deciding how to handle them. Deciding what to do with outliers varies from problem to problem. On the one hand, they may hold key information because they're so different from the main group. But they may also just hold meaningless noise or mistakes that can throw off your model. In general, if possible, it's a good idea to explore your data both with and without the outliers. Filling in missing values is such a significant topic that we're going to defer it for now. As a first pass, you should find out how complete your data-set is. By complete, I mean is there an actual value for every characteristic and every example? Another important consideration for consolidating and identifying the data you're going to use is availability. Your machine learning system will need to access the data somehow. Choose an option that integrates well with your current systems. I also want to highlight that just because data is accessible at a certain point in time, doesn't mean it's always going to be there. Having access to data today but not tomorrow is a problem, a big problem. We need the operational data to be consistent with the learning data. So an important factor for data decisions is continued accessibility. Building strong data pipelines involves a lot of programming and integration between many members of your team. So it's good to have an idea of what coding standards and testing procedures you're going to hold to. Testing data-flow or pipelines creates challenges that are different from traditional software testing like unit testing. For data pipelines, we need to test both the code in the usual way making sure the transformations and cleaning steps work on test cases, but we also need to test that the data conforms to the expected schema. It's best to have all data go through these sanity checks automatically, so that the downstream process can use data with confidence and you'll still need to occasionally manually check in and make sure no strange cases have thrown the whole thing off. Many data management platforms exist that can be used to consolidate data from several different sources and integrate into a machine learning system. We'll not go into the Data Management Platform options in this video but pick one that works well for your business. You should consider things like; how do you want to do version control or whether or not you want to use the Cloud or encrypted hard drives or distributed storage. In this video, we talked about the considerations you should take into account for data warehousing. We briefly reviewed the three main stages of the data pipeline process namely; extract, transform, and load or ETL. We then look closer at the extract phase of ETL discussing privacy, pre-processing and accessibility considerations. Now you have a good starting point for consolidating your own data sources.