So now you have relevant data of various clients collected from different sources or locations altogether in front of you. What next? Although these data are stored digitally, they're still not ready to be consumed by a learning algorithm. In this video, we'll show you examples of converting different forms of data to a standard. By the end, you'll have a good idea how to transform your data after consolidating it. Let's start with an example. Bills or books can be scanned and stored digitally as images. Although a human can easily read the content of these images, to a computer it's just a bunch of pixel values as we've discussed before. A machine doesn't automatically understand that there's text, let alone what information the text is meant to convey. So for us to conduct data analysis and apply machine learning, we have to extract the textual content of these scanned images using a technique called Optical Character Recognition. As we discussed in earlier courses, optical character recognition or OCR translates images of written words into digital text. This has been actively studied for a long time and there are a lot of different versions. For beginning projects, you should use an existing program to handle the text recognition and realize no system is perfect. A related example is transforming audio recordings to a machine readable text format. For this, we again use automatic speech recognition systems or services. Unless your whole project is about improving speech or character recognition specifically in your domain, you'll want to use an existing package to perform these tasks. It's a lively area of research. So you can think of converting images or recordings of words into text strings as a component of data preparation, then use the output from whatever tool as your input for your machine learning workflow. As the product matures, you might want to revisit this conversion because of course, more precise and accurate conversion translates into better data for your quo. A slightly simpler example is creating CSV or text files, both standard text formats. CSV or comma separated values for handling spreadsheet data and text, TXT for generic text. TXT files are the simplest kind of text files unlike.doc or PDF or even rich text, RTF. When you're data comes from a vendor specific programs such as Excel or Word, you have to convert it to a standard non-proprietary format. CSV files are a relatively consistent way of storing spreadsheets or databases of information. Although, you have to be aware of when column names are included and how that particular conversion handles missing values. In general, at this stage, you need to think about converting that raw data into a consistent data format that your machine learning algorithms and analysis tools can directly and correctly read. So at this point, we have a set of data files all converted into some machine understandable data format. The next major thing we should look into is how to integrate different kinds of data to get a unified view. If you're doing supervised learning you can think of it as creating a standard matrix of your data. This means rows of specific examples with particular features as a column and of course, for the learning data you need to have each example paired with the appropriate label but right now, we have data coming from multiple sources for all that is machine readable. How do we get a set of examples with their particular features and appropriate label all in one place? There's two different ways the data could be split up. It could be that we have information about each example coming from different places. For example, when we're combining readings from different sensors with camera data for a security system or the results of medical tests from different labs. On the other hand, we could have different examples coming from different places as when we have patient records from different systems, even pre-digitized examples combined with digital or environmental readings collected by different teams from different locations, and of course it could be split up on both dimensions at once. So first we need to make sure that we've identified what the roles are? What an example actually consists of? Do we have a unique identifier for that, and then what characteristics does each example have? There, you've identified the rows and columns for your data. Let's look at the case when we have a set of examples with different aspects or feature values coming from different sources. We need to make sure that we have consistent rows. The most important question is what identifies the particular example? For every data source is that identifier present? It could be something like a serial number identifying a piece of hardware or a timestamp falling into a particular region of time. In the simplest case, you have that same identifier correctly paired with each data source, and for the database nerds among us, you have a good old inner join. All you have to do is match up the unique identifiers across the files. If it's not already there, you have to construct it which might be a lengthy process in itself or involve a combination of steps. In the other case, we're integrating data where each example is complete but it comes from different sources. Then you have to watch out for whether the column dimension matches. Do you have the same features for each example? If not, which ones are you going to use? Often different teams will have slightly different data collection standards or different hospitals might order different tests on their patients. You have to decide what set of features you're going to use. Then you have to make sure you know how they're represented in each source. Maybe there's a blood type column at one hospital but it's BT in another or worse, it's not even labeled because of course a doctor understands what ABO means but the computer doesn't. So besides matching the number of columns, you have to know where each value comes from from each data source. Sometimes this will take extra care and involvement from a domain expert. Often, you'll find yourself relying heavily on the metadata associated with the files, and one last thing, units. The hidden differences are the most dangerous. Maybe you correctly identified where length was recorded for the files from California and for the files from Alberta, but are they both using millimeters or inches or parsecs? Check and then convert to whatever standards suits you best. Now that we have a unified dataset which can be understood by the machine, the next task is to make sure the data we have are relevant to the question we are trying to solve. If not, eliminate those irrelevant data before putting any additional effort onto cleaning them. Removing irrelevant data can be highly domain specific. For instance, in text analysis is the text all in the same language, do we throw away the examples that aren't in the language we're most interested in or do we translate them? If we translate them, what tool do we use? There are potentially many different transformational steps and it would be impossible to cover them all. The main thing to remember is that your ultimate goal is converting your raw data into a well defined structured set. Know what defines an example, and then whatever sources you use, it always comes back to fitting them into that structure match to the appropriate example. In the next video, we'll talk more about how you can monitor the quality of your data and ensure that you really are working on the data that you think you are.