So, learning from big data in the textbooks that I read and all the research I did for preparing myself for this segment both on the machine learning part and on the data analytics part, the authors of the textbook drove this point home repeatedly, and that is that if you want to build a solid foundation, you have to build it on something that's stable. What this means for machine learning and big data analytics is preparing the data. I'll talk more about that and we'll see why that's important when we get to my example at the end here. At the data preparation portion of any machine learning or data analytics project, can consume roughly about 80 percent of the time. So, there's a whole bunch of human involved work, deciding and working on, and processing the data. We'll look at what some of those steps are here. Don't underestimate data preparation if you are to embark on a machine learning or data analytics project in the future. Processing your data. The first step is identifying what data you want to take a look at. You want to clean that data. You want to generate derivative data, if any, and in my experiment, I actually derived data and we'll get into the rationale for that. Reducing the dimensionality of your data. As I alluded on Tuesday, dealing with very large dataset that has many many dimensions to it. It can be very very difficult to see key insights in that data or be able to extract key insights from that data. So, you may want to start if you're having trouble making forward progress and you're discouraged. We've been working on a machine learning project or data analytics project for weeks or months and you're not getting where you want. You can try to start to reduce the dimensionality of your data. Hierarchical data needs to be flattened to reduce any duplications. Hierarchical data might look something like this. I'm just going to use letters A, B, I'm just imagine a C. Then, there's another A here, and a D, and E, and F. Maybe there's an F there. So, there might be data values that are repeated in some hierarchical fashion. The authors of the material, I read it said it was important to flatten that hierarchy. If you think about it, the algorithms that we've looked at, you want to put your data into some kind of a matrix. So, it needs to get flattened in order to be able to fit into a matrix, for at least the algorithms that we looked at. I didn't come across any algorithms that have a matrix that consists of pointers to other matrices of features, that might consist of pointers to other arrays of features. I haven't found any light. That doesn't mean they're not out there, I just didn't come across them in my learning and my research. You can perform principle component analysis and we'll see some more on that coming up. A PCA studies had dataset to learn the most relevant features responsible for the highest variation in the dataset. That's may or may not give you some insights into what your data is telling you. It can be used for visualization and it can be used to reduce the number of features in your dataset, that can then feed into your machine learning algorithm or algorithms. Obtain meaningful data, aka ground truth. Data that someone has correctly measured and labeled, categorized. Some human being has looked at it, validated it. Has done some work to determine that there's meaning in the data that this is good data. Acquiring enough data is very very important. Only after testing and validation, will you be able to determine if you are suffering from bias or variance issues that we've touched on briefly on Tuesday. When arranges data into a matrix as we've seen and we want to deal with bad data. That can happen, you can get bad data. You got missing cases when dealing with huge datasets. Distorted distributions of data. Data with high-variance. You might have a bunch of data. You are looking along an axis and then you may have couple way out here where most of your data's clustered here. You need to go take a look at these and see what they are, see what they represent. Are they valid? Is the data valid? Remember the six V's, validity was one of them. They might have these distorted distributions that you need to contend with. Redundancies, there may be redundant features in there. Again, like the hierarchical case, you want to flatten those data structures in such that they can be turned into an array. An anomalous examples. So, those data points, or features that are way out on the right over there that I have part, they might be important or they might be anomalous. Only we, human beings, can look at those and say, "Yeah, that makes sense or not. It doesn't make sense in where they should. Those samples should be excluded from the dataset." Then, there's noise. Noise can get introduced into sensor samples. We looked at filtering techniques earlier. I can filter noise out of samples if noises are issue in your particular machine learning and data analytics process and problem that you're trying to solve. You want to then extract features. You may have multidimensional dataset. I know I had a multidimensional dataset, we'll see coming up. I had to make choices about what data, what features I wanted to use in my exercise. This can come from database queries. You can use principle component analysis. You can do it manually. I did mine manually. You rank features. What are the most important features? There are statistical and entropy based approaches to help you identify these. I haven't used any of these. I just came across them in my own learning and education about this area of computer science. Gain ratio. Information gain. Is it chi? Must be chi. College's too long. Chi, I want to say chi. I go Jackie Chan. SVM ranker is another way to analyze data and I'm not covering any of that in class but I just wanted to mention it to you so you can go off and pursue these on your own at some point in the future if they are of interest to you. When you get out in the working world and there's machine learning and data analytics become more and more widely deployed, you may hear this term called extraction transformation and load. I found a nice little article, we're not going to go and get it, an Oracles database where they talk about this process. So, extraction collects raw data from operational system. Transform is the cleaning, and the repairing, and the processing of data. Again, perhaps principal component analysis could be applied and along with other techniques to prepare your data and transform it into something that a machine-learning algorithm can consume. Then, load. Place all the data in a designated location for algorithms to work on, perhaps an SSD array or a server on a Hadoop or a Lustre file system perhaps. So, you can go read that article.