Hello everyone. Today we're going to talk about data preprocessing, which is an important component in the whole data mining pipeline. The learning objective is to, one, identify the potential issues in your datasets and then be able to apply specific techniques to preprocess your data so that they're ready for specific data mining tasks. As we had talked about earlier, if you look at the whole data mining pipeline, you start with the raw data set, then you need to spend some good time just to really understand your dataset. Then with that then you're ready to proceed to the next stage, which is data preprocessing. When you talk about a data preprocessing, there are various aspects. You want to start by asking what issues there may be in the dataset. This could be missing values, could be inconsistency, could be errors. With that understanding then you need to think about, what do you need to do so that you can prepare your dataset for your data mining process. Specifically here, typically, we'll look at four different components. The first one is about the cleaning, this is cleaning up your data. Also in terms of integration, this is about where you have multiple data sources, and you want to combine datasets. Also you'll look at the various methods to transform your dataset, and also data reduction. All those are important mechanisms to ensure that you have good data to do your data mining tasks. Just always keep in mind that you need good data. If you don't have good datasets, then you cannot really do good data mining. Let's start with that discussion of data quality. The general just like you want to get some notion in terms of what kind of data you have, but also specifically what kind of quality issues you need to paying attention to. Of course, in a specific real-world data setting, you may not be considering all the possible quality scenarios, but it is more about what are the generals you need to be paying attention to. Think about the dataset level. The first one of course is about relevance, because if you want to do a particular data mining task, you want to get data that are relevant. That's just the first one to think about whether you have the relevant data to proceed. The next one is about accessibility. You know that this dataset is highly relevant, but then the challenge is about whether you can easily get it, whether you can access the specific types of data you have or you can get is important question to explore. The next one is about interpretability. You can get the data, but then whether the data has good descriptions or metadata, so that you can interpret what you're looking at. Because you may be looking at say, many million columns, and those columns are really just like numerical values. Well, what are you looking at? What do those values mean? You need a good description of the datasets, so you can interpret them. The next one is about reliability. This is generally more related to the data sources, and also the providers. Whether there's a good quality control done by the provider already, or knowing what process they have already gone through, or go into the whole process of making the data is good. The next one is about timeliness. That's about how quickly you can get or if you're looking for latest information, apparently, you want something that's up-to-date, rather than you have to wait for a full year before you get a particular type of information. All those aspects are more related to the overall dataset. You can think about just like a various aspects in terms of generally, what kind of dataset you're dealing with. But then the other are more in terms of the specific data values. If you go into dataset, one question you may ask is how accurate those datapoints are, or how likely you may see errors in your dataset. That's just the general understanding of the accuracy of the specific data. The next one is about consistency This is particularly related to when you're looking at multiple datapoints, or dimensions, or multiple data sources. Because consistent just means that are whether more than one thing, whether they agree with each other or not. This is particularly important when you're dealing with multiple attributes, and also multiple data sources. The next one is about precision. Precision is generally more about like how precise the values can be. This relate to accuracy, but slightly different. If you think about sensors. If you get your sensor readings, on one side your sensor might be making errors. That's more like the error or accuracy part. While the other ones are the different sensors may have different kind of sensitivity, that means how precisely those values those sensors can provide can also vary. That's also gets into this scenario, say if you're looking at expensive sensors they may give you quite very precise readings, while the other local ones may be convenient, but then may not give you as precise information. The other metric that's related is on granularity. Here is talking about the resolution of the data you can get. Specifically like saying about a spatial temporal datasets. If you're talking about your spatial scale, am I talking about city-level, country level or this is in terms of meter level of granularity. Or you're talking about kilometers. Those are important kind of spatial granularities. From the temporal side, then you may say, "Okay, do I have like miniature level, hourly, monthly, weekly?" All those are very relevant. Depending on what you need for your specific types of applications. Another way is just about completeness. That's basically say, "Okay. Like you're getting certain types of data." But how complete is it? How are you reasonably clear that you actually have a good dataset that can capture all your, say the customers' transactions. Or a good sample of it. As you can see those data quality metrics. They cover different perspectives and you don't necessarily have to consider all of them for every application scenarios. But you do need to consider which one's, are more important. Or more relevant. For your particular scenarios. Just keeping this in mind will always helps you to then look at the dataset and be able to identify potential data quality issues. Then of course you can proceed to adjust those issues. Let's consider some really common issues in real world data. The starting point is that, getting the real word. [inaudible] They are messy in many different ways. The first one is about incomplete data. That just means that you don't have all the information you need. There are always some missing values, missing attributes that you don't have in your dataset. Usually one main step you consider. When you look at your data and you're trying to preprocess data is about addressing such missing value scenarios. The next one is about noisy data. Let just say, "Okay, there may be errors or just outliers, imprecision." Those are potential issues you may need to deal with when you're looking at your particular dataset. So make sure earlier your sensor readings and maybe just not precise. Like they don't have those specific precision you need. That is one. But there may be just errors. Like you may have age. Which is a minus 10. Apparently that is an error. The other one is about being inconsistent. This is at the center. This is where you're comparing across different attributes or across different data sources, and you may be able to see individually. They all look reasonable. But when you put two and two together, then you realize something's not right. Think about my example here. If you look at the age information, you may have a list of the age. They are all positive discrete values. Then that looks right. Then you also have the birthday information. Which also seemed reasonable. But when you put them together, you know that the age, for a particular person. The age and the birthday should've apparently be consistent. Aligned with each other. The other example is rating scale. You may have seen various kinds of survey ratings. Lets say between zero and five. But then in one scenario say, "In this case somehow one seems to be the highest rating. While in others scenario, people have been using five as highest rating." Those are potential inconsistencies. Now the question is that, we are all trying to make a good use of our dataset. But why are we still seeing so much issues in the real-world data? What are the potential causes? You think about the general process about how data is being collected. Then they're being transmitted to some places, and it may be shared within certain parties. Then there may be various kinds of processing taking place on the dataset. If you think about all those different kinds of steps, then you think about, what can go wrong? Think about the human involvement. Sometimes you need a human to enter the data or be preprocessing some of the data. Apparently that can be just human errors. But also think about all the possible hardware failures or software errors. All those contribute to potential issues you can see in the dataset. Another active big factor is about, just changes over time. Because when you get new sensors, apparently, like the new sensor under the older ones may give you different types of information. Or if you're designing your survey, so that maybe the first iteration of the survey may have slightly different questions or you're adding new questions via second the survey. They're just all the various kinds of scenarios. When things change over time, then you tend to see various issues popping up in your dataset. We started with the first component in data preprocessing. This is cleaning, you just say okay, I have my dataset and I have some issues with my dataset. Of course I went on and try to clean up those issues. Think about if you have an incomplete data say, okay, I need to of course identify their missing values. But then hopefully, I can find a way to either just remove them but also even fill in those missing values. Then if you talk about a noisy information. If you now say okay, my data look noisy, What can I do? I would try to smooth out your data. You smooth out a noisy data or the other scenario, you can just identify those outliers and knowing that they may be arrows, you can fix them or remove them. If its inconsistent, then of course you need to, well first identify, there are some inconsistency in your dataset, and once you know what inconsistent it is, then you try to hopefully resolve those inconsistency. Let's look at specifically how are you going to adjust each of those scenarios. First one, incomplete data. This is basically say, I have my dataset. I know typically I like say how many objects I would have had and how many attributes I'm looking at. You would expect all the fields are entered, so you have all the information. The first of course, you would be able to quickly check whether you have all that information. If their value that just not there, its empty. Then you say, okay that's incomplete. When you have incomplete data, what can you do? The first one is simple. You can say, I can just remove the objects or attributes that are incomplete. That basically says that if you go through your objects, you say, this object is missing one or multiple attribute values. Maybe I will just not include that object in my dataset. Or if you look across attributes where I say, most of my attributes are actually complete, they have all the information, but I have a few attributes that are missing some values. In this case, instead of just removing all the objects because they may have that missing value. You actually you can keep all the other attributes that have values and then complete the values and then just remove those single attributes that are missing something. That's easy, that's good. That's actually a widely usable in many real-world settings. But what's a potential problem with that? Depending on your particular dataset, you could say if I do that, I may be actually removing majority part of my dataset. There is a lot of missing value. Then of course when I say in that case, maybe I shouldn't just remove them altogether. Instead I'll try to fix some of that. I'm trying to figure out what those missing values should be and then I fill those in. To do that, first one is this manual process because you look at your dataset, you see where things are missing and then you can manually examine what is missing and then try to hopefully provide a good guess of that information and then just to fill in manually. That works and its being used regularly. But apparently as you can see, this is not a scalable solution. If you have a large dataset and there are quite a few missing values, apparently you cannot have a person manually exam or that and fill in the values. Then more typically we're talking about some automated methods. That will say, I want to have automated [inaudible] that exams all the missing values and trying to provide a better guess of the missing values. Now the question is, what is the best guess? What would you do if you are writing this automated approach to fill in missing values? There are some more default value scenarios. You can say, I'm using a global constant. If it's a numerical value, whenever I see a missing value I put in zero. That may work, but may not be good, for example, depending on particular attribute, actually zero is actually not a good value to use. Then you may say, I can use my attributed mean because if you know that particular attribute, for example, age, so zero may not be good, but if I know the average age of my population, I can't use that as a default value. That works. Then one step further, you can consider not only attribute mean, but also class mean. What that means is that not only you are looking at that particular attribute, but you also look at your objects and whether they belong to different classes. For example, you think about like high-school students versus college students versus senior group then apparently there you can use the aggregated average, but that's not as precise. Instead, what you are going to do is that you look at that particular age attribute and then you look at which group they belong to. Then you may have actually better tailored average. That class of mean would again be more precise if you know which classes you have. But these are more of automated but also fairly straightforward approaches. They're not particularly wholly say, dynamic in terms of your dataset. There are other ways that actually we just try to estimate, you can actually see you later on when we learn all the different data mining tasks or techniques. You can use that to then estimate the best value. You can use a regression, kNN, that's k-Nearest Neighbors. You look at which ones are most similar to your particular object, and then say okay if they're are similar then I would like to use, maybe my nearest neighbors values as the missing value. Then also, a probabilistic mechanism for you to then estimate what is the most likely value for that particular one. Now next to look, if you are were to the-noise a noisy data. You can usually detect noisy data based on some visualization or as we discussed in terms of data understanding, you can use some distribution to look at how your data look like. If you have noisy data, so by example shown here is that you have those specific black dots. Those are specific data points you have. As you can see they're roughly aligned, but then there are some fluctuations. In this case, if you have reasonable understanding of your dataset, some regression may actually work really well. You're using some regression function, and say, well my data should have stood this. For example, if you expect your dataset to have roughly linear relationship, then you can use a linear regression model to find the better fit so that way you can smooth out the noisiness in your dataset. Another approach is clustering. As we mentioned earlier, clustering may basically try to look for similar objects. Similar objects now fall into the same cluster and dissimilar objects would fall into different clusters. But then particularly, here you are looking for things that not particularly fit well into those clusters that means they are outliers. If you look at my example here, I have reasonably centered a cluster of the green ones and the blue ones, the red ones, the roughly. But you can see there are actually quite a few red points that are really spread out all over the place. With that clustering approach then you can also easily identify data points that may just be not as good or should be removed then the other ones may be to smooth out and within your cluster. Now the third piece, we talk about incomplete data, noisy data, the otherwise by the inconsistent data. Here, as we said, the issue is that when I look at say, multiple attributes or from multiple data sources, I see them being inconsistent. The age of birth, the information was just inconsistent reading many of those can be detected through your semantic understanding that's also tied closer to the interpretability of our dataset. Because if you know what you're looking at then you can interpret them and then be able to identify scenarios that where they should be consistent. If you know this is the age column and this is the birth date column, apparently, you can easily check whether they're consistent. If you know this is a rating then there should be clear definition about you using one through five, which one's the highest, which is the lowest. If you have that information, then you can easily compare and also identify those inconsistent scenarios to end also fix that place. For example, if one rating scales are one-to-five the other one is a five-to-one. Then all you need is just up reordering and better marking. Those are more semantic-based scenarios while the other ones, you can use a pewter data-driven approach to identify scenarios, where things are just either, should be correlated or should not be correlated. We mentioned earlier, a statistical analysis of visualization, this are actually very useful. Because if you just visualize some of your data or look at the overall distribution that you get actually a quicker view of, this doesn't look right those are mechanisms that you can leverage to identify potential inconsistence data. Now, we talk about data cleaning, data cleaning are those data that just try to address specific issues in your dataset. Another related to that part is data in regression. This is particularly relevant for data mining, because, with the data mining, you are typically leveraging data from various sources. How you integrate data across multiple sources. In this process, you may particularly actually look for issues that may arise in terms of consistency. Because apparently different data sources may have a different way of managing their data and collect data or naming their attributes. All those are important things for you to figure out so that you can combine the datasets. One important aspect here you can consider is entity identification. That means if you're talking about different customers and now you have information for multiple sources, you would know this is actually the same customer, so you have a customer ID information or customer name, date of birth. That would allow you to map customers. That actually occurs within the multiple data sets. Or this could be particular product, a particular term. All those things you just need to figure out, they are actually presenting the same thing so that it can combine them. This also gets to the point about a redundant data because they may be keeping redundant information across the multiple data sources. This is actually also a good way for you to identify such redundancy, particularly, for example, through some correlation analysis. We will briefly talk about correlation analysis here and later on we actually come back to this because the correlation analysis is a big part of data mining tasks. If you want to identify correlations, remember earlier we talked about a scatter plot, right? You can just show x and y and then you can see whether there roughly correlated, positively or negatively or not correlated at all. If it's a numerical value, then typically you can use this notion of correlation coefficient. What you're calculating is that you're looking at say a and a b. So these are the two attributes. You want to see whether these two attributes are correlated. Now what you can do is that you calculate the mean of each attribute, that's an a bar and a B bar, right? Then you just go through your individual values and just see whether the difference of the a sub i to the a bar, so that's the relative difference to the mean for attribute a and also the relative difference of attributes of b. Then you will see they have some correlated changes meaning that they're both [inaudible] higher or positive or [inaudible] or negative. Using that, then you can calculate this correlation coefficient. Here I have three different examples. You can look hand of the positive correlation. That basically just says that if one attribute has a higher value, the other attribute also tends to higher values, you tend to see this upward relationship. Then of course, if there's no correlation, if for the middle example, I see that the poles are really spread within my space. There's no particular positive or negative correlation. While the third one, that's the downward relationship. That means that if your A is increasing then you would see B decreasing. As you can see you can plot using scatter plots visualization to identify such correlations. But the quantified measure, the correlation coefficient then gives you a specific measure of how positive or negative or no correlation that two attributes may have. That's for numerical values, but then if you're talking about nominal attributes, in this case, of course, you don't have those continuous values to consider. Instead we're using this categorical comparison. Think about if you have two attributes, A and B, A has, say, two different values, B has two possible values. Now you want to see whether A and B are related or correlated. We typically use this chi-square test. What it's doing is looking at across the different possible values of A and B and then we're just looking at the likelihood of A and B occurring together. There are two important notions here. Why is this o sub i j. That is the observed value of i and j occurring. That means across all the possible values for A, or the possible values for B, you can see, how often do I see a sub i, b sub j occurring together? In our case it could be you can see I have these four possible cases, two-by-two. You can have the observed value of A being a1, B being b1, right? Then similarly you can see a2b1, a1b2 and a2b2. So these are all the possible scenarios that you would observe. That's the observative value, but then you are trying to compare them to this expected value. That's e sub ij. So e sub ij [inaudible] if I assume A and B would occur independently, that is the expected value. You can say, okay, if they're independent or they have no correlation, I would expect to see this number of occurrences. Now you're comparing your observed value and the expected value. Intuitively, if your observed value and the expected values are similar, that means the difference, the square difference is more than their bases, they should be independent because they are expected. What I'm seeing is similar to the expected values, assuming they're independent but if you see a big difference so that means my observed value debits significantly from the expected value. That means maybe they are not independent, there are some relationships, correlation across these attributes. That's just a high-level intuition and a quick introduction of a correlation analysis, but we will come back later and we'll talk a lot more about how we can use correlation analysis for various kinds of data mining tasks. Now one quick question we want to ask is that does correlation imply causality? Here the term suggests that correlation basically what you can see from the data, I can say I tend to see if A occurs, B is more likely to occur, or A occurs, B is less likely to occur. That's a correlation. Well, causality refers to the point that while you know they're correlated, but it is one causing the other or not? Now think for yourself for a little bit, think about the various scenarios you think are likely to maybe correlated. To what extent do we think that correlation may imply causality? Is your answer yes or no? Now let's look at some specific examples. The first one, sleeping with one's issues on is strongly correlated with waking up with a headache. This is again like you may have collected data about people reporting having headache in the morning when they wake up, but also, whether they had their shoes all when they go to bed last night. If you have that dataset and your data actually may show you a pretty strong correlation. You say yes, they are strongly correlated. But in this case, is it a causality. Meaning that if you put your shoes on when you go to bed, that you will have a headache this morning. Is this wearing shoes to bed causing the factor that you'd have a headache? That's one. Let's think about another one. The more firemen fighting damage, the more damage there is going to be. Again, if you look at the historical data of all various fires that have happened, and then you look at the fire damage locale, how many firefighters who were actually deployed to fight that particular fire, then you will see there is again, a strong correlation. You see more people, like more firefighters there and you will see higher damage. Now the question is, the fact that a more firemen are in there, are they actually causing the larger damage? I hope your answer will become clear now. One more example. As ice cream sales increases, the rate of drowning deaths increases sharply. This is like in the case that you will see from your dataset, if you do your correlation analysis, you will see there is a strong correlation in terms of the ice-cream sales. When you increase ice-cream sales, but also that you would see those deaths increasing like drowning cases. In this case, you can ask. I see a correlation but is the ice cream sales causing drowning, does it? Hope your answers are clear, to all those three questions. Of course, these are extreme examples, but I hope that really converts this key point is about correlation does NOT imply causality. That is very important, especially in data mining. If you're dealing with a lot of data, when you're doing correlation analysis, you're likely to see some very strong correlated cases. We really need to be careful not to infer causality from correlation because as I said, those extreme examples, it may be easy like, I would never think that there is causality there, but then for some more Holassie, common or subtle scenarios, if you're not careful, you may be thinking about causality. That's very important to keep in mind. Correlation does not imply causality.