In this lesson, you're going to learn how to normalize and discretize data. Again, we said that data can be noisy and in order to handle the noise in the data, we can transform them globally, in two main ways. One is to reduce the grain in the data, it's called discretize, from fine grain to higher grain. For example, from numeric to nominal. And the other main way is to change the scale or range as the data, it's called normalize. It might also be necessary to discretize to apply different data analytics models and methods because some prediction methods require a nominal attribute instead of a numeric continuous attribute. So let's first see how to normalize data. Normalization consists in changing the scale in the data. When you have data of mixed scale. For example, you may have mixed data from different data sources. We've talked about merging key con data with gene expression data in the same dataset. In this case, you're going to have data of mixed scales. And so for data analytics methods, journey don't behave very well with different scales, and you want to deal with that. For example, age and income may have widely different ranges. It is frequent to scale all data between the ranges -1, 1 or 0, 1. So all data values will be within these scales. And to accomplish that you normalize your data. Generally, data are scaled into a smaller range. Let us include min-max normalization, z-score normalization and decimal scaling. The min-max normalization transforms data from range let's say m, M into a range m prime, M prime, using the formula that is here. So new value is the original value where you subtract the minimum in the original range, you divide by the maximum minus minimum into the original range and you multiply this ratio by the new maximum minus the new minimum into that at the end you add the new minimum. Example, we have age values that say between 0 and 150, to be sure to include everyone. And you want to normalize it int 0,1. Intuitively, you see that someone of age 50 is around 0,150 is about one-third of the range. So intuitively, you can say if I map that into a range 0, 1, the age of this person will be 0.33. And you can also use this formula to calculate. The new value is 50, so value val minus the minimum in the original range divided by 150, which is a original maximum minus original minimum. You calculate this ratio and then you multiply by the new maximum minus the new minimum, it's going to be 1 minus 0. And at the end you add the new minimum is going to be 0. This of course, gives 1, this gives 0. So it just ends up being 50 divided by 150, which is months of 0.33. And of course, you have programs and in all you're going to have special functions that reset automatically. The second one is Z-score normalization. This one is used quite a lot. The new value val prime is going to be 0 value val' minus the mean divided by standard deviation. Example, you want to normalize age values between 0, 150. Well, you know that the mean age in the population that you are using is 36.8 and standard deviation is 12. Age, so what would be the new value of age 50? It would be val prime equals val which is 50 minus the mean 36.8 divided by standard deviation, 12. And this would be 1.1. Again, you have some functions to take currencies automatically. One of the goals of normalization and particularly this one, the Z-score normalization, is to not only change the scale, or range of the data, but also, this particular one is going to change the distribution to be closer to a normal distribution curve. A lot of statistical methods, they have some requirements about the shape in your data and they perform better so data is normalized for example. And so we see here its call a bell shape curve its represents normalized distribution of the data. You see that 95% of the data is in the gray area here it's the main characteristic of both of these curves is that it is centered in 0, as you can see it's symmetric around 0. And between minus 1 the mean minus 1 time this is standard deviation and plus 1 is standard deviation contains about 68% of all measurements. If you go to two standard deviations below above the mean you get 95% of the values and if you go to three standard deviations, below or above, you arrive at 99.7% of the values. And so the effect of the Z-score normalization is that your data all going to be normalized. They're going to be centered interval and they're going to have this particular shape or close to that. And that's the reason why it's often used particularly if you intend to use some statistical data analysis method after that. Another sub method for normalizing data is called decimal scaling. It's the new value val prime is equal to the original value, divided by 10 power n. N is determined such as the largest val prime would be less than 1. This formula transforms the values into interval minus 1, 1. Is there are all negative values, and into 0, 1 otherwise. Example, normalizing age values between 0, 150. We want the highest age to be less than 1, so it means we want, again, to project in the same interval 0, 1. Therefore, we need to divide by 1,000 because to have 150 less than 1, it would divide by 100, 150 would be 1.5, so it would not be below 1, so we need to move up to 1,000. So we divide 150 by 1,000 which is 10 power 3. And all the values will be divided by 10 power 3, so for example age 50 will become 0.05. Now how do these methods compare? The method that preserves the original data distribution is decimal scaling. Therefore it preserves more than the others the shape, the original shape in your data repartition. It acts similarly to image resizing in photo editing software, you shrink, you magnify, but your data is pretty much intact. So this co normalization is most used because resulting distribution is going to be normal. It's advantageous with certain statistical methods, however, it distorts natural shape of the data distribution. The implementation of min/max normalization is that it can accommodate any new range we want, not only 0, 1 and minus 1, 1 like the other ones. And this concludes our lesson on data normalization.