[MUSIC] This week we've talked about how using all the features can be a good starting point to build a machine learning model. However, real world data is very large and complex. We often have too many features or dimensions, so is it still a good idea to use all those features? Not really, think about those scatter plots we keep showing you with data plotted in two dimensional space, we keep using two dimensional examples because it's so easy to picture. Three dimensions, sure we can do that, but look, where before we had a flat square, now, we have a whole box, a three-dimensional box, where all those data points can live. It's bigger, obvious observation, right? Yep, but there's profound consequences, as the dimensionality of the data increases the volume of space, where our data points can live, increases as well. 10 data points were pretty reasonable when we were looking at a square, but each new dimension drastically increases the space. In fact, each new dimension exponentially increases the amount of space we're trying to fill with our learning data. If we don't get a corresponding increase in the number of data points, we obviously are going to have less and less coverage. In other words, as the dimensionality of our data increases our coverage gets more and more sparse, for most statistical or machine learning modeling techniques sparsity is a huge problem. The amount of data needed to have the same level of performance increases exponentially as we add future dimensions, most of the times when we look for patterns in the data we want to capture areas with high density or groups of data points having similar properties. However, with high dimensionality those points just get more spread out, this makes them look really different in that higher dimension, even if they truly are near, on, or in some subset of the space. This phenomenon is popularly referred to as the curse of dimensionality. There are some algorithms that are more affected by the curse of dimensionality than others, for instance, distance measuring algorithms, like K and N or K-mean are greatly impacted. Since adding dimensions to the data, literally, increases the distances between examples, on the other hand in random forest algorithms individual trees look at subsets of features at a time. They can ignore the vastness of empty space for better and for worse, this focus can make it easier to optimize for each tree. So they don't feel the curse of dimensionality to the same extent as distance based algorithms, but there's no getting away from the fact that more features means exponentially more space to explore. In a perfect world we can identify exactly the minimum feature set we need for the model to perform well, then the machine learning model is not as complex and less easier to interpret. In the next lesson we'll talk about some techniques to filter out low impact features, such as chi-squared analysis, but we'll get to that later. When working on day-to-day problems, domain knowledge is one of the most important factors in finding critical features that are most likely to impact the model performance. It's one of the reasons we stress the importance of advice from domain experts on the data features, and awareness of the business process overall. By just looking at the data you can consider things like correlation between features, it's not always bad to have correlated features, but it's also not always good. Of course, it really depends on the specifics of the problem and the learning algorithm, in terms of the number of features and the degree of correlation, and, of course, curse of dimensionality. Let's take a look at the sample data table with four features and a label. The correlated features are visualized on the right using a heat map, features that are highly positively correlated will have values close to 1, features that are highly negatively correlated will be close to negative 1, and the uncorrelated features have values close to zero. We'll look in detail at a specific correlation technique called Pearson's correlation in the next lesson. Lastly, is there some rule of thumb that states how many features should one use given a fixed number of data points? Well, of course, there isn't a single answer that works in all situations, for instance, in the case of a linear model with uncorrelated features an equal number of features and data points can be sufficient. Although, I'd be pretty uncomfortable, personally, in that situation, you usually want way more than that, after all in our two dimensional examples we've been showing you way more than two points just to explain the concepts. If you have more features than data points you're in a very particular learning regime, it's under constrained, you couldn't possibly solve for a linear model in this case. There's just not enough data to tell you what all the weights should be, but we see this scenario in bioinformatics and precision medicine, a great study may have thousands of participants, but genetic information is quite a bit more complicated than that. Graphical models, Bayesian learning methods, and dimensionality reduction are all possible ways of dealing with this, but without getting a thorough grounding in those techniques I would look for a different problem to solve. Where you do have more examples than features of interest, or rely heavily on domain expertise to convert to a reasonably sized feature space. So there you have it, the curse of dimensionality explains how the cold vastness of a large feature space makes learning intrinsically more difficult. The right number of features for a problem really depends on the data, the techniques, and what knowledge you can bring to bear. Now that we understand the basics of feature engineering, we're going to get into more sophisticated methods for creating and choosing features, see you there.