It's really great to take into data by visualizing it in different ways. But are there any techniques to get some quantitative measure that can give us information which variables should help us in our task and which would not. Well in fact, there are many ways to do that. Let's take a look at some of them. There are many tactics to select features from the predictive model. Here are some of them you might be already familiar with. We already implemented correlation analysis in the previous video to remove collinear variables. Another thing that we should be interested in is finding variables that have different distribution between our two classes so that we can utilize them in classification task. There is also another way to choose significant variables like a stepwise selection of variables based on some criteria like F statistic of the model or some information criteria like a Kayak criterion to regularization which is also a very powerful technique. You can learn those from appropriate Machine Learning courses which was specified in the lecture notes in our additional reading section. What is important to know about a t-test and analysis of variance also referred to as ANOVA. Because that is most commonly used to determine if two sets of data are significantly different from each other. In our case, we're checking if particular feature of an ECG complex can help us in recognizing noisy conflict sets. We assume that our variables have normal distribution and to use this analysis for a mean value as a statistic. If you find that the difference is significant, we take the variable for further analysis. In our case, we are dealing with one factor multidimensional ANOVA. After reviewing their buff points, it can be said that t-test is a special type of ANOVA that can be used when we have only two classes to compare their means. Variation sometimes called the error group or error variance, is a term which refers to variation caused by differences within individual groups or levels. In other words, not all the values within one group like means are the same. These are differences not caused by the independent variable. In our particular case, we have only two classes. So, we could just get along with a t-test. Nevertheless, ANOVA might be handy in many situation where t-test cannot be used. Here is an example of output of another function in Math Lab. On the left, we see a number of statistics, as S which is the sum of squares and DF, degrees of freedom. MS, the mean squared error which is actually SS divided by DF for each source of variation. Then we also have an F statistic which is the ratio of the mean squared errors, and the last one probably the most important is the so-called p-value or probability value. It is also often called asymptomatic significance which is a probability for a given statistical model that the null hypothesis is true. In our case, the probability that the sample mean difference between two compare groups is the same. Small p-values, usually people take 0.05 value, indicates strong evidence against the null hypothesis. So, you reject the null hypothesis and otherwise a large p-value more than 0.05 indicates a weak evidence against null hypothesis. So, you fail to reject the null hypothesis. In other words, if you see that p-value is very small, the characteristics that we're looking at has different mean value in two groups belonging to two different classes. On the right, there is just a box-plot slightly different from the one we've seen before. This time that's a notch boxplot which also showed the difference between the confident intervals of millions between two groups. If you want to learn more about this kind of analysis, please refer to very nice books on statistics that we've provided in the reference section of our lecture notes. Now we are ready to make an analysis of variance in order to select useful features in our dataset. Let's throw the data first and then let's try to do both t-tests and ANOVA for every variable and dataset and compare the results. To do so, we need to make some simple four loops. So, let's try that. So, it should iterate among these features. Now, dataset. Let's display the H iteration. Okay. Then we can use the t-test function as well as ANOVA function. In our case, ANOVA one and t-test two. These two functions have a slightly different notation. So, the arguments are going to be a little bit different. But in fact, they do the same thing in our setting. So, let's take a look at that. So, you see that arguments in t-tests two actually are two arrays that we want to compare as first two arguments, while in ANOVA we just have to first specify the dependent variable and with the second monument on the column with information on classes and Column with Labels. We'll specify the third argument as off to suppress system output to save some time here. Let's just display for each feature. Our P values that we got from this analysis. Before we run this risk, let's take a look in the help and we see why is it called a t-test two, it's actually because we use a two sample test in this case. Here's an order of our outputs. We can actually get it from here, but just to call them P1 because we are most interested in p-value now because you want to have a significant difference and ANOVA one is this Gaussian called one because this is one way analysis of variance. It has slightly different outputs. Let's take them also. Actually, we have to specify this dots are different. Let's call this P2 and display all of them. Okay, we see here actually that we have the same results. So, this is the number of our column and these are the p-values that we got from this t-test two and ANOVA one function. We see that if the p-value is like this E is in the power of minus 20, there's definitely a very significant variable. This variable is so significant that p-values even not an exponential number is really close to zero. We also see that there are variable that are not that significant like this one. Well, 0.1 might be significant in some cases. This is definitely not, it's really close to one. This is also, I would say, not that significant and 15 is variable as well. So, actually, if you remove this variables, probably you wouldn't actually lose any information. But anyway, if you should remove them or not really depends on your future goals. That if you want to trust this test really, remember that we made an assumption that our variables distributed normally which might not always be the case especially if that distribution has many outliers and so on. But when we are going to make a logistic regression, we really have to have that in mind.