Different problems that might be found in a dataset can lead us to a wrong decision afterwards. Therefore, it is necessary to take a very precise look on the data we would like to use for our decision-making system. Here are some of the ways to explore the structure of the dataset. Let us explore some of the basic data visualization techniques available in MATLAB that will help us to get a better view on the data that we have. On this slide, you can see some of the typical approaches to explore the input data. As we've already seen, that all the data inputs in our system are numeric variables. Talking more precisely, after we performed normalization, these are comma separated values range from minus one to one. You might be already familiar with some of those techniques, but let's just use them and see how they work on our data. A nice way to visualize the distribution of a variable is to make a boxplot. It is especially useful to plot boxplots for same features of two different classes one next to another. If you see that there is a significant difference in the form of a boxplot belong to two different classes, this variable might be a very useful identifier of particular class. We plot a boxplot for noisy records on top and for clear records on the bottom. Additionally, we can see that ECG complexes that were classified as noise tend to have more outliers. We also can analyze the distribution of variables belonging to different classes in a more precise way by plotting two histograms for a variable belonging to two different classes next to another. On this plot, we put together histograms of the feature called "the length of an ECG complex" in our two classes. Although mean values seem to be equal for these two classes, a noisy complex surely has a smaller variation. We can go further and plot distributions for two variables together with a scatter plot. You can see that for our two variables which are in this case "the relation of the length of complexes" on the x coordinate and "the lengths of ECG line to the magnitude of a cardiac complex" on the y. there is a clear evidence of difference between our two records. Using these variables together in our classifier might give us better results than using them alone. Another way is to plot a complex graph like this, which combines a set of different graphs in itself that gives you a three-in-one solution and might be very useful for a fast digging into data, but be careful computation of this kind of a graph for a big set of variables may take a long time. Know that to use "Corrplot()", some additional MATLAB libraries should be installed on your system. This time we'd just upload data for the whole dataset together to identify multicollinear variables, which might be a possible problem for a future model that we want to create. Multicollinearity refers to a situation in which two or more explanatory variables, in a multiple regression model are highly linearly related. We have perfect multicollinearity if the corelation between two independent variables is equal to 1 or to -1. In practice we rarely face perfect multicollinearity in a dataset. More commonly, the issue of multicollinearity arises when there is an approximate linear relationship among two or more independent variables. Now, lets go to MATLAB and try that for our dataset. Let's start from making two different datasets by separating the initial dataframe into two according to our labels. We can do that using "find()" command. Okay. Now we see that in the "bad" class we have only those records whose labeled with "1" and for the "norm" class with "0". Now that we have separated our data frame into two, let's make some boxplots. We can plot more than one graph in the figure by utilizing a "subplot()" command. You see that we're also using "ylim()" command to specify the limits on the scale, and we also use a "grid on" command to have a grid on this graph. Well, we see that for some variables there is slightly different distribution. This is the broad picture. What you better off doing is just plotting each of these boxplots one to each other, like "this is box plot for a normal class, this is for pathological", you can take two of them, put them together, and analyze them in details so to have a better view. Let's now plot our histograms for the first feature as we've seen on the slide. So, we see that even that our classes are distributed on quite the same scale, we see that they really have different form. "Red" is the biased QRS complexes and "blue" is normal. So, you might think of some maybe thresholds or some characteristics of this feature that can help you distinguish between two classes. Now let's make a scatter plot to analyze the joint distribution between some of the features. Let's take, for example, the second feature which is the width of our QRS complex, and the third which is relative length of the complex, as an example. Let's change the color. let's use the blue color for normal class and the red for a biased class. Okay. From this picture, you can see that our normal QRS complex in these two characteristics surely has a wider spread. Now, let's make some bigger graph that actually utilizes all of three above, like histograms, scatter plots, and also makes some correlation metrics together with the "corrplot()" command. Okay. As you see, I just made this graph only for first 10 variables, because, first of all, as you've already seen, this quite performance-heavy operation, it takes some time, and also if we plot more than 10 variables with this kind of a plot, we couldn't actually see the actual plots inside of it. What we can see here is that we definitely have some highly correlated data namely: the 4th variable... and the 9th... The 9th, the 10th, and the 4th, they actually have a very huge corelation between them, almost one, and we might be better off removing two of them for further analysis as it might not give us more information, but in fact only lead us to possible problems that may arise in the stage of construction of a regression model. Now, let's just save the data that we got here to work with it afterwards. Okay. So, that's all for now, and let's go to the next step.