So, let's continue exploration. We wanted to determine the types of variables, and to do that we will first use this nunique function to determine how many unique values again our feature have. And we use this dropna=False to make sure this function computes and accounts for nons. Otherwise, it will not count nun as unique value. It will just unhit them. So, what we see here that ID has a lot of unique values again and then we have not so huge values in this series, right? So I have 150,000 elements but 6,000 unique elements. 25,000, it's not that a huge number, right? So, let's aggregate this information and do the histogram of the values from above. And it's not histogram of these exact values but but it's normalized values. So, we divide each value by the number of rows in the tree. It's the maximum value of unique values we could possibly have. So what we see here that there are a lot of features that have a few unique values and there are several that have a lot, but not so much, not as much as these. So these features have almost in every row unique value. So, let's actually explore these. So, ID essentially is having a lot of unique values. No problem with that. But what is this? So what we actually see here, they are integers. They are huge numbers but they're integers. Well, I would expect a real, nunique variable with real values to have a lot of unique values, not integer type variable. So, what could be our guess what these variables represent? Basically, it can be a counter again. But what else it could be? It could be a time in let's say milliseconds or nanoseconds or something like that. And we have a lot of unique values and no overlapping between the values because it's really unlikely to have two events or two rows in our data set having the same time, let's say it's time of creation and so on, because the time precision is quite good. So yeah, that could be our guess. So next, let's explore this group of features. Again with some manipulations, I found them and these are presented in this table. So, what's interesting about this? Actually, if you take a look at the names. So the first one is 541. And the second one is 543. Okay. And then we have 1,081 and 1,082, so you see they are standing really close to each other. It's really unlikely that half of the row, if the column order was random, if the columns were shuffled. So, probably the columns are grouped together according to something and we could explore this something. And what's more interesting, if we take a look at the values corresponding to one row, then we'll find that'll say this value is equal to this value. And this value is equal to this value and this value, and this is basically the same value that we had in here. So, we have five features out of four of this having the same value. And if you examine other objects, some of them will have the same thing happening and some will not. So, you see it could be something that is really essential to the objects and it could be a nice feature that separates the objects from each other. And, it's something that we should really investigate and where we should really do some feature engineering. So, for say [inaudible] , it will be really hard to find those patterns. I mean, it cannot find. Well, it will struggle to find that two features are equal or five features are equal. So, if we create or say feature that will calculate how many features out of these, how many features we have have the same value say for the object zero where we'll have the value five in this feature and something for other rows, then probably this feature could be discriminative. And then we can create other features, say we set it to one if the values in this column, this and this and this and this are the same and zero to otherwise, and so on. And basically, if you go through these rows, you will find that the patterns are different and sometimes the values are the same in different columns. So for example, for this row, we see that this value is equal to this value. And this value is different to previous ones but its equal to this one. And it's really fascinating, isn't it? And if it actually will work and improve the model, I will be happy. And another thing we see here is some strange values and they look like nons. I mean, it's something that a human typed in or a machine just autofilled. So, let's go further. Oh, yeah. And the last thing is just try to pick one variable from this group and see what values does it have. So, let's pick variable 15 and here's its values. And minus 999 is probably how we've filled in the nons. And yeah, we have 56 of them and all other values are non-negative, so probably it's counters. I mean, how many events happened in, I don't know, in the month or something like that. Okay. And finally, let's filter the columns and then separate columns into categorical and numeric. And it's really easy to do using this function select_dtypes. Basically, all the columns that will have objects type, if you would use a function dtypes. We think of them as categorical variables. And otherwise, if they are assigned type integer or float or something like that, or numeric type then we will think of these columns as numeric columns. So, we can go through the features one-by-one as actually I did during the competition. Well, we have 2,000 features in this data set and it is unbearable to go through a feature one-by-one. I've stopped at about 250 features. And you can find in my notebook and reading materials if you're interested. It's a little bit messy but you can see it. So, What we will do here, just several examples of what I was trying to investigate in data set, let's do the following. Let's take the number of columns, we computed them previously. So, we'll now work with only the first 42 columns and we'll create such metrics. And it looks like correlation matrices and all of that type of matrices like when we have the features along the y axis, features along the x axis. Basically, well, it's really huge. Yeah. And in this case, what we'll have as the values is the number or the fraction of elements of one feature that are greater than elements of the second feature. So, for example, this cell shows that all variables or all values in variable 50 are less than values and variable ID, which is expected. So, yeah. And it's opposite in here. So, if we see one in here it means that variable 45, for example, is always greater than variable 24. And, while we expect this metrics to be somehow random, if the count order was random. But, in here we see, for example, these kind of square. It means that every second feature is greater, not to the second but let's say i+1 feature is greater than the feature i. And, well it could be that this information is about, for example, counters in different periods of time. So, for example, the first feature is how many events happened in the first month. The second feature is how many events happened in the first two month and so kind of cumulative values. And, that is why one feature is always greater than the other. And basically, what information we can extract from this kind of metrics is that we have this group and we can generate new features and these features could be, for example, the difference between two consecutive features. That is how we will extract, for example, the number of events in each month. So, we'll go from cumulative values back to normal values. And, well linear models, say, neural networks, they could do it themselves but tree-based algorithms they could not. So, it could be really helpful. So, in attached to non-book in the reading materials you will see that a lot of these kind of patterns. So, we have one in here, one in here. The patterns, well, this is also a pattern, isn't it? And now we will just go through several variables that are different. So, for example, variable two and variable three are interesting. If you build a histogram of them, you will see something like that. And, the most interesting part here are these spikes. And you see, again, they're not random. There's something in there. So, if we take this variable two and build there, well, use this value count's function, we'll have value and how many times it occurs in this variable. We will see that the values, the top values, are 12, 24, 36, 60 and so on. So, they can be divided by 12 and well probably, this variable is somehow connected to time, isn't it? To hours. Well, and what can we do? We want to generate features so we will generate feature like the value of these variable modular 12 or, for example, value of this variable integer division by 12. So, this could really help. In other competition, you could build a variable and see something like that again. And what happened in there, the organizers actually had quantized data. So, they only had data that in our case could be divided by 12. Say 12, 24 and so on. But, they wanted to kind of obfuscate the data probably and they added some noise. And, that is why if you plot an histogram, you will still see the spikes but you will also see something in between the spikes. And so, again, these features in that competition they work quite well and you could dequantize the values and it could really help. And the same is happening with variable 3 basically, 0, 12, 24 and so on. And variable 4, I don't have any plot for variable 4 itself in here but actually we do the same thing. So, we take variable 4, we create a new feature variable 4 modulus 50. And now, we plot this kind of histogram. What you see here is light green, there are actually two histograms in there. The first one for object from the class 0 and the second one for the objects from class 1. And one is depicted with light green and the second one is with dark green. And, you see these other values. And, you see only difference in these bar, but, you see the difference. So, it means that these new feature variable 4 modulus 50 can be really discriminative when it takes the value 0. So, one could say that this is kind of, well, I don't know how to say that., I mean, certain people would never do that. Like, why do we want to take away modular 50? But, you see sometimes this can really help. Probably because organizers prepare the data that way. So, let's get through categorical features. We have actually not a lot of them. We have some labels in here, some binary variables. I don't know what is this, this is probably is some problems with the encoding I have. And then, we have some time variables. This is actually not a time. Time. Not a time. Not a time. This is time. Whoa, this is interesting. This looks like cities, right? Or towns, I mean, city names. And, if you remember what features we can generate from geolocation, it's the place to generate it. And, then again, it was some time, some labels and once again, it's the states. Isn't it? So, again, we can generate some geographic features. But particularly interesting, the features are the date. Dates that we had in here. And basically, these are all the columns that I found having the data information. So, it was one of the best features for this competition actually. You could do the following, you could do a scatter plot between two date features to particular date features and found that they have some relation, and, one is always greater than another. It means that probably these are dates of some events and one event is happening always after the first one. So, we can extract different features like the difference between these two dates. And in this competition, it really helped a lot. So, be sure to do exploratory data analysis and extract all the powerful features like that. Otherwise, if you don't want to look into the data, you will not find something like that. And, it's really interesting. So, thank you for listening.