All right, so here's our list. Good feature columns must be related to the objective, no net prediction time, numeric with a meaningful magnitude, have enough examples present, and lastly, you're going to bring your own human insight to the problem. First up, a good feature needs to be related to what you're actually predicting. So you need to have some kind of a reasonable hypothesis for why a particular feature value might matter for this particular problem. You can't just throw arbitrary data in there and hope that there's some kind of relationship somewhere for your model to figure out. Because the larger your data set is, the more likely it is that there's lots of these spurious or strange correlations that your model is going to learn. So take a look at this. What are the good features shown here for horses? Well, it's a trick question. If you said, it depends on what you're predicting, you're exactly right. I didn't tell you the objective of what we're after. If the objective is to find what features make for a good race horse, you might go with the data points on breed and age. Does the color of the horse's eyes really matter that much for racing? However, if the objective was to determine if certain horses are more predisposed to eye disease, eye color may indeed be a valid feature. The point here is that you can't look at all your feature columns in isolation and say whether or not one is a good feature, it's all dependent upon what you're trying to model or what your objective ultimately is. All right, number two, you need to know the value at the time that you're doing the prediction. Remember the whole reason to build the ML model is that you can predict with it. If you can't predict with it, there's no point in building and training an ML model. So a common mistake that you're going to see a lot out there is to just look at the data warehouse that you have and take all that data, all the related fields and then throw them all into a model. So if you take all these fields and just throw them into an ML model, what's going to happen when you're going to go and predict with it? Well, when you go at prediction time, you may discover that your warehouse had all kinds of good historical sales data, even if it's perfectly clean, that's fine. So use that as the input for your model. Say you wanted to get how many things were sold on the previous day, that's now an input to your model. But here's the tricky part, it turns out that daily sales data actually comes into your system a month later. It takes some time for the information to come out to your store. Now there's a delay in collecting and processing this data. But your data warehouse has this information because somebody went through all the trouble of taking all the data and joining all the tables and putting it all in there. But at prediction time or at real time, you don't have that data. Now the third key aspect of a good feature is that all your features have to be numeric and they have to have a meaningful magnitude. Why is that you ask? Well ML models are simply adding, multiplying, and weighing machines. When it's actually training your model, it's just doing arithmetic operations, computing trigonometric functions and algebraic functions behind the scenes and your input variables. Your inputs need to be numbers and trickily, your magnitudes need to have a useful meaning say like the number two that's present there is in fact twice a number one. Let's do a quick example. Here we're trying to predict the number of promo coupons that are going to be used. And when we look at the different features of that promotional coupon, first up, is a discount percentage like 10% off, 20% off, etc., is that numeric? Sure, yeah, absolutely, that's a number. Is the magnitude meaningful? Yeah, in this case, absolutely. A 20% off coupon is worth twice as much as a 10% off coupon, so it's not a problem. This is a perfect example of a numeric input. What about the size of the coupon? Say I define it as four square centimeters, 24 square centimeters, and 48 square centimeters. Is that numeric? Yeah, sure. So it's 24 square centimeters six times as much or more visible than one that's four square centimeters? Yeah, that makes sense, you can imagine this is numeric. But it's also unclear whether or not that magnitude is really meaningful. Now if this was an ad you're placing, the size of the banner ad, larger ads are better and you could argue that that makes sense. But if it's a physical coupon, it's something that goes out like the newspaper, then you got to wonder whether or not a bigger or 48 square centimeter coupon really is twice as good as a 24 square centimeter coupon. Now let's change the problem a little bit. Suppose we define a size of the coupon as small, medium, and large. At that point, our small, medium, large numeric? No, not at all. Now I'm not saying you can't have categorical variables as input to your models, you can, but you just can't use strictly small, medium, and large directly. We got to do something smart with them and we'll take a look at how to do that shortly. All right, let's go with the font of an advertisement. Arial 18, times new roman 24, is this numeric just because it has numbers in it? No. Well how do you convert something like times new roman to numeric? Well, you could say arial is number one, times new roman is number two, roboto is number three, and comic sans is number four, but that's a number code. They don't have meaningful magnitudes, and if we set arial as one in times new roman as two, times new roman is not twice as good as arial. So the meaningful magnitude part is really important. How about the color of the coupon, red, black, blue. Again these aren't numeric values and they don't have meaningful magnitudes. We can come up with like RGB values to make the color values numbers, but again, they're not going to be meaningful numerically. If I subtract two colors and the difference between them is three, does that mean if I subtract two other colors and the difference between them is also three, are these the same? Are they commensurate? No, and that's the problem with magnitude. All right, how about item category? One for dairy, two for Deli, three for canned goods. Now I've just said before, these are categorical, they're not numeric. Now I'm not saying that you can't use non numerical values as we said before, we just need to do something to them and we look at things that we need to do to them. So use an example, suppose you have the words in a natural language processing system. The things that you need to do to words to make it numeric is that you can simply run something that's called Word2Vec. It's a very standard technique and you basically take all of your words, apply this technique to make those words numerical vectors which as you know have a magnitude. So each of these words becomes a vector, and at the end of Word2Vec, when you look at these vectors, the vectors are such that if you take a vector from man and you take a vector from woman and you subtract them. The difference that you're going to get is going to be very similar to the differences if you take the vector for king and you take the vector for queen and if you subtract them, that's what Word2Vec does. So changing an input variable that's not numeric to be numeric, it's not a simple matter, it's a little bit of work. Well you can just go ahead and throw some random encoding in there, but your ML model is not going to be as good as if you started with a vector encoding that's nice and understand the context of things like male and female, man and woman, king and queen. So that's what we're talking about when we you say numeric features and meaningful magnitudes. They got to be useful so you can do the arithmetic operations on them during your ML processing phase. All right, point number four, you need to have enough examples of that feature value in your data set. And a good starting point for experimentation is that you need to have at least five examples of any value before I'll use it in my model. At least five examples of a value before using training or validation or so on. So if you went back to our promo code example, if you want to run an ML model and all promotion codes that gave you 10%, you may well have a lot of examples of 10% promo coupons in your training data set. Ahh, but what if you gave a few users a one time discount code of 87% off? Do you think that you have enough instances in your data set of an 87% discount code for your model to learn from? Likely not, so you ought to avoid having values which you don't have enough examples to learn from. And notice I'm not saying that you have at least five categories, like 10% off, 20%, 30% off, and I'm not saying that you need to have at least five samples like four records or five records in a column. I'm saying that for every value of a particular column, you need to have at least five examples. So in this case, five instances at least for 87% off discount coupon code that had been used before we even consider using it for ML. Last but not least, bring your human insight in the problem. Recall how we verbalized and reason all of the responses for what makes it a good feature or not. You need to have a subject matter expertise and a curious mind to think of all the ways that you could construe a data field as a feature, and remember that feature engineering is not done in a vacuum. After you train your first model, you can always come back and add or remove features for model number two.