For our first machine learning exploration, we're going to build an extremely simple form of object recognition system. Now, although the example we'll use is very simple, it does reflect many of the same key machine learning concepts that go into building real-world commercial systems. The dataset we're going to use is a small, very simple example dataset derived from one originally created by Dr. Iain Murray at the University of Edinburgh for the task of training a classifier to distinguish between different types of fruit. To create the original dataset, Dr. Murray went to a nearby store, bought a few dozen oranges, lemons, and apples of different varieties, and recorded their measurements and the table. He sat down and he looked at the height and the width, estimated their mass and so forth. We've reformatted his original data slightly and added one or two extra simulated features such as a color score for instructional purposes. This dataset is called fruit data with colors.TXT, and it's included in the folder of materials that you downloaded for this course. Now you might think that fruit prediction is a silly and in practical scenario, and given the limited nature of this dataset, it is a bit of a toy example, but actually food companies do indeed now rely on machine learning systems that aren't all that different in concept from the ones we're about to build. They can do automated quality control. In fact they do exist, for example real systems used by fruit shipping companies that screen for rotten oranges during processing. Now they're the features that they use in building these systems are a little more sophisticated than the ones we're looking at here. For example, quality control systems for rotten orange detection use ultraviolet light that can detect interior decay, which is often less visible than just by looking on the surface. Anyway, to solve machine learning problems, you can think of the input data as a table where each object, so in our case, a piece of fruit is represented by row and the attributes of the object, the measurement, the color, the size, and so forth in our case, for a piece of fruit, the features of the fruit are represented by the values that you see across the columns. In a supervised learning problem, the dataset will also typically contain a special column with the label of the object. If the dataset does not have such a field already, sometimes you can derive it from information that's in one or more columns. To make sure you're ready to continue, make sure you've run the following code snippet that loads the proper libraries we're going to need to proceed and we'll show those here now. The first thing we're going to do is to load the fruit dataset file using the very handy read table command in Pandas. Now this will read the dataset from disk and store it into a DataFrame variable that we'll call fruits here. Let's look at this dataset and dump out the first few rows of the DataFrame. Here we can see that each row of the dataset represents one piece of fruit as represented by several features that are in the tables columns. In order, the columns that we see our fruit label, so this is the training label that we'll use. It's a number that corresponds to the general type of fruit, so for example, one is an apple to as a mandarin orange, three is irregular orange and so forth. This label was supplied by the human creator of the dataset. The fruit name and fruit subtype columns contain text descriptions of the general and specific fruit categories. The fruit name corresponds as a text form of the corresponding fruit label in the same row. Now we won't be using these named columns as features. I've just included them here to make the dataset a bit more readable for our purposes. After that, the features in this representation include measurements for each fruit that capture, its mass in grams and its width and height in centimeters. Finally, there's a feature stored in a column called color score. That's meant to be a single number that captures a rough idea of the color of the fruit. In a real system, this would actually be something more sophisticated, like a histogram of the distribution of colors or maybe the pixels from an actual image or video of the fruit. But for our purposes, we're going to just summarize the color along a spectrum scale. That'll be a handy summary that we can use. That will just be easy to visualize. Scores close to one, meaning the fruit is read, scores around 0.7 indicate yellow and so forth. Looking at this DataFrame, we can see that it contains 59 rows corresponding to 59 different pieces of fruit that have been measured and entered into the table. Our goal here is to build a classifier from this data that can predict the correct type of fruit for any given observation of features, such as mass, height, width, and color score. For example, can we tell based on the color score and the dimensions, the difference between an orange and a lemon? I have the classifier predict the type of piece of fruit correctly just from its observed measurements. Now, assuming for the moment that we already had a classifier ready to go, how would we know if its predictions were likely to be accurate? Well, we could choose a fruit sample, called a test sample for which we already had a label. We could feed the features of that piece of fruit into the classifier and then compare the label that the classifier predicts with the actual true label of the fruit types. Here's a very important point though, if we use one of our labeled fruit examples in the data that we use to train the classifier, we can also use that same fruit sample later as a test sample to it also evaluate the classifier. Why is that? Well, a key ability that our classifier needs to have is that it needs to work well on any input sample, any new pieces of fruit that we might see in the future, not just on the ones that we have in our training set. Because our classifier could simply memorize every sample in the training set, it'd be pretty trivial to just give back the correct label for any one of the same samples later, so, measuring the classifier's performance later using the same samples that we've used to train it in the first place, doesn't tell us anything about how well the classifier is likely to work for a fruit that we haven't seen before, it will only tell us what we already know about what's in the training set. Since our only source of labeled data is the dataset we've been given, to estimate how well the classifier will do on future samples, what we'll do is split the original dataset into two parts. We'll have an array of labeled samples called the training set that will be used to train the classifier and then we'll hold out the remaining labeled samples and put them into a second separate array called the test set, that will be used to then evaluate the trained classifier. To create training and test sets from an input dataset, scikit-learn provides a handy function that will do this split for us called not surprisingly, train-test-split, and here's an example of how we'll use it. This function randomly shuffles the dataset and splits off a certain percentage of the input samples for use as a training set, and then puts the remaining samples into a different variable for use as a test set. In this example, we're using a 75, 25 percent split of training versus test data. That's a pretty standard relative split that's used. It's a good rule of thumb to use in deciding what proportion of training versus test might be helpful. As a reminder, when we're using scikit-learn, we'll denote the data that we have using different flavors of the variable X, which is typically a two-dimensional array or DataFrame. The notation we'll use for the labels will be typically based on y, which is usually a one-dimensional array or a scalar. Now, note the use of the random state parameter in the train test split function, this random state parameter provides a seed value to the function's internal random number generator. If we choose different values for that seed value, that will result in different randomized splits for training and tests. If we want to get the same training and test split each time, we just make sure to pass in the same value of the random state parameter and so here we're going to set that parameter to zero for all our examples. The training test split function will put the training set here into X_train, the test set into X_test, the training labels into y_train, and the test labels into y_test. This is a 75, 25 partitioning of the original data into these two parts. This is the variable naming convention we'll use for pretty much all of our coding. We'll put the rows of the data without the label, the training instances into this X variable, and the list of corresponding labels for those rows into a variable called y. When using the training set and a test set, we'll then use X_train, that holds the training instances to train the classifier, and X_test to evaluate the classifier after it's been trained. We're going to show a short snippet of code here now that illustrates how to do that. Here we can see the results of applying this train-test-split function. You can see that it has indeed split our fruit dataset into training and test sets with the correct proportion of samples. Now that we have a training and a test set, we're ready for the next step and we'll look now into more depth at the data itself before we proceed with giving it to a machine-learning algorithm.