0:10

So what is supervised learning and how is it different from unsupervised learning?

Â In supervised learning, we work with a set of labelled training data,

Â where we have pairs of inputs and desired outputs.

Â The goal here, is to infer a function from

Â the labelled training data so that we can produce the desired output given an input,

Â even for examples that the supervised learning algorithm has never seen before.

Â This availability of labelled training data

Â is what distinguishes supervised from unsupervised learning.

Â There are two main types of supervised learning, classification and regression.

Â In classification, we predict results in a discrete output.

Â In other words, we use our training data to build a model that

Â assigns unseen observations to one or more classes.

Â Examples here, include deciding whether an e-mail is

Â spam or not, or classifying images as cat,

Â dog or fish images.

Â In regression, on the other hand,

Â we predict result within a continuous numerical output.

Â Continuous means that there aren't any gaps in the value that the output can take.

Â Examples here include predicting the price at which a house will

Â sell or predicting the age of a person based on their image.

Â Supervised learning has been successfully applied to biomedicine and healthcare,

Â for both classification and regression problems.

Â For instance, classification techniques have been employed to predict the occurrence

Â of an asthma exacerbation based on data from preceding telemonitoring reports.

Â Another example, involves the employment of

Â regression techniques on brain MRI images of patients with Alzheimer's,

Â to predict the clinical stage of

Â the disease, as expressed through continuous clinical variables.

Â There is a wide range of supervised learning techniques available.

Â Such as, decision trees,

Â random forests, neural networks,

Â support vector machines, K-Nearest neighbors,

Â linear regression, logistic regression and others.

Â Let's look at a few of these in more detail.

Â K-Nearest neighbors is one of

Â the simplest machine learning algorithms used for classification and regression.

Â The main idea here is that the prediction for a new data point or a new case,

Â is the mode or mean of the values of its K-closest data points.

Â In other words, its K-Nearest neighbors.

Â Let's demonstrate this visually.

Â Here, we have blue circles and red boxes.

Â Now suppose that a mysterious grey star is introduced,

Â and we want to predict whether it is a blue circle or a red box.

Â If we're following the K-Nearest neighbors algorithm, where K equals one,

Â then all we need to do is find the closest data point and assign its value to our star.

Â In this case, we predict that our star should be a red box.

Â We could try with different values for K, such as three.

Â In this case, our star is predicted to be a blue circle and that's all it is.

Â All we do is look at the K-Closest data points and take

Â their mode, in the case of classification or their mean, in the case of regression.

Â And how do we decide what value K should take?

Â We can split our training data set into different segments for which we build

Â K-Nearest neighbors models using different values

Â of K and choose the one that has the best accuracy.

Â You might also be wondering,

Â how do we measure the distance between different data points?

Â A commonly used distance metric for continuous variables is euclidean distance,

Â while for discrete variables,

Â the Hamming distance is used.

Â Note that the classification example that we saw

Â earlier contained two dimensional training examples,

Â but if we have more complex data with several features then

Â the training examples are vectors in

Â a multidimensional feature space, each with a class label.

Â This could be the case of classification of high risk and low risk patients with asthma,

Â where several risk factors such as smoking,

Â air pollution and allergies, are used as features in the model.

Â The K-Nearest neighbors algorithm is easy to understand

Â and often gives reasonable performance without a lot of adjustments.

Â Building the model is fast,

Â but when a large number of features is involved,

Â prediction can be rather slow.

Â So this algorithm is worth trying as

Â a baseline method before considering more advanced options.

Â Decision trees allow us to predict the value of

Â a target variable based on a set of input variables.

Â Here's a fictitious example of

Â a decision tree for predicting whether a flat is in Edinburgh or New York,

Â and hence used for classification.

Â As demonstrated in this example,

Â decision trees have a tree like structure, consisting of nodes and edges.

Â Each node is associated with one of the input variables,

Â while the edges coming from that node, are the total possible values of that node.

Â Here for instance, the root node is about

Â the flat price and the edges distinguish between

Â two mutually exclusive cases, covering the entire range of

Â possible values, greater or lower than Â£400,000.

Â When we are presented with a previously unknown case of a flat,

Â all we need to do is follow the corresponding path down

Â the tree to identify the predictive value of the leaf node.

Â So, suppose that we want to predict the location of a flat with these characteristics.

Â The corresponding path is the one highlighted

Â here, and the prediction is that it's an Edinburgh flat.

Â But how do we build a decision tree like that?

Â Learning a decision tree is about learning the sequence of questions.

Â In other words, the sequence of variables used and their split of

Â possible values that gets us to the correct answer more quickly.

Â Let's explain this with the use of the flat example that we just saw.

Â Suppose that we've been given a data set about

Â flats in Edinburgh and New York with these attributes.

Â To learn a tree, the algorithm searches over all possible attributes or

Â questions, and picks the one that is most informative about the target variable.

Â In our example, price is the first attribute chosen to split the tree in two subsets,

Â as it is a big determinant of flat location.

Â If the price of a flat is greater than 400k,

Â then it's most likely to be based in New York and the other way round.

Â However, there are still some cases of Edinburgh flats that cost more than

Â 400k as well as cases of New York flats that are cheaper than 400k.

Â We can build a more accurate model, by repeating

Â the search for the next attribute for each of the two subsets.

Â The most informative attribute for the subset on

Â the left hand side of the tree, is the number of bedrooms,

Â while, for the right hand side of the tree,

Â it's the year in which the flat was built.

Â These new splits improve the prediction accuracy of the tree.

Â We could have kept going to introduce

Â further attributes until the trees' accuracy is 100 percent.

Â However, this is not recommended, as

Â our extended decision tree would then not perform well on previously unseen test data.

Â This would be due to overfitting to the training data,

Â a phenomenon that may occur in supervised learning, and which reduces

Â the ability of the model to generalize from the training to the test data.

Â Decision trees have been used in the medical field to diagnose

Â blood infections or even predict heart attack outcomes in chest pain patients.

Â An extension of decision trees has also been used, called random forests.

Â In which several decision trees are built from separate subsets of

Â the training data set and then are combined to improve the prediction accuracy.

Â The main advantage of decision trees is that they can handle

Â large amounts of data while remaining easy to read and communicate.

Â On the other hand, they may lead to

Â overly complex models and overfitting could be an issue.

Â Neural networks or artificial neural networks as they're also called, are

Â another technique that is used for both classification and regression problems.

Â A neural network is an interconnected group of nodes, called neurons,

Â which are captured here as circles.

Â The arrows here, represent connections between the neurons,

Â which can transmit a signal from one neuron to another.

Â The neuron that receives the signal can

Â process it and then signal neurons that are connected to it.

Â Neural networks are typically organized in layers.

Â There is an input and an output layer and then maybe one or more hidden layers.

Â Before looking at how neural networks work,

Â let's explain how a single neuron processing works.

Â A neuron calculates a weighted sum of its inputs and then applies

Â a threshold or activation function to determine whether the signal will be sent or not.

Â In a neural network, this process is repeated several times,

Â first computing hidden units as an intermediate processing step,

Â which are again combined to yield the final result.

Â So, if we know the inputs,

Â the weights between the different layers and the threshold functions,

Â then we can compute the output, even manually.

Â However, all we know in supervised learning is the inputs and outputs.

Â So the purpose of neural networks is to learn the weights and

Â the threshold functions with the use of

Â existing data, so as to make predictions about previously unseen data.

Â We should note that an important parameter that needs to be

Â set by the user, is the number of nodes in the hidden layer.

Â It is also possible to add more layers, as shown here.

Â Having large neural networks that consist of many

Â of these layers is referred to as deep learning.

Â Deep learning has attracted a lot of attention lately, as it

Â has brought significant advances in several challenging domains,

Â such as, image analysis and natural language processing.

Â Neural networks and deep learning are extremely powerful.

Â They can deal with large amounts of data and build incredibly complex models.

Â This however comes at the cost of interpretability.

Â As the more hidden layers are added,

Â the harder it is to understand what is happening inside the neural network.

Â Training large and powerful neural networks can also be a challenging task.

Â