Hello everyone! I'm Meng Wang from Center for Bioinformatics, Peking University.

Now let me introduce the support vector machine (SVM).

First I will show you how machine learning is done. Here is how the supervised machine learning is carried out.

Supervised machine learning picks a data set as the training data set.

Then it uses machine learning methods to analyze and model the data,

Common machine learning methods include the support vector machine,

the Hidden Markov Model, the Bayesian methods, and artificial neural networks.

Using the training data set to train these methods

is just to get the exact values of parameters of these models. That's how to get the whole model.

Then we can give new data to this model to compute the predicted result.

Here, the classification problem is one of the important problems addressed by machine learning.

We take as example the binary classification here.

It classifies data into two classes, one of which labelled with 1 and the other with -1.

Each data point in the training set can be classified by various features.

For example, here the X1, X2, and X3 denote three features, and Y denotes the class for each data point.

In this way we can map the data to a 3D space. The three features correspond to

three axes of this 3D space. Y represents the color of these points.

As you can see, these points are classified into two clusters, the Class 1 of which colored in blue and Class -1 in red.

By learning these known points, we can determine which cluster (and thus the color) a new given point should be assigned to.

A straightforward idea is to plot a plane to completely separate these two clusters.

Then we can determine whether a new given point is above or below this plane,

and tell the class it belongs to.

For example, if it is above this plane, then it belongs to the first class; if it is below this plane, then it belongs to the second class.

What the support vector machine wants to do is to find such a plane to finish the classification.

The support vector machine can be used not only in classification, but also in regression analysis, pattern recognition, etc.

The goal of support vector machine is to choose in the space certain key points, which are the support vectors.

These points will be used to construct the plane for classification to separate the two data points as much as possible.

SVM can handle more than the linear classification problems.

What makes it strong is that it can handle non-linear cases.

Briefly, it uses kernel functions to map non-linear cases to a hyper-dimensional space, and use linear method to finish it.

For example, assume in a 2D space are two classes of points, one in red and the other in green.

We want to find in this plane a straight line to separate these two classes of points.

As you can see, there are infinite number of lines between these two clusters of points that can separate them completely, as shown here.

Which one is the best?

As you can see, for points that are near the center of the two clusters, such as this one, the line in the figure above will classify it as Class 2.

But it will be classified as Class 1 if the line in the figure below is adopted. So which class does this point belong to?

What SVM wants to do is to find the optimal line among all the possible lines. The optimality is defined as the fairest.

Which is the best among all these planes for classification?

Intuitively, for the sake of fairness,

the one that lies in the very center of the two clusters is the fairest and the most reasonable one.

What SVM wants to do is to find among all possible planes the fairest one that lies in the very center of the two clusters.

As shown on the right, for these two clusters, once the blue and the pink lines are given,

the final red line can be determined unambiguously.

The blue and the pink lines are determined

by these three key points.

In other words, we can unambiguously determine the plane for classification in the center once these three points are given.

The plane is independent with other points beyond these two lines.

In other words, the final plane for classification is not affected by any addition or deletion of points beyond this region.

Thus this classification method is stable; it is not affected by outliers.

The three key points are called as support vectors in support vector machine.

In other words, the plane for classification is unambiguously determined once the support vectors are determined.

Thus the support vector machine is just to find the support vectors to determine unambiguously the plane for classification.

Therefore, SVM is to find in the space the fairest plane for classification to separate the two clusters of points as fair as possible.

The fairest plane is the plane whose distances to the two clusters are equal.

Next I will explain briefly how the support vector machine is expressed mathematically.

If you are interested, you can deduce and learn them by the hints provided in the slides and related materials.

Here is an expression for a plane in a space.

Mathematically speaking, we can determine whether a point is above or below the plane by assigning to the point values of its coordinates.

The value of f(x) can be worked out. The sign of f(x) can thus be used to determine whether the point is above or below the plane.

If f(x)>0, then the point is above the plane and belongs to the first class; if f(x)<0, then it is below the plane and belongs to the other class.

Then how can we evaluate the reliability of the classification result?

A straightforward idea is to use the absolute value of f(x). A larger value implies that the point is relatively far away from the plane.

The classification result is thus more reliable.

Likewise, if the absolute value of f(x) is small and approximates 0, it will be closer to the plane.

The classification result is thus less reliable.

Therefore, we can evaluate the how reliable the classification result is by the absolute value of f(x).

As expressed above, f(x) can be formulated as yi*f(xi). This formula is called as the functional margin of this point.

Next we define the functional margin of a hyperplane as the minimum of the functional margins over all points in the space.

This is defined as the functional margin of a hyperplane.

The functional margin of a point (or a hyperplane) has a very important property.

That is, they can be scaled arbitrarily. The reason is that

the equation always hold no matter by what number the two sides of the equation are multiplied.

For example, we multiply the w and b in the equation by 2 to get 2w and 2b.

The equation still holds anyway.

The functional margins of the point and the corresponding hyperplane are both doubled.

Therefore, by multiplying the two sides of the equation by any same number

the functional margin of a point and the hyperplane margin can be scaled to any number.

Thus, the hyperplane margin is not a static value. It is a relative value.

To make it stable, it is divided by the length of vector w, wich is the 2-norm of w. So the result is a absolute value, which is represented by r, called geometrical margin.

Here the r means the minimum of all the distances from each point to the hyperplane.

As we mentioned, the hyperplane margin could be scaled to any value.

Thus, we can set it to be 1, by multiplying a particular number at the two sides of equation.

So the formula is simplified by setting the hyperplane margin to be 1.

Here we get the following results. Because the r represents the minimum distance of all the points to the hyperplane,

when we maximize it, as you could imagine, the hyperplane will be located in the middle of the two clusters, which is the fairest plane.

In SVM, the objective is to maximize this formula.

At the same time, the hyperplane margin is set to be 1, and it is the minimum distance, so all the other points' distance to the plane is greater than or equal to 1.

Now we have convert the problem of finding the fairest plane to a problem of finding the maximum value of a function under constraints.

To facilitate calculations, we transform the formula, putting the 2-norm of w in the numerator position and squired and divided by 2,

which is equivalent to the original formula.

Finally, the problem becomes to get the minimum value of a function under constraints.

This problem can be solved by quadratic programming.

Meanwhile, there is another general approach to solve this sort of question,

that is lagrangian multiplier method.

The lagrangian multiplier method is to construct a formula about L and lagrange multipliers alpha, to get the extremum of a function.

According the knowledge of differential calculus, the extremum value of a function will be found at the points which their derivative is 0.

So by calculating each partial derivative and set them to be 0, we get the values of parameters by solving the system of equations.

So we get the results about x, y.

We replace the w in the formula about f(x) with above results.