In that last video,

we had already mentioned the ensemble learning methods and presented random forests,

as one of the most known and most successful ensemble learning method.

It turns out that there are even more powerful methods in these classic approaches

which is collectively thought to as Boosting method or the Boosting methods.

Broadly speaking, Boosting is any ensemble method

that produces a strong learner out of a several weak learners.

Boosting was developed in 1990's in the work of Schapire and Freund.

And since then, it became one of

the most successful and practical methods in the machine learning community.

Leo Breiman himself, once called the boosting where

a weak learner is a shallow CART tree,

the best off the shelf classifier in the world.

Given that most of top places in open data science competitions

such as Kaggle are one using some combinations of ensemble methods,

and boosting trees in particular.

This opinion expressed in 1998 remains largely valid today as well.

Even though there are some exceptions such as deep learning models and

certain types of data such as images or texts,

where there are lots of such data.

As you could guess, there are two main questions in many boosting algorithm.

How to choose a basic learner?

And how to construct a strong learner out of a bunch of weak learners?

While many boosting algorithms are agnostic about the first question,

that is they can work with arbitrary weak learners in practice.

People usually use shallow CART trees weak learners.

And more important question is how to construct a strong learner

and this is where different boosting methods differ from each other,

dependent both from the type of the loss function and the type of the algorithm.

Among many possible boosting algorithms,

two most known or popular ones are AdaBoost and Gradient boosting.

Boosting algorithms are meta models that have their own hyper parameters.

In particular, these hyper parameters include the number of weak learners,

which is needed to control model complexity and hence,

out of sample performance.

Further, they inherit hyper parameters or

base learners such as hyper-parameters over Cart tree.

Furthermore, they often have learning rate parameters,

which control the speed of weight updates in boosting and are similar in their meaning to

learning rates we used for gradient descent methods in the crash course.

In view of their popularity and practical importance,

it's worth for us to spend

the remaining time allocated to supervise learning in our course,

on taking the look at boosting.

Due to availability of a implementation

of different boosting algorithms in libraries such as scikit-learn.

Once you understand what they do and how they are used,

you can start experimenting with boosting methods in a matter of minutes.

Now, what I want to do in this video is to present you a unified view of

different Boosting methods by viewing them as a problem of

optimization with a gradient acting in the function space.

This view of Boosting was originally suggested by the same Leo Breiman in 1990s.

And since then, it became the most popular way of interpretation of Boosting algorithms.

Boosting is viewed as a problem of minimization of

empirical laws over the functional form of the model outputs.

Let's see how it works.

Let's assume that we defined a loss function,

which is a function of observations why I and modal outputs f of Xi,

were Xi are model inputs.

Dependent on whether your problem is regression or classification,

you can use different loss functions.

For aggression, you can use this squared loss or absolute loss functions.

[inaudible] respectively to algorithms called L2Boosting and Gradient Boosting.

For classification, you can use the exponential loss function shown here,

which leads to AdaBoost.

We can however, proceed to the general way.

So, this now will hold for either regression or classification.

The objective function minimized by boosting is

just some of the loss functions over all data points.

This should be minimized over all possible functions f of x.

But such problem is to general.

And also, not very practical because it's an infinite dimensional problem.

But, we can convert it to a finite dimensional problem.

If we restrict the functional form of function f of x to be

an expansion in some set of features or basis functions Q of x.

With weights w, m plus a constant term w not.

In this formulation function optimization amounts to

optimization with respect to m plus one parameters w sub i,

where i runs from zero to m plus parameters gamma sub m,

that parameterize Bayes's functions.

The algorithm starts with the minimization of

the objective function with respect to all model parameters.

So for example, if we deal with regression and use this square loss function

shown here we can set w not equal to y bar,

which is the mean value of outputs y,

while setting all the rest of weights w,

i from one to m to 0.

Then, on each iteration M, we do the following.

First, we compute parameters w,

m and gamma m,

for this iteration by solving the first equation shown here.

In this equation, we substitute in the objective function,

the sum of the previous step of approximation for function f of x and a sum of a

Basis learner times weight w. When these parameters are computed,

on the next step, they are substituted back into

the second formula to find an updated function approximation for the output.

This is called forward stage-wise additive modeling.

Because in this procedure,

there is no going back and updating earlier parameters.

You can read more on boosting methods for example in chapter 16.4 in the book by Murphy.

And in particular, you can follow the deviation of the other Boost

algorithm when the loss function used is so-called exponential loss function,

that is often used for classification tasks.

Another example is regression where the loss is given by

the absolute value of the difference between the model prediction and the data.

This produces the same as Gradient Boosting algorithm.

The example I want to specifically mention here is

the example of regression with the regular square loss function.

In this case, as we just said before,

the function f of x can be initialized by the mean value or the target value y.

Then, at iteration m,

we have the previous optimization problem for parameters w,

m and gamma, m. But this time,

use with the quadratic loss function.

As this equation suggests,

we can interpret such procedure as a method of adding

the next Basis function along with its fitted parameters and its weights w,

by fitting it to the residual between the data and

the model which is absorbed in the previous iteration.

This is called least square boosting or L2Boosting.

And it was suggested by Buhlmann and Yu in 2003.

Again, I refer you to the book of Murphy and additional reading for this week.

For more information on boosting methods.

And on this note,

we finish our excursion into Supervise learning,

which was the topic of the first week of our course.

Then next week, we will move on and start looking into unsupervised learning algorithms.

Good luck with your homework and see you the next big.