The split procedure we just described applies to continuous variables.

For continuous variables we can order records according to the value

of the chosen variable.

And split them at middle point between any consecutive values.

Consider a data set with four values, 3.3, 1.3, 2.0, and 3.7.

After we sort the data from smallest to largest,

we obtain (1.3, 2.0, 3.3, 3.7).

The possible break points are the middle points, 1.65, 2.65, and 3.5.

For categorical data, splitting procedure is slightly different.

Categorical data with multiple categories can be split into any pairs of subsets.

Note that this is not supported in some software implementations,

including XLMiner.

Instead, we first convert the categorical variable into multiple binary variables.

And then perform the split on the binary variable.

We mentioned earlier that as a goal,

your recursive partitioning is to find pure partitions.

How do we measure purity, then?

There are many alternative measures, including gini index and entropy.

Let me briefly discuss gini index as an example.

Gini index for a set of data is 1- the sum of Pk squared.

Where Pk is the proportion of cases that belong to class k.

For binary classification, the number of classes is 2.

When all cases belong to the same class, Pk is 1 for

one class and 0 for all other classes.

It follows that a gini index is 0.

It can be verified that a gini index takes the maximum value when all classes

are represented.

For binary classification, this means that Pk = 0.5.

For the appointment data, lag takes integer values, 0, 1, 2, etc.

Therefore, the possible breaks are 0.5, 1.5, 2.5, etc.

Let's consider split at the break 20.5.

For lag less than 20.5, 384 cases canceled and

3,356 cases arrived.

The proportion of cancellation is 0.1027.

And its proportion of arrival is 0.8973.

The gini index is 0.1843.

Know that this index measures the purity of the data where lag is less than 20.5.

Similarly, the gini index for the data with lag greater than 20.5 is 0.4509.

The overall gini index is weighted across the two sets.

Where the weight is relative size of the two sets.

We can verify that as an overall weighted gini index as 0.3173.

Our goal is to find a split where the overall gini index is as low as possible.

Specifically, we obtain overall impurity measure for each split and

compare this measure across all possible splits in all variables.

We choose a split that reduces impurity the most.

The chosen split points then become nodes on the trees.

In end nodes, or leaf nodes, we would like to make prediction and

label accounts accordingly.

The are two alternative ways to make the prediction.

The first one is majority voting, where would we determine the battery prediction.

For example, when there are 6 successes out of 9, we'll predict success.