0:40

The candidate explanatory variables include gender, race, alcohol, marijuana,

Â cocaine, or inhalant use.

Â Availability of cigarettes in the home, whether or not either parent was on public

Â assistance, any experience with being expelled from school.

Â Age, alcohol problems, deviance, violence, depression, self esteem,

Â parental presence, activities with parents, family and

Â school connectedness, and grade point average.

Â Following the lib name statement and data step, which I am using to call in this

Â data set called triad health, I will include PROC HPFOREST.

Â Next, I name my response, or target variable, TREG1.

Â And indicate with a forward slash and the level option that is a categorical

Â variable by including the word nominal following the equal sign.

Â Categorical, quantitative, and even ordinal variables for

Â my random forest need to be included in separate input statements.

Â Here I include my categorical explanatory variables,

Â followed by the word input and end the statement with a forward slash,

Â level option, and the word nominal following the equal sign.

Â 1:54

Then a second input statement for my quantitative explanatory variables,

Â Indicating that they are on an interval scale.

Â As always, all statements are ended with a semicolon.

Â Finally I end my program with a run statement, so

Â let's run the program and take a look at the output.

Â We can see in the model information section that variables to try is equal to

Â 5, indicating that a random selection of five explanatory variables was

Â selected to test each possible split for each node in each tree within the forest.

Â By default, SAS will grow 100 trees and

Â select 60% of their sample when performing the bagging process.

Â That is the inbag fraction.

Â The prune fraction specifies the fraction of training observations that

Â are available for pruning a split.

Â The value can be any number from 0 to 1,

Â although a number close to 1 would leave little to grow the tree.

Â The default value is 0.

Â In other words, the default value is not to prune.

Â Leaf size specifies the smallest number of training observations

Â that a new branch can have.

Â The default value is 1.

Â 3:07

The split criteria used in HPFOREST is the Gini index.

Â In terms of missing data, if the value of our target or

Â response variable is missing, the observation is excluded from the model.

Â If the value of an explanatory variable is missing,

Â PROC HPFOREST uses the missing value as a legitimate value by default.

Â Notice, too, that the the number of observations read from my data set was

Â 6,504 while the number of observations used was 6,500.

Â Within the baseline fit statistics output,

Â you can see that the misclassification rate of the random forest is displayed.

Â Here we see that the forest misclassified 19.8% of the sample.

Â Suggesting that the forest correctly classified 80.2% of the sample.

Â Now I'll show the first ten and last ten observations of the fit statistics table.

Â PROC HPFOREST computes fit statistics for

Â a sequence of forests that have an increasing number of trees.

Â As the number of trees increases, the fit statistics usually improve, that is,

Â decrease at first and then they level off and fluctuate in a small range.

Â Forest models provide an alternative estimate of average square error and

Â misclassification rate, called the out of bag or OOB estimate.

Â The OOB estimate is a convenient substitute for

Â an estimate that is based on test data and

Â is a less biased estimate of how the model will perform on future data.

Â We end up with near perfect prediction in the training samples as the number of

Â trees grown gets closer to 100.

Â When those same models are tested on the out of bag sample,

Â the misclassification rate is around 16%.

Â The final table in our output represents arguably the largest contribution of

Â random forests.

Â Specifically, the variable importance rankings.

Â The number of rules column shows the number of splitting rules

Â that use a variable.

Â Each measure is computed twice, once on training data and

Â once on the out of bag data.

Â As with fit statistics, the out of bag estimates are less biased.

Â The rows are sorted by the out of bag Gini measure or OOB Gini measure.

Â The variables are listed from highest importance to lowest importance

Â in predicting regular smoking.

Â In this way, random forests are sometimes used as a data reduction technique,

Â where variables are chosen in terms of their importance to be

Â included in regression and other types of future statistical models.

Â Here we see that some of the most important variables in predicting regular

Â smoking include marijuana use, alcohol use, race,

Â cigarette availability in the home, cocaine use, deviant behavior, etc.

Â To summarize, like decision trees, random forests are a type of data mining

Â algorithm that can select from among a large number of variables,

Â those that are most important in determining the target or

Â response variable to be explained.

Â Also, like decision trees,

Â the target variable in a random forest can be categorical or quantitative.

Â And the group of explanatory variables can be categorical or

Â quantitative, or any combination.

Â Unlike decision trees, however,

Â the results of random forest generalize well to new data

Â since the strongest signals are able to emerge through the growing of many trees.

Â Further, small changes in the data do not impact the results of random forests.

Â In my opinion, the main weakness of random forests is simply that results

Â are somewhat less satisfying, since no trees are actually interpreted.

Â Instead, the forest of trees is used to rank the importance of variables

Â in predicting the target.

Â Thus, we get a sense of the most important predictive variables,

Â but not their relationship to one another.

Â [MUSIC]

Â