Random Forest¶
Background¶
Random Forest is a decision tree algorithm for both classification and regression. It was first described by [Breiman2001]. It is a recursive algorithm which partitions the training dataset by doing binary splits based on the labels of the data. It builds an ensemble of models, and uses randomization to increase the variance between the trees of the ensemble. It performs remarkable well in practice, and is widely used.
Random Forest in Shark¶
For this tutorial, the following includes are needed:
#include <shark/Data/Csv.h> //importing the file
#include <shark/Algorithms/Trainers/RFTrainer.h> //the random forest trainer
#include <shark/ObjectiveFunctions/Loss/ZeroOneLoss.h> //zero one loss for evaluation
Sample classification problem: Protein fold prediction¶
Let us consider the same bioinformatics problem as in the
Nearest Neighbor Classification and Linear Discriminant Analysis tutorial, namely the prediction of the
secondary structure of proteins. The goal is to assign a protein to
one out of 27 SCOP fold types [DingDubchak2001a]. We again consider
the descriptions of amino-acid sequences provided by
[DamoulasGirolami2008a]. The data C.csv
provide a description of the amino-acid compositions of 695 proteins
together with the corresponding fold type. Each row corresponds to a
protein. After downloading the data C.csv
we
can read, inspect and split the data as described in the
Nearest Neighbor Classification tutorial:
ClassificationDataset data;
importCSV(data, "data/C.csv", LAST_COLUMN, ' ');
//Split the dataset into a training and a test dataset
ClassificationDataset dataTest = splitAtElement(data,311);
Model and learning algorithm¶
We can use the RFTrainer class to train a RFClassifier which represents a Random Forest:
RFTrainer<unsigned int> trainer;
RFClassifier<unsigned int> model;
trainer.train(model, data);
Evaluating the model¶
After training the model we can evaluate it. As a performance measure, we consider the standard 0-1 loss. The corresponding risk is the probability of error, the empirical risk is the average classification error. When measured on set of sample patterns, it simply computes the fraction of wrong predictions:
ZeroOneLoss<> loss;
auto prediction = model(data.inputs());
cout << "Random Forest on training set accuracy: " << 1. - loss.eval(data.labels(), prediction) << endl;
prediction = model(dataTest.inputs());
cout << "Random Forest on test set accuracy: " << 1. - loss.eval(dataTest.labels(), prediction) << endl;
Of course, the accuracy is given by one minus the error. In this example, Random Forest outperforms LDA and KNN.
Parameters of the trainer¶
The trainer has some properties that can be set to tweak the learning process. All parameters have meaningful default values. The parameters are set by the following methods:
Method | Effect |
---|---|
setMTry(size_t mtry) |
Number of random attribute to try at each inner node of each tree. |
setNTrees(size_t nTrees) |
Number of trees to be built. Typically this would be 100+. |
setNodeSize(size_t nodeSize) |
The maximum nodesize, before a node is classified as a leaf. Lowering this value makes the trees in the ensemble larger and increasing this value makes the trees smaller. |
setOOBratio(double ratio) |
Controls the ratio determining the number of OOB (out-of-bag) samples is sampled from the training dataset. |
Full example program¶
The full example program is RFTutorial.cpp.
References¶
[Breiman2001] | L. Breiman. Random Forests. Machine Learning, vol. 45, issue 1, p. 5-32, 2001 |