Classification And Regression Trees (CART)

Background

CART is a decision tree algorithm for both classification and regression. It was first described by [Breiman1984]. It is a recursive algorithm, which partitions the training dataset by doing binary splits. It is a conceptual simple decision tree algorithm, and performs OK in practice.

CART in Shark

For this tutorial the following includes are needed:

#include <shark/Data/Csv.h>                            // importing CSV files
#include <shark/Algorithms/Trainers/CARTTrainer.h>     // the CART trainer
#include <shark/ObjectiveFunctions/Loss/ZeroOneLoss.h> // 0/1 loss for evaluation

Sample classification problem: Protein fold prediction

Let us consider the same bioinformatics problem as in the Nearest Neighbor Classification tutorial, namely the prediction of the secondary structure of proteins. The goal is to assign a protein to one out of 27 SCOP fold types [DingDubchak2001a]. We again consider the descriptions of amino-acid sequences provided by [DamoulasGirolami2008a]. The data C.csv provide a description of the amino-acid compositions of 695 proteins together with the corresponding fold type. Each row corresponds to a protein. After downloading the data C.csv we can read, inspect and split the data as described in the Nearest Neighbor Classification tutorial:

// read data
ClassificationDataset dataTrain;
importCSV(dataTrain, "data/C.csv", LAST_COLUMN, ' ');


//Split the dataset into a training and a test dataset
ClassificationDataset dataTest =splitAtElement(dataTrain,311);

cout << "Training set - number of data points: " << dataTrain.numberOfElements()
         << " number of classes: " << numberOfClasses(dataTrain)
         << " input dimension: " << inputDimension(dataTrain) << endl;

cout << "Test set - number of data points: " << dataTest.numberOfElements()
         << " number of classes: " << numberOfClasses(dataTest)
         << " input dimension: " << inputDimension(dataTest) << endl;

Model and learning algorithm

Training the CART-Tree is very simple as it is just a call to the trainer which creates the tree:

CARTTrainer trainer;
CARTClassifier<RealVector> model;
trainer.train(model, dataTrain);

Evaluating the model

After training the model we can evaluate it. As a performance measure, we consider the standard 0-1 loss. The corresponding risk is the probability of error, the empirical risk is the average classification error. When measured on set of sample patterns, it simply computes the fraction of wrong predictions:

ZeroOneLoss<unsigned int, RealVector> loss;
Data<RealVector> prediction = model(dataTrain.inputs());
cout << "CART on training set accuracy: " << 1. - loss.eval(dataTrain.labels(), prediction) << endl;

prediction = model(dataTest.inputs());
cout << "CART on test set accuracy:     " << 1. - loss.eval(dataTest.labels(), prediction) << endl;

Of course, the accuracy is given by one minus the error. In this example, CART performs slightly better than the LDA and much better than random guessing.

Full example program

The full example program is CARTTutorial.cpp.

References

[Breiman1984]L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone. Classification and Regression Trees. Wadsworth and Brooks, 1984