Support Vector Machines: Model Selection Using Cross-Validation and Grid-Search¶
Please read the Support Vector Machines: First Steps tutorial first to follow the SVM example. However, the part on cross-validation and grid-search works of course also for other classifiers.
The performance of your SVM classifier depends on the choice of the regularization parameter \(C\) and the kernel parameters. For a standard radial Gaussian kernel
the bandwidth parameter \(\gamma\) (or \(\sigma\)) is the only kernel parameter. Adapting the “hyperparameters” is referred to as SVM model selection.
The Shark library offers many algorithms for SVM model selection. In this tutorial, we consider the most basic approach.
Cross-validation¶
Cross-validation (CV) is a standard technique for adjusting hyperparameters of predictive models. In K-fold CV, the available data \(S\) is partitioned into K subsets \(S_1,\dots, S_K\). Each data point in \(S\) is randomly assigned to one of the subsets such that these are of almost equal size (i.e., \(\lfloor |S|/K\rfloor \leq |S_i|\leq \lceil |S|/K\rceil\)). Further, we define \(S_{\setminus i}=\bigcup_{j=1,\dots,K \wedge j\neq i} S_i\) as the union of all data points except those in \(S_i\). For each \(i=1,\dots,K\), an individual model is built by applying the algorithm to the training data \(S_{\setminus i}\). This model is then evaluated by means of a cost function using the test data in \(S_i\). The average of the K outcomes of the model evaluations is called cross-validation (test) performance or cross-validation (test) error and is used a predictor of the performance of the algorithm when applied to \(S\). Typical values for K are 5 and 10 [HastieTibshiraniFriedman-2008].
To choose \(C\) and \(\gamma\) using K-fold CV, we first split the available data into K subsets. Then we compute the CV error using this split error for the SVM classifiers using different values for \(C\) and \(\gamma\). Finally, we pick the \(C\) and \(\gamma\) with the lowest CV error and use it for training an SVM on the complete data set \(S\).
Jaakkola’s heuristic¶
Jaakkola’s heuristic [JaakkolaDiekhausHaussler1999] provides a reasonable initial guess for the bandwidth parameter \(\gamma\) or \(\sigma\) of a Gaussian kernel.
To estimate a good value for \(\sigma\), consider all pairs consisting of an training input vector from the positive class and a training input vector from the negative class. Compute the difference in input space between all pairs. The median of these distances can be used as a measure of scale and therefore as a guess for \(\sigma\). More formally, compute
based on your training data \(S\). Then set \(\sigma_{\text{Jaakkola}}\) equal to the median of the values in \(G\):
SVM Model Selection in Shark¶
We consider the same toy problem and the same models as in the tutorial Support Vector Machines: First Steps. We additionally include:
#include <shark/ObjectiveFunctions/CrossValidationError.h>
#include <shark/Algorithms/DirectSearch/GridSearch.h>
#include <shark/Algorithms/JaakkolaHeuristic.h>
for computing the cross-validation error, for calculating Jaakkola’s Heuristic, and for optimizing the parameters using grid-search, respectively.
Preparing the SVM for unconstraint optimization¶
Our model and trainer are now given by:
GaussianRbfKernel<> kernel(0.5, true); //unconstrained?
KernelClassifier<RealVector> svm;
bool offset = true;
bool unconstrained = true;
CSvmTrainer<RealVector> trainer(&kernel, 1.0, offset,unconstrained);
The Boolean flags set to true in the constructors of
GaussianRbfKernel and CSvmTrainer indicate that the
kernel parameter \(\gamma\) and the regularization parameter
\(C\), which “belongs” to the trainer, are internally encoded as
\(\ln \gamma\) and \(\ln C\). Because both parameters have to
be positive, this encoding allows for unconstraint optimization (e.g.,
if the model parameter encoding the kernel width is set to -1, we have
\(\gamma =1/e\)). This encoding affects the interface between
model, objective function, and optimizer, but not functions such as
setGamma
, setSigma
or setC
. These behave always the same
regardless of the internal encoding.
Cross-validation¶
Now we define the cross-validation error. First let us define the folds we need
const unsigned int K = 5; // number of folds
ZeroOneLoss<unsigned int> loss;
CVFolds<ClassificationDataset> folds = createCVSameSizeBalanced(dataTrain, K);
The first line sets the number of folds \(K\) to 5. Then we define the
basic error measure underlying the cross-validation error, here the
standard 0-1 loss. Note that we do classification here, so we need
to use unsigned int as the template parameter for the 0-1 loss.
After that we split the available training data
into \(K\) folds using the function createCVSameSizeBalanced() from
Data/CVDatasetTools.h
.
Then we need to instantiate CrossValidationError:
CrossValidationError<KernelClassifier<RealVector>, unsigned int> cvError(
folds, &trainer, &svm, &trainer, &loss
);
The template arguments of CrossValidationError specify the model and that the given labels are unsigned integers (encoding classes). The first parameter of the constructor is just the data in form of the folds we defined. The last parameter is the loss function on which the CV error is based. But what about the other three parameters? The CrossValidationError works as follows. A new parameter configuration is written into an “meta” object A that is IParameterizable (such as a regularizer or a trainer, in our case the trainer we defined earlier). Then the specified model B is trained with the specified trainer C. The pointers to A, B, and C are the arguments 2, 3, and 4 of the constructor. In our case of SVM model selection, the meta object and the trainer are the same, this is why the trainer occurs twice as a parameter.
Jaakkola’s heuristic¶
To find a good starting point for \(\gamma\), we apply Jaakkola’s heuristic
JaakkolaHeuristic ja(dataTrain);
double ljg = log(ja.gamma());
as defined above.
Grid-search¶
We have two hyperparameters. To adapt them using grid-search, we have to define a two-dimensional grid. Let us consider 9 grid points for \(\ln \gamma\) and 11 for \(\ln C\). Let
and
and define the grid accordingly:
GridSearch grid;
vector<double> min(2);
vector<double> max(2);
vector<size_t> sections(2);
// kernel parameter gamma
min[0] = ljg-4.; max[0] = ljg+4; sections[0] = 9;
// regularization parameter C
min[1] = 0.0; max[1] = 10.0; sections[1] = 11;
grid.configure(min, max, sections);
The optimizer GridSearch “sees” the parameter in the logarithmic encoding we activated in the model and trainier definition above. Therefore, we specify a linear grid while searching on logarithmic scale. Now we do the grid search by
grid.step(cvError);
and finally train the model using all data and the best parameters
trainer.setParameterVector(grid.solution().point);
trainer.train(svm, dataTrain);
and evaluate the model as described in Support Vector Machines: First Steps.
Full example program¶
The full example program for tutorial is CSvmGridSearchTutorial.cpp.
References¶
[HastieTibshiraniFriedman-2008] | T. Hastie, R. Tibshirani and J. Friedman. The Elements of Statistical Learning, section 4.3. Springer-Verlag, 2008. |
[JaakkolaDiekhausHaussler1999] |
|