Please read the Support Vector Machines: First Steps tutorial first to follow the SVM example. However, the part on cross-validation and grid-search works of course also for other classifiers.
The performance of your SVM classifier depends on the choice of the regularization parameter \(C\) and the kernel parameters. For a standard radial Gaussian kernel
the bandwidth parameter \(\gamma\) (or \(\sigma\)) is the only kernel parameter. Adapting the “hyperparameters” is referred to as SVM model selection.
The Shark library offers many algorithms for SVM model selection. In this tutorial, we consider the most basic approach.
Cross-validation (CV) is a standard technique for adjusting hyperparameters of predictive models. In K-fold CV, the available data \(S\) is partitioned into K subsets \(S_1,\dots, S_K\). Each data point in \(S\) is randomly assigned to one of the subsets such that these are of almost equal size (i.e., \(\lfloor |S|/K\rfloor \leq |S_i|\leq \lceil |S|/K\rceil\)). Further, we define \(S_{\setminus i}=\bigcup_{j=1,\dots,K \wedge j\neq i} S_i\) as the union of all data points except those in \(S_i\). For each \(i=1,\dots,K\), an individual model is built by applying the algorithm to the training data \(S_{\setminus i}\). This model is then evaluated by means of a cost function using the test data in \(S_i\). The average of the K outcomes of the model evaluations is called cross-validation (test) performance or cross-validation (test) error and is used a predictor of the performance of the algorithm when applied to \(S\). Typical values for K are 5 and 10 [HastieTibshiraniFriedman-2008].
To choose \(C\) and \(\gamma\) using K-fold CV, we first split the availbale data into K subsets. Then we compute the CV error using this split error for the SVM classifiers using different values for \(C\) and \(\gamma\). Finally, we pick the \(C\) and \(\gamma\) with the lowest CV error and use it for training an SVM on the complete data set \(S\).
Jaakkola’s heuristic [JaakkolaDiekhausHaussler1999] provides a reasonable initial guess for the bandwidth parameter \(\gamma\) or \(\sigma\) of a Gaussian kernel.
To estimate a good value for \(\sigma\), consider all pairs consisting of an training input vector from the positive class and a training input vector from the negative class. Compute the difference in input space between all pairs. The median of these distances can be used as a measure of scale and therefore as a guess for \(\sigma\). More formally, compute
based on your training data \(S\). Then set \(\sigma_{\text{Jaakkola}}\) equal to the median of the values in \(G\):
We consider the same toy problem and the same models as in the tutorial Support Vector Machines: First Steps. We additionally include
#include <shark/ObjectiveFunctions/CrossValidationError.h>
#include <shark/ObjectiveFunctions/JaakkolaHeuristic.h>
#include <shark/Algorithms/DirectSearch/GridSearch.h>
for computing the cross-validation error, for calculating Jaakkola’s Heuristic, and for optimizing the parameters using grid-search, respectively.
Our model and trainier are now given by:
GaussianRbfKernel<> kernel(0.5, true);
KernelExpansion<RealVector> svm(true);
CSvmTrainer<RealVector> trainer(&kernel, 1.0, true);
The Boolean flags set to true in the constructors of GaussianRbfKernel and CSvmTrainer indicate that the kernel parameter \(\gamma\) and the regularization parameter \(C\), which “belongs” to the trainer, are internally encoded as \(\ln \gamma\) and \(\ln C\). Because both parameters have to be positive, this encoding allows for unconstraint optimization (e.g., if the model parameter encoding the kernel width is set to -1, we have \(\gamma =1/e\)). This encoding affects the interface between model, objective function, and optimizer, but not functions such as setGamma, setSigma or setC. These behave always the same regardless of the internal encoding.
Now we define the cross-validation error by instantiating CrossValidationError :
const unsigned int K = 5;
ZeroOneLoss<unsigned int, RealVector> loss;
createCVSameSizeBalanced(dataTrain, K);
CrossValidationError<RealVector, RealVector, unsigned int> cvError(dataTrain,
&trainer,
&svm,
&trainer,
&loss,
K);
The first line sets defines the number of folds. Then we define the basic error measure underlying the cross-validation error, here the standard 0-1 loss. After that we split the available training data into K folds using the function createCVSameSizeBalanced() from Data/CVDatasetTools.h. The template arguments of CrossValidationError specify that the input data points are real vectors, the output of the models are real vectors, and that the given labels are unsigned integers (encoding classes). The first and the last two parameters of the constructor are clear. First we have to pass the training data to the CrossValidationError. The final two parameters specify the loss function on which the CV error is based and the number of folds, respectively. But what about the other parameters? The CrossValidationError works as follows. A new parameter configuration is written into an “meta” object A that is IParameterizable (such as a regularizer or a trainer). Then the specified model B is trained with the specified trainer C. The pointers to A, B, and C are the arguments 2, 3, and 4 of the constructor. In our case of SVM model selection, the meta object and the trainier are the same.
To find a good starting point for \(\gamma\), we apply Jaakkola’s heuristic
JaakkolaHeuristic ja(dataTrain);
double ljg = log(ja.gamma());
as defined above.
We have two hyperparameters. To adapt them using grid-search, we have to define a two-dimensional grid. Let us consider 17 grid points for \(\ln \gamma\) and 11 for \(\ln C\). Let
and
and define the grid accordingly:
GridSearch grid;
vector<double> min(2);
vector<double> max(2);
vector<size_t> sections(2);
min[0] = ljg-4.; max[0] = ljg+4; sections[0] = 17; // kernel parameter gamma
min[1] = 0.0; max[1] = 10.0; sections[1] = 11; // regularization parameter C
grid.configure(min, max, sections);
The optimizer GridSearch “sees” the parameter in the logarithmic encoding we activated in the model and trainier definition above. Therefore, we specify a linear grid while searching on logarithmic scale. Now we do the grid search by
grid.step(cvError);
and finally train the model using all data and the best parameters
trainer.setParameterVector(grid.solution().point);
trainer.train(&svm, dataTrain);
and evaluate the model as described in Support Vector Machines: First Steps.
The full example program for tutorial is CSvmGridSearchTutorial.cpp.
| [HastieTibshiraniFriedman-2008] | T. Hastie, R. Tibshirani and J. Friedman. The Elements of Statistical Learning, section 4.3. Springer-Verlag, 2008. |
| [JaakkolaDiekhausHaussler1999] |
|