Support Vector Machines: Likelihood-based Model Selection

Please first read the Support Vector Machines: First Steps tutorial, and possibly also the Support Vector Machines: Model Selection Using Cross-Validation and Grid-Search tutorial for traditional approaches to SVM model selection.

Motivation

The previous tutorial explained that the performance of an SVM classifier depends on the choice of regularization parameter \(C\) and the kernel parameters. We also presented the most common method to find SVM hyperparameters: grid search on the cross-validation error. This is suited for kernels with only one or two parameters, because a two- or three-dimensional SVM hyperparameter search space can still be sufficiently covered by a fixed grid of search points. Using naive heuristics like NestedGridSearch, where the search resolution increases iteratively, a four- or perhaps even five-dimensional SVM hyperparameter space can maybe still be sampled sufficiently. But we do not get around the fact that grid search-based SVM model selection suffers from the curse of dimensionality.

Naturally, much research has been directed toward differentiable estimates of, substitutes for, or bounds on the generalization error. A good overview is given in our paper [GlasmachersIgel2010]. There, we also present a novel model selection criterion that is differentiable (in almost all practical cases) with respect to the regularization and kernel parameters. In practical experiments, it compared very favorably to other gradient-based model selection criteria. We consider it the current state-of-the-art for gradient-based SVM model selection, and especially when the number of examples is relatively small. In the next paragraphs, we explain how to use this maximum-likelihood based approach to SVM model selection in Shark. For theoretical details and background, please consult the original article.

The toy problem

Assume we have a higher- or high-dimensional kernel, for example an “Automatic Relevance Detection” (ARD) kernel, which has one parameter for each input dimension:

\[k(x, z) = \exp( - \sum_i \gamma_i (x_i - z_i)^2 )\]

Such a kernel can be useful when the individual features correlate much differently with the labels, hence calling for individual bandwidths \(\gamma_i\) per feature. (From another angle, learning the ARD kernel bandwidths corresponds to learning a linear transformation of the input space.)

In [GlasmachersIgel2010], a toy problem is introduced which well lends itself to an ARD kernel and the optimization of its parameters. It creates a binary classification dataset of dimension \(2d\) in the following way: first, fix a positive or negative label \(y\), i.e., \(1\) or \(0\), respectively. Then, fill the first \(d\) dimensions by

\[y - 0.5 + \mathcal N(0.0,1.0) \enspace .\]

That is, produce Gaussian distributed noise around \(+0.5\) for positive label and \(-0.5\) for negative label. The second \(d\) dimensions are simply filled with only Gaussian noise \(\sim \mathcal N(0.0,1.0)\). Overall, there will be \(d\) dimensions which are correlated with the labels and hence informative, and \(d\) dimensions which are not correlated with the labels and uninformative.

By design, this toy problem is well tailored to an ARD kernel. The ARD kernel weights corresponding to the uninformative dimensions would best be optimized out to be zero, since these dimensions on average hold no information relevant to the classification problem. In the following, we will use our maximum-likelihood model selection criterion to optimize the hyperparameters of an SVM using an ARD kernel on such a toy problem. Ideally, the kernel weights will afterwards reflect the nature of the underlying distribution. (And we will see that they do.)

Likelihood-based model selection in Shark

You can find the source code for the following example in CSvmMaxLikelihoodMS.cpp (as generated by its according .tpp file). There, one trial is wrapped by the function run_one_trial(), which takes a verbosity preference as argument. The first trial is carried out verbosely, the 100 aggregating trials (which take a long time) silently and only the overall hyperparameter averages are printed. The tutorial below mostly covers the functionality of the run_one_trial() function. For the complete program, see the example .cpp file.

The key class for maximum-likelihood based SVM model selection in Shark is SvmLogisticInterpretation, and we include its header. To create the toy problem via the aptly named PamiToy distribution, we also include the header for data distributions; and the gradient-based optimizer “Rprop”, with which we will optimize the SVM hyperparameters under the SvmLogisticInterpretation criterion. With various other helpers, our complete list of includes and namespaces becomes:

#include <shark/Data/Dataset.h>
#include <shark/Data/CVDatasetTools.h>
#include <shark/Data/DataDistribution.h>
#include <shark/Data/Statistics.h>
#include <shark/Models/Kernels/ArdKernel.h>
#include <shark/Algorithms/QP/QuadraticProgram.h>
#include <shark/Algorithms/Trainers/CSvmTrainer.h>
#include <shark/Algorithms/GradientDescent/Rprop.h>
#include <shark/ObjectiveFunctions/Loss/ZeroOneLoss.h>
#include <shark/ObjectiveFunctions/SvmLogisticInterpretation.h>
#include <shark/Algorithms/Trainers/NormalizeComponentsUnitVariance.h>

using namespace std;
using namespace shark;

Creating the toy problem

First, define the basic dimensionalities, here using \(d=5\):

// define the basic dimensionality of the problem
unsigned int useful_dim = 5;
unsigned int noise_dim = 5;
unsigned int total_dim = useful_dim + noise_dim;

Then set up the above described classification problem:

// set up the classification problem from a DataDistribution
PamiToy problem( useful_dim, noise_dim );

// construct training and test sets from the problem distribution
unsigned int train_size = 500;
unsigned int test_size = 5000;
ClassificationDataset train = problem.generateDataset( train_size );
ClassificationDataset test = problem.generateDataset( test_size );

and normalize the data to unit variance in the training set as usual:

// normalize data as usual
Normalizer<> normalizer;
NormalizeComponentsUnitVariance<> normalizationTrainer(false);
normalizationTrainer.train( normalizer, train.inputs() );
train = transformInputs( train, normalizer );
test = transformInputs( test, normalizer );

Then create the ARD kernel with appropriate dimensions (kernel parameter initialization will come later):

// set up the ArdKernel
DenseARDKernel kernel( total_dim, 0.1 ); //for now with arbitrary value for gamma (gets properly initialized later)

Data folds and model selection criterion

Before we go ahead and declare our model selection criterion (i.e., objective funtion), we first have to partition the training data into folds: the SvmLogisticInterpretation class requires to be passed data in the form of a CVFolds object. That is, it demands an existing partitioning for cross-validation. This way, control over the type of data partitioning (e.g., stratified vs. IID, etc.) strictly remains with the user:

// set up partitions for cross-validation
unsigned int num_folds = 5;
CVFolds<ClassificationDataset> cv_folds = createCVIID( train, num_folds );

The next three lines now finally set up the maximum-likelihood based objective function for model selection:

// set up the learning machine
bool log_enc_c = true; //use log encoding for the regularization parameter C
QpStoppingCondition stop(1e-12); //use a very conservative stopping criterion for the individual SVM runs
SvmLogisticInterpretation<> mlms( cv_folds, &kernel, log_enc_c, &stop ); //the main class for this tutorial
//SvmLogisticInterpretation<> mlms( cv_folds, &kernel, log_enc_c ); //also possible without stopping criterion

The first line specifies that in this case, we want to allow for unconstrained optimization of the regularization parameter (i.e., we do not want to bother with the possibility of the optimizer accidentally driving \(C\) into the negative half-space). However, true is also the default, so we could have omitted it had we not passed a custom stopping criterion. The second line sets up a QpStoppingCondition with a very conservative (= small) stopping criterion value. This gets used by all SVMs that the SvmLogisticInterpretation will train internally.

Note on the stopping criterion

Here, the QpStoppingCondition is set to a rather small, or conservative, value for the final KKT violation. In general, the computation of the SvmLogisticInterpretation criterion is somewhat volatile and requires high computational accuracy. For that reason, we use a very conservative stopping criterion in this tutorial. In a real-world setting this can be relaxed somewhat, as long as the signs of the gradient of the SvmLogisticInterpretation will be correct “often enough”. To date, we do not have an airtight method to properly choose the stopping criterion so that it is loose enough to allow fast optimization, but tight enough to ensure a proper optimization path. A well-performing heuristic used in [GlasmachersIgel2010] was to set the maximum number of iterations to 200 times the input dimension. This proved robust enough to have produced the state-of-the-art results given in the paper.

In the last line, we finally find the declaration of our objective function, which takes as arguments the CVFolds object, kernel, log-encoding information, and the stopping criterion (optional).

The optimization process

Now we only need to set a starting point for the optimization process, and we choose \(C=1\) and \(\gamma_i = 0.5/(2d)\) as motivated in the paper:

// set up a starting point for the optimization process
RealVector start( total_dim+1 );
if ( log_enc_c ) start( total_dim ) = 0.0; else start( total_dim ) = 1.0; //start at C = 1.0
for ( unsigned int k=0; k<total_dim; k++ )
    start(k) = 0.5 / total_dim;

(Note that by convention, the CSvmTrainer stores the regularization parameter \(C\) last in the parameter vector, and the SvmLogisticInterpretation honors this convention.)

One single evaluation of the objective function at this current point looks like this:

// for illustration purposes, we also evalute the model selection criterion a single time at the starting point
double start_value = mlms.eval( start );

A simple cout command can tell us that the value we get from that last call (on our development machine) is 0.337388.

Next, we set up an IRpropPlus optimizer, choosing the same parameters for it as in the original paper, except with a lower number of total iterations:

// set up the optimizer
IRpropPlus rprop;
double stepsize = 0.1;
double stop_delta = 1e-3;
mlms.init();
rprop.init( mlms, start, stepsize );
unsigned int its = 50;

The main process of this tutorial – optimizing the SVM hyperparameters under the SvmLogisticInterpretation objective function – is now straightforward and follows the general optimization schemes in Shark (see General Optimization Tasks as well as Optimizers and Trainers and following):

// start the optimization loop
for (unsigned int i=0; i<its; i++) {
    rprop.step( mlms );
    if ( verbose )
        std::cout << "iteration " << i << ": current NCLL = " <<  rprop.solution().value << " at parameter: " << rprop.solution().point << std::endl;
    if ( rprop.maxDelta() < stop_delta ) {
        if ( verbose ) std::cout << "    Rprop quit pecause of small progress " << std::endl;
        break;
    }
}

Evaluation after optimization

After the optimization loop, we would like to do several things: query the final objective function value, view the final hyperparameters, train a final SVM with them, and view the train and test errors obtained from that. For the latter tasks, there are at least two different ways to transfer the final hyperparameters from the model selection process to the final SVM. In both cases, care must be taken at one spot or another to correctly specify the encoding style for the regularization parameter (namely, the same as previously used by the SvmLogisticInterpretation object). These slightly error-prone lines are below marked with an //Attention comment. Before presenting each of the two approaches, we declase some general helper variables:

double C_reg; //will hold regularization parameter
double test_error_v1, train_error_v1; //will hold errors determined via method 1
double test_error_v2, train_error_v2; //will hold errors determined via method 2

Option 1: Implicit/manual copy

The first variant is to exploit an implicit parameter copy that takes place when calling SvmLogisticInterpretation::eval(...). This copies (only) the kernel parameters from the RProp solution vector into the kernel function used by the CSvmTrainer. But we still need to take care of the regularization parameter C. For this, we manually obtain the value of C, but carefully minding the parameter encoding…

// copy final parameters, variant one
double end_value = mlms.eval( rprop.solution().point ); //this at the same time copies the most recent parameters from rprop to the kernel.
C_reg = ( log_enc_c ? exp( rprop.solution().point(total_dim) ) : rprop.solution().point(total_dim) ); //ATTENTION: mind the encoding

… and print the parameter set:

if ( verbose ) {
    std::cout << "    Value of model selection criterion at final point: " << end_value << std::endl;
    std::cout << "    Done optimizing the SVM hyperparameters. The final parameters (true/unencoded) are:" << std::endl << std::endl;
    std::cout << "        C = " << C_reg << std::endl;
    for ( unsigned int i=0; i<total_dim; i++ )
        std::cout << "        gamma(" << i << ") = " << kernel.parameterVector()(i)*kernel.parameterVector()(i) << std::endl;
    std::cout << std::endl << "    (as also given by kernel.gammaVector() : " << kernel.gammaVector() << " ) " << std::endl;
}

The objective function value we get (on our development machine) is 0.335099, so the initial parameter guess in this case was already quite good (in terms of the associated objective function value).

For C and the gamma parameters, the output says:

C = 1.71335
gamma(0) = 0.460517
gamma(1) = 0.0193955
gamma(2) = 0.0277312
gamma(3) = 0.0235109
gamma(4) = 0.0308288
gamma(5) = 0
gamma(6) = 0.000977712
gamma(7) = 0
gamma(8) = 0.0171233
gamma(9) = 0

In the majority of cases, the ARD kernel parameters corresponding to uninformative feature dimensions were learned to be (close to) zero. However, for some reason, the value of gamma(8) is almost in the range of its informative counterparts (on our development machine).

With the SVM hyperparameters, we can now set up and train the final SVM, in order to see the “best” performance by our newly found “best” hyperparameters. As a sanity check, we print the hyperparameters again as accessed through the SVM trainer after its construction:

// construct and train the final learner
KernelClassifier<RealVector> svm_v1;
CSvmTrainer<RealVector> trainer_v1( &kernel, C_reg, true, log_enc_c ); //encoding does not really matter in this case b/c it does not affect the ctor
if ( verbose ) {
    std::cout << std::endl << std::endl << "    Used mlms.eval(...) to copy kernel.parameterVector() " << kernel.parameterVector() << std::endl;
    std::cout << "    into trainer_v1.parameterVector() " << trainer_v1.parameterVector() << std::endl;
    std::cout << "    , where C (the last parameter) was set manually to " << trainer_v1.C() << std::endl << std::endl << std::endl;
}
trainer_v1.train( svm_v1, train ); //the kernel has the right parameters, and we copied C, so we are good to go

Now that the final SVM was trained, we only need to pipe training and test set through it and a proper loss function to get the training and test errors:

// evaluate the final trained classifier on training and test set
ZeroOneLoss<unsigned int> loss_v1;
Data<unsigned int> output_v1; //real-valued output
output_v1 = svm_v1( train.inputs() );
train_error_v1 = loss_v1.eval( train.labels(), output_v1 );
output_v1 = svm_v1( test.inputs() );
test_error_v1 = loss_v1.eval( test.labels(), output_v1 );
if ( verbose ) {
    std::cout << "    training error via possibility 1:  " <<  train_error_v1 << std::endl;
    std::cout << "    test error via possibility 1:      " << test_error_v1 << std::endl << std::endl << std::endl;
}

On our development machine, we obtain:

training error:  0.116
test error:      0.1374

Our mission is now finished, and we present a second variant to copy the hyperparameters – namely via solution().point. We prefer this second variant, as it does not rely on calling eval(...) on the objective function first.

Option 2: Using solution().point

For this alternative take, we copy all the hyperparameters found by the optimizer into the CSvmTrainer. This is simply done via the setParameterVector method of the CSvmTrainer:

KernelClassifier<RealVector> svm_v2;
CSvmTrainer<RealVector> trainer_v2( &kernel, 0.1, true, log_enc_c ); //ATTENTION: must be constructed with same log-encoding preference
trainer_v2.setParameterVector( rprop.solution().point ); //copy best hyperparameters to svm trainer

Again, we print the trainer’s parameter vector for comparison:

if ( verbose ) {
    std::cout << "    Copied rprop.solution().point = " << rprop.solution().point << std::endl;
    std::cout << "    into trainer_v2.parameterVector(), now = " << trainer_v2.parameterVector() << std::endl << std::endl << std::endl;
}

Training is now as simple as:

trainer_v2.train( svm_v2, train );

To evaluate this second SVM’s prediction, again pipe all data through the SVM and a proper loss:

// evaluate the final trained classifier on training and test set
ZeroOneLoss<unsigned int> loss_v2;
Data<unsigned int> output_v2; //real-valued output
output_v2 = svm_v2( train.inputs() );
train_error_v2 = loss_v2.eval( train.labels(), output_v2 );
output_v2 = svm_v2( test.inputs() );
test_error_v2 = loss_v2.eval( test.labels(), output_v2 );
if ( verbose ) {
    std::cout << "    training error via possibility 2:  " <<  train_error_v2 << std::endl;
    std::cout << "    test error via possibility 2:      " << test_error_v2 << std::endl << std::endl << std::endl;
    std::cout << std::endl << "That's all folks - we are done!" << std::endl;
}

And we are happy to get the same results as above:

training error:  0.116
test error:      0.1374

Repetition over 100 trials

We now examine the distribution of hyperparameter values over several trials on different realizations of the toy problem distribution. We repeat the experiment 100 times, and note the means and variances of the SVM hyperparameters. This also mostly follows the methodology in [GlasmachersIgel2010]. We obtain the following results (where the last/11th entry is the regularization parameter C):

avg-param(0)    = 0.0174454  +- 0.000372237
avg-param(1)    = 0.0243765  +- 0.00276891
avg-param(2)    = 0.0170669  +- 0.000236762
avg-param(3)    = 0.0148257  +- 0.000139686
avg-param(4)    = 0.0175333  +- 0.000225192
avg-param(5)    = 0.00810077 +- 0.000397033
avg-param(6)    = 0.00831601 +- 0.000484481
avg-param(7)    = 0.0134892  +- 0.000909667
avg-param(8)    = 0.00652671 +- 0.000238294
avg-param(9)    = 0.00863524 +- 0.000432687
avg-param(10)   = 1.68555    +- 0.971377

avg-error-train = 0.12594    +- 0.000294276
avg-error-test  = 0.137724   +- 4.49206e-05

We see that on average, the SvmLogisticInterpretation objective clearly selects a meaningful model with an emphasis on the informative parameters. At the same time, some tendency still exists for the uninformative parameters to be different from completely zero. Note that the mean test error is well below 14%, which is an excellent value for an SVM on this toy problem.

References

[GlasmachersIgel2010](1, 2, 3, 4) T. Glasmachers and C. Igel. Maximum Likelihood Model Selection for 1-Norm Soft Margin SVMs with Multiple Parameters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010.