Support Vector Machines: Likelihood-based Model Selection
=========================================================


Please first read the :doc:`svm` tutorial, and possibly also the
:doc:`svmModelSelection` tutorial for traditional approaches to SVM model
selection.


Motivation
----------


The previous tutorial explained that the performance of an SVM classifier depends
on the choice of regularization parameter :math:`C` and the kernel parameters.
We also presented the most common method to find SVM hyperparameters: grid search
on the cross-validation error. This is suited for kernels with only one or two
parameters, because a two- or three-dimensional SVM hyperparameter search space
can still be sufficiently covered by a fixed grid of search points. Using naive
heuristics like :doxy:`NestedGridSearch`, where the search resolution increases
iteratively, a four- or perhaps even five-dimensional SVM hyperparameter space
can maybe still be sampled sufficiently. But we do not get around the fact that
grid search-based SVM model selection suffers from the curse of dimensionality.

Naturally, much research has been directed toward differentiable estimates of,
substitutes for, or bounds on the generalization error. A good overview is given
in our paper [GlasmachersIgel2010]_. There, we also present a novel model selection
criterion that is differentiable (in almost all practical cases) with respect to
the regularization and kernel parameters. In practical experiments, it compared
very favorably to other gradient-based model selection criteria. We consider it
the current state-of-the-art for gradient-based SVM model selection, and especially
when the number of examples is relatively small. In the next paragraphs, we explain
how to use this maximum-likelihood based approach to SVM model selection in Shark.
For theoretical details and background, please consult the original article.


The toy problem
---------------


Assume we have a higher- or high-dimensional kernel, for example an "Automatic
Relevance Detection" (ARD) kernel, which has one parameter for each input
dimension:

.. math::

    k(x, z) = \exp( - \sum_i \gamma_i (x_i - z_i)^2 )

Such a kernel can be useful when the individual features correlate much
differently with the labels, hence calling for individual bandwidths
:math:`\gamma_i` per feature. (From another angle, learning the ARD kernel
bandwidths corresponds to learning a linear transformation of the input space.)

In [GlasmachersIgel2010]_, a toy problem is introduced which well lends itself
to an ARD kernel and the optimization of its parameters. It creates a binary
classification dataset of dimension :math:`2d` in the following way: first, fix
a positive or negative label :math:`y`, i.e., :math:`1` or :math:`0`, respectively.
Then, fill the first :math:`d` dimensions by

.. math::

	y - 0.5 + \mathcal N(0.0,1.0) \enspace .

That is, produce Gaussian distributed noise around :math:`+0.5` for positive label
and :math:`-0.5` for negative label. The second :math:`d` dimensions are simply filled
with only Gaussian noise :math:`\sim \mathcal N(0.0,1.0)`. Overall, there will be
:math:`d` dimensions which are correlated with the labels and hence informative, and
:math:`d` dimensions which are not correlated with the labels and uninformative.

By design, this toy problem is well tailored to an ARD kernel. The ARD kernel
weights corresponding to the uninformative dimensions would best be optimized out
to be zero, since these dimensions on average hold no information relevant to the
classification problem. In the following, we will use our maximum-likelihood model
selection criterion to optimize the hyperparameters of an SVM using an ARD kernel
on such a toy problem. Ideally, the kernel weights will afterwards reflect the
nature of the underlying distribution. (And we will see that they do.)


Likelihood-based model selection in Shark
-----------------------------------------


You can find the source code for the following example in
:doxy:`CSvmMaxLikelihoodMS.cpp` (as generated by its according .tpp file). There,
one trial is wrapped by the function ``run_one_trial()``, which takes a verbosity
preference as argument. The first trial is carried out verbosely, the 100 aggregating
trials (which take a long time) silently and only the overall hyperparameter averages
are printed. The tutorial below mostly covers the functionality of the ``run_one_trial()``
function. For the complete program, see the example .cpp file.


The key class for maximum-likelihood based SVM model selection in Shark
is :doxy:`SvmLogisticInterpretation`, and we include its header. To create
the toy problem via the aptly named ``PamiToy`` distribution, we also include
the header for data distributions; and the gradient-based optimizer "Rprop",
with which we will optimize the SVM hyperparameters under the
:doxy:`SvmLogisticInterpretation` criterion. With various other helpers,
our complete list of includes and namespaces becomes::

    
	#include <shark/Data/Dataset.h>
	#include <shark/Data/CVDatasetTools.h>
	#include <shark/Data/DataDistribution.h>
	#include <shark/Data/Statistics.h>
	#include <shark/Models/Kernels/ArdKernel.h>
	#include <shark/Algorithms/QP/QuadraticProgram.h>
	#include <shark/Algorithms/Trainers/CSvmTrainer.h>
	#include <shark/Algorithms/GradientDescent/Rprop.h>
	#include <shark/ObjectiveFunctions/Loss/ZeroOneLoss.h>
	#include <shark/ObjectiveFunctions/SvmLogisticInterpretation.h>
	#include <shark/Algorithms/Trainers/NormalizeComponentsUnitVariance.h>
	
	using namespace std;
	using namespace shark;
	

Creating the toy problem
&&&&&&&&&&&&&&&&&&&&&&&&


First, define the basic dimensionalities, here using :math:`d=5`::

    
	// define the basic dimensionality of the problem
	unsigned int useful_dim = 5;
	unsigned int noise_dim = 5;
	unsigned int total_dim = useful_dim + noise_dim;
	

Then set up the above described classification problem::

    
	    // set up the classification problem from a DataDistribution
	    PamiToy problem( useful_dim, noise_dim );
	
	    // construct training and test sets from the problem distribution
	    unsigned int train_size = 500;
	    unsigned int test_size = 5000;
	    ClassificationDataset train = problem.generateDataset( train_size );
	    ClassificationDataset test = problem.generateDataset( test_size );
	    

and normalize the data to unit variance in the training set as usual::

    
	    // normalize data as usual
	    Normalizer<> normalizer;
	    NormalizeComponentsUnitVariance<> normalizationTrainer(false);
	    normalizationTrainer.train( normalizer, train.inputs() );
	    train = transformInputs( train, normalizer );
	    test = transformInputs( test, normalizer );
	    

Then create the ARD kernel with appropriate dimensions (kernel parameter
initialization will come later)::

    
	    // set up the ArdKernel
	    DenseARDKernel kernel( total_dim, 0.1 ); //for now with arbitrary value for gamma (gets properly initialized later)
	    

Data folds and model selection criterion
&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&


Before we go ahead and declare our model selection criterion (i.e., objective
funtion), we first have to partition the training data into folds: the
:doxy:`SvmLogisticInterpretation` class requires to be passed data in the
form of a :doxy:`CVFolds` object. That is, it demands an existing partitioning
for cross-validation. This way, control over the type of data partitioning
(e.g., stratified vs. IID, etc.) strictly remains with the user::

    
	    // set up partitions for cross-validation
	    unsigned int num_folds = 5;
	    CVFolds<ClassificationDataset> cv_folds = createCVIID( train, num_folds );
	    

The next three lines now finally set up the maximum-likelihood based objective
function for model selection::

    
	    // set up the learning machine
	    bool log_enc_c = true; //use log encoding for the regularization parameter C
	    QpStoppingCondition stop(1e-12); //use a very conservative stopping criterion for the individual SVM runs
	    SvmLogisticInterpretation<> mlms( cv_folds, &kernel, log_enc_c, &stop ); //the main class for this tutorial
	    //SvmLogisticInterpretation<> mlms( cv_folds, &kernel, log_enc_c ); //also possible without stopping criterion
	    

The first line specifies that in this case, we want to allow for unconstrained optimization
of the regularization parameter (i.e., we do not want to bother with the possibility of the
optimizer accidentally driving :math:`C` into the negative half-space). However, ``true``
is also the default, so we could have omitted it had we not passed a custom stopping
criterion. The second line sets up a :doxy:`QpStoppingCondition` with a very conservative
(= small) stopping criterion value. This gets used by all SVMs that the
SvmLogisticInterpretation will train internally.

.. admonition:: Note on the stopping criterion

	Here, the :doxy:`QpStoppingCondition` is
	set to a rather small, or conservative, value for the final KKT violation. In general,
	the computation of the :doxy:`SvmLogisticInterpretation` criterion is somewhat volatile
	and requires high computational accuracy. For that reason, we use a very conservative
	stopping criterion in this tutorial. In a real-world setting this can be relaxed somewhat,
	as long as the signs of the gradient of the :doxy:`SvmLogisticInterpretation` will be correct
	"often enough". To date, we do not have an airtight method to properly choose the stopping
	criterion so that it is loose enough to allow fast optimization, but tight enough to ensure
	a proper optimization path. A well-performing heuristic used in [GlasmachersIgel2010]_ was
	to set the 	maximum number of iterations to 200 times the input dimension. This	proved
	robust enough to have produced the state-of-the-art results given in the paper.

In the last line, we finally find the declaration of our objective function, which takes as
arguments the CVFolds object, kernel, log-encoding information, and the stopping criterion
(optional).


The optimization process
&&&&&&&&&&&&&&&&&&&&&&&&


Now we only need to set a starting point for the optimization process, and we choose
:math:`C=1` and :math:`\gamma_i = 0.5/(2d)` as motivated in the paper::

    
	    // set up a starting point for the optimization process
	    RealVector start( total_dim+1 );
	    if ( log_enc_c ) start( total_dim ) = 0.0; else start( total_dim ) = 1.0; //start at C = 1.0
	    for ( unsigned int k=0; k<total_dim; k++ )
	        start(k) = 0.5 / total_dim;
	    

(Note that by convention, the CSvmTrainer stores the regularization parameter :math:`C`
last in the parameter vector, and the SvmLogisticInterpretation honors this convention.)

One single evaluation of the objective function at this current point looks like this::

    
	    // for illustration purposes, we also evalute the model selection criterion a single time at the starting point
	    double start_value = mlms.eval( start );
	    

A simple ``cout`` command can tell us that the value we get from that last call
(on our development machine) is ``0.337388``.

Next, we set up an :doxy:`IRpropPlus` optimizer, choosing the same parameters
for it as in the original paper, except with a lower number of total iterations::

    
	    // set up the optimizer
	    IRpropPlus rprop;
	    double stepsize = 0.1;
	    double stop_delta = 1e-3;
	    mlms.init();
	    rprop.init( mlms, start, stepsize );
	    unsigned int its = 50;
	    

The main process of this tutorial -- optimizing the SVM hyperparameters under the
SvmLogisticInterpretation objective function -- is now straightforward and follows
the general optimization schemes in Shark (see :doc:`../first_steps/general_optimization_tasks`
as well as :doc:`../concepts/optimization/optimizationtrainer` and following)::

    
	    // start the optimization loop
	    for (unsigned int i=0; i<its; i++) {
	        rprop.step( mlms );
	        if ( verbose )
	            std::cout << "iteration " << i << ": current NCLL = " <<  rprop.solution().value << " at parameter: " << rprop.solution().point << std::endl;
	        if ( rprop.maxDelta() < stop_delta ) {
	            if ( verbose ) std::cout << "    Rprop quit pecause of small progress " << std::endl;
	            break;
	        }
	    }
	    

Evaluation after optimization
&&&&&&&&&&&&&&&&&&&&&&&&&&&&&


After the optimization loop, we would like to do several things: query the
final objective function value, view the final hyperparameters, train a
final SVM with them, and view the train and test errors obtained from that.
For the latter tasks, there are at least two different ways to transfer the
final hyperparameters from the model selection process to the final SVM. In
both cases, care must be taken at one spot or another to correctly specify
the encoding style for the regularization parameter (namely, the same as
previously used by the SvmLogisticInterpretation object). These slightly
error-prone lines are below marked with an ``//Attention`` comment. Before
presenting each of the two approaches, we declase some general helper variables::

    
	    double C_reg; //will hold regularization parameter
	    double test_error_v1, train_error_v1; //will hold errors determined via method 1
	    double test_error_v2, train_error_v2; //will hold errors determined via method 2
	    

Option 1: Implicit/manual copy
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


The first variant is to exploit an implicit parameter copy that takes place
when calling ``SvmLogisticInterpretation::eval(...)``. This copies (only) the
kernel parameters from the RProp solution vector into the kernel function used
by the CSvmTrainer. But we still need to take care of the regularization parameter
C. For this, we manually obtain the value of C, but carefully minding the
parameter encoding... ::

    
	    // copy final parameters, variant one
	    double end_value = mlms.eval( rprop.solution().point ); //this at the same time copies the most recent parameters from rprop to the kernel.
	    C_reg = ( log_enc_c ? exp( rprop.solution().point(total_dim) ) : rprop.solution().point(total_dim) ); //ATTENTION: mind the encoding
	    

... and print the parameter set::

    
	    if ( verbose ) {
	        std::cout << "    Value of model selection criterion at final point: " << end_value << std::endl;
	        std::cout << "    Done optimizing the SVM hyperparameters. The final parameters (true/unencoded) are:" << std::endl << std::endl;
	        std::cout << "        C = " << C_reg << std::endl;
	        for ( unsigned int i=0; i<total_dim; i++ )
	            std::cout << "        gamma(" << i << ") = " << kernel.parameterVector()(i)*kernel.parameterVector()(i) << std::endl;
	        std::cout << std::endl << "    (as also given by kernel.gammaVector() : " << kernel.gammaVector() << " ) " << std::endl;
	    }
	    

The objective function value we get (on our development machine) is ``0.335099``,
so the initial parameter guess in this case was already quite good (in terms of
the associated objective function value).

For C and the gamma parameters, the output says:

.. code-block:: none

    C = 1.71335
    gamma(0) = 0.460517
    gamma(1) = 0.0193955
    gamma(2) = 0.0277312
    gamma(3) = 0.0235109
    gamma(4) = 0.0308288
    gamma(5) = 0
    gamma(6) = 0.000977712
    gamma(7) = 0
    gamma(8) = 0.0171233
    gamma(9) = 0

In the majority of cases, the ARD kernel parameters corresponding to uninformative
feature dimensions were learned to be (close to) zero. However, for some reason,
the value of ``gamma(8)`` is almost in the range of its informative counterparts
(on our development machine).

With the SVM hyperparameters, we can now set up and train the final SVM, in order
to see the "best" performance by our newly found "best" hyperparameters. As a
sanity check, we print the hyperparameters again as accessed through the SVM trainer
after its construction::

    
	    // construct and train the final learner
	    KernelClassifier<RealVector> svm_v1;
	    CSvmTrainer<RealVector> trainer_v1( &kernel, C_reg, true, log_enc_c ); //encoding does not really matter in this case b/c it does not affect the ctor
	    if ( verbose ) {
	        std::cout << std::endl << std::endl << "    Used mlms.eval(...) to copy kernel.parameterVector() " << kernel.parameterVector() << std::endl;
	        std::cout << "    into trainer_v1.parameterVector() " << trainer_v1.parameterVector() << std::endl;
	        std::cout << "    , where C (the last parameter) was set manually to " << trainer_v1.C() << std::endl << std::endl << std::endl;
	    }
	    trainer_v1.train( svm_v1, train ); //the kernel has the right parameters, and we copied C, so we are good to go
	    

Now that the final SVM was trained, we only need to pipe training and test set
through it and a proper loss function to get the training and test errors::

    
	    // evaluate the final trained classifier on training and test set
	    ZeroOneLoss<unsigned int> loss_v1;
	    Data<unsigned int> output_v1; //real-valued output
	    output_v1 = svm_v1( train.inputs() );
	    train_error_v1 = loss_v1.eval( train.labels(), output_v1 );
	    output_v1 = svm_v1( test.inputs() );
	    test_error_v1 = loss_v1.eval( test.labels(), output_v1 );
	    if ( verbose ) {
	        std::cout << "    training error via possibility 1:  " <<  train_error_v1 << std::endl;
	        std::cout << "    test error via possibility 1:      " << test_error_v1 << std::endl << std::endl << std::endl;
	    }
	    

On our development machine, we obtain:

.. code-block:: none

	training error:  0.116
	test error:      0.1374

Our mission is now finished, and we present a second variant to copy the
hyperparameters -- namely via ``solution().point``. We prefer this second
variant, as it does not rely on calling ``eval(...)`` on the objective function
first.


Option 2: Using solution().point
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


For this alternative take, we copy all the hyperparameters found by the optimizer
into the CSvmTrainer. This is simply done via the setParameterVector method of
the CSvmTrainer::

    
	    KernelClassifier<RealVector> svm_v2;
	    CSvmTrainer<RealVector> trainer_v2( &kernel, 0.1, true, log_enc_c ); //ATTENTION: must be constructed with same log-encoding preference
	    trainer_v2.setParameterVector( rprop.solution().point ); //copy best hyperparameters to svm trainer
	    

Again, we print the trainer's parameter vector for comparison::

    
	    if ( verbose ) {
	        std::cout << "    Copied rprop.solution().point = " << rprop.solution().point << std::endl;
	        std::cout << "    into trainer_v2.parameterVector(), now = " << trainer_v2.parameterVector() << std::endl << std::endl << std::endl;
	    }
	    

Training is now as simple as::

    
	    trainer_v2.train( svm_v2, train );
	    

To evaluate this second SVM's prediction, again pipe all data through the SVM
and a proper loss::

    
	    // evaluate the final trained classifier on training and test set
	    ZeroOneLoss<unsigned int> loss_v2;
	    Data<unsigned int> output_v2; //real-valued output
	    output_v2 = svm_v2( train.inputs() );
	    train_error_v2 = loss_v2.eval( train.labels(), output_v2 );
	    output_v2 = svm_v2( test.inputs() );
	    test_error_v2 = loss_v2.eval( test.labels(), output_v2 );
	    if ( verbose ) {
	        std::cout << "    training error via possibility 2:  " <<  train_error_v2 << std::endl;
	        std::cout << "    test error via possibility 2:      " << test_error_v2 << std::endl << std::endl << std::endl;
	        std::cout << std::endl << "That's all folks - we are done!" << std::endl;
	    }
	    

And we are happy to get the same results as above:

.. code-block:: none

	training error:  0.116
	test error:      0.1374


Repetition over 100 trials
&&&&&&&&&&&&&&&&&&&&&&&&&&


We now examine the distribution of hyperparameter values over several trials on
different realizations of the toy problem distribution. We repeat the experiment
100 times, and note the means and variances of the SVM hyperparameters. This also
mostly follows the methodology in [GlasmachersIgel2010]_. We obtain the following
results (where the last/11th entry is the regularization parameter C)::

    avg-param(0)    = 0.0174454  +- 0.000372237
    avg-param(1)    = 0.0243765  +- 0.00276891
    avg-param(2)    = 0.0170669  +- 0.000236762
    avg-param(3)    = 0.0148257  +- 0.000139686
    avg-param(4)    = 0.0175333  +- 0.000225192
    avg-param(5)    = 0.00810077 +- 0.000397033
    avg-param(6)    = 0.00831601 +- 0.000484481
    avg-param(7)    = 0.0134892  +- 0.000909667
    avg-param(8)    = 0.00652671 +- 0.000238294
    avg-param(9)    = 0.00863524 +- 0.000432687
    avg-param(10)   = 1.68555    +- 0.971377

    avg-error-train = 0.12594    +- 0.000294276
    avg-error-test  = 0.137724   +- 4.49206e-05

We see that on average, the :doxy:`SvmLogisticInterpretation` objective clearly
selects a meaningful model with an emphasis on the informative parameters. At the
same time, some tendency still exists for the uninformative parameters to be different
from completely zero. Note that the mean test error is well below 14%, which is an
excellent value for an SVM on this toy problem.


References
----------

.. [GlasmachersIgel2010] T. Glasmachers and C. Igel. Maximum Likelihood Model Selection
   for 1-Norm Soft Margin SVMs with Multiple Parameters. IEEE Transactions on Pattern
   Analysis and Machine Intelligence, 2010.