Support Vector Machines: Likelihood-based Model Selection ========================================================= Please first read the :doc:`svm` tutorial, and possibly also the :doc:`svmModelSelection` tutorial for traditional approaches to SVM model selection. Motivation ---------- The previous tutorial explained that the performance of an SVM classifier depends on the choice of regularization parameter :math:`C` and the kernel parameters. We also presented the most common method to find SVM hyperparameters: grid search on the cross-validation error. This is suited for kernels with only one or two parameters, because a two- or three-dimensional SVM hyperparameter search space can still be sufficiently covered by a fixed grid of search points. Using naive heuristics like :doxy:`NestedGridSearch`, where the search resolution increases iteratively, a four- or perhaps even five-dimensional SVM hyperparameter space can maybe still be sampled sufficiently. But we do not get around the fact that grid search-based SVM model selection suffers from the curse of dimensionality. Naturally, much research has been directed toward differentiable estimates of, substitutes for, or bounds on the generalization error. A good overview is given in our paper [GlasmachersIgel2010]_. There, we also present a novel model selection criterion that is differentiable (in almost all practical cases) with respect to the regularization and kernel parameters. In practical experiments, it compared very favorably to other gradient-based model selection criteria. We consider it the current state-of-the-art for gradient-based SVM model selection, and especially when the number of examples is relatively small. In the next paragraphs, we explain how to use this maximum-likelihood based approach to SVM model selection in Shark. For theoretical details and background, please consult the original article. The toy problem --------------- Assume we have a higher- or high-dimensional kernel, for example an "Automatic Relevance Detection" (ARD) kernel, which has one parameter for each input dimension: .. math:: k(x, z) = \exp( - \sum_i \gamma_i (x_i - z_i)^2 ) Such a kernel can be useful when the individual features correlate much differently with the labels, hence calling for individual bandwidths :math:`\gamma_i` per feature. (From another angle, learning the ARD kernel bandwidths corresponds to learning a linear transformation of the input space.) In [GlasmachersIgel2010]_, a toy problem is introduced which well lends itself to an ARD kernel and the optimization of its parameters. It creates a binary classification dataset of dimension :math:`2d` in the following way: first, fix a positive or negative label :math:`y`, i.e., :math:`1` or :math:`0`, respectively. Then, fill the first :math:`d` dimensions by .. math:: y - 0.5 + \mathcal N(0.0,1.0) \enspace . That is, produce Gaussian distributed noise around :math:`+0.5` for positive label and :math:`-0.5` for negative label. The second :math:`d` dimensions are simply filled with only Gaussian noise :math:`\sim \mathcal N(0.0,1.0)`. Overall, there will be :math:`d` dimensions which are correlated with the labels and hence informative, and :math:`d` dimensions which are not correlated with the labels and uninformative. By design, this toy problem is well tailored to an ARD kernel. The ARD kernel weights corresponding to the uninformative dimensions would best be optimized out to be zero, since these dimensions on average hold no information relevant to the classification problem. In the following, we will use our maximum-likelihood model selection criterion to optimize the hyperparameters of an SVM using an ARD kernel on such a toy problem. Ideally, the kernel weights will afterwards reflect the nature of the underlying distribution. (And we will see that they do.) Likelihood-based model selection in Shark ----------------------------------------- You can find the source code for the following example in :doxy:`CSvmMaxLikelihoodMS.cpp` (as generated by its according .tpp file). There, one trial is wrapped by the function ``run_one_trial()``, which takes a verbosity preference as argument. The first trial is carried out verbosely, the 100 aggregating trials (which take a long time) silently and only the overall hyperparameter averages are printed. The tutorial below mostly covers the functionality of the ``run_one_trial()`` function. For the complete program, see the example .cpp file. The key class for maximum-likelihood based SVM model selection in Shark is :doxy:`SvmLogisticInterpretation`, and we include its header. To create the toy problem via the aptly named ``PamiToy`` distribution, we also include the header for data distributions; and the gradient-based optimizer "Rprop", with which we will optimize the SVM hyperparameters under the :doxy:`SvmLogisticInterpretation` criterion. With various other helpers, our complete list of includes and namespaces becomes:: #include #include #include #include #include #include #include #include #include #include #include using namespace std; using namespace shark; Creating the toy problem &&&&&&&&&&&&&&&&&&&&&&&& First, define the basic dimensionalities, here using :math:`d=5`:: // define the basic dimensionality of the problem unsigned int useful_dim = 5; unsigned int noise_dim = 5; unsigned int total_dim = useful_dim + noise_dim; Then set up the above described classification problem:: // set up the classification problem from a DataDistribution PamiToy problem( useful_dim, noise_dim ); // construct training and test sets from the problem distribution unsigned int train_size = 500; unsigned int test_size = 5000; ClassificationDataset train = problem.generateDataset( train_size ); ClassificationDataset test = problem.generateDataset( test_size ); and normalize the data to unit variance in the training set as usual:: // normalize data as usual Normalizer<> normalizer; NormalizeComponentsUnitVariance<> normalizationTrainer(false); normalizationTrainer.train( normalizer, train.inputs() ); train = transformInputs( train, normalizer ); test = transformInputs( test, normalizer ); Then create the ARD kernel with appropriate dimensions (kernel parameter initialization will come later):: // set up the ArdKernel DenseARDKernel kernel( total_dim, 0.1 ); //for now with arbitrary value for gamma (gets properly initialized later) Data folds and model selection criterion &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& Before we go ahead and declare our model selection criterion (i.e., objective funtion), we first have to partition the training data into folds: the :doxy:`SvmLogisticInterpretation` class requires to be passed data in the form of a :doxy:`CVFolds` object. That is, it demands an existing partitioning for cross-validation. This way, control over the type of data partitioning (e.g., stratified vs. IID, etc.) strictly remains with the user:: // set up partitions for cross-validation unsigned int num_folds = 5; CVFolds cv_folds = createCVIID( train, num_folds ); The next three lines now finally set up the maximum-likelihood based objective function for model selection:: // set up the learning machine bool log_enc_c = true; //use log encoding for the regularization parameter C QpStoppingCondition stop(1e-12); //use a very conservative stopping criterion for the individual SVM runs SvmLogisticInterpretation<> mlms( cv_folds, &kernel, log_enc_c, &stop ); //the main class for this tutorial //SvmLogisticInterpretation<> mlms( cv_folds, &kernel, log_enc_c ); //also possible without stopping criterion The first line specifies that in this case, we want to allow for unconstrained optimization of the regularization parameter (i.e., we do not want to bother with the possibility of the optimizer accidentally driving :math:`C` into the negative half-space). However, ``true`` is also the default, so we could have omitted it had we not passed a custom stopping criterion. The second line sets up a :doxy:`QpStoppingCondition` with a very conservative (= small) stopping criterion value. This gets used by all SVMs that the SvmLogisticInterpretation will train internally. .. admonition:: Note on the stopping criterion Here, the :doxy:`QpStoppingCondition` is set to a rather small, or conservative, value for the final KKT violation. In general, the computation of the :doxy:`SvmLogisticInterpretation` criterion is somewhat volatile and requires high computational accuracy. For that reason, we use a very conservative stopping criterion in this tutorial. In a real-world setting this can be relaxed somewhat, as long as the signs of the gradient of the :doxy:`SvmLogisticInterpretation` will be correct "often enough". To date, we do not have an airtight method to properly choose the stopping criterion so that it is loose enough to allow fast optimization, but tight enough to ensure a proper optimization path. A well-performing heuristic used in [GlasmachersIgel2010]_ was to set the maximum number of iterations to 200 times the input dimension. This proved robust enough to have produced the state-of-the-art results given in the paper. In the last line, we finally find the declaration of our objective function, which takes as arguments the CVFolds object, kernel, log-encoding information, and the stopping criterion (optional). The optimization process &&&&&&&&&&&&&&&&&&&&&&&& Now we only need to set a starting point for the optimization process, and we choose :math:`C=1` and :math:`\gamma_i = 0.5/(2d)` as motivated in the paper:: // set up a starting point for the optimization process RealVector start( total_dim+1 ); if ( log_enc_c ) start( total_dim ) = 0.0; else start( total_dim ) = 1.0; //start at C = 1.0 for ( unsigned int k=0; k svm_v1; CSvmTrainer trainer_v1( &kernel, C_reg, true, log_enc_c ); //encoding does not really matter in this case b/c it does not affect the ctor if ( verbose ) { std::cout << std::endl << std::endl << " Used mlms.eval(...) to copy kernel.parameterVector() " << kernel.parameterVector() << std::endl; std::cout << " into trainer_v1.parameterVector() " << trainer_v1.parameterVector() << std::endl; std::cout << " , where C (the last parameter) was set manually to " << trainer_v1.C() << std::endl << std::endl << std::endl; } trainer_v1.train( svm_v1, train ); //the kernel has the right parameters, and we copied C, so we are good to go Now that the final SVM was trained, we only need to pipe training and test set through it and a proper loss function to get the training and test errors:: // evaluate the final trained classifier on training and test set ZeroOneLoss loss_v1; Data output_v1; //real-valued output output_v1 = svm_v1( train.inputs() ); train_error_v1 = loss_v1.eval( train.labels(), output_v1 ); output_v1 = svm_v1( test.inputs() ); test_error_v1 = loss_v1.eval( test.labels(), output_v1 ); if ( verbose ) { std::cout << " training error via possibility 1: " << train_error_v1 << std::endl; std::cout << " test error via possibility 1: " << test_error_v1 << std::endl << std::endl << std::endl; } On our development machine, we obtain: .. code-block:: none training error: 0.116 test error: 0.1374 Our mission is now finished, and we present a second variant to copy the hyperparameters -- namely via ``solution().point``. We prefer this second variant, as it does not rely on calling ``eval(...)`` on the objective function first. Option 2: Using solution().point ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For this alternative take, we copy all the hyperparameters found by the optimizer into the CSvmTrainer. This is simply done via the setParameterVector method of the CSvmTrainer:: KernelClassifier svm_v2; CSvmTrainer trainer_v2( &kernel, 0.1, true, log_enc_c ); //ATTENTION: must be constructed with same log-encoding preference trainer_v2.setParameterVector( rprop.solution().point ); //copy best hyperparameters to svm trainer Again, we print the trainer's parameter vector for comparison:: if ( verbose ) { std::cout << " Copied rprop.solution().point = " << rprop.solution().point << std::endl; std::cout << " into trainer_v2.parameterVector(), now = " << trainer_v2.parameterVector() << std::endl << std::endl << std::endl; } Training is now as simple as:: trainer_v2.train( svm_v2, train ); To evaluate this second SVM's prediction, again pipe all data through the SVM and a proper loss:: // evaluate the final trained classifier on training and test set ZeroOneLoss loss_v2; Data output_v2; //real-valued output output_v2 = svm_v2( train.inputs() ); train_error_v2 = loss_v2.eval( train.labels(), output_v2 ); output_v2 = svm_v2( test.inputs() ); test_error_v2 = loss_v2.eval( test.labels(), output_v2 ); if ( verbose ) { std::cout << " training error via possibility 2: " << train_error_v2 << std::endl; std::cout << " test error via possibility 2: " << test_error_v2 << std::endl << std::endl << std::endl; std::cout << std::endl << "That's all folks - we are done!" << std::endl; } And we are happy to get the same results as above: .. code-block:: none training error: 0.116 test error: 0.1374 Repetition over 100 trials &&&&&&&&&&&&&&&&&&&&&&&&&& We now examine the distribution of hyperparameter values over several trials on different realizations of the toy problem distribution. We repeat the experiment 100 times, and note the means and variances of the SVM hyperparameters. This also mostly follows the methodology in [GlasmachersIgel2010]_. We obtain the following results (where the last/11th entry is the regularization parameter C):: avg-param(0) = 0.0174454 +- 0.000372237 avg-param(1) = 0.0243765 +- 0.00276891 avg-param(2) = 0.0170669 +- 0.000236762 avg-param(3) = 0.0148257 +- 0.000139686 avg-param(4) = 0.0175333 +- 0.000225192 avg-param(5) = 0.00810077 +- 0.000397033 avg-param(6) = 0.00831601 +- 0.000484481 avg-param(7) = 0.0134892 +- 0.000909667 avg-param(8) = 0.00652671 +- 0.000238294 avg-param(9) = 0.00863524 +- 0.000432687 avg-param(10) = 1.68555 +- 0.971377 avg-error-train = 0.12594 +- 0.000294276 avg-error-test = 0.137724 +- 4.49206e-05 We see that on average, the :doxy:`SvmLogisticInterpretation` objective clearly selects a meaningful model with an emphasis on the informative parameters. At the same time, some tendency still exists for the uninformative parameters to be different from completely zero. Note that the mean test error is well below 14%, which is an excellent value for an SVM on this toy problem. References ---------- .. [GlasmachersIgel2010] T. Glasmachers and C. Igel. Maximum Likelihood Model Selection for 1-Norm Soft Margin SVMs with Multiple Parameters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010.