Deciding When to Stop

In the previous tutorial, we set up a general optimization tasks which were trained iteratively. This approach has two notable usage downsides compared to the convenience of the one-step LDA- or SVM-trainer:

  1. We have to decide when the accuracy achieved by the optimization steps is high enough.
  2. We need to write more code to set up all parts.

While the second point is just a nuisance, the first point is a “real” structural problem. In the LDA example, we did not have to bother whether the solution was sufficiently exact, as the LDA problem can be solved analytically.

Motivation

In general, choosing a good number of iterations for an iterative optimizer touches on two issues:

  • First, simply a computational point of view: we do not want to perform more iterations than necessary to reach a “good” solution. But we also don’t want to stop before we found a “good” solution.
  • Second, stopping early also constitutes a way of regularizing the adaptation of a model to a training set. Hence, stopping even earlier than would be indicated solely by the training dataset might be desired in machine learning usage anyways.

One means of early-stopping that goes beyong picking an arbitrary number of iterations is monitoring the performance on a validation split, which needs to be created from the dataset in addition to training and test split.

Neural-network training example

Overview

This tutorial will introduce different stopping criteria. As an example, we consider a slightly more complex learning task than in the previous tutorials, namely classification with a simple feed-forward neural network. You can learn more on neural networks in Shark in the Training Feed-Forward Networks tutorial. The code for this example can be found in StoppingCriteria.cpp.

We show how to create a trainer for this task which generalizes important concepts and saves us manual work. Then, we construct and compare three different stopping criteria for that trainer. To this end, we introduce the AbstractStoppingCriterion, another interface of Shark. In addition to this tutorial, a concept tutorial on Stopping Criteria exists, which gives a more detailed explanation about how stopping criteria are implemented in shark.

Building blocks & includes

We first list all includes for this tutorial and then motivate their usage for each one:

#include <shark/Data/Csv.h>
#include <shark/Models/LinearModel.h>//single dense layer
#include <shark/Models/ConcatenatedModel.h>//for stacking layers to a feed forward neural network
#include <shark/Algorithms/GradientDescent/Rprop.h> //Optimization algorithm
#include <shark/ObjectiveFunctions/Loss/CrossEntropy.h> //Loss used for training
#include <shark/ObjectiveFunctions/Loss/ZeroOneLoss.h> //The real loss for testing.
#include <shark/Algorithms/Trainers/OptimizationTrainer.h> // Trainer wrapping iterative optimization
#include <shark/Algorithms/StoppingCriteria/MaxIterations.h> //A simple stopping criterion that stops after a fixed number of iterations
#include <shark/Algorithms/StoppingCriteria/TrainingError.h> //Stops when the algorithm seems to converge
#include <shark/Algorithms/StoppingCriteria/GeneralizationQuotient.h> //Uses the validation error to track the progress
#include <shark/Algorithms/StoppingCriteria/ValidatedStoppingCriterion.h> //Adds the validation error to the value of the point

As before, Csv.h is included for reading in data. The header FFNet.h is needed because we want to train a neural network to distinguish between two classes. Rprop is a fast and stable algorithm for gradient-based optimization of a differentiable objective function. Since the 0-1-loss is not differentiable, and would thus not be compatible with any gradient descent method including Rprop, we instead use the CrossEntropy as surrogate loss. But for testing, we still want to use and hence include the ZeroOneLoss. As in the last tutorial, the ErrorFunction binds together the model, dataset and the loss function. For a given set of parameters, it returns the actual error of the model with this parameters measured by the loss function on the dataset. The remaining includes are needed for the different stopping criteria we will examine.

Using an AbstractStoppingCriterion

We want to use a feed-forward neural network with one hidden layer and two output neurons for classification, and train it under three different stopping criteria: a fixed number of iterations, progress on the training error, and progress on a validation set. To facilitate our experiments, we create one single, local, auxiliary function that takes an AbstractStoppingCriterion – the base class of all stopping criteria – as an argument. This auxiliary function creates and trains a neural network using the abstract stopping criterion. In addition, instead of manually and explicitly coding an optimization loop as in the previous examples, we use a so-called OptimizationTrainer that encapsulates the entire training process given an ObjectiveFunction, Optimizer, and StoppingCriterion. Overall, we use the following function to create, train and evaluate our neural network under a given stopping criterion:

template<class T>
double experiment(
        AbstractModel<RealVector, RealVector>& network,
        AbstractStoppingCriterion<T> & stoppingCriterion,
        ClassificationDataset const& trainingset,
        ClassificationDataset const& testset
){
        initRandomUniform(network,-0.1,0.1);

        //The Cross Entropy maximises the activation of the cth output neuron
        // compared to all other outputs for a sample with class c.
        CrossEntropy loss;

        //we use IRpropPlus for network optimization
        IRpropPlus optimizer;

        //create an optimization trainer and train the model
        OptimizationTrainer<AbstractModel<RealVector, RealVector>,unsigned int > trainer(&loss, &optimizer, &stoppingCriterion);
        trainer.train(network, trainingset);

        //evaluate the performance on the test set using the classification loss we choose 0.5 as threshold since Logistic neurons have values between 0 and 1.

        ZeroOneLoss<unsigned int, RealVector> loss01(0.5);
        Data<RealVector> predictions = network(testset.inputs());
        return loss01(testset.labels(),predictions);
}

To run the experiment, we need to load a dataset and split it into training, validation and test set:

ClassificationDataset data; importCSV(data, “data/diabetes.csv”,LAST_COLUMN, ‘,’); data.shuffle(); ClassificationDataset test = splitAtElement(data,static_cast<std::size_t>(0.75*data.numberOfElements())); ClassificationDataset validation = splitAtElement(data,static_cast<std::size_t>(0.66*data.numberOfElements()));

Evaluation

Now it is time to load some data and try out different stopping criteria.

Fixed number of iterations

The simplest stopping heuristic is halting after a fixed number of iterations. MaxIterations then is the subclass of choice, which simply provides this trivial functionality for within the framework of an AbstractStoppingCriterion. We try out several different numbers of steps:

MaxIterations<> maxIterations(10);
double resultMaxIterations1 = experiment(network, maxIterations,data,test);
maxIterations.setMaxIterations(100);
double resultMaxIterations2 = experiment(network, maxIterations,data,test);
maxIterations.setMaxIterations(500);
double resultMaxIterations3 = experiment(network, maxIterations,data,test);

Progress on training error

Next we employ a stopping criterion that monitors progress on the training error \(E\). The stopping criterion TrainingError takes in its constructor a window size (or number of time steps) \(T\) together with a threshold value \(\epsilon\). If the improvement over the last \(T\) timesteps does not exceed \(\epsilon\), that is, \(E(t-T)-E(t) < \epsilon\), the stopping criterion becomes active and tells the optimizer to stop (because it assumes that progress over subsequent optimization steps will be negligible as well). Note that a danger when using this stopping criterion is that it may stop optimization even when the algorithm only traverses a plateau or saddle point. However, the optimizer used here, IRpropPlus, dynamically adapts it step size and and hence is somewhat less vulnerable to these problems. After all the groundwork has been done, we can test this stopping criterion with only two lines of code:

TrainingError<> trainingError(10,1.e-5);
double resultTrainingError = experiment(network, trainingError,data,test);

Progress on a validation set

To use validation error information, we need to define an additional validation error function. In the simplest case, this is just an error function using the same objects as that on the training set, but a different dataset. For simplicity of the tutorial, we will instead just create it from scratch. The class that takes the current point of the search space from the optimizer and passes it on the the evaluation error function is the so-called ValidatedStoppingCriterion. It constructor takes as argument not only the validation error function, but also another stopping criterion, to which the result of the validation run is passed and which is prepared to make its decision based on both training and validation information. In this example, we will use the GeneralizationQuotient as such a stopping criterion. In detail, it calculates the ratio of two other criteria to reach its decision, and hence we refer to the class documentation for an exact description, as well as the scientific publication mentioned therein.

In summary, this code uses the progress on a validation set to decide when to stop:

CrossEntropy loss;
ErrorFunction validationFunction(validation,&network,&loss);
GeneralizationQuotient<> generalizationQuotient(10,0.1);
ValidatedStoppingCriterion validatedLoss(&validationFunction,&generalizationQuotient);
double resultGeneralizationQuotient = experiment(network, validatedLoss,data,test);

Printing the results

Printing all variables of type double defined in the snippets above, we get:

RESULTS:
========

10 iterations   : 0.375
100 iterations : 0.348958
500 iterations : 0.302083
training Error : 0.333333
generalization Quotient : 0.375

So stopping after around 500 iterations yielded the lowest error on the test set. The TrainingError criterion will, as predicted, wait a lot longer. The GeneralizationQuotient does in fact stop too early in this case, which is very likely due to the small size of the data set used in the example code.

What you learned

You should have learned the following aspects in this Tutorial:

  • How to train a feed forward neural network
  • How to create a trainer from a general optimization task
  • That the choice of stopping criterion matters.

What next?

Now you should be ready to leave the “first steps” section of the tutorials and read through its other sections, which will tell you about various aspects of the library in more detail.