Deciding When to Stop¶
In the previous tutorial, we set up a general optimization tasks which were trained iteratively. This approach has two notable usage downsides compared to the convenience of the one-step LDA- or SVM-trainer:
- We have to decide when the accuracy achieved by the optimization steps is high enough.
- We need to write more code to set up all parts.
While the second point is just a nuisance, the first point is a “real” structural problem. In the LDA example, we did not have to bother whether the solution was sufficiently exact, as the LDA problem can be solved analytically.
Motivation¶
In general, choosing a good number of iterations for an iterative optimizer touches on two issues:
- First, simply a computational point of view: we do not want to perform more iterations than necessary to reach a “good” solution. But we also don’t want to stop before we found a “good” solution.
- Second, stopping early also constitutes a way of regularizing the adaptation of a model to a training set. Hence, stopping even earlier than would be indicated solely by the training dataset might be desired in machine learning usage anyways.
One means of early-stopping that goes beyong picking an arbitrary number of iterations is monitoring the performance on a validation split, which needs to be created from the dataset in addition to training and test split.
Neural-network training example¶
Overview¶
This tutorial will introduce different stopping criteria. As an example, we consider a slightly more complex learning task than in the previous tutorials, namely classification with a simple feed-forward neural network. You can learn more on neural networks in Shark in the Training Feed-Forward Networks tutorial. The code for this example can be found in StoppingCriteria.cpp.
We show how to create a trainer for this task which generalizes
important concepts and saves us manual work. Then, we construct and compare
three different stopping criteria for that trainer. To this end, we introduce
the AbstractStoppingCriterion
, another interface of Shark. In addition to
this tutorial, a concept tutorial on Stopping Criteria exists,
which gives a more detailed explanation about how stopping criteria are implemented in shark.
Building blocks & includes¶
We first list all includes for this tutorial and then motivate their usage for each one:
#include <shark/Data/Csv.h>
#include <shark/Models/LinearModel.h>//single dense layer
#include <shark/Models/ConcatenatedModel.h>//for stacking layers to a feed forward neural network
#include <shark/Algorithms/GradientDescent/Rprop.h> //Optimization algorithm
#include <shark/ObjectiveFunctions/Loss/CrossEntropy.h> //Loss used for training
#include <shark/ObjectiveFunctions/Loss/ZeroOneLoss.h> //The real loss for testing.
#include <shark/Algorithms/Trainers/OptimizationTrainer.h> // Trainer wrapping iterative optimization
#include <shark/Algorithms/StoppingCriteria/MaxIterations.h> //A simple stopping criterion that stops after a fixed number of iterations
#include <shark/Algorithms/StoppingCriteria/TrainingError.h> //Stops when the algorithm seems to converge
#include <shark/Algorithms/StoppingCriteria/GeneralizationQuotient.h> //Uses the validation error to track the progress
#include <shark/Algorithms/StoppingCriteria/ValidatedStoppingCriterion.h> //Adds the validation error to the value of the point
As before, Csv.h
is included for reading in data. The header FFNet.h
is needed
because we want to train a neural network to distinguish between two classes.
Rprop
is a fast and stable algorithm for gradient-based optimization of
a differentiable objective function. Since the 0-1-loss is not differentiable,
and would thus not be compatible with any gradient descent method including
Rprop, we instead use the CrossEntropy
as surrogate loss. But for testing,
we still want to use and hence include the ZeroOneLoss
. As in the last
tutorial, the ErrorFunction
binds together the model, dataset and the loss function.
For a given set of parameters, it returns the actual error of the model with this parameters
measured by the loss function on the dataset.
The remaining includes are needed for the different stopping
criteria we will examine.
Using an AbstractStoppingCriterion¶
We want to use a feed-forward neural network with one hidden layer and two output
neurons for classification, and train it under three different stopping criteria:
a fixed number of iterations, progress on the training error, and progress on a
validation set. To facilitate our experiments, we create one single, local, auxiliary
function that takes an AbstractStoppingCriterion
– the base class of all
stopping criteria – as an argument. This auxiliary function creates and
trains a neural network using the abstract stopping criterion. In
addition, instead of manually and explicitly coding an optimization loop as in
the previous examples, we use a so-called OptimizationTrainer
that encapsulates
the entire training process given an ObjectiveFunction, Optimizer, and StoppingCriterion.
Overall, we use the following function to create, train and evaluate our neural
network under a given stopping criterion:
template<class T>
double experiment(
AbstractModel<RealVector, RealVector>& network,
AbstractStoppingCriterion<T> & stoppingCriterion,
ClassificationDataset const& trainingset,
ClassificationDataset const& testset
){
initRandomUniform(network,-0.1,0.1);
//The Cross Entropy maximises the activation of the cth output neuron
// compared to all other outputs for a sample with class c.
CrossEntropy loss;
//we use IRpropPlus for network optimization
IRpropPlus optimizer;
//create an optimization trainer and train the model
OptimizationTrainer<AbstractModel<RealVector, RealVector>,unsigned int > trainer(&loss, &optimizer, &stoppingCriterion);
trainer.train(network, trainingset);
//evaluate the performance on the test set using the classification loss we choose 0.5 as threshold since Logistic neurons have values between 0 and 1.
ZeroOneLoss<unsigned int, RealVector> loss01(0.5);
Data<RealVector> predictions = network(testset.inputs());
return loss01(testset.labels(),predictions);
}
To run the experiment, we need to load a dataset and split it into training, validation and test set:
ClassificationDataset data; importCSV(data, “data/diabetes.csv”,LAST_COLUMN, ‘,’); data.shuffle(); ClassificationDataset test = splitAtElement(data,static_cast<std::size_t>(0.75*data.numberOfElements())); ClassificationDataset validation = splitAtElement(data,static_cast<std::size_t>(0.66*data.numberOfElements()));
Evaluation¶
Now it is time to load some data and try out different stopping criteria.
Fixed number of iterations¶
The simplest stopping heuristic is halting after a fixed number of iterations.
MaxIterations
then is the subclass of choice, which simply provides this
trivial functionality for within the framework of an AbstractStoppingCriterion.
We try out several different numbers of steps:
MaxIterations<> maxIterations(10);
double resultMaxIterations1 = experiment(network, maxIterations,data,test);
maxIterations.setMaxIterations(100);
double resultMaxIterations2 = experiment(network, maxIterations,data,test);
maxIterations.setMaxIterations(500);
double resultMaxIterations3 = experiment(network, maxIterations,data,test);
Progress on training error¶
Next we employ a stopping criterion that monitors progress on the
training error \(E\). The stopping criterion TrainingError
takes in its constructor a window size (or number of time steps)
\(T\) together with a threshold value \(\epsilon\). If the
improvement over the last \(T\) timesteps does not exceed
\(\epsilon\), that is, \(E(t-T)-E(t) < \epsilon\), the
stopping criterion becomes active and tells the optimizer to stop
(because it assumes that progress over subsequent optimization steps
will be negligible as well). Note that a danger when using this
stopping criterion is that it may stop optimization even when the
algorithm only traverses a plateau or saddle
point. However, the optimizer used here, IRpropPlus
, dynamically
adapts it step size and and hence is somewhat less vulnerable to these
problems. After all the groundwork has been done, we can test this
stopping criterion with only two lines of code:
TrainingError<> trainingError(10,1.e-5);
double resultTrainingError = experiment(network, trainingError,data,test);
Progress on a validation set¶
To use validation error information, we need to define an additional validation error
function. In the simplest case, this is just an error function using the same objects
as that on the training set, but a different dataset. For simplicity of the tutorial,
we will instead just create it from scratch. The class that takes the current point
of the search space from the optimizer and passes it on the the evaluation error function
is the so-called ValidatedStoppingCriterion
. It constructor takes as argument not
only the validation error function, but also another stopping criterion, to which the
result of the validation run is passed and which is prepared to make its decision based
on both training and validation information. In this example, we will use the
GeneralizationQuotient
as such a stopping criterion. In detail, it calculates the
ratio of two other criteria to reach its decision, and hence we refer to the class
documentation for an exact description, as well as the scientific publication
mentioned therein.
In summary, this code uses the progress on a validation set to decide when to stop:
CrossEntropy loss;
ErrorFunction validationFunction(validation,&network,&loss);
GeneralizationQuotient<> generalizationQuotient(10,0.1);
ValidatedStoppingCriterion validatedLoss(&validationFunction,&generalizationQuotient);
double resultGeneralizationQuotient = experiment(network, validatedLoss,data,test);
Printing the results¶
Printing all variables of type double
defined in the snippets above, we get:
RESULTS:
========
10 iterations : 0.375
100 iterations : 0.348958
500 iterations : 0.302083
training Error : 0.333333
generalization Quotient : 0.375
So stopping after around 500 iterations yielded the lowest error on the test set. The TrainingError criterion will, as predicted, wait a lot longer. The GeneralizationQuotient does in fact stop too early in this case, which is very likely due to the small size of the data set used in the example code.
What you learned¶
You should have learned the following aspects in this Tutorial:
- How to train a feed forward neural network
- How to create a trainer from a general optimization task
- That the choice of stopping criterion matters.
What next?¶
Now you should be ready to leave the “first steps” section of the tutorials and read through its other sections, which will tell you about various aspects of the library in more detail.