Pretraining of Deep Neural Networks


This is an advanced topic

Training deep neural networks is a challenge because normal training easily gets stuck in undesired local optima which prevent the lower layers from learning useful features. This problem can be partially circumvented by pretraining the layers in an unsupervised fashion and thus initialising them in a region of the error function which is easier to train (or fine-tune) using steepest descent techniques.

In this tutorial we will implement the architecture presented in “Deep Sparse Rectifier Neural Networks” [Glorot11]. The authors propose a multi-layered feed forward network with rectified linear hidden neurons, which is first pre-trained layerwise using denoising autoencoders [Vincent08]. Afterwards, the full network is trained supervised with a L1-regularisation to enforce additional sparsity.

Training denoising autoencoders is outlined in detail in Denoising Autoencoders and supervised training of a feed forward neural network is explained in Training Feed-Forward Networks. This tutorial provides the glue to bring both together.

Due to the complexity of the task, a number of includes are needed:

//noisy AutoencoderModel model and deep network
#include <shark/Models/FFNet.h>// neural network for supervised training
#include <shark/Models/Autoencoder.h>// the autoencoder to train unsupervised
#include <shark/Models/ConcatenatedModel.h>// to concatenate Autoencoder with noise adding model

//training the  model
#include <shark/ObjectiveFunctions/ErrorFunction.h>//the error function performing the regularisation of the hidden neurons
#include <shark/ObjectiveFunctions/Loss/SquaredLoss.h> // squared loss used for unsupervised pre-training
#include <shark/ObjectiveFunctions/Loss/CrossEntropy.h> // loss used for supervised training
#include <shark/ObjectiveFunctions/Loss/ZeroOneLoss.h> // loss used for evaluation of performance
#include <shark/ObjectiveFunctions/Regularizer.h> //L1 and L2 regularisation
#include <shark/Algorithms/GradientDescent/SteepestDescent.h> //optimizer: simple gradient descent.
#include <shark/Algorithms/GradientDescent/Rprop.h> //optimizer for autoencoders

Deep Network Pre-training

We will use the code of the denoising autoencoder tutorial to pre-train a deep neural network and we will create another helper function which initialises a deep neural network using the denoising autoencoder. In the next step a supervised fine-tuning step is applied which is simple gradient descent on the supervised learning goal using the pre-trained network as starting point for the optimisation. The types of networks we use are:

typedef Autoencoder<RectifierNeuron,LinearNeuron> AutoencoderModel;//type of autoencoder
typedef FFNet<RectifierNeuron,LinearNeuron> Network;//final supervised trained structure

First, we create a function to initialise the network. We start by training the autoencoders for the two hidden layers. We proceed by taking the original dataset and train an autoencoder using this. Next, we take the encoder layer - that is the connection of inputs to the hidden units - and compute the feature vectors for every point in the dataset using evalLayer, a method specific to autoencoders and feed forward networks. Finally, we create the autoencoder for the next layer by training it on the feature dataset:

Network unsupervisedPreTraining(
        UnlabeledData<RealVector> const& data,
        std::size_t numHidden1,std::size_t numHidden2, std::size_t numOutputs,
        double regularisation, std::size_t iterations
        //train the first hidden layer
        std::cout<<"training first layer"<<std::endl;
        AutoencoderModel layer =  trainAutoencoderModel<AutoencoderModel>(
        //compute the mapping onto the features of the first hidden layer
        UnlabeledData<RealVector> intermediateData = layer.evalLayer(0,data);

        //train the next layer
        std::cout<<"training second layer"<<std::endl;
        AutoencoderModel layer2 =  trainAutoencoderModel<AutoencoderModel>(

We can now create the pre-trained network from the auto encoders by creating a network with two hidden layers, initialize all weights randomly, and then setting the first and hidden layers to the encoding layers of the auto encoders:

//create the final network
Network network;
network.setStructure(dataDimension(data),numHidden1,numHidden2, numOutputs);

return network;

Supervised Training

The supervised training part is overall the same as in previous tutorials and we only show the code here. We use the CrossEntropy loss for classification and the OneNormRegularizer for sparsity of the activation function. We again optimize using IRpropPlusFull:

//model parameters
std::size_t numHidden1 = 8;
std::size_t numHidden2 = 8;
//unsupervised hyper parameters
double unsupRegularisation = 0.001;
std::size_t unsupIterations = 100;
//supervised hyper parameters
double regularisation = 0.0001;
std::size_t iterations = 200;

//load data and split into training and test
LabeledData<RealVector,unsigned int> data = createProblem();
LabeledData<RealVector,unsigned int> test = splitAtElement(data,static_cast<std::size_t>(0.5*data.numberOfElements()));

//unsupervised pre training
Network network = unsupervisedPreTraining(
        data.inputs(),numHidden1, numHidden2,numberOfClasses(data),
        unsupRegularisation, unsupIterations

//create the supervised problem. Cross Entropy loss with one norm regularisation
CrossEntropy loss;
ErrorFunction error(data, &network, &loss);
OneNormRegularizer regularizer(error.numberOfVariables());

//optimize the model
std::cout<<"training supervised model"<<std::endl;
IRpropPlusFull optimizer;
for(std::size_t i = 0; i != iterations; ++i){
        std::cout<<i<<" "<<optimizer.solution().value<<std::endl;


In the original paper, the networks are optimized using stochastic gradient descent instead of RProp.

Full example program

The full example program is DeepNetworkTraining.cpp. As an alternative route, DeepNetworkTrainingRBM.cpp shows how to do unsupervised pretraining using the RBM module.


[Glorot11]X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier networks. Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. JMLR W&CP (15), 2011.
[Vincent08]P. Vincent, H. Larochelle Y. Bengio, and P. A. Manzagol. Extracting and Composing Robust Features with Denoising Autoencoders, Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML‘08), pages 1096-1103, ACM, 2008.