Serialization

Most objects in Shark can be serialized, meaning that their internal state can be transferred from and to a stream, e.g., for loading and saving. This short tutorial demonstrates how to use this feature.

Let us start with a basic machine learning example, similar to the one developed in the Support Vector Machines: First Steps tutorial:

#include <shark/Algorithms/Trainers/SvmTrainer.h>
#include <shark/Models/Kernels/GaussianRbfKernel.h>
#include <shark/ObjectiveFunctions/Loss/ZeroOneLoss.h>
#include <shark/Data/DataDistribution.h>
#include <fstream>

using namespace shark;
using namespace std;

int main(int argc, char** argv)
{
        // generate synthetic data
        Chessboard prob;
        ClassificationDataset training(prob, 500);

        // define a model
        GaussianRbfKernel<> kernel(0.5, true);
        KernelExpansion<RealVector> ke(&kernel, true);

        // train the model
        CSvmTrainer<RealVector> trainer(&kernel, 1000.0);
        trainer.train(&ke, training);

        // evaluate the trained model on the training set
        Data<RealVector> output;
        ke.eval(training.inputs(), output);
        ZeroOneLoss<unsigned int, RealVector> loss;
        double trainError = loss.eval(training.labels(), output);
        cout << "training error of the original model:\t" << trainError << endl;
}

This program trains a support vector machine and outputs its training error. Now let’s assume we want to store the trained model for later use, e.g., as a recovery point in a long running process. We extend the above program:

// save the model to the file "svm.model"
ofstream ofs("svm.model");
boost::archive::polymorphic_text_oarchive oa(ofs);
ke.write(oa);
ofs.close();

Shark makes heavy use of templates. This has many great advantages, but in this case makes life a bit harder. The kernel expansion model internally holds a list of all support vectors, and they are objects of an arbitrary type that comes as a template argument. In other words, the KernelExpansion code does not know anything about this type and how to serialize it. Now, this unknown and possibly user defined type needs to be serialized to a file, since it is an important part of the model’s state. This is where the serialization capability of boost comes into play, since the boost serialization library offers a principled solution to this problem.

Use of this feature is easy. We construct a boost archive object and call the write method of the kernel expansion. The model stores its internal state in the archive. Another interesting aspect of this construction is the handling of the kernel parameters, in this case the bandwidth parameter of the Gaussian RBF kernel. This parameter has been set to 0.5 in the above example, and since the kernel is an integral part of the kernel expansion, the kernel state it stored alongside the other parameters.

Now let’s assume disaster has happened: our long running process was killed, maybe by a power outage. We are lucky, because we have stored the kernel expansion model to disk. So let’s continue the process with the stored model, instead of going through the possibly lengthy training process again:

// load the file "svm.model" into a new model
GaussianRbfKernel<> kernelLoad(true);
KernelExpansion<RealVector> keLoad(&kernelLoad, true);
ifstream ifs("svm.model");
boost::archive::polymorphic_text_iarchive ia(ifs);
keLoad.read(ia);
ifs.close();

That’s all. We construct a boost archive for input and invoke the read method of a fresh kernel expansion model. Note that we have provided the kernel expansion object already with the right type of kernel object, but we have not set its parameters. All parameters (support vectors, weights and bias of the kernel expansion and bandwidth of the kernel) are restored from disk, and the model is straight away ready for evaluation:

// evaluate the loaded model on the training set
keLoad.eval(training.inputs(), output);
trainError = loss.eval(training.labels(), output);
cout << "training error of the loaded model:\t" << trainError << endl;