This short tutorial will demonstrate how data can be normalized using Shark. Read the basic data tutorials first if you are not familiar with the Data containers.
Shark normalizes data by training a LinearModel. Two different trainers for two different types of normalization are available. The trainers are NormalizeComponentsUnitInterval and NormalizeComponentsUnitVariance. The first one normalizes every input dimension to the range [0,1], the other makes the data mean free and adjusts the variance of each component to one. This is no whitening since the covariance remains unchanged.
In the following we will normalize data to unit variance. First we have to train our linear model so that it can perform the normalization:
#include <shark/Algorithms/Trainers/NormalizeComponentsUnitVariance.h>
using namespace shark;
int main()
{
//load data from a file or generate it
UnlabeledData<RealVector> trainingData = loadData();
std::size_t dataSize = dataDimension(trainingData); //size of a data vector
//the normalizer needs input and output dimension equal to the size of a
//data vector and also a bias weight!
LinearModel<> normalizer(dataSize, dataSize, true); //last argument is for the offset
NormalizeComponentsUnitVariance<> normalizingTrainer;
normalizingTrainer.train(normalizer, trainingData);
}
Now the normalizer is ready to use. Data::transform() can be applied to normalize the previously declared training data
trainingData.transform(normalizer);
Applying transform will copy the training data and disconnect it from previously created subsets of this set. Thus previously created subsets won’t be normalized, but all subsets created afterwards. In order to apply such a normalization to LabeledData, the methods transformInputs and transformLabels can be used. Of course, the test data can be normalized as well, mutually or separately. The following example trains and transforms the labels of a regression task:
int main()
{
//load data somehow from a file or generate it
LabeledData<RealVector,RealVector> trainingData = loadData();
std::size_t labelSize = labelDimension(trainingData); //size of label vector
//train normalizer
LinearModel<> labelNormalizer(labelSize,labelSize,true);
NormalizeComponentsUnitVariance<> normalizingTrainer;
normalizingTrainer.train(labelNormalizer,trainingData.labels());
//apply normalizer
trainingData.transformInputs(labelNormalizer);
}
You can concatenate a normalizer with another model. This comes handy when a model should be used to handle a stream of new input data. Only one call to eval is needed to use the normalization followed by the trained model:
#include<shark/Models/ConcatenatedModel.h>
//...
YourModel model;
ConcatenatedModel<RealVector,RealVector> completeModel = normalizer >> model;
For a more complex example of how normalization can be used, see the tutorial about training the Extreme Learning Machine with the complete example source elmTutorial.cpp.