Kernels

The term kernel function, or kernel for short, is overloaded. Here, we refer to positive semi-definite kernels, that is, the functions inducing reproducing kernel Hilbert spaces [Aronszajn1950]. They underlie the “kernel trick” [Schölkopf2002], which is used, for instance, in non-linear support vector machines (SVMs).

This tutorial covers all knowledge required for using existing and writing you own kernel based predictors and learning algorithms. That is, this tutorial explains the kernel interface from a user perspective. Writing a specialized kernel function is explained in the tutorial Writing Kernel Functions.

Background

Given some set \(\mathcal X\), a positive semi-definite kernel \(k:\mathcal X\times\mathcal X\to\mathbb R\) is a symmetric function for which

\[\sum_{i=1}^N\sum_{j=1}^N a_i a_j k(x_i, x_j) \ge 0\]

for all \(N\), all \(x_1,...,x_N\in\mathcal X\), and all \(a_1,...,a_N\in\mathbb R\).

A kernel \(k\) on \(\mathcal X\) corresponds to a scalar product in a dot product space \(\mathcal H\), the so called feature space:

\[k(x,y) = \langle \phi(x),\phi(y) \rangle_{\mathcal H}\]

where \(x\) and \(y\) are elements of \(\mathcal X\) , \(\phi\) is a map from \(\mathcal X\) to \(\mathcal H\), and \(\langle \cdot, \cdot \rangle_{\mathcal H}\) is the scalar product in \(\mathcal H\). For details we refer to [Aronszajn1950] and [Mercer1909].

Many machine learning algorithms can be written in a way that the only operations involving input elements are scalar products between those elements. A common strategy in machine learning is to map the input data into a feature space \(\mathcal H\) and to do the learning in this feature space. If the only operations in \(\mathcal H\) are scalar products, these can be replaced by kernel function evaluations rendering explicit computations of the mapping \(\phi\) to feature space unnecessary. This has some advantages:

  • Typically, the kernel can be computed more efficiently than the scalar product itself. This allows for working in very high-dimensional feature spaces.
  • The kernel provides a clean interface between general and problem specific aspects of the learning machine.

Thus, the “kernel trick” allows efficient formulation of nonlinear variants of any algorithm that can be expressed in terms of dot products. The choice of the kernel function is crucial for the performance of the machine learning algorithm.

The generic distance between to points mapped to a kernel-induced feature space is given by

\[d(x,y) = \sqrt{\langle \phi(x)-\phi(y), \phi(x)-\phi(y) \rangle_{\mathcal H}} =\sqrt{k(x,x) - 2k(x,y) + k(y,y)}\]

where \(d\) is the distance between the points \(x\) and \(y\). We call a kernel normalized, if \(k(x,x)=1\) for all \(x\). In this case calculating the distance reduces to \(d(x,y) =\sqrt{2 - 2k(x,y)}\).

Kernels in Shark

Shark provides strong support for kernel-based algorithms. All kernel functions’ base class is the AbstractKernelFunction. A linear combination of kernels is represented in Shark as a KernelExpansion

\[\sum_{i=1}^N \alpha_i k(x_i, . ) + b\]

with \(x_1,...,x_N\in\mathcal X\), \(\alpha_1,...,\alpha_N\in\mathbb R\), and optional bias/offset parameter \(b\in\mathbb R\).

Many kernel-based algorithms need to repeatedly evaluate the kernel on some training data points \(x_1,\dots,x_N\) or they operate on the kernel (Gram) matrix \(K\) with entries \(K_{ij}=k(x_i,x_j)\) directly. To save computation time, the matrix \(K\) would be stored in memory. Depending on the hardware, even training sets with a few hundred thousand can make this prohibitive. Therefore, often only parts of \(K\) are calculated at a time, most often matrix rows or blocks. In Shark, the classes KernelMatrix and CachedMatrix as well as some derived and sibling classes encapsulate kernel Gram matrices. The CachedMatrix also automatically takes care of memory handling.

The base class ‘AbstractKernelFunction<InputTypeT>’

The interface of kernels can be understood as a generalization of the interface of Models to functions taking two arguments of the same type. All kernels are derived from the abstract class AbstractKernelFunction. Due to the demanding computations involving kernel evaluations, the interface is optimized for speed, and to allow parallelization of the evaluation of different parts of the kernel Gram matrix at a time. In the following, the basic design decisions are outlined and explained. Since kernels and models have much in common, consider reading the Models tutorial first.

Types

First, we introduce the templated types of a Kernel, which are all inferred from the only template argument InputType using several metafunctions. As in the Models, we have the InputType, and the BatchInputType, which is a batch of inputs. In contrast to Models, we also introduce special reference types:

Types Description
InputType Argument type of the kernel
BatchInputType Batch of arguments; same as Batch<InputType>::type
ConstInputReference Constant reference to InputType as returned by ConstProxyReference<InputType>::type; by default this is InputType const&
ConstBatchInputReference Constant reference to BatchInputType as returned by ConstProxyReference<BatchInputType>::type

The reason for the ConstBatchInputReference and ConstInputReference types is that we want to make use of the structure of the arguments to prevent unnecessary copying: consider a common case when only single elements of a batch of data are to be computed. If the batch type then is a matrix, the argument will be a row of this matrix, and not a vector. Thus, the argument would be automatically copied into a temporary vector, which is then in turn fed into the kernel. This is of course unnecessary, and for fast kernels, the copying can exceed the running time of a kernel evaluation. Thus we use proxy references for vectors, which simply treat matrix rows and vectors in the same way. This optimization right now only works for the class of dense vectors and not for example sparse vectors or even more complex types.

Todo

implications of this? is there a task in the tracker? etc.

Flags

Like a Model, every kernel has a set of flags and convenience access functions which indicate the traits and capabilities of the kernel:

Flag and accessor function name Description
HAS_FIRST_PARAMETER_DERIVATIVE, hasFirstParameterDerivative If set, the kernel can evaluate the first derivative w.r.t its parameters
HAS_FIRST_INPUT_DERIVATIVE, hasFirstInputDerivative If set, the kernel can evaluate the first derivative w.r.t its left input parameters; This is no restriction, since kernel functions are symmetric
IS_NORMALIZED, isNormalized For all \(x\) it holds \(k(x,x)=1\)
SUPPORTS_VARIABLE_INPUT_SIZE, supportsVariableInputSize Between different calls to \(k(x,y)\) the number of dimensions of the kernel is allowed to vary; this is needed for kernel evaluation of inputs with missing features

Evaluation

Next, we introduce the functions evaluating kernels. We have three types of functions. The first version simply calculates the kernel value given two inputs. The second computes the kernel evaluation of two batches of inputs. Here, the inner product between all points of the first and second batch is calculated in Hilbert space. Thus, the resulting type is a matrix of inner products – a block of the kernel Gram matrix. The third version takes two batches as well but also a state object. The state is a data structure which allows the kernel to store intermediate results of the evaluation of the kernel values. These can later be reused in the computation of the derivatives. Thus, when derivatives are to be computed, this latter version must be called beforehand to fill the state object with the correct values. There is no version of the derivative with two single inputs, because this is a rare use case. If still needed, batches of size one should be used. The reason for the state object being external to the kernel class is that this design allows for concurrent evaluation of the kernel from different threads, with each thread holding its own state object.

With this in mind, we now present the list of functions for eval, including the convenience operator(). Let in the following I be a ConstInputReference and B a ConstBatchInputReference.

Method Description
double eval(I x, I z) Calculates \(k(x,z)\)
void eval(B X, B Z, RealMatrix& K) Calculates \(K_{ij}=k(x_i,z_j)\) for all elements \(x_i\) of X and \(z_j\) of Z
void eval(B X, B Z, RealMatrix& K, State& ) Calls eval(X,Z,K) while storing intermediate results needed for the derivative functions
double operator()(I x, I z) Calls eval(x,z)
RealMatrix operator()(B X, B Z) Calls eval(X,Z,K) and returns K.

For a kernel, it is sufficient to implement the batch version of eval that stores the state, since all other functions can rely on it. However, if speed is relevant, all three eval functions should be implemented in order to avoid unnecessary copy operations.

Distances

As outlined before, kernels can also be used to compute distances between points in \(\mathcal H\):

Method Description
double featureDistanceSqr(I x, I z) Returns the squared distance between x and z
double featureDistance(I x, I z) Returns the distance between x and z.
RealMatrix featureDistanceSqr(B X, B Z) Returns the squared distances between all points in X to all points in Z.

Derivatives

Some Kernels are differentiable with respect to their parameters. This can for example be exploited in gradient-based optimization of these parameters, which in turn amounts to a computationally efficient way of finding a suitable space \(\mathcal H\) in which to solve a given learning problem. Further, if the input space is differentiable as well, even the derivative with respect to the inputs can be computed.

The derivatives are weighted as outlined in Shark Conventions for Derivatives. The parameter derivative is a weighted sum of the derivatives of all elements of the block of the kernel matrix. The input derivative has only weights for the inputs of the right argument.

Todo

math here? mt: yes please! :)

The methods for evaluating the derivatives are:

Method Description
weightedParameterDerivative Computes the weighted derivative of the parameters over all elements of a block of the kernel Gram matrix.
weightedInputDerivative Computes the derivative with respect of the left argument, weighting over all right arguments.

Putting everything together, we can calculate the derivative of a kernel like this:

BatchInputType X; //first batch of inputs
BatchInputType Y; //second batch of inputs
RealMatrix K;     //resulting part of the kernel Gram matrix
MyKernel kernel;  //the differentiable kernel

// evaluate K for X and Y, store the state
boost::shared_ptr<State> state = kernel.createState();
kernel.eval(X, Y, result, *state);

// somehow compute some weights and calculate the parameter derivative
RealMatrix weights = someFunction(result, X, Y);
RealVector derivative;
kernel.weightedParameterDerivative(X, Y, weights, *state, derivative);

Todo

i think we need some more explanation on the expected size of weights, especially since we don’t have type checks in the code of weightedParameterDerivative (maybe these should be added, too). in any case, the workings of weightedParameterDerivative should be explained more, or link to some tutorial where this is done.

Other

Kernels support several other concepts. They have parameters, can be serialized and have an external state object.

Method Description
numberOfParameters The number of parameters which can be optimized
parameterVector Returns the current parameters of the kernel object
setParameterVector Sets the new parameter vector
createState Returns a newly created State object holding the state to be stored in eval

Kernel Helper Functions

The file KernelHelpers.h defines some free functions that help dealing with common tasks in kernel usage. Currently this file offers the following functions:

Method Description
calculateRegularizedKernelMatrix Evaluates the whole kernel Gram matrix given a kernel and a dataset; optionally, a regularization value is added to the main diagonal
calculateKernelMatrixParameterDerivative Computes the parameter derivative for a kernel Gram matrix defined by a kernel, dataset, and a weight matrix

List of Kernels

Shark implements a number of general purpose kernels:

Model Description
LinearKernel Standard Euclidean inner product \(k(x,y) = \langle x,y \rangle\)
MonomialKernel For a given exponent n, computes \(k(x,y) = \langle x,y \rangle^n\)
PolynomialKernel For a given exponent n and offset b, computes \(k(x,y) = \left(\langle x,y \rangle+b\right)^n\)
DiscreteKernel This kernel on a discrete space is explicitly defined by a symmetric, positive semi definite Gram matrix
GaussianRbfKernel Gaussian isotropic (“radial basis function”) kernel \(k(x,y) = e^{-\gamma ||x-y||^2}\)
ARDKernelUnconstrained Gaussian kernel \(k(x,y) = e^{-(x-y)^T C(x-y)}\) with diagonal parameter matrix C

Valid positive semi-definite kernels can be formed, among others, by adding and multiplying kernels. This leads to a range of what we call combined kernels listed below:

Model Description
WeightedSumKernel For a given set of kernels computes \(k(x,y) = k_1(x,y)+\dots + k_n(x,y)\)
ProductKernel For a given set of kernels computes \(k(x,y) = k_1(x,y) \dots k_n(x,y)\)
NormalizedKernel Normalizes a given kernel; computes: \(k(x,y) = k_1(x,y) / \sqrt{k_1(x,x) k_1(y,y)}\)
ScaledKernel Scales a kernel by a fixed constant
SubrangeKernel Weighted sum kernel for vector spaces; every kernel receives only a subrange of the input
MklKernel Weighted sum kernel for heterogenous type input tuples; every kernel receives one part of the input tuple
GaussianTaskKernel, see MultiTaskKernel.h Specialization of the DiscreteKernel for multi task learning
MultiTaskKernel, see MultiTaskKernel.h Framework kernel for multi task learning with kernels

References

[Aronszajn1950](1, 2) Aronszajn, N. Theory of Reproducing Kernels. Transactions of the American Mathematical Society 68 (3): 337–404, 1950.
[Mercer1909]Mercer, J. Functions of positive and negative type and their connection with the theory of integral equations. In Philosophical Transactions of the Royal Society of London, 1909.
[Schölkopf2002]Schölkopf, B. and Smola, A. Learning with Kernels. MIT Press, 2002.