# Loss and Cost Functions¶

Shark uses the notion of loss and cost functions to define machine learning tasks.

## Loss functions¶

Consider a model (a hypothesis) \(f\) mapping inputs \(x\)
to predictions \(y=f(x)\in Y\). Let \(t\in Y\) be the true
label of input pattern \(x\). Then a *loss function*
\(L:Y\times Y\to\mathbb{R}^+_0\) measures the quality of the
prediction. If the prediction is perfectly accurate, the loss function
is zero (\(t=y\Rightarrow L(t, y)=0\)). If not, the loss
function measures “how bad” the mistake is. The loss can be
interpreted as a penalty or error measure.

For a classification tast, a fundamental loss function is the 0-1-loss:

For regression, the squared loss is most popular:

Using the concept of a loss function, the goal of supervised learning
can be described as finding a model \(f\) minimizing the *risk*:

Here the expectation \(\mathbb{E}\) is over the joint distribution underlying the observations of inputs and corresponding labels.

## Cost functions¶

Now let us consider a collection of observations
\(S=\{(x_1,t_1),(x_2,t_2),\dots,(x_N,t_N)\}\in(X\times Y)^N\) and
corresponding predictions \(y_1.y_2,\dots,y_N\) by a model \(f\).
A *cost function* \(C\) is a mapping assigning
an overall cost value, which can be interpreted as an overall error,
to \(\{(y_1,t_1),(y_2,t_2),\dots,(y_N,t_N)\}\in(Y\times Y)^N\).
Every loss function induces a cost function, namely the *empirical
risk*:

The cost function induced by the 0-1-loss is the average misclassification error and the cost function induced by the squared loss is the mean squared error (MSE).

However, there are cost functions which cannot be decomposed using a loss
function. For example, the *area under the curve* (AUC).
In other words, all loss functions generate a cost function, but not all cost
functions must be based on a loss function.

Todo

i think what is missing here is a comment on the normalization conventions. i can’t even quite figure them out on my own from the code. it seems that for losses, when the type passed to eval is data-of-something, then the result is normalized by the number of samples, and if i use the batch interface of eval, it is not normalized, correct? so, using the data interface is a statement that one wants normalization and otherwise on always does it oneself? i am aware that this could also go down to where the eval-interfaces are explained, but i think it should be at a more prominent place actually.

### Derivatives¶

When both the loss function and the model are differentiable, it is possible to calculate the derivative of the empirical risk with respect to the model parameters \(w\):

This allows embarrassingly parallelizable gradient descent on the cost function. Please see the tutorial Shark Conventions for Derivatives for learning more about the handling of derivatives in Shark.

## The base class ‘AbstractCost<LabelTypeT,OutputTypeT>’¶

The base class AbstractCost is templatized with respect to both the label and output type. Using batches, that is, collections of input elements, is an important concept in Shark, see the tutorial Data Batches. The proper batch types are inferred from the label and output types:

Types | Description |
---|---|

LabelType | Type of a label \(t_i\) |

OutputType | Type of a model output \(z_i\) |

BatchLabelType | Batch of Labels; same as Batch<LabelType>::type |

BatchOutputType | Batch of Outputs; same as Batch<OutputType>::type |

Like all other interfaces in Shark, cost functions have flags indicating their internal capabilities:

Flag, Accessor function | Description |
---|---|

HAS_FIRST_DERIVATIVE, hasFirstDerivative | Can the cost function calculate its first derivative? |

IS_LOSS_FUNCTION, isLossFunction | Is the cost function a loss in the above terms (i.e., separable)? |

The interface of AbstractCost reflects the fact that costs can only be evaluated
on a complete set of data. The following functions can be used for evaluation of
`AbstractCost`

. For brevity let `L`

be the `LabelType`

and `O`

the
`OutputType`

:

Method | Description |
---|---|

`double eval(Data<L> const& label, Data<O> const& predictions)` |
Returns the cumulated cost of the predictions \(z_i\) given the label \(t_i\). The loss is normalized by the number of points in the datasets, thus the mean loss is returned. |

`double operator()(Data<L> const& label, Data<O> const& predictions)` |
Convenience function Returning eval(label,predictions) |

`double evalDerivative(Data<L> const&label, Data<O> const& predictions, Data<O>& gradient)` |
Returns the error of the predictions \(z_i\) given the label \(t_i\) and computes \(\frac 1 N \frac {\partial}{\partial z_i}L(z_i,t_i)\), where \(N\) is the number of data points. This function also returns the mean of the loss as its return value. |

## The base class ‘AbstractLoss<LabelTypeT,OutputTypeT>’¶

The base class AbstractLoss is derived from AbstractCost. It implements
all methods of its base class and offers several additional methods. Shark code is
allowed to read the flag `IS_LOSS_FUNCTION`

via the public method `isLossFunction()`

and to downcast an AbstractCost object to an AbstractLoss. This enables the use of the
following much more efficient interface:

Method | Description |
---|---|

`double eval(LabelType const& t, InputType const& z)` |
Returns the error of the prediction \(z\) given the label \(t\). |

`double eval(BatchLabelType const& T, BatchInputType const& Z)` |
Returns the error of the predictions \(z_i \in Z\) given the label \(t_i \in T\). |

`double operator()(LabelType const& t, InputType const& z)` |
Calls eval(t,z) |

`double operator()(BatchLabelType const& T, BatchInputType const& Z)` |
Calls eval(T,Z) |

`double evalDerivative(BatchLabelType const& T, BatchInputType const& Z, BatchInputType const& gradient)` |
Returns the error of the predictions \(z_i\) given the label \(t_i\) and computes \(\frac {\partial}{\partial z_i}L(z_i,t_i)\) |

## List of Cost and Loss functions¶

Currently only one instance of AbstractCost is implemented:

Model | Description |
---|---|

NegativeAUC | Area under the ROC (receiver operating characteristic) curve. Value is negated so that it plays well with optimizers (which perform minimization by convention) |

Loss Functions:

Model | Description |
---|---|

AbsoluteLoss | Returns the \(L_2\)-norm of the distance, \(|t-z|_2\) |

SquaredLoss | Returns the squared distance in two-norm \(|t-z|_2^2\); standard regression loss |

ZeroOneLoss | Returns 0 if \(t_i=z_i\) otherwise standard classification loss |

DiscreteLoss | Uses a cost matrix to calculate losses in a discrete output and label space (general classification loss) |

CrossEntropy | Logarithmic likelihood function if the model outputs are interpreted as exponents of a softmax classifier; useful, e.g., for training of neural networks with linear outputs |

HingeLoss | Loss used in Maximum margin classification. Binary and multiclass implemented. |

SquaredHingeLoss | Loss used in Maximum margin classification. It is the pointwise Square of the HingeLoss. It is differentiable everywhere. |

EpsilonHingeLoss | Loss for regression. It can be underestood as the 1-norm loss which is cut off to 0 in a box of size epsilon around the label. |

SquaredEpsilonHingeLoss | Maximum margin regression. It is zero in a ball of size epsilon around the label and outside the squared two-norm of the distance of prediction and label. Thus very close points are not punished. |

HuberLoss | Robust loss for rgression. It is quadratic close to 0 and becomes a linear function for big discrepancies between model prediction and target. |

TukeyBiweightLoss | Robust loss for regression. It is similar to the Huber loss, but instead of becoming linear, it becomes constant. |

Todo

i think the descriptions in the right table need some update. for example, the one for CrossEntropyIndependent does not make sense;