belief_flows

# Belief Flows

## Paper

Ortega, P.A., Crammer, K., Lee, D.D.
“Belief Flows for Robust Online Learning.”
Information Theory and Applications (ITA), February 2015.
[PDF] [Slides]

## In a nutshell

Belief flows chooses the most conservative belief update given a single observation of the error gradient at a location chosen through Thompson sampling.

We introduce an online learning method, which we call belief flows, that scales to very large datasets and complex model classes while avoiding overfitting. This is done by combining ideas from stochastic gradient descent, Bayesian filtering, and multi-armed bandits. Simply stated the algorithm:

1. tracks the uncertainty by placing distributions over the model parameters;
2. updates these distributions by minimizing the Kullback-Leibler divergence, subject to a class of possible updates and a noisy observation of gradient descent step;
3. and evaluates inputs using randomly drawn parameters (i.e. Thompson sampling).

Our preliminary experiments show that belief flows outperforms state-of-the-art regularization techniques, such as DROPOUT in deep learning.

## Motivation

Many problems in machine learning must cope with continual learning tasks involving very large datasets or even data streams. As more data becomes available, we naturally start asking more complex questions about it, requiring richer models to answer them such as a deep neural network. In turn, these models demand learning algorithms that scale in accuracy and computational resources. In fact, recent work has shown that complex models require regularization to avoid overfitting even when the data is abundant 1).

## Leveraging the best of the old

How do we formulate efficient online learning algorithms that scale well to complex models? Consider two well-known approaches:

1. Stochastic Gradient Descent: Many learning tasks can be cast as optimization problems, and stochastic gradient descent (SGD) is simple, scalable and enjoys strong theoretical guarantees in convex problems. Unfortunately, SGD overfits easily if not properly regularized—especially in rich, non-convex model classes such as deep neural networks.
2. Bayesian Filtering: On the other hand, Bayesian filtering methods, such as the Kalman filter, track belief distributions over the optimal parameters to avoid overfitting. On the downside, they are typically computationally prohibitive when the model class is too rich.

Let's think about this quickly. How would you design an algorithm that leverages the best parts of both aforementioned methods? This is what we came up with:

1. Track parameter uncertainty to avoid overfitting. Like in Bayesian methods, we place distributions over the model parameters.
2. Update parameters based on the gradient information to accelerate learning. Like in SGD, we want to guide the exploration of the parameter space using the error gradient.
3. Reduce computational complexity by replacing marginalization with reinforcement learning. To make predictions, Bayesian methods must marginalize over the model parameters. However, marginalization becomes quickly intractable when the model class is large. Instead, let's think of the learning and prediction problem differently, by treating it as a reinforcement learning problem: the goal of the learning algorithm is to choose a sequence of model parameters in order to reduce the expected prediction error—just like a choosing a sequence of arms in order to minimize the expected reward in a bandit problem.

Schematic comparison: a) SGD; b) Bayesian filtering; c) Belief flows.

## Belief flows

Let's illustrate the idea using Gaussian distributions for representing parameter uncertainty. Assume we focus on supervised prediction tasks using parametrized models $F_w(x)$ with inputs $x \in \mathbb{R}^p$ and parameters $w \in \mathbb{R}^d$. A deep neural network would be an example. Then, in each round, we do the following.

1. Prior: We place a Gaussian distribution $P(w)$ to represent our parameter uncertainty. To simplify our exposition, we assume that the covariance matrix is diagonal, and so $P(w) = N(w; \mu, \Sigma) = \prod_n N(w_n; \mu_n, \sigma^2_n),$ where $w_n$, $\mu_n$ are the $n$-th components of the parameter and mean vectors respectively, and $\sigma^2_n$ is the $n$-th diagonal element of the covariance matrix $\Sigma$.
2. Parameter choice: The learning algorithm now has to choose model parameters to minimize the prediction error. It does so using Thompson sampling, that is, by sampling a parameter vector $w'$ from the prior distribution: $\bar{w} \sim P(w).$
3. Evaluation of Loss and Local Update: Once the parameter is chosen, the learning algorithm is given a supervised pair $(x, y)$ that is can use to evaluate the loss $\ell(y, \hat{y})$, where $\hat{y} = F_{\bar{w}}(x)$ is the predicted output. Based on this loss, the learning algorithm can calculate the update of the parameter $\bar{w}$ using SGD: $\bar{w}' = \bar{w} - \eta \cdot \frac{\partial}{\partial w} \ell(y,\hat{y}),$ where $\eta > 0$ is the learning rate.
4. Global Update: Now, the algorithm has to change its prior beliefs $P(w)$ into posterior beliefs $P'(w)$. To do so, it must infer the SGD update over the whole parameter space based solely on the local observation $\bar{w} \rightarrow \bar{w}'$.
1. If we assume a quadratic error function with uncorrelated coordinates, then the class of possible SGD updates becomes the class of linear flow fields in parameter space that transforms each component as $w'_n = a_n w_n + b_n,$ preserving the Gaussian shape of the resulting posterior. However, there are many such transformation that are consistent with the observed SGD update $\bar{w} \rightarrow \bar{w}'$, so which one should the algorithm choose?
2. Fortunately, there's a principled answer to this question: The algorithm should pick the most conservative information update2) by minimizing $D(P'\|P) = \int P'(w) \log \frac{ P'(w) }{ P(w) } \, dw$ subject to the shape of the flow field and the observed update.
3. In our paper we show that in this case, the optimal posterior $P^\ast$ is another Gaussian with diagonal covariance matrix having posterior hyperparameters $\mu_n' = a^\ast_n (\mu_n-w_n) + w_n' \qquad \sigma'_n = a^\ast \sigma_n,$ where the optimal transformation $a^\ast_n$ is $a^\ast = \frac{u_n v_n + \sqrt{4 + u_n^2 (4+v_n^2)}}{2(1+u_n^2)}$ and where we have defined the normalized deviations $u_n := (w_n-\mu_n)/\sigma_n$ and $v_n = (w_n'-\mu_n)/\sigma_n$. The algorithm then picks the aforementioned optimal posterior, and the next round starts.

This idea can be generalized. For instance, we can extend this to Gaussian distributions with a full covariance matrix and the spherical Gaussians:

Or, we can make different choices for the belief distributions, flow fields, and local update rules.

## Evaluation

We ran preliminary experiments to validate the Gaussian belief flows model. For this, we trained a neural network on the MNIST handwritten digits database and some of its variants. We've chosen a modest architecture of 784 inputs (for the 28×28 pixel raw images), 200 hidden units, and 10 outputs with aggressive updates. All units had logistic sigmoid activation functions, and error gradients were evaluated on the binary KL-divergence averaged over the outputs (see the details in the paper). The results are as follows:

Online Classification Error in %
Plain Random Images
SGD 11.25 89.14 72.41
DROPOUT 9.84 52.87 50.68
Belief Flows 11.01 37.94 47.71
Test Classification Error in %
Plain Random Images
SGD 7.01 89.17 65.17
DROPOUT 5.52 53.42 46.67
Belief Flows 5.00 29.11 41.55
1) Welling, M. and Teh, Y.-W., “Bayesian Learning via Stochastic Gradient Langevin Dynamics”, in ICML, 2011.
2) More precisely, this is an application of Kullback's principle of minimum discrimination information.