Arg-Max Prior
Paper
Ortega, P.A., Grau-Moya, J., Genewein, T., Balduzzi, D. and Braun, D.A.
“A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function.”
Neural Information Processing Systems (NIPS) 2012
[PDF]
Download Code
What is the Arg-Max Prior?
The arg-max prior is a Bayesian model of the maximizing argument (argmax in short) of a noisy function. For instance, consider the noisy function depicted in the next figure, and assume you are given a set of observations of the y-value for a number of x-locations. Then, our model allows computing the posterior probability density over the argmax.
Why do we need such a Prior?
The Arg-Max Prior is useful for optimizing a noisy black-box function, which is something that you encounter in many real-world applications. The brute-force approach for solving this problem consists in writing down a probabilistic model over all the possible functions implemented by this black-box (e.g. a Gaussian Process Prior), calculating the posterior distribution given the input-output data, and choosing the next test location (using e.g. a heuristic such as UCB). This is a lot of additional modelling when in reality all we look for is a model over the optimum.
The Arg-Max Prior addresses this problem by providing a direct probabilistic model based on simulated annealing and a temperature calculated from a measure of the effective number of data points we have observed so far (e.g. the Rényi entropy). Given this probabilistic model, we can choose the next test point by just sampling from it (i.e. Thompson sampling). This surprisingly simple approach turns out to work very well in practice.
The Model
The model is intuitively straightforward and easy to implement. Let $h(x): \mathcal{X} \rightarrow \mathcal{R}$ be an estimate of the mean $\bar{f}(x)$ constructed from data $\mathcal{D}_t := \{(x_i, y_i)\}_{i=1}^t$. This estimate can easily be converted into a posterior pdf over the location of the maximum by first multiplying it with a precision parameter $\alpha>0$ and then taking the normalized exponential \[ P(x^\ast|\mathcal{D}_t) \propto \exp\{ \alpha \cdot h(x^\ast) \}. \] In this transformation, the precision parameter $\alpha$ controls the certainty we have over our estimate of the maximizing argument: $\alpha \approx 0$ expresses almost no certainty, while $\alpha \rightarrow \infty$ expresses certainty. The rationale for the precision is: the more distinct inputs we test, the higher the precision—testing the same (or similar) inputs only provides local information and therefore should not increase our knowledge about the global maximum. A simple and effective way of implementing this idea is given by \[ P(x^\ast| \mathcal{D}) \propto \exp \biggl\{ \rho \underbrace{ \biggl( \xi + t \cdot \frac{ \sum_i K(x_i, x_i) }{ \sum_{i,j} K(x_i, x_j) } \biggr) }_\text{effective number of observations} \underbrace{ \biggl( \frac{ \sum_i K(x_i, x^\ast) y_i + K_0(x^\ast) y_0(x^\ast) } { \sum_i K(x_i, x^\ast) + K_0(x^\ast) } \biggr) }_\text{mean function estimate} \biggr\} \] where $\rho$, $\xi$, $K$, $K_0$ and $y_0$ are parameters of the estimator: $\rho > 0$ is the precision we gain for each new distinct observation; $\xi > 0$ is the number of prior points; $K: \mathcal{X} \times \mathcal{X} \rightarrow \mathcal{R}^{+}$ is a symmetric kernel function; $K_0: \mathcal{X} \rightarrow \mathcal{R}^{+}$ is a prior precision function; and $y_0: \mathcal{X} \rightarrow \mathcal{R}$ is a prior estimate of $\bar{f}$.
The two components of the model are:
- Mean function. The mean function $\bar{f}$ is estimated with a kernel regressor that combines the function observations with a prior estimate of the function.
- Precision. The precision is given by a scaling factor $\rho$ times an estimate of the effective number of observations (e.g. a Rényi entropy). This is calculated as the sum of the number of prior observations $\xi$ and the number of distinct locations in the data $\mathcal{D}_t$. This last number is estimated by multiplying the number of data points $t$ with the coefficient \[ \frac{ \sum_i K(x_i, x_i) }{ \sum_i \sum_j K(x_i, x_j) } \in (0,1], \] that is, the ratio between the trace of the Gramian matrix $(K(x_i,x_j))_{i,j}$ and the sum of its entries. Inputs that are very close to each other will have overlapping kernels, resulting in large off-diagonal entries of the Gramian matrix—hence decreasing the number of distinct locations. For example, if we have $t$ observations from $n \ll t$ locations, and each location has $t/n$ observations, then the coefficient is equal to $n/t$ and hence the number of distinct locations is exactly $n$, as expected.
The expression for the posterior can be calculated up to a constant factor in $\mathcal{O}(t^2)$ time. The computation of the normalizing constant is in general intractable. Therefore, our proposed posterior can be easily combined with Markov chain Monte Carlo methods (MCMC) to implement stochastic optimizers. For more information, please refer to the paper.