Pages
Topics
Other Sites
Institutions
Univ. of Chile
Univ. of Cambridge
MPI for Intelligent Systems
MPI for Biological Cybernetics
The Hebrew University of Jerusalem
University of Pennsylvania
Ortega, P.A., Grau-Moya, J., Genewein, T., Balduzzi, D. and Braun, D.A.
“A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function.”
Neural Information Processing Systems (NIPS) 2012
[PDF]
The arg-max prior is a Bayesian model of the maximizing argument (argmax in short) of a noisy function. For instance, consider the noisy function depicted in the next figure, and assume you are given a set of observations of the y-value for a number of x-locations. Then, our model allows computing the posterior probability density over the argmax.
The model is intuitively straightforward and easy to implement. Let $h(x): \mathcal{X} \rightarrow \mathcal{R}$ be an estimate of the mean $\bar{f}(x)$ constructed from data $\mathcal{D}_t := \{(x_i, y_i)\}_{i=1}^t$. This estimate can easily be converted into a posterior pdf over the location of the maximum by first multiplying it with a precision parameter $\alpha>0$ and then taking the normalized exponential \[ P(x^\ast|\mathcal{D}_t) \propto \exp\{ \alpha \cdot h(x^\ast) \}. \] In this transformation, the precision parameter $\alpha$ controls the certainty we have over our estimate of the maximizing argument: $\alpha \approx 0$ expresses almost no certainty, while $\alpha \rightarrow \infty$ expresses certainty. The rationale for the precision is: the more distinct inputs we test, the higher the precision—testing the same (or similar) inputs only provides local information and therefore should not increase our knowledge about the global maximum. A simple and effective way of implementing this idea is given by \[ P(x^\ast| \mathcal{D}) \propto \exp \biggl\{ \rho \underbrace{ \biggl( \xi + t \cdot \frac{ \sum_i K(x_i, x_i) }{ \sum_{i,j} K(x_i, x_j) } \biggr) }_\text{effective number of observations} \underbrace{ \biggl( \frac{ \sum_i K(x_i, x^\ast) y_i + K_0(x^\ast) y_0(x^\ast) } { \sum_i K(x_i, x^\ast) + K_0(x^\ast) } \biggr) }_\text{mean function estimate} \biggr\} \] where $\rho$, $\xi$, $K$, $K_0$ and $y_0$ are parameters of the estimator: $\rho > 0$ is the precision we gain for each new distinct observation; $\xi > 0$ is the number of prior points; $K: \mathcal{X} \times \mathcal{X} \rightarrow \mathcal{R}^{+}$ is a symmetric kernel function; $K_0: \mathcal{X} \rightarrow \mathcal{R}^{+}$ is a prior precision function; and $y_0: \mathcal{X} \rightarrow \mathcal{R}$ is a prior estimate of $\bar{f}$.
The two components of the model are:
The expression for the posterior can be calculated up to a constant factor in $\mathcal{O}(t^2)$ time. The computation of the normalizing constant is in general intractable. Therefore, our proposed posterior can be easily combined with Markov chain Monte Carlo methods (MCMC) to implement stochastic optimizers. For more information, please refer to the paper.