Pages
Topics
Other Sites
Institutions
Univ. of Chile
Univ. of Cambridge
MPI for Intelligent Systems
MPI for Biological Cybernetics
The Hebrew University of Jerusalem
University of Pennsylvania
People have asked me to write a tutorial about the Bayesian control rule. If the number of requests exceeds a critical mass, then I will take the time to do it. So, if you're interested, drop me an email! For now, I'm just expanding this page slowly whenever I find myself in a coffee shop with some spare minutes.
The Bayesian control rule is an extension to Bayes' Rule that is obtained by combining probability theory with causal interventions. Simply stated, it says \[ P(\theta|\hat{A},O) = \frac{ P(\theta) P(\hat{A}, O|\theta) }{ P(\hat{A}, O) }, \] where the “hat”-notation $\hat{A}$ denotes a causal intervention rather than a condition. At a first glance, it seems to be just Bayes' rule. But there is a subtle difference: the Bayesian control rule makes a distinction between seeing (= observing = conditioning) and doing (= acting = manipulating). Or, in other words, it allows us to model learning when we have the ability to control our world - hence the name.
…and why do we even need it?
The quintessential example is the barometer. Imagine you have a barometer in your house. The atmospheric pressure changes the height of the (say) mercury: if it rises, we expect good weather; and if it drops rapidly, we expect rain. A simple Bayesian model captures this relation: \[ P(W|B) = \frac{ P(B|W) P(W) }{ P(B) }, \] where $P(W)$ is the prior probability of the weather (say, good or bad) and $P(B|W)$ is the likelihood of the barometer change given the weather - more likely to be high when good weather approaches and more likely to the bad when bad weather approaches.
Choosing reasonable values for the prior probabilities and the likelihoods we can calculate the posterior probability of having a good weather given a rapid rise of the mercury level: \[ P(W=good|B=high) = \frac{ P(B=high|W=good) P(W=good) }{ P(B=high) } \] where $P(B=high)$ is just the normalizing constant \[ P(B=high) = \frac{ P(B=high|W=good) P(W=good) }{ P(B=high|W=bad) P(W=bad) + P(B=high|W=good) P(W=good) }. \]
Now, imagine you decide to change the level of the mercury yourself, say (using a bit of imagination) a pressurizing device. Now, you set the value of the random variable - and intuition tells us that we cannot predict the weather anymore. Apparently, our previous Bayesian model is useless now. The intervention of the barometer changes the joint probability distribution $P(B,W)$ to \[ P(\hat{B},W) = \delta(B) P(W), \] that is, where $P(B|W)$ has been replaced by $\delta(B)$ (i.e. the Kronecker delta function that evaluates to one whenever the value of $B$ is the value we have chosen). The hat-notation is just a shorthand referring to this particular transformation of the probability distribution. Hence, the posterior, assuming we set $B \leftarrow high$, is \[ P(W=good|\hat{B}=high) = \frac{ 1 \cdot P(W=good) }{ 1 \cdot P(W=bad) + 1 \cdot P(W=good) } = P(W=good). \] In other words, we don't gain knowledge about the weather - as expected.
The reason for this special treatment of actions is that when we set the value of a random variable, we change Nature's probability law. This has important consequences to the philosophy behind Bayesian modelling. The Bayesian interpretation of probability theory interprets probabilities as degrees of belief, but it is incomplete if not enriched with causal interventions: it only models belief updates after observations, not actions. This is rather striking, considering that understanding how to update belief after both actions and observations is one of the central themes in artificial intelligence!
If you have convinced yourself now that interventions are important, then you can find more material here:
The most important application of the Bayesian control rule is adaptive control. That is, the rule says that \[ P(\theta|\hat{A},O) = \frac{ P(\theta) P(\hat{A},O|\theta) }{ P(\hat{A}, O) }, \] i.e. the posterior of the hypothesis $\theta$ is given by multiplying the prior $P(\theta)$ with a likelihood $P(\hat{A}, O|\theta)$ obtained from $P(A, O|\theta)$ by intervening the random variable $A$. Tagging actions as causal interventions makes sure that we do not learn the hypothesis $\theta$ from our actions, but from the effects of our actions.
Consider as an example a multi-armed bandit. This is a slot machine with many levers. Each time we pull a lever, the machine produces a reward (say, a coin) with a probability specific to the lever. Our goal is to produce a sequence of actions $A_1, A_2, \ldots$ in order to maximize the cumulative sum of the rewards $R_1, R_2, \ldots$. In this case, if we knew the biases of the levers, then the optimal strategy would be to always pull the lever with the highest bias.
This knowledge of the optimal strategy given the characterization of the multi-armed bandit (which in this case corresponds to a vector of biases) allows us using the Bayesian control rule. Let $\theta = [\theta_1, \theta_2, \ldots, \theta_N]$ be the vector of biases characterizing a $N$-armed bandit with Bernoulli distributed rewards. If we know this vector, then we can completely characterize the optimal interaction between a player and the multi-armed bandit. This interaction is a stochastic process $A_1, R_1, A_2, R_2, \ldots$ of actions $A_t \in \{1, 2, \ldots, N\}$ followed by rewards $R_t \in \{0, 1\}$. The causal model of this process is chronological (i.e. each random variable depends only on the previous ones), hence \[ P(A_1, R_1, A_2, R_2, \ldots, A_T, O_T|\theta) = P(A_1|\theta) \times P(R_1|\theta, A_1) \times P(A_2|\theta, A_1, O_1) \times \cdots \times P(R_T|\theta, A_{1:T},O_{1:T-1}). \]
The optimal action $A_t$ depends functionally only on the vector of biases $\theta$: \[ P(A_t|\theta, A_{1:t-1}, O_{1:t}) = P(A_t|\theta) = \delta[A_t = a^\ast(\theta)]. \] where $a^\ast(\theta)$ is the index of the lever with the highest bias in $\theta$, and where $\delta$ is the Kronecker delta function. Also, the reward $R_t$ depends functionally only on the bias vector $\theta$ and the previous action $A_t$: \[ P(R_t|\theta,A_{1:t}, O_{1:t-1}) = P(R_t|\theta, A_t) = \theta_{A_t}, \] where $\theta_{A_t}$ denotes the bias component corresponding to lever $A_t$. As you can see, we have essentially constructed a likelihood model describing the optimal interaction given that $\theta$ is the true bias vector of the multi-armed bandit.
We now need a prior over $\theta$. This is easy: since each lever behaves like a Bernoulli distribution, we can place a Beta prior - because it is conjugate to the likelihood. Since we need one Beta per lever, the resulting prior is just a product of $N$ Beta distributions: \[ P(\theta) = \prod_{n=1}^N Beta(\theta; \alpha_n, \beta_n). \]
This completes the model. Now, how do we use it?
In Bayesian statistics, there are two important distributions: the posterior distribution and the predictive distribution. The latter tells us how to predict next observations given past data. But in the control case, it also tells us how to choose actions: \[ P(A_{t+1}|\hat{A}_{1:t}, O_{1:t}) = \int P(A_{t+1}|\theta, \hat{A}_{1:t}, O_{1:t}) P(\theta|\hat{A}_{1:t}, O_{1:t}) \, d\theta, \] that is, actions are drawn stochastically from this predictive distribution!
There are several remarks that are needed at this point:
[under construction]
Ortega, P.A. and Braun, D.A.
A minimum relative entropy principle for learning and acting
Journal of Artificial Intelligence Research 38, pp. 475-511, 2010.
[PDF]
Ortega, P.A. and Braun, D.A.
Adaptive Coding of Actions and Observations
NIPS Workshop on Information in Perception and Action, 2012.
[PDF]
Ortega, P.A. and Braun, D.A.
Generalized Thompson Sampling for Sequential Decision-Making and Causal Induction
arXiv:1303.4431, 2013
[PDF]
Ortega, P.A.
Bayesian Control Rule - Talk Slides
[PDF].