Differences

This shows you the differences between two versions of the page.

--- belief_flows [2023/11/19 17:21] – created - external edit 127.0.0.1
+++ belief_flows [2025/04/08 15:20] (current) – [Belief flows] pedroortega
@@ Line 52: / Line 52: @@
   - **Prior:** We place a Gaussian distribution  $P(w)$  to represent our parameter uncertainty. To simplify our exposition, we assume that the covariance matrix is diagonal, and so  $P(w) = N(w; \mu, \Sigma) = \prod_n N(w_n; \mu_n, \sigma^2_n),$  where  $w_n$ ,  $\mu_n$  are the  $n$ -th components of the parameter and mean vectors respectively, and  $\sigma^2_n$  is the  $n$ -th diagonal element of the covariance matrix  $\Sigma$ .
   - **Parameter choice:** The learning algorithm now has to choose model parameters to minimize the prediction error. It does so using Thompson sampling, that is, by sampling a parameter vector  $w'$  from the prior distribution:  $\bar{w} \sim P(w).$
-  - **Evaluation of Loss and Local Update:** Once the parameter is chosen, the learning algorithm is given a supervised pair  $(x, y)$  that is can use to evaluate the loss  $\ell(y, \hat{y})$ , where  $\hat{y} = F_{\bar{w}}(x)$  is the predicted output. Based on this loss, the learning algorithm can calculate the update of the parameter  $\bar{w}$  using SGD:  $\bar{w}' = \bar{w} - \eta \cdot \frac{\partial}{\partial w} \ell(y,\hat{y}),$  where  $\eta > 0$  is the learning rate.
+  - **Evaluation of Loss and Local Update:** Once the parameter is chosen, the learning algorithm is given a supervised pair  $(x, y)$  that it can use to evaluate the loss  $\ell(y, \hat{y})$ , where  $\hat{y} = F_{\bar{w}}(x)$  is the predicted output. Based on this loss, the learning algorithm can calculate the update of the parameter  $\bar{w}$  using SGD:  $\bar{w}' = \bar{w} - \eta \cdot \frac{\partial}{\partial w} \ell(y,\hat{y}),$  where  $\eta > 0$  is the learning rate.
   - **Global Update:** Now, the algorithm has to change its prior beliefs  $P(w)$  into posterior beliefs  $P'(w)$ . To do so, it must infer the SGD update over the whole parameter space based solely on the local observation  $\bar{w} \rightarrow \bar{w}'$ .
     - If we assume a quadratic error function with uncorrelated coordinates, then the class of possible SGD updates becomes the class of linear flow fields in parameter space that transforms each component as  $w'_n = a_n w_n + b_n,$  preserving the Gaussian shape of the resulting posterior. However, there are many such transformation that are consistent with the observed SGD update  $\bar{w} \rightarrow \bar{w}'$ , so which one should the algorithm choose?