Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
and_or_kl [2024/06/20 12:46] – created pedroortega | and_or_kl [2024/07/26 18:53] (current) – pedroortega | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== And, Or, and the Two KL Projections ====== | ====== And, Or, and the Two KL Projections ====== | ||
+ | > I discuss the difference between minimizing the KL-divergence with respect to the first and second argument, and will conclude that they correspond to AND and OR operations on distributions, | ||
+ | |||
+ | //Cite as: Ortega, P.A. "And, Or, and the Two KL Projections", | ||
- | > I discuss the difference between minimizing the KL-divergence with respect to | ||
- | > the first and second argument, and will conclude that they correspond to | ||
- | > AND and OR operations on distributions, | ||
Oftentimes I see people wondering about the meaning of the two KL-projections: | Oftentimes I see people wondering about the meaning of the two KL-projections: | ||
Line 22: | Line 22: | ||
{{ :: | {{ :: | ||
- | Personally, I find this explanation somewhat | + | Personally, I find this explanation somewhat |
- | ## A tale of two coordinate systems | + | ===== A tale of two coordinate systems |
Understanding the distinction between the two projections solely by examining | Understanding the distinction between the two projections solely by examining | ||
their application on two distributions can be quite challenging. Instead, a | their application on two distributions can be quite challenging. Instead, a | ||
clearer grasp of the difference can be attained through the examination | clearer grasp of the difference can be attained through the examination | ||
- | of mixture distributions. Let' | + | of mixture distributions. Let' |
- | ### Linear mixture | + | ==== Linear mixture |
- | Let's say we have $$N$$ distributions | + | Let's say we have N distributions q1,q2,…,qN over a finite set X. |
- | Given a set of positive weights | + | Given a set of positive weights w1,w2,…,wN that sum up to one, their |
- | *linear mixture* is | + | //linear mixture// is |
$$ | $$ | ||
Line 42: | Line 42: | ||
$$ | $$ | ||
- | The *linear mixture* expresses | + | The //linear mixture// expresses N mutually exclusive hypotheses qi(x) that |
- | could be true with probabilities | + | could be true with probabilities wi. That is, either q1 **or** q2 **or** |
- | ... **or** | + | ... **or** qN is true, with probability w1, w2, ..., wN respectively, |
expressing a **disjunction** of probability distributions. | expressing a **disjunction** of probability distributions. | ||
- | ### Exponential mixture | + | ==== Exponential mixture |
- | Given a set of positive coefficients | + | Given a set of positive coefficients α1,α2,…,αN (not necessarily summing up to one), their //exponential mixture// (a.k.a. geometric mixture) is |
$$ | $$ | ||
Line 57: | Line 57: | ||
It's important to highlight that in order for the exponential mixture to yield a valid probability distribution, | It's important to highlight that in order for the exponential mixture to yield a valid probability distribution, | ||
- | The *exponential mixture* expresses | + | The //exponential mixture// expresses N constraints qi(x) that must be true |
- | simultaneously with precisions | + | simultaneously with precisions αi. That is, q1 **and** q2 **and** ... |
- | **and** | + | **and** qN are true, with precisions α1, α2, ..., αN |
respectively, | respectively, | ||
- | ## Building conjunctions and disjunctions | + | ===== Building conjunctions and disjunctions |
So now that we've seen how to express conjunctions and disjunctions of distributions as | So now that we've seen how to express conjunctions and disjunctions of distributions as | ||
Line 68: | Line 68: | ||
divergence measures, if you prefer) to build them. | divergence measures, if you prefer) to build them. | ||
- | **Disjunctions: | + | **Disjunctions: |
- | which are true with probabilities | + | which are true with probabilities w1 through wN respectively, |
- | how divergent is $$p$$ from this knowledge? | + | how divergent is p from this knowledge? |
The answer is | The answer is | ||
Line 78: | Line 78: | ||
$$ | $$ | ||
- | That is, using the KL-divergences where $$p$$ is the second argument. Indeed, | + | That is, using the KL-divergences where p is the second argument. Indeed, |
the minimizer is precisely the linear mixture: | the minimizer is precisely the linear mixture: | ||
- | \begin{equation} | + | $$ |
- | \label{eq: | + | \arg\min_p \sum_i w_i D(q_i \| p) = \sum_i w_i q_i(x).\qquad(1) |
- | \arg\min_p \sum_i w_i D(q_i \| p) = \sum_i w_i q_i(x). | + | $$ |
- | \end{equation} | + | |
- | **Conjunctions: | + | **Conjunctions: |
- | with precisions | + | with precisions α1 through αN, how far off is p from this knowledge? |
The answer is | The answer is | ||
Line 95: | Line 94: | ||
$$ | $$ | ||
- | That is, using the KL-divergences where $$p$$ is in the first argument. And in fact, | + | That is, using the KL-divergences where p is in the first argument. And in fact, |
the minimizer is precisely the exponential mixture: | the minimizer is precisely the exponential mixture: | ||
- | \begin{equation} | + | $$ |
- | \label{eq: | + | \arg\min_p \sum_i \alpha_i D(p \| q_i) \propto \prod q_i(x)^{\alpha_i}.\qquad(2) |
- | \arg\min_p \sum_i \alpha_i D(p \| q_i) \propto \prod q_i(x)^{\alpha_i}. | + | $$ |
- | \end{equation} | + | |
- | Equations | + | Equations |
of my argument. Basically, we have found a relation between the two KL-projections | of my argument. Basically, we have found a relation between the two KL-projections | ||
and the two logical operators **and** and **or**. The two KL-divergences then measure | and the two logical operators **and** and **or**. The two KL-divergences then measure | ||
- | the divergence of $$p$$ from a conjunction and disjunction, | + | the divergence of p from a conjunction and disjunction, |
- | ## Final thoughts | + | ===== Final thoughts |
As a final comment, I'd like to point out the connection to the Bayesian framework. | As a final comment, I'd like to point out the connection to the Bayesian framework. | ||
- | Let's consider hypotheses | + | Let's consider hypotheses h∈H with i.i.d. likelihoods q(x|h) over |
- | $$\mathcal{X}$$ and priors | + | X and priors q(h). |
**Prediction: | **Prediction: | ||
Line 121: | Line 119: | ||
$$ | $$ | ||
- | The resulting estimator | + | The resulting estimator ∑hq(h)q(x∣h) is the Bayesian estimator. |
**Belief update:** The second step is to understand how to update our beliefs after | **Belief update:** The second step is to understand how to update our beliefs after | ||
making an observation. It turns out that an observation is a constraint on the prior | making an observation. It turns out that an observation is a constraint on the prior | ||
beliefs, i.e. we obtain the posterior through a conjunction between the prior | beliefs, i.e. we obtain the posterior through a conjunction between the prior | ||
- | and the likelihood function. If $$\dot{x}$$ is our observation, | + | and the likelihood function. If ˙x is our observation, |
then the conjunction is | then the conjunction is | ||
Line 134: | Line 132: | ||
$$ | $$ | ||
- | In this operation we use precisions equal to $$\alpha=\beta=1$$ so that the result | + | In this operation we use precisions equal to α=β=1 so that the result |
- | is proportional to $$q(h)q(\dot{x}|h)$$. We can easily verify that the right hand | + | is proportional to q(h)q(˙x|h). We can easily verify that the right hand |
side is indeed the Bayesian posterior after normalization. | side is indeed the Bayesian posterior after normalization. | ||
Notice how building the predictor involves taking the disjunction of distributions | Notice how building the predictor involves taking the disjunction of distributions | ||
- | over the observations | + | over the observations x, while computing the posterior amounts to computing |
- | the conjunction of functions over the hypotheses | + | the conjunction of functions over the hypotheses h. In doing so, we interpret |
the likelihood as two different functions: in the disjunction, | the likelihood as two different functions: in the disjunction, | ||
- | of $$x$$ whereas in the conjunction it is a function of $$h$$. | + | of x whereas in the conjunction it is a function of h. |
Thus, it turns out that sequential predictions can be regarded as an alternation | Thus, it turns out that sequential predictions can be regarded as an alternation | ||
between OR and AND operations, first to express our uncertainty over the hypotheses, | between OR and AND operations, first to express our uncertainty over the hypotheses, | ||
- | and second to incorporate new evidence, respectively. | + | and second to incorporate new evidence, respectively. |