Differences

This shows you the differences between two versions of the page.

--- and_or_kl [2024/06/20 12:53] – [Exponential mixture] pedroortega
+++ and_or_kl [2024/07/26 18:53] (current) – pedroortega
@@ Line 1: / Line 1: @@
 ====== And, Or, and the Two KL Projections ======
+> I discuss the difference between minimizing the KL-divergence with respect to the first and second argument, and will conclude that they correspond to AND and OR operations on distributions, respectively.
+//Cite as: Ortega, P.A. "And, Or, and the Two KL Projections", Tech Note 1, DAIOS, 2024.//
-> I discuss the difference between minimizing the KL-divergence with respect to
-> the first and second argument, and will conclude that they correspond to
-> AND and OR operations on distributions, respectively.
 Oftentimes I see people wondering about the meaning of the two KL-projections:
@@ Line 22: / Line 22: @@
 {{ ::forward-reverse-kl.png?800 |}}
-Personally, I find this explanation somewhat unclear and unsatisfying. I've grappled with this distinction since my PhD studies, while attempting to comprehend it from various perspectives, including [information geometry](https://en.wikipedia.org/wiki/Information_geometry). However, I was never truly happy with how the material was presented in the literature. I think the simplified explanation below, which took me years to arrive to, encapsulates a substantial amount of insight without relying on the complexities of information geometry. I hope you agree!
+Personally, I find this explanation somewhat vague and unsatisfying. I've looked at the difference from various angles, including [[https://en.wikipedia.org/wiki/Information_geometry|Information Geometry]]. However, I was never truly happy with how the material was presented in the literature. I think the simplified explanation below, which took me years to get to, encapsulates the core insight without relying on the complexities of information geometry. I hope you agree!
@@ Line 30: / Line 30: @@
 their application on two distributions can be quite challenging. Instead, a
 clearer grasp of the difference can be attained through the examination
-of mixture distributions. Let's delve into this approach.
+of mixture distributions. Let's look into this approach.
 ==== Linear mixture ====
@@ Line 36: / Line 36: @@
 Let's say we have  $N$  distributions  $q_1, q_2, \ldots, q_N$  over a finite set  $\mathcal{X}$ .
 Given a set of positive weights  $w_1, w_2, \ldots, w_N$  that sum up to one, their
-*linear mixture* is
+//linear mixture// is
 $$
@@ Line 42: / Line 42: @@
 $$
-The *linear mixture* expresses  $N$  mutually exclusive hypotheses  $q_i(x)$  that
+The //linear mixture// expresses  $N$  mutually exclusive hypotheses  $q_i(x)$  that
 could be true with probabilities  $w_i$ . That is, either  $q_1$  **or**  $q_2$  **or**
 ... **or**  $q_N$  is true, with probability  $w_1$ ,  $w_2$ , ...,  $w_N$  respectively,
@@ Line 49: / Line 49: @@
 ==== Exponential mixture ====
-Given a set of positive coefficients  $\alpha_1, \alpha_2, \ldots, \alpha_N$  (not necessarily summing up to one), their *exponential mixture* (a.k.a. geometric mixture) is
+Given a set of positive coefficients  $\alpha_1, \alpha_2, \ldots, \alpha_N$  (not necessarily summing up to one), their //exponential mixture// (a.k.a. geometric mixture) is
 $$
@@ Line 57: / Line 57: @@
 It's important to highlight that in order for the exponential mixture to yield a valid probability distribution, it has to be normalized.
-The *exponential mixture* expresses  $N$  constraints  $q_i(x)$  that must be true
+The //exponential mixture// expresses  $N$  constraints  $q_i(x)$  that must be true
 simultaneously with precisions  $\alpha_i$ . That is,  $q_1$  **and**  $q_2$  **and** ...
 **and**  $q_N$  are true, with precisions  $\alpha_1$ ,  $\alpha_2$ , ...,  $\alpha_N$
@@ Line 82: / Line 82: @@
 $$
-\begin{equation}
+  \arg\min_p \sum_i w_i D(q_i \| p) = \sum_i w_i q_i(x).\qquad(1)
-  \label{eq:build-linear}
-  \arg\min_p \sum_i w_i D(q_i \| p) = \sum_i w_i q_i(x).
-\end{equation}
 $$
@@ Line 97: / Line 94: @@
 $$
-That is, using the KL-divergences where $$p$$ is in the first argument. And in fact,
+That is, using the KL-divergences where  $p$  is in the first argument. And in fact,
 the minimizer is precisely the exponential mixture:
 $$
-\begin{equation}
+  \arg\min_p \sum_i \alpha_i D(p \| q_i) \propto \prod q_i(x)^{\alpha_i}.\qquad(2)
-  \label{eq:build-exponential}
-  \arg\min_p \sum_i \alpha_i D(p \| q_i) \propto \prod q_i(x)^{\alpha_i}.
-\end{equation}
 $$
-Equations \eqref{eq:build-linear} and \eqref{eq:build-exponential} form the core
+Equations (1) and (2) form the core
 of my argument. Basically, we have found a relation between the two KL-projections
 and the two logical operators **and** and **or**. The two KL-divergences then measure
@@ Line 150: / Line 144: @@
 Thus, it turns out that sequential predictions can be regarded as an alternation
 between OR and AND operations, first to express our uncertainty over the hypotheses,
-and second to incorporate new evidence, respectively.
+and second to incorporate new evidence, respectively. What class of algorithms does this remind you of?