A Comparison of Probabilistic and Statistic View of Deep Learning Models

8 minute read

Published:

Background

We use lower case $x$ to denote the observed sample, and upper case $X$ to denote the associated unknown variable. We denote by $X, Y, Z$ the observation, latent, and lable, respectively. In most cases, only the variable $X$ can be fully observed, the label $Y$ might be partially observed, and the latent $Z$ is unobservable.

Evidence Lower Bound (ELBO)

We have an observed variable $X$ and one latent variable $Z$, the joint distribution $p(X, Z)$ can be parameterized by

\[p(X, Z) = p (Z) p (X|Z) \approx \theta (Z) \theta (X|Z), \text{ where } \theta (Z) = \mathcal{N} (Z; 0, I), \; \theta (X|Z) = \mathcal{N} (X; \mu_{\theta}, \Sigma_{\theta})\]

or

\[p(X, Z) = p (X) p (Z \vert X) \approx \phi (X) \phi (Z \vert X), \text{ where } \phi (Z \vert X) = \mathcal{N} (Z; \mu_{\phi}, \Sigma_{\phi})\]

Now we have three probabilistic worlds:

  • Real world: $p(X)$;
  • Observed world: $\hat{p} (X)$;
  • Approximated world: $\phi (X), \theta(Z), \theta(X\vert Z), \phi(Z\vert X)$.

The true world is never observable, and the approximated world is derived from our prior and posterior knowledge. Therefore, the goal is to let our knowledge as close to the ground true as possible, i.e., minimize the probability discrepancy between $\hat{p} (x)$ and $p(x)$. In the cases where computing the evidence $\log p (x)$ is intractable (due to mixture of distributions), we would rather use the marginal

\[\log p(x) = \log \int p (x, z) \text{d} z = \log \int \theta (x \vert z) \theta (z) \text{d} z\]

If the integral over $z$ is tractable, then EM algorithm can be applied. Otherwise, we need to optimize its lower bound, called evidence lower bound. For any observed sample $x$, the log likelihood is lower bounded by

\[\log p (x) \ge \mathbb{E}_{Z \sim \phi (\cdot \vert x)} \bigg[\log \frac{p (x, Z)}{\phi (Z \vert x)}\bigg] = ELBO (x)\]

Actually, $\log p (x) = ELBO (x) + \mathcal{D}_{KL} (\phi (\cdot\vert x) \vert \vert p(\cdot\vert x))$, which means we must pay a price if we use an approximation of the posterior $p(\cdot\vert x)$. Futhermore, ELBO can be written as the difference between a reconstruction error and a prior matching error:

\[ELBO (x) = \underbrace{\mathbb{E}_{Z \sim \phi (\cdot \vert x)} [\log \theta (x \vert Z)]}_{\text{reconstruction error}} - \underbrace{\mathcal{D}_{KL} (\phi (\cdot \vert x) \vert\vert \theta (\cdot))}_{\text{prior matching error}}\]

On the other hand,

\[ELBO (x) = \underbrace{\log p(x)}_{\text{likelihood}} - \underbrace{\mathcal{D}_{KL} (\phi (\cdot \vert x) \vert p (\cdot \vert x))}_{\text{posterior matching error}}\]

We see that, by maximizing ELBO, we simutaneously minimize the KL-divergence between pair $(\phi(\cdot\vert x), \theta(\cdot))$ and $(\phi (\cdot\vert x), p(\cdot\vert x))$. The former one ensures that $\phi$ is indeed learning some distribution, and the latter one ensures that $\phi$ is getting close to the real distribution. In practice, since both $\phi (\cdot \vert x)$ and $\theta (\cdot)$ are assumed to be Gaussian, the KL divergence between them can be computed in closed form. The reconstruction error is computed via Monte-Carlo sampling:

\[\mathbb{E}_{Z \sim \phi (\cdot \vert x)} [\log \theta (x \vert Z)] \approx \frac{1}{N} \sum_{i = 1}^N \log \theta (x \vert z_i).\]

Density Estimation

With observation $X$, we minimize the KL-divergence between $\hat{p} (X)$ and $\phi(X)$:

\[\min KL (\hat{p} \vert \phi) = \max \mathbb{E}_{X \sim \hat{p}} [\log \phi(X)]\]

Classification/Regression

With observation $X, Y$, we minimize the KL-divergence between $\hat{p} (Y\vert X)$ and $\phi(Y\vert X)$:

\[\min KL (\hat{p} \vert \phi) = \max \mathbb{E}_{(X, Y) \sim \hat{p}} [\log \phi(Y \vert X)]\]

Variational Autoencoder (VAE)

With observation $X$ associated with latent variable $Z$, we minimize the KL-divergence between $\hat{p} (X)$ and $q(X)$:

\[\begin{align*} & \min KL (\hat{p} \vert \phi) \\ = & \max \mathbb{E}_{X \sim \hat{p}} [\log \phi(X)] \\ \ge & \max \mathbb{E}_{X \sim \hat{p}} [ELBO (X)] \\ = & \max \mathbb{E}_{X \sim \hat{p}} \big[\mathbb{E}_{Z \sim \phi (\cdot \vert X)} [\log \theta (X \vert Z)] - \mathcal{D}_{KL} (\phi (\cdot \vert X) \vert\vert \theta (\cdot))\big] \end{align*}\]

Markovian Hierarchical VAE (MHVAE)

With observation $X$ associated with latent variables $Z_{1:T}$, we minimize the KL-divergence between $\hat{p} (X)$ and $\phi(X)$:

\[\begin{align*} & \min KL (\hat{p} \vert \phi) \\ = & \max \mathbb{E}_{X \sim \hat{p}} [\log \phi(X)] \\ \ge & \max \mathbb{E}_{X \sim \hat{p}} [ELBO (X)] \\ = & \max \mathbb{E}_{X \sim \hat{p}} \bigg[\mathbb{E}_{Z_{1:T} \sim \phi (\cdot \vert X)} \bigg[\log \frac{p (X, Z_{1:T})}{\phi (Z_{1:T} \vert X)}\bigg]\bigg] \\ = & \max \mathbb{E}_{X \sim \hat{p}} \bigg[\mathbb{E}_{Z_{1:T} \sim \phi (\cdot \vert X)} \bigg[\log \frac{\theta (Z_T) \theta (X \vert Z_1) \prod_{t = 2}^T \theta (Z_{t-1} \vert Z_t)}{\phi (Z_1 \vert X) \prod_{t = 2}^T \phi (Z_t \vert Z_{t-1})}\bigg]\bigg] \end{align*}\]

Variational Diffusion Model (VDM)

With observation $X_0$ associated with latent variables $X_{1:T}$, we additionally know (pre-define) the forward encoder conditional distribution $p (X_t \vert X_{t-1})$. We minimize the KL-divergence between $\hat{p} (X_0)$ and $q(X_0)$:

\[\begin{align*} & \min KL (\hat{p} \vert \phi) \\ = & \max \mathbb{E}_{X_0 \sim \hat{p}} [\log \phi(X_0)] \\ \ge & \max \mathbb{E}_{X_0 \sim \hat{p}} [ELBO (X_0)] \\ = & \max \mathbb{E}_{X_0 \sim \hat{p}} \bigg[\mathbb{E}_{X_{1:T} \sim \phi (\cdot \vert X_0)} \bigg[\log \frac{p (X_{0:T})}{\phi (X_{1:T} \vert X_0)}\bigg]\bigg] \\ = & \max \mathbb{E}_{X_0 \sim \hat{p}} \bigg[\mathbb{E}_{X_{1:T} \sim \phi (\cdot \vert X_0)} \bigg[\log \frac{\theta (X_T) \prod_{t = 1}^T \theta (X_{t-1} \vert X_t)}{\prod_{t = 1}^T p (X_t \vert X_{t-1})}\bigg]\bigg] \end{align*}\]

The ELBO of VDM has two different decompositions. The first one is

\[\begin{align*} ELBO (x_0) = & \underbrace{\mathbb{E}_{X_1 \sim p (\cdot \vert x_0)} [\log \theta (x_0 \vert X_1)]}_{\text{reconstruction error}} - \underbrace{\mathbb{E}_{X_{T-1} \sim p (\cdot \vert x_0)} [\mathcal{D}_{KL} (p (\cdot \vert X_{T-1}) \vert\vert \theta)]}_{\text{prior matching error of } X_T} \\ & - \sum_{t = 1}^{T-1} \underbrace{\mathbb{E}_{(X_{t-1}, X_{t+1}) \sim p (\cdot \vert x_0)} [\mathcal{D}_{KL} (p (\cdot \vert X_{t-1}) \vert\vert \theta (\cdot \vert X_{t+1}))]}_{\text{consistency error of } X_t} \end{align*}\]

The second one is

\[\begin{align*} ELBO (x_0) = & \underbrace{\mathbb{E}_{X_1 \sim p (\cdot \vert x_0)} [\log \theta (x_0 \vert X_1)]}_{\text{reconstruction error}} - \underbrace{\mathcal{D}_{KL} (p (\cdot \vert x_0) \vert\vert \theta)}_{\text{prior matching error of } X_T} \\ & - \sum_{t = 2}^{T} \underbrace{\mathbb{E}_{X_t \sim p (\cdot \vert x_0)} [\mathcal{D}_{KL} (p (\cdot \vert X_{t}, x_0) \vert\vert \theta (\cdot \vert X_{t}))]}_{\text{denoising matching error of } X_{t-1}} \end{align*}\]

Understanding Denoising Matching Error

The denoising matching error of $X_{t-1}$ can be interpretated using three different formulations:

\[\begin{align*} \mathcal{D}_{KL} (p (\cdot | X_{t}, x_0) \vert \theta (\cdot | X_{t})) = \begin{cases} C_1 \cdot \vert\vert\hat{x}_{\theta} (X_t, t) - x_0\vert\vert_2^2, & \text{matching the input}; \\ C_2 \cdot \vert\vert\hat{\varepsilon}_{\theta} (X_t, t) - \varepsilon_0\vert\vert_2^2, & \text{matching the noise}; \\ C_3 \cdot \vert\vert\hat{s}_{\theta} (X_t, t) - \nabla \log p (X_t)\vert\vert_2^2, & \text{matching the score}. \end{cases} \end{align*}\]

Diffusion Model from the Physical Point of View

The forward process in diffusion models can be considered as the Ornstein-Ulhenbeck process, which is described by the Stochastic Differential Equation (SDE):

\[\text{d} X_t = - \frac{1}{2} g(t) X_t \text{d} t + \sqrt{g (t)} \text{d} W_t, \text{ for } g(t) > 0,\]

where $(W_t)_{t \ge 0}$ is a standard Wiener process, and $g(t)$ is a non-decreasing weighting function. Given initial $X_0$, the conditional distribution of $X_t$ is

\[X_t \vert X_0 \sim \mathcal{N} (\alpha (t) X_0, h(t) I),\]

where $\alpha (t) = \exp (- \frac{1}{2} \int_0^t g (s) \text{d} s)$ and $h(t) = 1 - \alpha^2 (t)$. Under mild conditions, the initial $X_0$ will be transformed to Gaussian, i.e., $X_{\infty} \vert X_0 \sim \mathcal{N} (0, I)$. The forward process will terminate at a sufficiently large $T > 0$, and the backward process is described by

\[\text{d} X_t = \bigg(\frac{1}{2} X_t + \nabla \log p_{T-t} (X_t)\bigg) \text{d} t + \text{d} W_t\]

Diffusion Model for Optimization [Li et al. 2024]

  • Input: Datasets $\mathcal{T}, \mathcal{T}_{label}$, target reward value $a$, early-stopping time $t_0$, noise level $\nu$.

  • Reward learning: $\hat{f} \in \arg\min_{f \in \mathcal{F}} L (f, \mathcal{T}_{label})$.

  • Pseudo labeling:

\[\hat{T} = \lbrace (x_i, \hat{y}_i) = \hat{f} (x_i) + \xi_i\rbrace_{i=1}^N, \; \xi_i \overset{i.i.d.}{\sim} \mathcal{N} (0, \nu^2).\]
  • Conditional score matching:
\[\hat{s} \in \arg\min_{s \in \mathcal{S}} \int_{t0}^T \hat{\mathbb{E}}_{(x, \hat{y}) \in \hat{\mathcal{T}}} \big[\mathbb{E}_{x' \sim \mathcal{N} (\alpha(t)x, h(t) I)} [\vert\vert\nabla_{x'} \log \phi_t (x'\vert x) - s (x', \hat{y}, t)\vert\vert^2_2] \big] \text{d} t\]
  • Conditional generation: sample from the backward SDE,
\[\text{d} X_t^t = \bigg(\frac{1}{2} X_t^y + \hat{s} (X_{k\eta}^y, a, T-k\eta)\bigg) \text{d} t + \text{d} W_t, \; t\in [k\eta, (k+1)\eta]\]
  • Return: generated population $\mathbf{Pr} (\cdot \vert \hat{y} = a)$, learned subspace representation $V$ contained in $\hat{s}$.