Differential Geometry and Generative Models
Published:
Background
In deep learning, models can be cast into three categories:
Encoder-type models that take data to a latent space, and use simple linear regression to perform downstream task. Such models include neural networks for classification, regression, segmentation, etc.
Decoder-type models that parameterize data using low-dimensional latent vectors, and sample latent vectors from a prior simple distribution to generate data. Such models include AE, VAE, diffusion model, etc.
Adversarial-type models that trains both an encoder as a discriminator, and a decoder as a generator. Such models include GAN, WGAN, etc.
Note that training an encoder does not need a decoder, while training an decoder usually requires training an encoder simutaneously. See my previous post about why we need an encoder to learn a decoder, also [Schuster et al. 2021] about how to train a decoder without an encoder..
It’s usually assumed that data in high dimensional space are distributed near a low-dimensional manifold $\mathcal{M}$. An encoder acts like a local chart $(U, \varphi_U)$ of the hidden manifold, that maps an open set $U$ on the manifold to an open set $\varphi_U (U)$ in the latent space (a low-dimensional Euclidean space). And the corresponding decoder pulls the flattened openset $\varphi_U (U)$ in the latent space back to the manifold, through $\varphi_U^{-1}$.
Riemannian Geometry
Briefly speaking, a manifold $\mathcal{M}$ is a set that is locally homeomorphic to Euclidean space $\mathbb{R}^m$. Here locally means each point $p \in \mathcal{M}$ is associated with a chart $(U, \varphi_U)$, such that $p \in U$ and $\varphi_U: U \to \varphi_U (U)$ is a homeomorphism. The collection of charts ${(U, \varphi_U)}$ is called an atlas, and $\mathbf{x} = \varphi_U$ is called the local coordinates. An atlas on a manifold is called differentiable if all charts are consistent, i.e., the transitions
\[\varphi_V \circ \varphi_U^{-1}: \varphi_U (U \cap V) \to \varphi_V (U \cap V)\]are $C^{\infty}$-differentiable. As we see, each proper definition on a manifold is translated through its local coordinates into an Euclidean space. For example, a function $f: \mathcal{M} \to \mathbb{R}$ with charts $(U, \varphi_U)$ is called differentiable if all $f \circ \varphi_U^{-1}$ are differentiable. A map $h: \mathcal{M} \to \mathcal{N}$ with charts $(U, \varphi_U)$ and $(V, \psi_V)$ is called differentiable if all $\psi_V \circ h \circ \varphi_U^{-1}$ are differentiable.
Tangent Space
Defining derivative operators on a manifold is much more complicated. At a hight level, for a given point $p \in \mathcal{M}$, the tangent space $T \mathcal{M}$ (of derivatives) and cotangent space $T^* \mathcal{M}$ (of differentials) is mutually dual space of finite dimension. Since the dual of dual is identity, we can either first define the tangent space and then the cotangent space, or the inverse. Whereas most of the textbooks start introducing the tangent space (because it’s easier to illustrate in tangent space), we borrow the insight from [Chern et al, 1999] to do the inverse, i.e., first define the cotangent space $T^* \mathcal{M}$ and then $T \mathcal{M}$ as the dual of $T^* \mathcal{M}$.
The space of $C^{\infty}$-differentiable functions at $p$, denoted by $C^{\infty}_p$, is an infinite-dimensional vector space. At an algebraic view, the space $C^{\infty}_p$ is not a concise object to study, because there are many “redundant” elements that destroy the algebraic structure. For instance, we would like to regard two functions from $C^{\infty}_p$ as the same if they coincide with each other locally, since we only care about the local property at $p$. To this end, we introduce the quotient space $\mathcal{F}_p := C^{\infty}_p / \sim$, where the equivalent relation $\sim$ is defined by
\[f \sim g \Longleftrightarrow f\vert_U = g\vert_U \text{ for some } U.\]The space $\mathcal{F}_p$ is still an infinite-dimensional vector space, but only contains the local information of functions at $p$. The element in $\mathcal{F}_p$ is called a germ. More interestingly, the vector space $\mathcal{F}_p$ is also a local ring, with unique maximal ideal $\mathfrak{m}_p := \lbrace [f]: f(p) = 0 \rbrace$ defined as the set of functions vanishing at $p$. Indeed, we can factorize $\mathcal{F}_p$ as $\mathbb{R} \oplus \mathfrak{m}_p$, hence $\mathcal{F}_p / \mathfrak{m}_p \cong \mathbb{R}$. Similarly, can further factorize $\mathcal{F}_p$ as $\mathbb{R} \oplus (\mathfrak{m}_p / \mathfrak{m}^2_p) \oplus \mathfrak{m}^2_p = \mathcal{H}_p \oplus (\mathfrak{m}_p / \mathfrak{m}^2_p)$, where $\mathcal{H}_p := \mathbb{R} \oplus \mathfrak{m}^2_p$. Now we arrive at the first definition of cotangent space:
\[T^* \mathcal{M} := \mathcal{F}_p / \mathcal{H}_p \cong \mathfrak{m}_p / \mathfrak{m}^2_p.\]As we see, the ddiferential operator are defined as the canonical homomorphism $\text{d}_p: \mathcal{F}_p \to \mathcal{F}_p / \mathcal{H}_p$. The cotangent space is spanned by the natural basis $\lbrace \text{d}_p (x^i) \rbrace$, i.e., differentials, and we have
\[\text{d}_p ([f]) = \sum_{i = 1}^m \partial_i f(p) \text{d}_p (x^i), \text{ where } \partial_i f(p) := \frac{\partial (f \circ \varphi_U^{-1})}{\partial x^i} \bigg \vert_{x^i (p)}.\]Here the way we define $\mathcal{H}_p$ is the linear space of germs of smooth functions whose partial derivatives with respect to local coordinates all vanish at $p$. We can also describe $\mathcal{H}_p$ through parameterized curves passing $p$. Denote by $\Gamma_p$ the set of $C^{\infty}$-curves
\[\gamma: (-\delta, \delta) \to \mathcal{M}, \; \gamma (0) = p.\]Then the directional derivative of $[f] \in \mathcal{F}_p$ along $\gamma \in \Gamma_p$ at point $p$ is defined as a linear functional over $\mathcal{F}_p$:
\[\ll \gamma, \cdot \gg: \mathcal{F}_p \to \mathbb{R}, \; [f] \mapsto \frac{\text{d} (f \circ \gamma)}{\text{d} t} \big\vert_{t = 0}.\]It’s easy to show that the space $\mathcal{H}_p$ is exactly the set of those germs where all linear functionals $\ll \gamma, \cdot \gg$ vanish. This linear functional can be naturally extended to $T^* \mathcal{M}$ and linear with respect in the second variable. In order to make the form $\ll \cdot, \cdot \gg: \Gamma_p \times \mathcal{F}_p \to \mathbb{R}$ bilinear, we introduce an equivalent relation over $\Gamma_p$ as follows:
\[\gamma \sim \gamma' \Longleftrightarrow \ll \gamma, \text{d}_p ([f]) \gg = \ll \gamma', \text{d}_p ([f]) \gg.\]We denote by $T \mathcal{M}$ the quotient space $\Gamma / \sim$, which is namely the tangent space and also the dual space of $T^* \mathcal{M}$ (actually this is not trivial and needs some justification). Now we have an induced bilinear form
\[\langle \cdot, \cdot \rangle: T \mathcal{M} \times T^* \mathcal{M} \to \mathbb{R}, \; ([\gamma], \text{d}_p ([f])) \mapsto \sum_{i = 1}^m a_i \xi^i,\]where $a_i = \partial_i f(p)$ and $\xi^i = \frac{\text{d} x^i}{\text{d} t} \big \vert_{t = 0}$. The tangent space $T \mathcal{M}$ is spanned by partial derivative operators $\lbrace \partial_i \big \vert_p \rbrace$ on function germs, and we have
\[\big\langle \partial_k \big \vert_p, \text{d}_p (x^i) \big\rangle = \delta^i_k.\]Tangent Map
As we see, a tangent vector $v \in T \mathcal{M}$ is a real-valued function $v: C_p ^{\infty} \to \mathbb{R}$. Let $f: \mathcal{M} \to \mathcal{N}$ be a differentiable map, the tangent map is defined as the linear map
\[T_f \vert_p: T \mathcal{M} \to T \mathcal{N}, \; T_f (v) (g) = v(g\circ f),\]for all $v \in T \mathcal{M}$ and $g \in C_{f(p)}^{\infty}$. If $v \in T \mathcal{M}$ is represented by a curve $\gamma: [-\delta, \delta] \to \mathcal{M}$, i.e., $v (f) = \frac{\text{d} (f \circ \gamma)}{\text{d} t} \big\vert_{t = 0}$. Then $T_f (v)$ is represented by the curve $f \circ \gamma: [-\delta, \delta] \to \mathcal{N}$, i.e., $T_f (v) (g) = \frac{\text{d} (g \circ f \circ \gamma)}{\text{d} t} \big\vert_{t = 0}$. It is necessary to define pull-back and push-forward for convenience: consider the following chain,
\[[-\delta, \delta] \overset{\gamma}{\to} \mathcal{M} \overset{f}{\to} \mathcal{N} \overset{g}{\to} \mathbb{R}.\]The curve $\gamma$ defined on $\mathcal{M}$ can be pushed forward to a curve on $\mathcal{N}$: $f_* \gamma := f \circ \gamma$, and the function $g$ defined on $\mathcal{N}$ can be pulled back to functions on $\mathcal{M}$: $f^* g := g\circ f$. If we define the push-forward on tangent vectors as $f_* v = T_f (v)$ for all $v \in T \mathcal{M}$, then the definition of tangent map is rewritten as $(f_* v) (g) = v (f^* g)$. The chain rule for manifolds also holds: given manifolds $\mathcal{M}, \mathcal{N}, \mathcal{O}$, and differential maps $\mathcal{M} \overset{f}{\to} \mathcal{N} \overset{g}{\to} \mathcal{O}$, we have
\[T_{g \circ f} \vert_p = T_g \vert_{f(p)} \circ T_f \vert_p\]As a special case, if $\mathcal{M} \subseteq \mathbb{R}^m$ and $\mathcal{N} \subseteq \mathbb{R}^n$, then the tangent map $T_f$ at point $p \in \mathcal{M}$ coincides with the Jacobian matrix $J_f$ at $p$, and the chain rule becomes the product between two Jacobian matrices: $J_{g \circ f} (p) = J_g (f(p)) \cdot J_f (p)$.
Generative Model
Now in the geometric view, an encoder is like a chart map $\varphi_U$ and a decoder is its inverse $\varphi_U^{-1}$. Learning an autoencoder type model is essentially learning the parameterization of a hidden manifold using a low-dimensional Euclidean latent space. Whether the manifold we learn is deterministic or statistic depends on whether we use reparameterization trick or not, the latter one is usually called a variational autoencoder (VAE). In this section, we are going to understand VAE and diffusion model through the lens of differential geometry.
Variational Autoencoder [Shao et al. 2017, Arvanitidis et al. 2021, Chadebec et al. 2022]
We parameterize a curve on $\mathcal{M}$ through a generative model $g: \mathcal{Z} \to \mathcal{M}$, as follows:
\[\gamma: [0, 1] \to \mathcal{Z}, \; \gamma_g = g \circ \gamma: [0, 1] \to \mathcal{M}.\]The arc length of $\gamma$ on $\mathcal{Z}$ is defined as the arc length of $\gamma_g$ on $\mathcal{M}$:
\[L (\gamma) := \int_0^1 \vert\vert \gamma_g' (t) \vert\vert \text{d} t = \int_0^1 \vert\vert J_g (\gamma(t)) \gamma' (t) \vert\vert \text{d} t = \int_0^1 \sqrt{\gamma' (t)^T G (\gamma(t)) \gamma' (t)} \text{d} t,\]where $G (z) := J_g (z)^T J_g (z)$ is called the Riemannian metric, and defines a Mahalanobis distance on $\mathcal{Z}$. The curve that minimize the arc length is called a geodesic curve. The geodesic distance between two points $p_1, p_2 \in \mathcal{M}$ is defined as the arc length of the geodesic curve that connect the two points. In practice, we discretize the curve $\gamma$ and compute an approximation of an equivalent energy functional:
\[E (\gamma) := \frac{1}{2} \int_0^1 \vert\vert \gamma_g' (t) \vert\vert^2 \text{d} t, \; \hat{E} (z_{0:T}) := \frac{1}{2N} \sum_{i = 1}^N \vert\vert g(z_i) - g(z_{i-1}) \vert\vert^2,\]where $z_i = \gamma (\frac{i}{N})$. The gradient of $\hat{E}$ with respect to $z_i$ is
\[\nabla_{z_i} \hat{E} = - \frac{1}{N} J_g (z_i) ^T \big(g (z_{i+1}) - 2g(z_i) + g(z_{i-1})\big).\]Diffusion Model [Park et al. 2023]
A diffusion model, or more general, a hierarchical autoencoder, is a “deep” version of autoencoder, that learns a complex manifold through the composition of simple manifolds. The backward process of diffusion model is a chain of differential mapping between manifolds:
\[\mathcal{M} = \mathcal{Z}_0 \overset{f_1}{\to} \mathcal{Z}_1 \overset{f_2}{\to} \cdots \overset{f_K}{\to} \mathcal{Z}_K,\]which induces a chain of tangent spaces through tangent maps:
\[T \mathcal{M} = T \mathcal{Z}_0 \overset{T_{1}}{\to} T \mathcal{Z}_1 \overset{T_{2}}{\to} \cdots \overset{T_{K}}{\to} T \mathcal{Z}_K.\]Assume all manifolds are subsets of $\mathbb{R}^n$, then the tangent maps coincide with Jacobian matrices $J_i$. For each $J_i \in \mathbb{R}^{n \times n}$, we rewrite it as $J_i = U_i \Lambda_i V_i^T$ using SVD, where $U_i$ and $V_i$ are the left and right singular vectors of $J_i$ respectively. The left singular vectors from $U_i$ span the tangent space $T \mathcal{Z}_{i-1}$, and the right singular vectors from $V_i$ span the tangent space $T \mathcal{Z}_i$.
Atlas Generative Model [Larsen et al. 2021, Schonsheck et al. 2019]
Note that in theory, it’s not possible to parameterize a manifold globally using a single chart. For instance, there doesn’t exist a diffeomorphism between the circle $\mathbb{S}^1$ and interval $[0,1]$. Fortunately, we don’t have acces to all possible data on the hidden manifold, we only have a finite number of observations. Therefore, it is indeed possible to learn a diffeomorphism between a finite set $\mathcal{T} \subseteq \mathbb{S}^1$ and $\mathcal{Z} \subseteq [0,1]$. Nevertheless, there are several works showing that the atlas-adaption of generative models is indeed helpful. Those works include Chart AE (CAE) [Alberti et al. 2023], Atlas VAE [Kingma et al. 2014, Pineau et al. 2018], Atlas WAE [Eric O. Korman 2018], Atlas GAN [Spurr et al. 2017], etc.