Dataset Condensation: Theory, Practice, and Beyond

11 minute read

Published:

Background

Given a dataset $\mathcal{T} \subseteq \mathbb{R}^{n \times N}$ of size $N$ drawn from some unknown distribution $\mathcal{D}$, we would like to find a synthetic dataset $\mathcal{S} \subseteq \mathbb{R}^{n \times M}$ of size $M$, such that the gap between generalization loss of models trained over $\mathcal{T}$ and $\mathcal{S}$ is minimized:

\[\min_{\mathcal{S}} \; \vert L_{\mathcal{T}} (\theta_{\mathcal{S}}) - L_{\mathcal{T}} (\theta_{\mathcal{T}}) \vert,\]

where $\theta_{\mathcal{T}}$ is the minimizer of $L_{\mathcal{T}} (\theta)$, and $\theta_{\mathcal{S}}$ is the minimizer of $L_{\mathcal{S}} (\theta)$. In practice, we repalce everywhere the true generalization loss by the empirical estimation. Note that $L_{\mathcal{T}} (\theta_{\mathcal{S}}) \ge L_{\mathcal{T}} (\theta_{\mathcal{T}})$, and that $L_{\mathcal{T}} (\theta_{\mathcal{T}})$ is not related to $\mathcal{S}$. Hence it’s equivalent to minimize $L_{\mathcal{T}} (\theta_{\mathcal{S}})$ over $\mathcal{S}$, and we have the following bi-level optimization problem:

\[\min_{\mathcal{S}} \; L_{\mathcal{T}} (\theta_{\mathcal{S}}), \; \text{s.t. } \theta_{\mathcal{S}} = \arg\min_{\theta} \; L_{\mathcal{S}} (\theta).\]

Performance-oriented Condensation

In this category, all methods aim to maximize the generalization property of the model trained over the synthetic dataset.

Optimization-based

  • Dataset Distillation (Wang et al. 2018, arxiv) [link]: Backpropagation-through-time (BPTT),
\[\begin{align*} & \theta^{t+1}_{\mathcal{S}^t} \leftarrow \theta^t_{\mathcal{S}^t} - \eta \nabla_{\theta} L_{\mathcal{S}} (\theta^t_{\mathcal{S}^t}) \\ & \mathcal{S}^{t+1} \leftarrow \mathcal{S}^t - \alpha \nabla_{\mathcal{S}} L_{\mathcal{T}} (\theta^{t+1}_{\mathcal{S}^t}) \end{align*}\]
  • Dataset Distillation with Convexified Implicit Gradients (Loo et al. 2023, ICML) [link]: Implicit differentiation,
\[\begin{align*} & \mathcal{S}^{t+1} \leftarrow \mathcal{S}^t - \alpha \frac{\text{d} L_{\mathcal{T}}}{\text{d} \mathcal{S}} \\ & \frac{\text{d} L_{\mathcal{T}}}{\text{d} \mathcal{S}} = \frac{\partial L_{\mathcal{T}} (\theta_{\mathcal{S}})}{\partial \mathcal{S}} + \frac{\partial}{\partial \mathcal{S}} \bigg(\frac{\partial L_{\mathcal{S}} (\theta)}{\partial \theta}^T v\bigg), v = \bigg(\frac{\partial^2 L_{\mathcal{S}} (\theta)}{\partial \theta \partial \theta^T}\bigg)^{-1} \frac{\partial L_{\mathcal{T}} (\theta_{\mathcal{S}})}{\partial \theta} \end{align*}\]

Gradient-based

  • Dataset Condensation with Gradient Matching (Zhao et al. 2020, ICLR) [link]: Gradient matching,
\[\min_{\mathcal{S}} \mathbb{E}_{c} \big[d (\mathbb{E}_{\theta_{1:T}} [\nabla_{\theta} L_{\mathcal{S}} (\theta)], \mathbb{E}_{\theta_{1:T}} [\nabla_{\theta} L_{\mathcal{T}} (\theta)])\big]\]
  • Loss-Curvature Matching for Dataset Selection and Condensation (Shin et al. 2023, AISTATS) [link]: Gradient matching + regularization,
\[\min_{\mathcal{S}} \bigg(\underbrace{\mathbb{E}_{c} \big[d (\mathbb{E}_{\theta_{1:T}} [\nabla_{\theta} L_{\mathcal{S}} (\theta)], \mathbb{E}_{\theta_{1:T}} [\nabla_{\theta} L_{\mathcal{T}} (\theta)])\big]}_{\text{gradient matching}} + \underbrace{\mathcal{R} (\varepsilon, \theta, \lambda_{\max}^{\mathcal{T}, \mathcal{S}})}_{\text{curvature regularization}} \bigg)\]

Parameter-based

  • Dataset Distillation by Matching Training Trajectories (Cazenavette et al. 2022, CVPR) [link]: Trajectory matching,
\[\min_{\mathcal{S}} \; \mathbb{E}_{\theta_{1:T}} [d (\theta_{\mathcal{S}}, \theta_{\mathcal{T}})]\]

Feature-based

  • Dataset Distillation with Infinitely Wide Convolutional Networks (Nguyen et al. 2021, NeurIPS) [link]: Regression only,
\[\min_{\mathcal{S}} \; d(\mathcal{T}_y, \theta_{\mathcal{S}} (\mathcal{T}_x)), \; \theta_{\mathcal{S}} = k(\cdot, \mathcal{S}_x) \big(k(\mathcal{S}_x, \mathcal{S}_x) + \lambda \mathbf{I}\big)^{-1} \mathcal{S}_y, \; k = \text{NTK}\]
  • Dataset Meat-Learning from Kernel Ridge-Regression (Nguyen et al. 2021, ICLR) [link]: Regression only,
\[\min_{\mathcal{S}} \; d(\mathcal{T}_y, \theta_{\mathcal{S}} (\mathcal{T}_x)), \; \theta_{\mathcal{S}} = k(\cdot, \mathcal{S}_x) \big(k(\mathcal{S}_x, \mathcal{S}_x) + \lambda \mathbf{I}\big)^{-1} \mathcal{S}_y, \; k = \text{NTK}\]
  • CAFE: Learning to Condense Dataset by Aligning Features (Wang et al. 2022, CVPR) [link]: Intermediate feature matching + output feature matching + regularization,
\[\begin{align*} \min_{\mathcal{S}} \; \bigg(\underbrace{\mathbb{E}_{\theta^{1:L-1}} \big[d(\theta_{\mathcal{S}} (\mathcal{S}), \theta_{\mathcal{T}} (\mathcal{T}))\big]}_{\text{intermediate feature matching}} + \underbrace{\mathbb{E}_{\theta^L} \big[d(\theta_{\mathcal{S}} (\mathcal{S}), \theta_{\mathcal{T}} (\mathcal{T}))\big]}_{\text{output softmax attention matching}} + \underbrace{L_{\mathcal{T}} (\theta_{\mathcal{S}})}_{\text{loss matching}}\bigg) \end{align*}\]

Distribution-oriented Condensation

The main issue with performance-oriented condensation methods is, the synthetic dataset is very likely to overfit to downstream performance. However, if we are using performance as an aspect to match two probability distributions, the synthetic dataset might follow a totally different distribution while preserve the same downstream performance.

The core of distribution-oriented condensation methods is using a metric to measure the distance between two distributions:

\[\min_{\mathcal{S}} \; d(\mu_{\mathcal{S}}, \mu_{\mathcal{T}}),\]

where $\mu_{\mathcal{T}}$ is the empirical distribution of $\mathcal{T}$ defined as $\mu_{\mathcal{T}} = \frac{1}{N} \sum_{i = 1}^N \delta_{\mathbf{x}_i}.$

  • Dataset Condensation with Distribution Matching (Zhao et al. 2021, WACV) [link]: Integral probability metric (IPM),
\[IPM (\mu_{\mathcal{S}}, \mu_{\mathcal{T}}) = \max_{\theta} \; \big\vert\mathbb{E}_{X \sim \mu_{\mathcal{S}}} [\theta(X)] - \mathbb{E}_{Y \sim \mu_{\mathcal{T}}} [\theta(Y)]\big\vert\]
  • M3D: Dataset Condensation by Minimizing Maximum Mean Discrepancy (Zhang et al. 2024, AAAI) [link]: MMD,
\[\begin{align*} MMD (\mu_{\mathcal{S}}, \mu_{\mathcal{T}}) & = \max_{\vert\vert \theta \vert\vert_{\mathcal{H}} \le 1} \big\vert\mathbb{E}_{X \sim \mu_{\mathcal{S}}} [\theta(X)] - \mathbb{E}_{Y \sim \mu_{\mathcal{T}}} [\theta(Y)]\big\vert^2 \\ & = \mathbb{E}_{X \sim \mu_{\mathcal{S}}, X' \sim \mu_{\mathcal{S}}} [k (X, X')] - 2 \mathbb{E}_{X \sim \mu_{\mathcal{S}}, Y \sim \mu_{\mathcal{T}}} [k (X, Y)] + \mathbb{E}_{Y \sim \mu_{\mathcal{T}}, Y' \sim \mu_{\mathcal{T}}} [k (Y, Y')] \end{align*}\]
  • Dataset Distillation via the Wasserstein Metric (Liu et al. 2024, arxiv) [link]: Wasserstein distance,
\[\begin{align*} W_1 (\mu_{\mathcal{S}}, \mu_{\mathcal{T}}) & = \max_{\vert\vert \theta \vert\vert_{Lip} \le 1} \big\vert\mathbb{E}_{X \sim \mu_{\mathcal{S}}} [\theta(X)] - \mathbb{E}_{Y \sim \mu_{\mathcal{T}}} [\theta(Y)]\big\vert \\ & = \min_{\gamma \in \Gamma (\mu_{\mathcal{S}}, \mu_{\mathcal{T}})} \bigg\{\mathbb{E}_{(X, Y) \sim \gamma} [d(X, Y)]: \int \gamma(\cdot, y) \text{d} y = \mu_{\mathcal{S}}, \; \int \gamma(x, \cdot) \text{d} x = \mu_{\mathcal{T}} \bigg\} \end{align*}\]
  • Dataset Condensation by Minimal Finite Covering (Chen et al. 2024, arxiv) [link]: Hausdorff distance, average Hausdorff distance,
\[\begin{align*} & d_{H} (\mu_{\mathcal{S}}, \mu_{\mathcal{T}}) = \max \bigg\{\max_{\mathbf{x} \in \mathcal{S}} \min_{\mathbf{y} \in \mathcal{T}} d(\mathbf{x}, \mathbf{y}), \max_{\mathbf{y} \in \mathcal{T}} \min_{\mathbf{x} \in \mathcal{S}} d(\mathbf{x}, \mathbf{y})\bigg\} \\ & \bar{d}_{H} (\mu_{\mathcal{S}}, \mu_{\mathcal{T}}) = \frac{1}{2} \bigg(\frac{1}{\vert \mathcal{S} \vert} \sum_{\mathbf{x} \in \mathcal{S}} \min_{\mathbf{y} \in \mathcal{T}} d(\mathbf{x}, \mathbf{y}) + \frac{1}{\vert \mathcal{T} \vert} \sum_{\mathbf{y} \in \mathcal{T}} \min_{\mathbf{x} \in \mathcal{S}} d(\mathbf{x}, \mathbf{y})\bigg) \end{align*}\]

Variants

Data Augmentation

  • Dataset Condensation with Differentiable Siamese Augmentation (Zhao et al. 2021, ICML) [link]: Gradient matching + data augmentation,
\[\mathcal{A}_{\omega}: \mathbb{R}^n \longrightarrow \mathbb{R}^n, \; \mathbf{x} \mapsto \mathcal{A}_{\omega} (\mathbf{x})\]
  • Dataset Condensation via Efficient Synthetic-Data Parameterization (Kim et al. 2022, ICML) [link]: Gradient matching + multi-formation + data augmentation,
\[f_c: \mathbb{R}^n \longrightarrow \mathbb{R}^{n'}, \; \mathbf{x} \mapsto f(\mathbf{x})\]
  • Dataset Distillation with Channel Efficient Process (Zhou et al. 2024, ICASSP) [link]: Gradient matching + channel-wise multi-formation,
\[f_c: \mathbb{R}^n \longrightarrow \mathbb{R}^{4n}, \; \mathbf{x} \mapsto f_c(\mathbf{x})\]

Space Changement

Data space

  • Synthesizing Informative Training Samples with GAN (Zhao et al. 2022, NeurIPS workshop) [link]: Distribution matching + pre-trained GAN + data augmentation,
\[\max_{\theta} \; \big\vert\mathbb{E}_{G(Z) \sim \mu_{\mathcal{S}}} [\theta(G(Z))] - \mathbb{E}_{Y \sim \mu_{\mathcal{T}}} [\theta(Y)]\big\vert\]
  • DiM: Distilling Dataset into Generative Model (Wang et al. 2023, arxiv) [link]: Distribution matching + GAN training,
\[\min_{\mathcal{Z}} \; \bigg(\underbrace{\mathbb{E}_{\theta \sim \Theta} \big[\big\vert\mathbb{E}_{G(Z) \sim \mu_{\mathcal{S}}} [\theta(G(Z))] - \mathbb{E}_{Y \sim \mu_{\mathcal{T}}} [\theta(Y)]\big\vert\big]}_{\text{logit matching}} + \underbrace{L_{GAN} (G(\mathcal{Z}), \mathcal{T})}_{\text{GAN loss}}\bigg)\]
  • Dataset Condensation via Generative Model (Zhang et al. 2023, arxiv) [link]: Feature matching + GAN training + regularization,
\[\min_{\mathcal{Z}} \; \bigg(\underbrace{\mathbb{E}_{\theta^{1:L}} \big[d(\theta_{\mathcal{Z}} (G(\mathcal{Z})), \theta_{\mathcal{T}} (\mathcal{T}))\big]}_{\text{feature matching}} + \underbrace{L_{GAN} (G(\mathcal{Z}), \mathcal{T})}_{\text{GAN loss}} + \underbrace{\mathcal{R}_{intra}}_{\text{intra-class diversity}} + \underbrace{\mathcal{R}_{inter}}_{\text{inter-class discrimination}}\bigg)\]
  • Generalizing Dataset Distillation via Deep Generative Prior (Cazenavette et al. 2023, CVPR) [link]: DC + pre-trained GAN,
\[\min_{\mathcal{Z}} \; L_{DC} (G(\mathcal{Z}), \mathcal{T})\]
  • Dataset Distillation via Factorization (Liu et al. 2022, NeurIPS) [link]: DC + pre-trained hallucinator-extractor,
\[\min_{\mathcal{Z}} \; \bigg(L_{DC} (F(\mathcal{Z}), \mathcal{T}) + \underbrace{L_{sim} (F)}_{\text{similarity loss}}\bigg), \; \min_{F} \; \bigg(\underbrace{L_{con} (F)}_{\text{contrasive loss}} + \underbrace{L_{task} (F)}_{\text{performance loss}}\bigg)\]
  • Efficient Dataset Distillation via Minimax Diffusion (Gu et al. 2024, CVPR) [link]: Feature matching + diffusion training + regularization,
\[\min_{\mathcal{S}} \; \bigg(\underbrace{L (\theta_{1:T}, \mathcal{T})}_{\text{diffusion loss}} + \underbrace{\mathbb{E}_{\theta_{1:T}} \big[d(\theta (\mathcal{S}), \theta(\mathcal{T}))\big]}_{\text{feature matching}} + \underbrace{\mathcal{R}_{intra} (\theta (\mathcal{S}))}_{\text{intra-class diversity}}\bigg)\]
  • MGDD: A Meta Generator for Fast Dataset Distillation (Liu et al. 2023, NeurIPS) [link]: Regression only, no backpropagation on $\mathcal{S}$,
\[\begin{align*} \min_{\omega} \; & d(\mathcal{T}_y, \theta_{\omega} (\mathcal{T}_x)) \\ & \theta_{\omega} = k(\cdot, G_{\omega} (\mathcal{S}_x)) \big(k(G_{\omega} (\mathcal{S}_x), G_{\omega} (\mathcal{S}_x)) + \lambda \mathbf{I}\big)^{-1} \theta_{\mathcal{T}} (\mathcal{S}_x) \\ & \theta_{\mathcal{T}} = k(\cdot, \mathcal{T}_x) \big(k(\mathcal{T}_x, \mathcal{T}_x) + \lambda \mathbf{I}\big)^{-1} \mathcal{T}_y, \; k = \text{NTK} \end{align*}\]
  • Dataset Condensation via Image Decomposition (Chen et al. 2024, ongoing): DC + low rank decomposition,
\[\min_{\mathcal{S}_r} \; L_{DC} (\mathcal{S}_r, \mathcal{T}), \; \mathcal{S}_r = \{\mathbf{XX}^T: \text{rank} (\mathbf{X}) = r\}\]
  • Dataset Condensation via Input Subspace Projection (Chen et al. 2024, ongoing): DC + data span,
\[\min_{\mathcal{S}_p} \; L_{DC} (\mathcal{S}_p, \mathcal{T}), \; \mathcal{S}_p \in \text{span} (\mathcal{T})\]

Model space

  • Dataset Condensation via Expert Subspace Projection (Ma et al. 2023, Sensors) [link]: DC + parameter span,
\[\begin{align*} & \theta^{t+1}_{\mathcal{S}^t} \leftarrow \mathcal{P}_{\mathbb{S}_{\tau}} \big(\theta^t_{\mathcal{S}^t} - \eta \nabla_{\theta} L_{\mathcal{S}} (\theta^t_{\mathcal{S}^t})\big) \\ & \mathcal{S}^{t+1} \leftarrow \mathcal{S}^t - \alpha \nabla_{\mathcal{S}} L_{\mathcal{T}} (\theta^{t+1}_{\mathcal{S}^t}) \end{align*}\]
  • Efficient Dataset Distillation using Random Feature Approximation (Loo et al. 2022, NeurIPS) [link]: Replace NTK by Neural Network Gaussian Process (NNGP),
\[\min_{\mathcal{S}} \; d(\mathcal{T}_y, \theta_{\mathcal{S}} (\mathcal{T}_x)), \; \theta_{\mathcal{S}} = k(\cdot, \mathcal{S}_x) \big(k(\mathcal{S}_x, \mathcal{S}_x) + \lambda \mathbf{I}\big)^{-1} \mathcal{S}_y, \; k = \text{NNGP}\]
  • Dataset Distillation using Neural Feature Regression (Zhou et al. 2022, NeurIPS) [link]: Regression only,
\[\min_{\mathcal{S}} \; d(\mathcal{T}_y, \theta_{\mathcal{S}} (\mathcal{T}_x)), \; \theta_{\mathcal{S}} = k(\cdot, \mathcal{S}_x) \big(k(\mathcal{S}_x, \mathcal{S}_x) + \lambda \mathbf{I}\big)^{-1} \mathcal{S}_y, \; k = \text{NFK}\]

Loss Improvement

  • Dataset Condensation with Contrastive Signals (Lee et al. 2022, ICML) [link]: Variant of gradient matching,
\[\mathbb{E}_{\theta_{1:T}} \big[d (\mathbb{E}_{c} [\nabla_{\theta} L_{\mathcal{S}} (\theta)], \mathbb{E}_{c} [\nabla_{\theta} L_{\mathcal{T}} (\theta)])\big]\]
  • Improved Distribution Matching for Dataset Condensation (Zhao et al. 2023, CVPR) [link]: Distribution matching + multi formation + data augmentation + model enrichment,
\[\max_{\theta \sim \Theta} \; \bigg(\big\vert\mathbb{E}_{X \sim \mu_{\mathcal{S}}} [\theta(\mathcal{A} \circ f(X))] - \mathbb{E}_{Y \sim \mu_{\mathcal{T}}} [\theta(\mathcal{A} \circ f (Y))]\big\vert + \underbrace{\mathcal{R}_{task} (\theta)}_{\text{performance loss}}\bigg), \; \mathcal{A} \circ f: \mathbb{R}^n \rightarrow \mathbb{R}^{l^2 n}\]
  • DataDAM: Efficient Dataset Distillation with Attention Matching (Sajedi et al. 2023, ICCV) [link]: Distribution Matching + spatial attention matching (SAM) + $\ell_2$-normalization,
\[\max_{\theta} \; \bigg(\big\vert\mathbb{E}_{X \sim \mu_{\mathcal{S}}} [\theta(X)] - \mathbb{E}_{Y \sim \mu_{\mathcal{T}}} [\theta(Y)]\big\vert + \underbrace{\mathcal{R}_{task} (\theta)}_{\text{performance loss}}\bigg), \; \mathcal{A} \circ f: \mathbb{R}^n \rightarrow \mathbb{R}^{l^2 n}\]

Efficiency Improvement

  • DREAM: Efficient Dataset Distillation by Representative Matching (Liu et al. 2023, ICCV) [link]: Gradient matching + K-means clustering,
\[\min_{\mathcal{S}} \mathbb{E}_{c} \big[d (\mathbb{E}_{\theta_{1:T}} [\nabla_{\theta} L_{\mathcal{S}} (\theta)], \mathbb{E}_{\theta_{1:T}} [\nabla_{\theta} L_{\mathcal{T}_K} (\theta)])\big], \; \mathcal{T}_K = \text{centers of K-means clustering}\]
  • Embarrassingly Simple Dataset Distillation (Feng et al. 2023, NeurIPS workshop) [link]: Randomized subsample of BPTT.

Applications

Privacy

  • Private Set Generation with Discriminative Information (Chen et al. 2022, NeurIPS) [link]: Gradient matching for private set generation.

  • Connect the dots: Dataset Condensation, Differential Privacy, and Adversarial Uncertainty (Kenneth Odoh 202, arxiv) [link]

  • DP-MERF: Differentially Private Mean Embeddings with Random Features for Practical Privacy-Preserving Data Generation (Harder et al. 2021, AISTATS) [link]: MMD for DP.

  • Privacy for Free: How does Dataset Condensation Help Privacy? (Dong et al. 2022, ICML) [link]

  • No Free Lunch in “Privacy for Free: How does Dataset Condensation Help Privacy” (Carlini et al. 2022, arxiv) [link]

Robustness

  • Robustness May Be at Odds with Accuracy (Tsipras et al. 2019, ICLR) [link]

  • Adversarial Examples Are Not Bugs, They Are Features (Ilyas et al. 2019, NeurIPS) [link]

  • Can we achieve robustness from data alone? (Tsilivis et al. 2022, arxiv) [link]

  • Towards Robust Dataset Learning (Wu et al. 2022, arxiv) [link]

  • Towards Adversarially Robust Dataset Distillation by Curvature Regularization (Xue et al. 2024, arxiv) [link]

Graph Neural Network

  • Graph Condensation for Graph Neural Networks (Jin et al. 2022, ICLR) [link]: Gradient matching for GNN.

  • Condensing Graphs via One-Step Gradient Matching (Jin et al. 2022, NeurIPS workshop) [link]: Gradient matching for GNN.

  • Graph Condensation via Receptive Field Distribution Matching (Liu et al. 2022, arxiv) [link]: Distribution matching for GNN.

  • Kernel Ridge Regression-Based Graph Dataset Distillation (Xu et al. 2023, KDD) [link]: KIP for GNN.

Medical Image Analysis

  • Progressive trajectory matching for medical dataset distillation (Yu et al. 2024, arxiv) [link]: Trajectory matching for MedMNIST.

Reviews

  • Dataset Distillation: A Comprehensive Review (Yu et al. 2023, TPAMI) [link]

  • A Survey on Dataset Distillation: Approaches, Applications and Future Directions (Geng et al. 2023, IJCAI) [link]

  • A Comprehensive Study on Dataset Distillation: Performance, Privacy, Robustness and Fairness (Chen et al. 2023, arxiv) [link]

  • A Survey on Dataset Distillation: Approaches, Applications and Future Directions (Geng et al. 2023, IJCAI) [link]

  • Data Distillation: A Survey (Sachdeva et al. 2023, TMLR) [link]

  • A comprehensive survey of dataset distillation (Lei et al. 2023, TPAMI) [link]

  • A review of dataset distillation for deep learning (Le et al. 2022, PlatCon) [link]

Benchmarks

  • DC-BENCH: Dataset Condensation Benchmark (Cui et al. 2022, NeurIPS) [link]

  • DD-RobustBench: An Adversarial Robustness Benchmark for Dataset Distillation (Wu et al. 2024, arxiv) [link]

  • Awesome Dataset Distillation [link]

  • DDCV: 1st CVPR Workshop on Dataset Distillation [link]