Dataset Condensation: Theory, Practice, and Beyond
Published:
Background
Given a dataset $\mathcal{T} \subseteq \mathbb{R}^{n \times N}$ of size $N$ drawn from some unknown distribution $\mathcal{D}$, we would like to find a synthetic dataset $\mathcal{S} \subseteq \mathbb{R}^{n \times M}$ of size $M$, such that the gap between generalization loss of models trained over $\mathcal{T}$ and $\mathcal{S}$ is minimized:
\[\min_{\mathcal{S}} \; \vert L_{\mathcal{T}} (\theta_{\mathcal{S}}) - L_{\mathcal{T}} (\theta_{\mathcal{T}}) \vert,\]where $\theta_{\mathcal{T}}$ is the minimizer of $L_{\mathcal{T}} (\theta)$, and $\theta_{\mathcal{S}}$ is the minimizer of $L_{\mathcal{S}} (\theta)$. In practice, we repalce everywhere the true generalization loss by the empirical estimation. Note that $L_{\mathcal{T}} (\theta_{\mathcal{S}}) \ge L_{\mathcal{T}} (\theta_{\mathcal{T}})$, and that $L_{\mathcal{T}} (\theta_{\mathcal{T}})$ is not related to $\mathcal{S}$. Hence it’s equivalent to minimize $L_{\mathcal{T}} (\theta_{\mathcal{S}})$ over $\mathcal{S}$, and we have the following bi-level optimization problem:
\[\min_{\mathcal{S}} \; L_{\mathcal{T}} (\theta_{\mathcal{S}}), \; \text{s.t. } \theta_{\mathcal{S}} = \arg\min_{\theta} \; L_{\mathcal{S}} (\theta).\]Performance-oriented Condensation
In this category, all methods aim to maximize the generalization property of the model trained over the synthetic dataset.
Optimization-based
- Dataset Distillation (Wang et al. 2018, arxiv) [link]: Backpropagation-through-time (BPTT),
- Dataset Distillation with Convexified Implicit Gradients (Loo et al. 2023, ICML) [link]: Implicit differentiation,
Gradient-based
- Dataset Condensation with Gradient Matching (Zhao et al. 2020, ICLR) [link]: Gradient matching,
- Loss-Curvature Matching for Dataset Selection and Condensation (Shin et al. 2023, AISTATS) [link]: Gradient matching + regularization,
Parameter-based
- Dataset Distillation by Matching Training Trajectories (Cazenavette et al. 2022, CVPR) [link]: Trajectory matching,
Feature-based
- Dataset Distillation with Infinitely Wide Convolutional Networks (Nguyen et al. 2021, NeurIPS) [link]: Regression only,
- Dataset Meat-Learning from Kernel Ridge-Regression (Nguyen et al. 2021, ICLR) [link]: Regression only,
- CAFE: Learning to Condense Dataset by Aligning Features (Wang et al. 2022, CVPR) [link]: Intermediate feature matching + output feature matching + regularization,
Distribution-oriented Condensation
The main issue with performance-oriented condensation methods is, the synthetic dataset is very likely to overfit to downstream performance. However, if we are using performance as an aspect to match two probability distributions, the synthetic dataset might follow a totally different distribution while preserve the same downstream performance.
The core of distribution-oriented condensation methods is using a metric to measure the distance between two distributions:
\[\min_{\mathcal{S}} \; d(\mu_{\mathcal{S}}, \mu_{\mathcal{T}}),\]where $\mu_{\mathcal{T}}$ is the empirical distribution of $\mathcal{T}$ defined as $\mu_{\mathcal{T}} = \frac{1}{N} \sum_{i = 1}^N \delta_{\mathbf{x}_i}.$
- Dataset Condensation with Distribution Matching (Zhao et al. 2021, WACV) [link]: Integral probability metric (IPM),
- M3D: Dataset Condensation by Minimizing Maximum Mean Discrepancy (Zhang et al. 2024, AAAI) [link]: MMD,
- Dataset Distillation via the Wasserstein Metric (Liu et al. 2024, arxiv) [link]: Wasserstein distance,
- Dataset Condensation by Minimal Finite Covering (Chen et al. 2024, arxiv) [link]: Hausdorff distance, average Hausdorff distance,
Variants
Data Augmentation
- Dataset Condensation with Differentiable Siamese Augmentation (Zhao et al. 2021, ICML) [link]: Gradient matching + data augmentation,
- Dataset Condensation via Efficient Synthetic-Data Parameterization (Kim et al. 2022, ICML) [link]: Gradient matching + multi-formation + data augmentation,
- Dataset Distillation with Channel Efficient Process (Zhou et al. 2024, ICASSP) [link]: Gradient matching + channel-wise multi-formation,
Space Changement
Data space
- Synthesizing Informative Training Samples with GAN (Zhao et al. 2022, NeurIPS workshop) [link]: Distribution matching + pre-trained GAN + data augmentation,
- DiM: Distilling Dataset into Generative Model (Wang et al. 2023, arxiv) [link]: Distribution matching + GAN training,
- Dataset Condensation via Generative Model (Zhang et al. 2023, arxiv) [link]: Feature matching + GAN training + regularization,
- Generalizing Dataset Distillation via Deep Generative Prior (Cazenavette et al. 2023, CVPR) [link]: DC + pre-trained GAN,
- Dataset Distillation via Factorization (Liu et al. 2022, NeurIPS) [link]: DC + pre-trained hallucinator-extractor,
- Efficient Dataset Distillation via Minimax Diffusion (Gu et al. 2024, CVPR) [link]: Feature matching + diffusion training + regularization,
- MGDD: A Meta Generator for Fast Dataset Distillation (Liu et al. 2023, NeurIPS) [link]: Regression only, no backpropagation on $\mathcal{S}$,
- Dataset Condensation via Image Decomposition (Chen et al. 2024, ongoing): DC + low rank decomposition,
- Dataset Condensation via Input Subspace Projection (Chen et al. 2024, ongoing): DC + data span,
Model space
- Dataset Condensation via Expert Subspace Projection (Ma et al. 2023, Sensors) [link]: DC + parameter span,
- Efficient Dataset Distillation using Random Feature Approximation (Loo et al. 2022, NeurIPS) [link]: Replace NTK by Neural Network Gaussian Process (NNGP),
- Dataset Distillation using Neural Feature Regression (Zhou et al. 2022, NeurIPS) [link]: Regression only,
Loss Improvement
- Dataset Condensation with Contrastive Signals (Lee et al. 2022, ICML) [link]: Variant of gradient matching,
- Improved Distribution Matching for Dataset Condensation (Zhao et al. 2023, CVPR) [link]: Distribution matching + multi formation + data augmentation + model enrichment,
- DataDAM: Efficient Dataset Distillation with Attention Matching (Sajedi et al. 2023, ICCV) [link]: Distribution Matching + spatial attention matching (SAM) + $\ell_2$-normalization,
Efficiency Improvement
- DREAM: Efficient Dataset Distillation by Representative Matching (Liu et al. 2023, ICCV) [link]: Gradient matching + K-means clustering,
- Embarrassingly Simple Dataset Distillation (Feng et al. 2023, NeurIPS workshop) [link]: Randomized subsample of BPTT.
Applications
Privacy
Private Set Generation with Discriminative Information (Chen et al. 2022, NeurIPS) [link]: Gradient matching for private set generation.
Connect the dots: Dataset Condensation, Differential Privacy, and Adversarial Uncertainty (Kenneth Odoh 202, arxiv) [link]
DP-MERF: Differentially Private Mean Embeddings with Random Features for Practical Privacy-Preserving Data Generation (Harder et al. 2021, AISTATS) [link]: MMD for DP.
Privacy for Free: How does Dataset Condensation Help Privacy? (Dong et al. 2022, ICML) [link]
No Free Lunch in “Privacy for Free: How does Dataset Condensation Help Privacy” (Carlini et al. 2022, arxiv) [link]
Robustness
Robustness May Be at Odds with Accuracy (Tsipras et al. 2019, ICLR) [link]
Adversarial Examples Are Not Bugs, They Are Features (Ilyas et al. 2019, NeurIPS) [link]
Can we achieve robustness from data alone? (Tsilivis et al. 2022, arxiv) [link]
Towards Robust Dataset Learning (Wu et al. 2022, arxiv) [link]
Towards Adversarially Robust Dataset Distillation by Curvature Regularization (Xue et al. 2024, arxiv) [link]
Graph Neural Network
Graph Condensation for Graph Neural Networks (Jin et al. 2022, ICLR) [link]: Gradient matching for GNN.
Condensing Graphs via One-Step Gradient Matching (Jin et al. 2022, NeurIPS workshop) [link]: Gradient matching for GNN.
Graph Condensation via Receptive Field Distribution Matching (Liu et al. 2022, arxiv) [link]: Distribution matching for GNN.
Kernel Ridge Regression-Based Graph Dataset Distillation (Xu et al. 2023, KDD) [link]: KIP for GNN.
Medical Image Analysis
- Progressive trajectory matching for medical dataset distillation (Yu et al. 2024, arxiv) [link]: Trajectory matching for MedMNIST.
Reviews
Dataset Distillation: A Comprehensive Review (Yu et al. 2023, TPAMI) [link]
A Survey on Dataset Distillation: Approaches, Applications and Future Directions (Geng et al. 2023, IJCAI) [link]
A Comprehensive Study on Dataset Distillation: Performance, Privacy, Robustness and Fairness (Chen et al. 2023, arxiv) [link]
A Survey on Dataset Distillation: Approaches, Applications and Future Directions (Geng et al. 2023, IJCAI) [link]
Data Distillation: A Survey (Sachdeva et al. 2023, TMLR) [link]
A comprehensive survey of dataset distillation (Lei et al. 2023, TPAMI) [link]
A review of dataset distillation for deep learning (Le et al. 2022, PlatCon) [link]
Benchmarks
DC-BENCH: Dataset Condensation Benchmark (Cui et al. 2022, NeurIPS) [link]
DD-RobustBench: An Adversarial Robustness Benchmark for Dataset Distillation (Wu et al. 2024, arxiv) [link]