Model Compression: Theory, Practice, and Beyond

6 minute read

Published: April 16, 2024

Background

Given a pretrained model $f_{\theta}$ with parameters $\theta \in \mathbb{R}^D$, model compression seeks a smaller model $f_{\theta_c}$ with $\theta_c \in \mathbb{R}^{D_c}$, $D_c \ll D$, such that the drop of generalization performance is minimized:

\[\min_{\theta_c} \; \vert L (\theta_c) - L (\theta) \vert, \; \text{s.t. } \text{size} (\theta_c) \le B,\]

where the budget $B$ can be measured in parameters, FLOPs, memory, or energy. This is the model-space counterpart of dataset condensation: instead of compressing the dataset while fixing the model class, we compress the model while fixing the data distribution. The four classical families are knowledge distillation, pruning, quantization, and tensor decomposition; they are orthogonal and often combined in practice.

Knowledge Distillation

Knowledge distillation (KD) [Hinton et al. 2015] transfers knowledge from a large teacher $f_t$ to a small student $f_s$ by matching softened output distributions. With logits $z$ and temperature $T$, define $\sigma_T (z) = \text{softmax} (z / T)$. The student is trained with

\[L_{KD} = (1 - \lambda) \, \text{CE} (y, \sigma_1 (z_s)) + \lambda T^2 \, \mathcal{D}_{KL} (\sigma_T (z_t) \vert\vert \sigma_T (z_s)).\]

The factor $T^2$ keeps the gradient magnitude of the soft term comparable to the hard term. Beyond logits, one can match:

Intermediate features (FitNets) [Romero et al. 2015]: $L_{hint} = \vert\vert \phi_t (x) - r (\phi_s (x)) \vert\vert^2$, where $r$ is a learned regressor aligning dimensions.
Attention maps [Zagoruyko et al. 2017]: match spatial attention $\sum_c \vert A_c \vert^2$ of teacher and student.
Relations between samples [Park et al. 2019]: match pairwise distances/angles $\psi (x_i, x_j)$ instead of individual outputs, i.e., distill the geometry of the feature space rather than the features themselves.

Note the structural similarity with distribution-oriented dataset condensation: both minimize a discrepancy between two representations of the same data, only the optimization variable differs (student weights vs. synthetic samples).

Model Pruning

Pruning removes parameters by applying a binary mask $\mathbf{m} \in \lbrace 0, 1 \rbrace^D$:

\[\min_{\mathbf{m}, \theta} \; L (\theta \odot \mathbf{m}), \; \text{s.t. } \vert\vert \mathbf{m} \vert\vert_0 \le k.\]

This is a combinatorial problem; all practical methods are heuristics for scoring the saliency of each weight.

Magnitude pruning [Han et al. 2015]: saliency $s_i = \vert \theta_i \vert$; iteratively prune and fine-tune. Simple and still a strong baseline.
Hessian-based pruning (Optimal Brain Damage / Surgeon) [LeCun et al. 1990, Hassibi et al. 1993]: second-order Taylor expansion of the loss gives $\delta L \approx \frac{1}{2} \theta_i^2 H_{ii}$ (OBD, diagonal approximation), or the exact rank-one update $\delta L = \frac{\theta_i^2}{2 [H^{-1}]_{ii}}$ (OBS).
Lottery Ticket Hypothesis [Frankle et al. 2019]: a randomly initialized dense network contains a sparse subnetwork (“winning ticket”) that, trained in isolation from the original initialization, matches the dense accuracy.
Pruning at initialization: SNIP [Lee et al. 2019] uses connection sensitivity $\vert \theta_i \cdot \nabla_{\theta_i} L \vert$; GraSP [Wang et al. 2020] preserves gradient flow via the Hessian-gradient product; SynFlow [Tanaka et al. 2020] is data-free and avoids layer collapse.

Structured pruning removes entire neurons/channels/heads instead of individual weights, trading a worse accuracy-sparsity frontier for actual hardware speedup. For an exact (but expensive) formulation of pruning as a polynomial optimization problem, see the ReLU network pruning section of my POP post.

Interactive illustration: magnitude pruning

A small MLP (1-24-24-1) is trained in your browser to fit a 1D function (gray curve: target; blue curve: network output). Drag the slider to prune the smallest-magnitude weights — the fit survives surprisingly high sparsity, then collapses.

sparsity = 0%

training...

Quantization

Quantization maps full-precision weights/activations to a low-bit grid. The uniform affine quantizer with scale $s$, zero-point $z$, and bit-width $b$ is

\[Q (x) = s \cdot \bigg(\text{clip} \bigg(\bigg\lfloor \frac{x}{s} \bigg\rceil + z, \; 0, \; 2^b - 1\bigg) - z\bigg).\]

Two regimes:

Post-training quantization (PTQ): quantize a trained model using a small calibration set to fit $(s, z)$ per tensor or per channel; no retraining. Works well down to 8 bits, degrades below.
Quantization-aware training (QAT): insert fake-quantization in the forward pass and train through it. The rounding operator has zero gradient almost everywhere, so one uses the straight-through estimator (STE) [Bengio et al. 2013]: $\frac{\partial Q(x)}{\partial x} \approx \mathbf{1}{\lbrace x \in [x{\min}, x_{\max}] \rbrace}$.

The extreme case is binary/ternary networks [Courbariaux et al. 2016, Rastegari et al. 2016], where weights (and possibly activations) live in $\lbrace -1, +1 \rbrace$ and convolutions reduce to XNOR-popcount. For modern LLM-scale PTQ, see GPTQ [Frantar et al. 2022] (Hessian-based, closely related to OBS above) and AWQ [Lin et al. 2023].

Tensor Decomposition

Linear and convolutional layers are (multi-)linear maps, hence compressible by low-rank factorization:

SVD for fully-connected layers: $\mathbf{W} \approx \mathbf{U}_r \mathbf{\Sigma}_r \mathbf{V}_r^T$ replaces one $m \times n$ layer by two layers of sizes $m \times r$ and $r \times n$; storage drops from $mn$ to $r (m + n)$.
CP decomposition of conv kernels [Lebedev et al. 2015]: a 4-way kernel $\mathcal{K} \in \mathbb{R}^{d \times d \times C_{in} \times C_{out}}$ is written as a sum of $r$ rank-one tensors, turning one convolution into a sequence of cheap 1D convolutions.
Tucker decomposition [Kim et al. 2016]: factor only the channel modes, $\mathcal{K} \approx \mathcal{G} \times_3 \mathbf{U}{in} \times_4 \mathbf{U}{out}$, with ranks selected by variational Bayesian matrix factorization.
Tensor-train (TT) format [Novikov et al. 2015]: reshape a weight matrix into a high-order tensor and store it as a chain of 3-way cores; parameter count becomes linear in the order.

The rank $r$ is the crucial hyperparameter and is usually tuned empirically — automating this choice with ML models is precisely the low-rank estimation problem mentioned in my research interests (ML4POP).

Training-free Compression

Most methods above require access to the training data (for fine-tuning, calibration, or distillation). A more challenging setting is compression without (real) data:

Data-free KD [Nayak et al. 2019, Yin et al. 2020 (DeepInversion)]: synthesize pseudo-samples by inverting the teacher (matching batch-norm statistics, maximizing class confidence), then distill on them. Note that the synthesis step is essentially dataset condensation with the teacher as the only supervision signal.
Data-free pruning: SynFlow [Tanaka et al. 2020] scores weights using an all-ones input, requiring no data at all; merging/averaging neurons by similarity [Srinivas et al. 2015].
Data-free quantization: DFQ [Nagel et al. 2019] equalizes weight ranges across layers and corrects biases analytically, using only batch-norm statistics.

The interplay between data-free compression and dataset condensation is, in my view, the most interesting open direction: both ask what the minimal sufficient statistics of a dataset (or of a model trained on it) are, one from the data side and one from the model side. See my dataset condensation post for the mirror image of this question.

Share on

Twitter Facebook LinkedIn

Tong CHEN

Model Compression: Theory, Practice, and Beyond

Background

Knowledge Distillation

Model Pruning

Interactive illustration: magnitude pruning

Quantization

Tensor Decomposition

Training-free Compression

Share on

You May Also Enjoy

From Transformers to Mamba: Is Attention All We Need?

Graph Neural Network: Theory, Practice, and Beyond

Differential Geometry and Generative Models

Dataset Condensation: Theory, Practice, and Beyond