TMPQ-DM: Joint Timestep Reduction and Quantization Precision Selection for Efficient Diffusion Models

Haojun Sun, Chen Tang, Zhi Wang, Yuan Meng, Jingyan Jiang, Xinzhu Ma, Wenwu Zhu

arxiv, 2024

Untitled

1. Background

Diffusion models have emerged as preeminent contenders in the realm of generative models, showcasing remarkable efficacy across diverse tasks and real-world scenarios. Distinguished by distinctive sequential generative processes of diffusion models, characterized by hundreds or even thousands of timesteps, diffusion models progressively reconstruct images from pure Gaussian noise, with each timestep necessitating full inference of the entire model. However, the substantial computational demands inherent to these models present challenges for deployment, quantization is thus widely used to lower the bit-width for reducing the storage and computing overheads.

Nevertheless, current quantization methodologies primarily focus on model-side optimization, disregarding the temporal dimension, such as the length of the timestep sequence, thereby allowing redundant timesteps to continue consuming computational resources, leaving substantial scope for accelerating the generative process.

Therefore, our method jointly optimizes timestep reduction and quantization to achieve a superior performance-efficiency trade-off, addressing both temporal and model optimization aspects.

2. Method

As shown in Fig. 1, firstly, we have observed different mixed-precision settings could share some same precision. Hence, we introduce the mixed-precision solver, a pre-build super-network that share the calibration results and thus only needs one-time calibration for serving as a accurate performance indicator.

We transform the original quantizer into a new quantizer that contains N sets of quantization parameters corresponding to N possible bit-widths. During calibration, we perform block-wise calibration on the entire model. Specifically, in each iteration, we randomly sample a bit-width b from the set of bit-width candidates, switch the quantization parameters of each quantizer to the corresponding bit-width for the forward pass, and then update the quantization parameters during the backward pass for the layers within this block.

As shown in Fig. 2, secondly, we observe that different timesteps alter images to varying degrees. Specifically, smaller timesteps exhibit greater disparity due to their closer proximity to real images, whereas larger timesteps show relatively less divergence. Based on this observation, we partition the timesteps into groups in a non-uniform manner. The closer the timesteps are to the real images, the denser we partition them into groups. This non-uniform timestep grouping strategy effectively reduces the search space of our problem.

Finally, inspired by Neural Architecuture Search (NAS), we showcase that both timestep reduction and precision selection can be integrated into a unified search space and thus search jointly.

Figure 1. Mixed-Precision Solver Building

Figure 2. Non-uniform Timestep Grouping Scheme

3. Results

We conduct experiments on 5 representative datasets with different resolution, including 32×32 CIFAR, 256×256 LSUN-Churches, 256×256 LSUN-Bedrooms, 256×256 ImageNet and 512×512 COCO. As shown in Tab. 1, We achieve more than 10× BitOPs savings on all these tasks while maintaining the same generative performance.

Table 1. FID and IS for DDIM on CIFAR-10 in different settings, varying the number of timesteps. “TS” denotes “Timestep Search”, “MP” denotes “Mixed Precision”, “GS” denotes “Groupwise splition”.

More details can be found in our paper:

Sun, Haojun, et al. "TMPQ-DM: Joint Timestep Reduction and Quantization Precision Selection for Efficient Diffusion Models." arXiv preprint arXiv:2404.09532 (2024).