Chen Tang*, Kai Ouyang*, Zhi Wang, Yifei Zhu, Wen Ji, Yaowei Wang, Wenwu Zhu

1. Background

Mixed-precision quantization aims to use non-uniform bit-width for a deep model, which means some layers can be used higher bit-width to ensure overall accuracy, and some layers can be used lower bit-width to reduce computational resources. Although mixed-precision quantization brings a favorable accuracy-efficiency trade-off, the exponentially large discrete search space in mixed-precision quantization makes it hard to determine the optimal bit-width for each layer. Previous works usually resort to iterative search methods on the training set, which consume hundreds or even thousands of GPU-hours. That is because existing works need to evaluate the goodness of the current searched policy, namely the accuracy, to guide the following search process. The searching process is performed on the training set and requires significant rounds to search for. So, our solution is to avoid the iterative search process, we propose to define some metrics to estimate the importance of certain layers at once. If so, we can use these metrics to allocate the bit-width rather than using time-consuming accuracy metrics.

Untitled

In our work, we find the scale-factor in the quantizer can be used to reflect the quantization-sensitivity of different layers. That is because the scale factors are optimized to adjust the quantization-mapping during training. After converging, they naturally capture the quantization information of layers. We consider two example layers with well-trained step sizes and weights under 2bits quantization. The learned scale-factor s controls the quantization intervals and their corresponding quantization levels (see the areas with different colors). One can see we should give the layer on the right (with a larger scale-factor value) more bit-width, since more continuous values are mapped to the same quantization level (e.g., the green area).

Untitled

2. Case Study & ILP & Indicators-based MPQ search

We also conduct a comparative experiment for MobileNetv1 to observe the numerical difference of the scale-factors. Specifically, there are two main layers in a visual model, one is point-wise layer and the other is depth-wise layer. There is some consensus that depthwise layers are more quantization-sensitive than pointwise layers at the same bit-width. In our experiment, one can see that the scale factor values of DW-conv layer (more sensitive) are significantly higher than the PW-conv layer (less sensitive). Based on our findings, we formulate the mixed-precision quantization policy search as an integer linear programming (ILP) problem.

Untitled

3. Results

We achieve state-of-the-art results, we compressed a common-used CNN model, the ResNet50 about 12.2 times with only 0.6% accuracy degradation. Meanwhile, we improve the search efficiency, the searching time for ResNet takes less than 1 second, whereas the previous works usually consume more than 100 GPU hours.

Untitled

More details can be found in our paper:

Tang C, Ouyang K, Wang Z, et al. Mixed-Precision Neural Network Quantization via Learned Layer-Wise Importance[C]//Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XI. Cham: Springer Nature Switzerland, 2022: 259-275.

https://arxiv.org/pdf/2203.08368.pdf