How to quantize model

[中文]

ESP-DL must use a proprietary format .espdl for model deployment. This is a quantized model format that supports 8bit and 16bit. In this tutorial, we will take quantize_sin_model as an example to show how to use ESP-PPQ to quantize and export a .espdl model. The quantization method is Post Training Quantization (PTQ).

Preparation

Install ESP_PPQ

Pre-trained model

python sin_model.py

Run sin_model.py . This script trains a simple Pytorch model to fit the sin function in the range [0, 2pi]. After training, the corresponding .pth weights will be saved and the ONNX model will be exported.

Note

ESP-PPQ provides two interfaces, espdl_quantize_onnx and espdl_quantize_torch, to support ONNX models and PyTorch models. Other deep learning frameworks, such as TensorfFlow, PaddlePaddle, etc., need to be converted to ONNX first.

Quantize and export .espdl

Reference quantize_torch_model.py and quantize_onnx_model.py , learn how to use the espdl_quantize_onnx and espdl_quantize_torch interfaces to quantize and export the .espdl model.

After executing the script, three files will be exported:

  • **.espdl: ESPDL model binary file, which can be deployed on chip for inference directly, and can be visualized with Netron.

  • **.info: ESPDL model text file, used to debug and determine whether the .espdl model is exported correctly. Contains model structure, quantized model weights, test input/output and other information.

  • **.json: Quantization information file, used to save and load quantization information.

Note

  1. The .espdl models of different platforms cannot be mixed; inference results will be inaccurate.

    • ESP32 uses a Per-Tensor quantization strategy; the rounding mode is ROUND_HALF_UP.

      • When quantizing ESP32 platform models with ESP-PPQ, set the target to c, because ESP-DL implements those operators in C.

      • When deploying ESP32 platform models with ESP-DL, set the project build target to esp32.

    • ESP32S3 uses a Per-Tensor quantization strategy; the rounding mode is ROUND_HALF_UP.

    • On ESP32P4, Conv and GEMM use Per-Channel quantization; other operators use Per-Tensor; the rounding mode is ROUND_HALF_EVEN.

  2. The quantization strategy currently used by ESP-DL is symmetric quantization + POWER OF TWO.

Add test input/output

To verify whether the inference results of the model on the board are correct, you first need to record a set of test input/output on the PC. By turning on the export_test_values option in the api, a set of test input/output can be saved in the .espdl model. One of the input_shape and inputs parameters must be specified. The input_shape parameter uses a random test input, while inputs can use a specific test input. The values ​​of the test input/output can be viewed in the .info file. Search for test inputs value and test outputs value to view them.

Quantized model inference & accuracy evaluation

espdl_quantize_onnx and espdl_quantize_torch APIs will return BaseGraph. Use BaseGraph to build the corresponding TorchExecutor to use the quantized model for inference on the PC side.

executor = TorchExecutor(graph=quanted_graph, device=device)
output = executor(input)

The output obtained by quantized model inference can be used to calculate various accuracy metrics. Since the board-side esp-dl inference result can be aligned with esp-ppq, these metrics can be used directly to evaluate the accuracy of the quantized model.

Note

  1. Currently esp-dl only supports batch_size of 1, and does not support multi-batch or dynamic batch.

  2. The test input/output and the quantized model weights in the .info file are all 16-byte aligned. If the length is less than 16 bytes, it will be padded with 0.

Advanced Quantization Methods

If the default 8-bit quantization does not meet your accuracy needs, the following methods can further reduce accuracy loss of the quantized model:

Post Training Quantization (PTQ)

Quantization Aware Training (QAT)