AutoQuant
This document describes how to use the AutoQuant toolchain to quantize and deploy deep learning models. AutoQuant is designed for Espressif chips (ESP32-P4 and ESP32-S3). You only need to define the quantization strategy and parameter search space. AutoQuant then runs quantization, accuracy evaluation, and on-device latency testing, and provides Top-K candidate models for selection. Compared with repeated manual tuning, AutoQuant can significantly reduce trial-and-error cost.
Preparation
Quick start
By default, AutoQuant uses MobileNetV2 + ImageNet for quantization and deployment. It is recommended to run this example first, get familiar with the full AutoQuant flow, and then switch to your own model.
Custom model quantization and deployment
To use a custom model, modify the following three files:
- config.py — runtime parameters and search space
- calib_dataloader.py — calibration data
- eval.py — accuracy evaluation
Note
No change is needed in main.py. You only need to implement
create_calib_dataloader() and evaluation(). AutoQuant will call
them automatically.
1. config.py
First, edit runtime_config:
runtime_config = {
"onnx_path": "./quantize_mobilenetv2/models/torch/mobilenet_v2_relu.onnx",
"export_path": "temp/model.espdl",
"input_shape": [3, 224, 224],
"device": "cpu", # "cpu" or "cuda"
"batchsz": 32,
"target": "esp32p4", # "esp32p4" or "esp32s3"
"test_app_path": "../../../test_apps/autoquant_latency",
"port": "/dev/ttyUSB0", # use "/dev/ttyACM0" on ESP32-S3
"num_of_candidates": 5, # number of Top-K models to keep
"primary_metric": "top1", # see eval.py section below
}
calib_dataloader is not in runtime_config. It is built and
injected by main.py at runtime.
The same file also defines strategy_space and param_space.
Each entry in the strategy space defines whether a quantization module
(e.g., mixed precision, weight equalization, TQT) is always enabled,
always disabled, or left optional for automatic selection during search.
param_space provides the parameter values that AutoQuant should try
when a module is enabled.
2. calib_dataloader.py
Replace the body of create_calib_dataloader() with your own
implementation. The function must return a PyTorch DataLoader. The
shape of each batch must match runtime_config["input_shape"].
Pre/post-processing (resize, normalize, etc.) is model-dependent and
must stay consistent with your training pipeline. The default
implementation downloads an ImageNet calibration subset and applies the
standard MobileNetV2 transforms. Replace it as a whole when switching
to another model.
3. eval.py
Replace the body of evaluation(quant_graph) with your own
implementation. The function takes the PPQ-quantized graph as the only
input and must return a dict where keys are metric names and values
are scalar numbers. The default implementation returns
{"top1": ..., "top5": ...} for ImageNet classification.
If evaluation needs GPU/CPU, read from runtime_config["device"].
AutoQuant uses runtime_config["primary_metric"] to rank candidates.
So the dict returned by eval.py must contain that key, and
primary_metric in config.py must match it. Otherwise, the
program fails immediately.
For example:
File |
Change |
|---|---|
|
Implement |
|
Set |
Stage 1: search and accuracy evaluation
Run Stage 1 on any machine with calibration data and GPU (or CPU). No
ESP board is required. AutoQuant enumerates all (strategy, parameter)
combinations, quantizes each one, and evaluates accuracy. Then it keeps
K results with a balanced rule (K = runtime_config["num_of_candidates"]):
take K//2 from low_latency_pool (mixed_precision=False and
horizontal_layer_split=False), picking the highest by
runtime_config["primary_metric"]; fill the remaining slots from the
other candidates using the same metric. If the other candidates are not
enough, the remaining slots are filled from low_latency_pool. If one
side is empty, the selection degenerates to the regular Top-K ranked by
primary_metric.
The following commands are expected to run in
esp-dl/examples/tutorial/how_to_quantize_model. If you are not in
this directory, cd into it first.
python -m auto_quant.main # fresh run; resets summary/candidates
python -m auto_quant.main --resume # continue from interruption
The --resume flag reuses outputs/run/summary.json and skips
already finished (strategy, parameter) combinations. Use it after a
crash, OOM, or manual interruption.
Stage 1 generates the following under outputs/run/:
<index>/(e.g.0000/,0001/…) — one directory per experiment, containingconfig.jsonand quantized model files (.espdl,.info,.json,.native).summary.json— append log for each experiment (accuracy, parameters, hash).candidates.json— current Top-K list, rewritten after each experiment, maintained by the Stage 1 balanced rule so different strategy families are represented, and partial results stay usable.
Stage 2: on-device latency test
Stage 2 must run on a local machine with a physically connected ESP
board (remote servers cannot access USB devices). Stage 2 flashes each
Top-K candidate from Stage 1, measures on-device inference latency, and
finally writes a Markdown report topk_report.md so the final model
can be chosen based on accuracy/latency trade-off.
Before running
Make sure Stage 1 has finished and
outputs/run/candidates.jsonexists. If Stage 1 ran on a remote machine, copy the fulloutputs/run/directory to local (for example usingrsync).Connect an ESP32-P4 or ESP32-S3 board, and verify
runtime_config["target"]andruntime_config["port"]inconfig.pymatch the connected hardware.Activate the ESP-IDF environment.
Run
python -m auto_quant.test_candidates
python -m auto_quant.test_candidates --port /dev/ttyUSB1
python -m auto_quant.test_candidates --target esp32s3
Optional --port and --target temporarily override corresponding
fields in runtime_config for a single run. This is useful when
switching boards without editing config.py.
Output files
outputs/run/candidates.json— same Top-K list as Stage 1, with an extralatency_msfield. It is updated after each candidate, so measured results are preserved even on crash or Ctrl-C.outputs/run/topk_report.md— Markdown report.primary_metricandlatency_msof all Top-K candidates are organized into two four-column tables (index | folder | primary_metric | latency_ms): one sorted byprimary_metricdescending, one sorted bylatency_msascending, so accuracy/latency trade-off is easy to compare. Full quantization strategy and parameters are not duplicated in the report. Checkcandidates.json(full list) or<folder>/config.json(single experiment snapshot). Failed candidates are listed in the “Failed measurements” section at the end. After reading the report, flash themodel.espdlin the selected row’sfolderdirectory into your firmware.
Switch target chip
When runtime_config["target"] changes between runs (for example from
esp32p4 to esp32s3), Stage 2 automatically cleans
sdkconfig, build/, managed_components/, and
dependencies.lock in the test app. No manual cleanup is needed.
Output layout
outputs/run/
summary.json # all Stage 1 experiment records (append)
candidates.json # Top-K; Stage 2 adds latency_ms field
topk_report.md # Stage 2 Markdown report for manual selection
0000/
config.json # quantization config used by this experiment
model.espdl # quantized model (can be deployed directly)
model.info
model.json
model.native
0001/
...
Troubleshooting
AssertionError: primary_metric=... not in metrics=...The key in
runtime_config["primary_metric"]does not match the key returned byevaluation(). Align them and re-run.- No latency printed for a candidate
Check
runtime_config["port"](/dev/ttyUSB0for ESP32-P4,/dev/ttyACM0for ESP32-S3).Make sure the serial port is not occupied by another process (
idf.py monitor,minicom, etc.).Confirm
runtime_config["target"]matches the connected board.
idf.py flashfailedWrong chip type — set
runtime_config["target"]correctly and run again; Stage 2 automatically cleans stale build outputs.Serial port busy — close all open serial monitors.
- Model file not found
Check
runtime_config["export_path"]inconfig.py.Confirm Stage 1 has finished and corresponding
outputs/run/<index>/model.espdlexists.