AI Inference
The espdl – Model Inference module runs neural networks on the device using ESP-DL, Espressif’s deep-learning inference library. ESP-VISION wraps ESP-DL in task-specific classes so a script can go from an image to detections, poses, or class labels in a few lines, while ESP-DL handles the heavy tensor math on the chip’s vector unit (and, on ESP32-P4, additional acceleration).
The .espdl model format
Models are distributed as .espdl files. An .espdl bundle contains the network graph, the trained weights, and – crucially – quantization information. A model is converted from a floating-point framework (PyTorch, TensorFlow) into ESP-DL format ahead of time; see Introduce a New Model for the deployment workflow. Models live on storage (/sdcard/...) or the flash data partition (/flash/...) and are loaded with the wrapper constructor or espdl.load_model().
Quantization
Microcontrollers have no FPU budget for full 32-bit float inference at frame rate, so ESP-DL runs quantized models, typically 8-bit (and 16-bit where accuracy demands it) integer arithmetic. Quantization maps the float range of each tensor onto integers with a scale factor, shrinking the model and letting the hardware multiply-accumulate integers far faster. The accuracy cost is usually small, but it is the reason a model must be quantized during conversion rather than loaded directly from its training checkpoint.
Inference flow
A single detect() (or classify()) call hides several stages. The wrapper prepares the input, ESP-DL runs the quantized network, and the wrapper decodes the raw output into Python tuples:

Pre-processing
Before inference the input image must match what the model was trained on:
Resize to the model’s input resolution. The wrappers handle this, scaling the captured frame to the network input size.
Color order and channel count must match (most detectors expect RGB).
Normalization subtracts a per-channel
meanand divides by a per-channelstd. These are constructor keywords (mean=/std=) so you can match the exact preprocessing the model expects; the defaults suit the bundled models.
Post-processing
The raw network outputs are decoded into friendly Python tuples:
Object detection (
espdl.ESPDet,espdl.YOLO11) produces candidate boxes with class scores. A confidencescorethreshold drops weak boxes and non-maximum suppression (nms) removes overlapping duplicates, leaving(x, y, w, h, score, category)tuples.YOLO11also caps results withtopk.Pose estimation (
espdl.YOLO11nPose) adds 17 COCO keypoints per detection.Classification (
espdl.ImageNetCls) applies an optionalsoftmaxand returns thetopk(label, score)pairs.
Thresholds can be tuned at runtime with set_thresholds without reloading the model, which is handy when adapting to lighting or distance.
Memory and performance
Models and their activation buffers are large and are allocated in PSRAM. Load a model once and reuse the wrapper across frames rather than recreating it per frame. Call deinit() (or drop the last reference) to free a model when you are done. Detection cost scales with input resolution, so smaller models such as espdet_pico run at higher frame rates than full-size networks.