How to Deploy Streaming Models
Time series models are now widely applied in various fields, such as audio processing. Audio models typically have two deployment modes when deployed:
Offline mode: The model receives the complete audio data (e.g., an entire speech file) at once and processes it as a whole.
Streaming mode: In streaming mode, the model receives audio data frame by frame (or chunk by chunk) in real-time, processes it, and outputs intermediate results.
In this tutorial, we will introduce how to quantize a streaming model using ESP-PPQ and deploy the quantized streaming model with ESP-DL.
Prerequisites
Model Quantization
How to Convert to a Streaming Model
There are numerous types of time series models. Here, we take the Temporal Convolutional Network (TCN) as an example. If you are unfamiliar with TCNs, please refer to relevant resources for details; we won’t elaborate further. Other models should be customized based on their specific structures.
The example code constructs a TCN model: models.py (the model is incomplete and used only for demonstration).
ESP-PPQ provides an automatic streaming conversion feature that simplifies the process of creating streaming models. With the auto_streaming=True parameter, ESP-PPQ automatically handles the model transformation required for streaming inference.
Note
In offline mode, the model input is a complete data segment, and the input shape typically has a large size along the time dimension (e.g.,
[1, 16, 15]).In streaming mode, the model input is continuous data with a smaller time dimension, which matches the chunk size for real-time processing (e.g.,
[1, 16, 3]).
Automatic Streaming Conversion
ESP-PPQ provides an automatic streaming conversion feature via the auto_streaming=True parameter in the quantization process. When this flag is enabled, ESP-PPQ automatically transforms the model to support streaming inference by:
Analyzing the model structure to identify appropriate chunking points
Creating internal state management for maintaining context between chunks
Generating optimized code suitable for streaming scenarios
Here’s an example of how to use the auto streaming feature:
# Export non-streaming model
quant_ppq_graph = espdl_quantize_torch(
model=model,
espdl_export_file=ESPDL_MODEL_PATH,
calib_dataloader=dataloader,
calib_steps=32, # Number of calibration steps
input_shape=INPUT_SHAPE, # Input shape for offline mode
inputs=None,
target=TARGET, # Quantization target type
num_of_bits=NUM_OF_BITS, # Number of quantization bits
dispatching_override=None,
device=DEVICE,
error_report=True,
skip_export=False,
export_test_values=True,
verbose=1, # Output detailed log information
)
# Export streaming model with automatic conversion
quant_ppq_graph = espdl_quantize_torch(
model=model,
espdl_export_file=ESPDL_STEAMING_MODEL_PATH,
calib_dataloader=dataloader,
calib_steps=32,
input_shape=INPUT_SHAPE,
inputs=None,
target=TARGET,
num_of_bits=NUM_OF_BITS,
dispatching_override=None,
device=DEVICE,
error_report=True,
skip_export=False,
export_test_values=False,
verbose=1,
auto_streaming=True, # Enable automatic streaming conversion
streaming_input_shape=[1, 16, 3], # Input shape for streaming mode
streaming_table=None,
)
Model Deployment
Reference example , this example uses pre-generated data to simulate a real-time data stream.
Note
For basic model loading and inference methods, please refer to other documents:
In streaming mode, the model receives data in chunks over time rather than requiring the entire input at once. The streaming model processes these chunks sequentially while maintaining internal state between chunks. The deployment code handles splitting the input into appropriate chunks and feeding them to the model. See app_main.cpp for the following code block:
dl::TensorBase *run_streaming_model(dl::Model *model, dl::TensorBase *test_input)
{
std::map<std::string, dl::TensorBase *> model_inputs = model->get_inputs();
dl::TensorBase *model_input = model_inputs.begin()->second;
std::map<std::string, dl::TensorBase *> model_outputs = model->get_outputs();
dl::TensorBase *model_output = model_outputs.begin()->second;
if (!test_input) {
ESP_LOGE(TAG,
"Model input doesn't have a corresponding test input. Please enable export_test_values option "
"in esp-ppq when export espdl model.");
return nullptr;
}
int test_input_size = test_input->get_bytes();
uint8_t *test_input_ptr = (uint8_t *)test_input->data;
int model_input_size = model_input->get_bytes();
uint8_t *model_input_ptr = (uint8_t *)model_input->data;
int chunks = test_input_size / model_input_size;
for (int i = 0; i < chunks; i++) {
// assign chunk data to model input
memcpy(model_input_ptr, test_input_ptr + i * model_input_size, model_input_size);
model->run(model_input);
}
return model_output;
}
This approach allows the model to process long sequences efficiently by breaking them into smaller, manageable chunks. Each chunk is fed to the model sequentially, and the internal state is maintained automatically to ensure continuity across chunks.
Note
The number of chunks is calculated based on the ratio between the full input size and the streaming model’s input size.
ESP-DL streaming models handle internal state management automatically, making deployment straightforward.
The output from the streaming model should match the final portion of the equivalent offline model’s output.