Acoustic Echo Cancellation (AEC)
Overview
The ESP-SR AEC (Acoustic Echo Cancellation) module provides high-performance acoustic echo cancellation, effectively removing the echo of speaker playback captured by the microphone. It is widely used in scenarios such as voice wake-up, voice calls, and full-duplex human-machine interaction.
AEC provides three different implementations covering three application scenarios:
Application Scenario |
Mode |
Description |
|---|---|---|
Speech Recognition (SR) |
|
Low-cost mode with linear filtering only, small memory footprint and fast speed |
Full-Duplex Conversation (FD) |
|
Low-cost full-duplex mode, includes linear filtering + nonlinear processing, suitable for Full-Duplex dialogue scenarios |
Voice over IP (VOIP) |
|
Low-cost call mode, supports 8 kHz / 16 kHz, suitable for ordinary voice calls |
Note
Users should select the appropriate mode based on the actual application scenario, resource budget, and performance requirements. It is generally recommended to choose AEC_MODE_FD_LOW_COST for the best balance between performance and resource consumption.
Usage
The AEC module provides two integration methods:
Method 1: Directly Call the AEC API
Suitable for scenarios requiring fine-grained control over the AEC module. The header file is include/esp32s3/esp_aec.h.
Basic Flow:
Create an AEC instance
#include "esp_aec.h" aec_handle_t *aec = aec_create( 16000, // Sample rate (Hz), currently only 16000 is supported 4, // Filter length, recommended value is 4, larger values consume more resources 1, // Number of microphone channels AEC_MODE_SR_LOW_COST // Working mode );
Or use advanced configuration:
aec_config_t config = { .mic_num = 1, .ref_num = 1, .out_num = 1, .filter_length = 4, .sample_rate = 16000, .caps = MALLOC_CAP_PSRAM | MALLOC_CAP_8BIT, .mode = AEC_MODE_SR_LOW_COST, .nlp_level = AEC_NLP_LEVEL_AGGR, }; aec_handle_t *aec = aec_create_from_config(&config);
Get frame length
int frame_size = aec_get_chunksize(aec);
Allocate audio buffers
int16_t *mic = heap_caps_aligned_alloc(16, frame_size * sizeof(int16_t), MALLOC_CAP_8BIT); int16_t *ref = heap_caps_aligned_alloc(16, frame_size * sizeof(int16_t), MALLOC_CAP_8BIT); int16_t *out = heap_caps_aligned_alloc(16, frame_size * sizeof(int16_t), MALLOC_CAP_8BIT);
Warning
All input/output buffers must be 16-bit signed integers (int16_t), and it is recommended to allocate them with 16-byte alignment using
heap_caps_aligned_alloc(16, ...).Process audio frames
Full processing (linear filtering + nonlinear processing):
aec_process(aec, mic, ref, out);
Release resources
aec_destroy(aec); free(mic); free(ref); free(out);
Method 2: Use via the AFE Module
Suitable for scenarios requiring multiple audio front-end algorithms such as AEC, NS (noise suppression), VAD (voice activity detection), and WakeNet (wake word detection) simultaneously. Please refer to the Audio Front End for details.
NLP Level Description
Nonlinear processing (NLP) is used to further suppress residual echo and can be configured via aec_nlp_level_t. Currently, it is only effective for FD mode:
Level |
Description |
|---|---|
|
Normal level, moderate echo suppression, less damage to speech |
|
Aggressive level (default), stronger echo suppression, may affect near-end speech quality to some extent |
|
Very aggressive level, strongest echo suppression, may have a greater impact on near-end speech quality |
Resource Consumption
The following table shows typical resource usage and performance data for each mode (16 kHz sample rate, single channel):
Mode |
Internal RAM (KB) |
PSRAM (KB) |
Time per Frame (ms) |
CPU Usage (%) |
|---|---|---|---|---|
SR_LOW_COST |
18.8 |
64.0 |
2.66 / 32 |
8.3 |
SR_HIGH_PERF |
8.2 |
100.1 |
2.72 / 32 |
8.5 |
VOIP_LOW_COST |
26.9 |
64.1 |
2.34 / 16 |
14.6 |
VOIP_HIGH_PERF |
69.4 |
66.6 |
2.60 / 16 |
16.3 |
FD_LOW_COST |
18.9 |
102.1 |
3.69 / 32 |
11.5 |
FD_HIGH_PERF |
8.3 |
138.2 |
3.73 / 32 |
11.7 |
Note
SR/FD mode frame length is 32 ms, VOIP mode frame length is 16 ms.
Test setting: ESP32-P4 @ 400 MHz, CONFIG_CACHE_L2_CACHE_256KB=y, CONFIG_CACHE_L2_CACHE_LINE_128B=y.
Actual resource consumption may vary slightly depending on the chip model, compiler optimization level, and specific configuration.
Test Audio Resources
File Name |
Description |
|---|---|
Far-end signal (speaker playback reference signal) |
|
Near-end signal (microphone signal containing echo) |
|
SR mode test audio |
|
VOIP mode test audio |
|
FD mode test audio |