Acoustic Echo Cancellation (AEC)

Overview

The ESP-SR AEC (Acoustic Echo Cancellation) module provides high-performance acoustic echo cancellation, effectively removing the echo of speaker playback captured by the microphone. It is widely used in scenarios such as voice wake-up, voice calls, and full-duplex human-machine interaction.

AEC provides three different implementations covering three application scenarios:

Application Scenario	Mode	Description
Speech Recognition (SR)	`AEC_MODE_SR_LOW_COST`, `AEC_MODE_SR_HIGH_PERF`	Low-cost mode with linear filtering only, small memory footprint and fast speed
Full-Duplex Conversation (FD)	`AEC_MODE_FD_LOW_COST`, `AEC_MODE_FD_HIGH_PERF`	Low-cost full-duplex mode, includes linear filtering + nonlinear processing, suitable for Full-Duplex dialogue scenarios
Voice over IP (VOIP)	`AEC_MODE_VOIP_LOW_COST`, `AEC_MODE_VOIP_HIGH_PERF`	Low-cost call mode, supports 8 kHz / 16 kHz, suitable for ordinary voice calls

Note

Users should select the appropriate mode based on the actual application scenario, resource budget, and performance requirements. It is generally recommended to choose AEC_MODE_FD_LOW_COST for the best balance between performance and resource consumption.

Usage

The AEC module provides two integration methods:

Method 1: Directly Call the AEC API

Suitable for scenarios requiring fine-grained control over the AEC module. The header file is include/esp32s3/esp_aec.h.

Basic Flow:

Create an AEC instance

#include "esp_aec.h"

aec_handle_t *aec = aec_create(
    16000,              // Sample rate (Hz), currently only 16000 is supported
    4,                  // Filter length, recommended value is 4, larger values consume more resources
    1,                  // Number of microphone channels
    AEC_MODE_SR_LOW_COST // Working mode
);

Or use advanced configuration:

aec_config_t config = {
    .mic_num       = 1,
    .ref_num       = 1,
    .out_num       = 1,
    .filter_length = 4,
    .sample_rate   = 16000,
    .caps          = MALLOC_CAP_PSRAM | MALLOC_CAP_8BIT,
    .mode          = AEC_MODE_SR_LOW_COST,
    .nlp_level     = AEC_NLP_LEVEL_AGGR,
};
aec_handle_t *aec = aec_create_from_config(&config);

Get frame length

int frame_size = aec_get_chunksize(aec);

Allocate audio buffers

int16_t *mic  = heap_caps_aligned_alloc(16, frame_size * sizeof(int16_t), MALLOC_CAP_8BIT);
int16_t *ref  = heap_caps_aligned_alloc(16, frame_size * sizeof(int16_t), MALLOC_CAP_8BIT);
int16_t *out  = heap_caps_aligned_alloc(16, frame_size * sizeof(int16_t), MALLOC_CAP_8BIT);

Warning

All input/output buffers must be 16-bit signed integers (int16_t), and it is recommended to allocate them with 16-byte alignment using heap_caps_aligned_alloc(16, ...).

Process audio frames

Full processing (linear filtering + nonlinear processing):
```
aec_process(aec, mic, ref, out);
```

Release resources

aec_destroy(aec);
free(mic); free(ref); free(out);

Method 2: Use via the AFE Module

Suitable for scenarios requiring multiple audio front-end algorithms such as AEC, NS (noise suppression), VAD (voice activity detection), and WakeNet (wake word detection) simultaneously. Please refer to the Audio Front End for details.

NLP Level Description

Nonlinear processing (NLP) is used to further suppress residual echo and can be configured via aec_nlp_level_t. Currently, it is only effective for FD mode:

Level	Description
`AEC_NLP_LEVEL_NORMAL`	Normal level, moderate echo suppression, less damage to speech
`AEC_NLP_LEVEL_AGGR`	Aggressive level (default), stronger echo suppression, may affect near-end speech quality to some extent
`AEC_NLP_LEVEL_VERYAGGR`	Very aggressive level, strongest echo suppression, may have a greater impact on near-end speech quality

Resource Consumption

The following table shows typical resource usage and performance data for each mode (16 kHz sample rate, single channel):

Mode	Internal RAM (KB)	PSRAM (KB)	Time per Frame (ms)	CPU Usage (%)
SR_LOW_COST	18.8	64.0	2.66 / 32	8.3
SR_HIGH_PERF	8.2	100.1	2.72 / 32	8.5
VOIP_LOW_COST	26.9	64.1	2.34 / 16	14.6
VOIP_HIGH_PERF	69.4	66.6	2.60 / 16	16.3
FD_LOW_COST	18.9	102.1	3.69 / 32	11.5
FD_HIGH_PERF	8.3	138.2	3.73 / 32	11.7

Note

SR/FD mode frame length is 32 ms, VOIP mode frame length is 16 ms.
Test setting: ESP32-P4 @ 400 MHz, CONFIG_CACHE_L2_CACHE_256KB=y, CONFIG_CACHE_L2_CACHE_LINE_128B=y.
Actual resource consumption may vary slightly depending on the chip model, compiler optimization level, and specific configuration.

Test Audio Resources

File Name	Description
`aec_in_far.wav`	Far-end signal (speaker playback reference signal)
`aec_in_near.wav`	Near-end signal (microphone signal containing echo)
`aec_test_sr.wav`	SR mode test audio
`aec_test_voip.wav`	VOIP mode test audio
`aec_test_fd.wav`	FD mode test audio

Provide feedback about this document