GMF-AI-Audio

[中文]

gmf_ai_audio is the AI voice front-end component of ESP-GMF, wrapping the esp-sr speech algorithm library (wake word, command word, AEC, NS, VAD, DOA) into elements that can be connected to a pipeline. The component provides six elements (ai_afe / ai_aec / ai_wn / ai_ns / ai_vad / ai_doa) and an internal manager (esp_gmf_afe_manager). Among them, ai_afe is the comprehensive interface encapsulating full voice front-end capabilities, suitable for direct use as a unified entry point; ai_aec / ai_wn / ai_ns / ai_vad / ai_doa are for individual capabilities. All six elements follow the unified element interface and can be chained and combined in any order in the same pipeline. This document covers the principles, configuration, and event system in the order of manager, comprehensive element, and standalone algorithm elements; for the element base class and runtime method mechanism, see GMF Elements; for the data path, see Data Flow.

Feature List

  • ai_afe: full voice front-end element; after connecting to codec_dev IO, outputs 16-bit mono PCM and reports wakeup, VAD, and command word events via event callbacks

  • ai_aec: standalone echo cancellation element; performs AEC on input PCM and outputs 16-bit mono PCM; suitable for scenarios requiring only echo-cancelled microphone signal without wakeup / VAD / command word

  • ai_wn: standalone wake word detection element; does not create feed / fetch tasks; synchronously detects in the element process and transparently passes the input PCM to the output port

  • ai_ns: standalone noise suppression element; input format is 16 kHz, 16-bit mono PCM; supports NSNet2 model or WebRTC NS backend

  • ai_vad: standalone voice activity detection element; supports WebRTC VAD and VADNet backends; reports VAD state via callback and can pass through input PCM

  • ai_doa: standalone direction of arrival estimation element; requires input format with two microphone channels; outputs angle estimation results via callback

  • esp_gmf_afe_manager: internal manager that encapsulates feed / fetch two tasks, responsible for esp-sr AFE data input, result retrieval, feature toggling, and pause / resume

  • NS / VAD / SE: ai_afe uses esp-sr’s noise suppression (NS), voice activity detection (VAD), and speech enhancement (SE) capabilities via the AFE manager; whether enabled is controlled by afe_config_t and runtime feature switches

  • Channel format convention: the input channel arrangement is described by a string; M for microphone, R for speaker reference, N for unused channel; e.g., MMNR means the first two channels are microphones, plus one unused and one reference channel

  • Wakeup and VAD state machine: supports three combinations - “wakeup only”, “VAD only”, “wakeup + VAD”; automatically maintains IDLE / WAKEUP / SPEECHING / WAIT_FOR_SLEEP states

  • Command word detection (VCMD): based on MultiNet; independent of the wakeup state machine; started by the application calling esp_gmf_afe_vcmd_detection_begin

  • Manual wakeup control: esp_gmf_afe_keep_awake keeps awake state; esp_gmf_afe_trigger_wakeup / ..._trigger_sleep switches manually; suitable for button wakeup and other non-voice triggers

  • Event system: 6 event types covering wakeup start / end, VAD start / end, command word detection, and command word timeout

Technical Details

Component Hierarchy

The relationship between the six elements and the manager is shown below. ai_afe is the upper-level wrapper of the manager, adapting manager callbacks to the GMF element interface; ai_aec, ai_wn, ai_ns, ai_vad, and ai_doa call the corresponding esp-sr algorithms directly, without going through the manager.

        classDiagram
    direction TB

    class esp_gmf_audio_element_t
    class ai_afe {
        feed/fetch task
        wakeup state machine
        command word detection
        event callback
    }
    class ai_aec {
        standalone AEC
        reference + microphone channels
    }
    class ai_wn {
        standalone WakeNet
        wake word detection
    }
    class ai_ns {
        standalone NS
        single-channel noise suppression
    }
    class ai_vad {
        standalone VAD
        state callback
    }
    class ai_doa {
        standalone DOA
        sound source localization
    }
    class esp_gmf_afe_manager {
        feed_task
        fetch_task
        feature control
    }

    esp_gmf_audio_element_t <|-- ai_afe
    esp_gmf_audio_element_t <|-- ai_aec
    esp_gmf_audio_element_t <|-- ai_wn
    esp_gmf_audio_element_t <|-- ai_ns
    esp_gmf_audio_element_t <|-- ai_vad
    esp_gmf_audio_element_t <|-- ai_doa
    ai_afe ..> esp_gmf_afe_manager : uses
    

When element behavior is needed, choose ai_afe / ai_aec / ai_wn / ai_ns / ai_vad / ai_doa to connect to the pipeline based on the scenario; when bypassing the GMF framework to use esp-sr directly, esp_gmf_afe_manager can be used standalone.

input_format Channel String

Elements using multi-channel input use the input_format string to describe the role of each channel: M for microphone capture, R for speaker reference (used as AEC reference), N for unused channel. For detailed rules of the channel string, see esp-sr AFE Input Channel Description.

For example, "MMNR" means the input is 4-channel interleaved PCM, channels 1/2 are microphones, channel 3 is unused, and channel 4 is reference. The component automatically extracts the required channels from the input PCM and feeds them to the underlying algorithm. The ai_afe input sample rate is fixed at 16 kHz, 16-bit; ai_aec additionally supports 8 kHz when using AFE_TYPE_VC_8K; ai_doa requires exactly two M channels in the input format.

AFE Manager

esp_gmf_afe_manager wraps the esp-sr AFE interface into a callback model of “data input → algorithm processing → result output”, automatically creating feed and fetch tasks.

        flowchart LR
    ReadCb[("read_cb<br/>(provided by app)")] --> Feed["feed_task"]
    Feed --> Core["AFE processing module<br/>esp-sr"]
    Core --> Fetch["fetch_task"]
    Fetch --> ResultCb[("result_cb<br/>(received by app)")]
    

The application provides read_cb and result_cb via esp_gmf_afe_manager_cfg_t; feed_task periodically calls read_cb to get one frame of multi-channel PCM and feeds it to the esp-sr AFE; fetch_task retrieves the processed result (noise-suppressed / AEC-processed mono PCM + wakeup / VAD / command word events) and calls result_cb. The two tasks default to different cores (core 0 / core 1), stack 3 KiB, priority 5; DEFAULT_GMF_AFE_MANAGER_CFG provides default values.

Individual features can be toggled at runtime:

esp_gmf_afe_manager_enable_features(mgr, ESP_AFE_FEATURE_AEC, true);
esp_gmf_afe_manager_enable_features(mgr, ESP_AFE_FEATURE_VAD, false);

After calling esp_gmf_afe_manager_suspend() with the suspend flag set to true, both feed and fetch tasks can be suspended simultaneously, suitable for low-power scenarios. After initialization, esp_gmf_afe_manager_get_chunk_size() and get_input_ch_num can be used to query the number of samples per processing chunk and the total number of input channels, helping the application adjust its IO buffer.

ai_afe: Full Voice Front-End

ai_afe wraps esp_gmf_afe_manager into an element that can be connected to a pipeline: feeds multi-channel PCM read from codec_dev IO into the manager, writes the mono PCM output by fetch to the output port, and converts wakeup / VAD / command word into event callbacks to send to the application.

The output of ai_afe is 16-bit mono PCM. Depending on the esp-sr AFE configuration, the output audio may incorporate AEC, NS, SE, AGC processing results; wakeup, VAD, and command word detection results are reported via event callbacks and are not placed in audio payloads.

Configuration. esp_gmf_afe_cfg_t requires at least afe_manager, models (loaded esp-sr models), and event_cb. The underlying AFE type is determined by the type parameter of afe_config_init; common types in the latest esp-sr include AFE_TYPE_SR (speech recognition), AFE_TYPE_VC / AFE_TYPE_VC_8K (voice communication), and AFE_TYPE_FD (full duplex, suitable for voice interaction scenarios with simultaneous playback and capture):

afe_config_t *afe_cfg = afe_config_init("MMNR", models, AFE_TYPE_FD, AFE_MODE_LOW_COST);
afe_cfg->wakenet_init = true;
afe_cfg->vad_init     = true;
afe_cfg->aec_init     = true;

esp_gmf_afe_manager_cfg_t mgr_cfg = DEFAULT_GMF_AFE_MANAGER_CFG(afe_cfg,
    my_read_cb, &io_ctx, NULL, NULL);
esp_gmf_afe_manager_handle_t mgr = NULL;
esp_gmf_afe_manager_create(&mgr_cfg, &mgr);

esp_gmf_afe_cfg_t afe_el_cfg = DEFAULT_GMF_AFE_CFG(mgr, my_event_cb, &app_ctx, models);
afe_el_cfg.vcmd_detect_en = true;
esp_gmf_element_handle_t ai_afe = NULL;
esp_gmf_afe_init(&afe_el_cfg, &ai_afe);

Four timing parameters (default values are given in esp_gmf_afe.h) control state machine behavior:

Parameter

Default

Description

wakeup_time

10000 ms

How long after wakeup without any VAD event before WAKEUP_END is triggered

wakeup_end

2000 ms

How long after VAD ends with silence before WAKEUP_END is triggered

vcmd_timeout

5760 ms

Command word detection timeout; begin must be called again after timeout

delay_samples

2048 samples

Output PCM delay to compensate for VAD detection lag; converted to time, should be greater than afe_config_t.vad_min_speech_ms

Wakeup and VAD State Machine. Three combinations switch automatically. Wakeup only:

        stateDiagram-v2
    [*] --> IDLE
    IDLE --> WAKEUP : wake word / WAKEUP_START
    WAKEUP --> IDLE : wakeup_time timeout / WAKEUP_END
    

VAD only:

        stateDiagram-v2
    [*] --> IDLE
    IDLE --> SPEECHING : voice detected / VAD_START
    SPEECHING --> IDLE : silence / VAD_END
    

Wakeup + VAD combined:

        stateDiagram-v2
    [*] --> IDLE
    IDLE --> WAKEUP : wake word / WAKEUP_START
    WAKEUP --> SPEECHING : voice / VAD_START
    WAKEUP --> IDLE : wakeup_time timeout / WAKEUP_END
    SPEECHING --> WAIT_FOR_SLEEP : silence / VAD_END
    WAIT_FOR_SLEEP --> SPEECHING : voice / VAD_START
    WAIT_FOR_SLEEP --> IDLE : wakeup_end timeout / WAKEUP_END
    

The combined mode is suitable for the interaction flow of “speak wake word → voice input → return to standby after silence”, avoiding frequent VAD events being triggered outside the wakeup interval.

Manual wakeup control. In addition to the automatic state machine, three APIs provide non-voice triggering:

Command word detection (VCMD): independent of the wakeup state machine. The typical flow is to call esp_gmf_afe_vcmd_detection_begin() after receiving WAKEUP_START; upon detection, the event callback provides esp_gmf_afe_vcmd_info_t (containing phrase_id, prob, and the command string); call begin again after detection or timeout to continue. vcmd_detection_cancel clears the current detection state while preserving the feature enable flag, so begin can be called again later.

Event System

ai_afe reports six event types via esp_gmf_afe_event_cb_t; event enum values can be positive or negative, allowing command word IDs to be placed directly in the enum:

Event

Value

Payload

ESP_GMF_AFE_EVT_WAKEUP_START

-100

esp_gmf_afe_wakeup_info_t (volume, wake word index, model index)

ESP_GMF_AFE_EVT_WAKEUP_END

-99

NULL

ESP_GMF_AFE_EVT_VAD_START

-98

NULL

ESP_GMF_AFE_EVT_VAD_END

-97

NULL

ESP_GMF_AFE_EVT_VCMD_DECT_TIMEOUT

-96

NULL

ESP_GMF_AFE_EVT_VCMD_DECTECTED

>= 0

esp_gmf_afe_vcmd_info_t, enumeration value equals phrase ID

The callback executes in the fetch_task context; the application is recommended to only do lightweight dispatching (update state, enqueue message); time-consuming logic should run in the main thread or an independent task.

ai_aec: Standalone Echo Cancellation

ai_aec only performs echo cancellation: extracts microphone + reference channels from multi-channel input PCM according to input_format, processes them through the AEC algorithm, and outputs mono PCM. No model partition is required; it consumes fewer resources than ai_afe and is suitable for recording pipelines that only need echo cancellation without wakeup / VAD / command word.

Three tuning fields in esp_gmf_aec_cfg_t:

  • filter_len: filter length; recommended 4 for ESP32-S3 / P4, 2 for ESP32-C5; higher values consume more CPU

  • type: AFE_TYPE_VC (voice communication) or AFE_TYPE_SR (speech recognition)

  • mode: AFE_MODE_LOW_POWER or AFE_MODE_HIGH_PERF

esp_gmf_aec_cfg_t cfg = {
    .filter_len   = 4,
    .type         = AFE_TYPE_SR,
    .mode         = AFE_MODE_HIGH_PERF,
    .input_format = "MMNR",
};
esp_gmf_obj_handle_t aec = NULL;
esp_gmf_aec_init(&cfg, &aec);

ai_aec internally maintains a synchronization buffer for reference and microphone signals: each process accumulates one frame of aligned data before calling the underlying AEC, outputting 16-bit mono PCM. The input sample rate is typically 16 kHz; when configured with AFE_TYPE_VC_8K, the input sample rate is 8 kHz. The bit depth must be 16-bit PCM; any mismatch is rejected at the open stage.

ai_wn: Standalone Wake Word Detection

ai_wn is a lightweight wrapper of WakeNet: process synchronously runs detection on the input PCM, calls the user’s detect_cb on a hit and passes the current frame through to the output port; on a miss, it also passes through, leaving the downstream to decide how to handle it.

Differences from ai_afe:

  • Does not create feed / fetch tasks; processing occurs directly in the GMF task context

  • Does not depend on the AFE manager or the full model set; only loads the WakeNet model

  • Lower resource usage; suitable for memory-constrained scenarios or those requiring only wake word detection

esp_gmf_wn_cfg_t cfg = {
    .models       = models,
    .det_mode     = DET_MODE_2CH_90,
    .input_format = "MMNR",
    .detect_cb    = my_wakeup_cb,
    .user_ctx     = &ctx,
};
esp_gmf_element_handle_t wn = NULL;
esp_gmf_wn_init(&cfg, &wn);

Supports sample rates of 8 kHz or 16 kHz, 16-bit PCM. The number of channels is determined by det_mode when WakeNet is initialized (e.g., DET_MODE_90); the number of M channels in input_format must match; otherwise the model refuses to run.

ai_ns: Standalone Noise Suppression

ai_ns performs noise suppression on single-channel PCM and outputs PCM in the same format. It is suitable for recording or voice pre-processing pipelines that only need noise suppression without the full AFE state machine.

Main fields of esp_gmf_ns_cfg_t:

  • sample_rate: sample rate; currently supports 16 kHz

  • channel: channel count; currently only mono is supported

  • frame_ms: WebRTC NS frame duration; supports 10 / 20 / 30 ms

  • model_name and partition_label: NSNet2 model name and model partition label

esp_gmf_ns_cfg_t cfg = ESP_GMF_NS_CFG_DEFAULT();
esp_gmf_obj_handle_t ns = NULL;
esp_gmf_ns_init(&cfg, &ns);

When CONFIG_SR_NSN_NSNET2 is enabled, ai_ns loads the NSNet2 model from the partition specified by partition_label; when the WebRTC NS backend is enabled, model-related fields are not used.

ai_vad: Standalone Voice Activity Detection

ai_vad performs voice activity detection on single-channel PCM and reports via callback when the VAD state changes. The element can copy the input PCM to the output port for subsequent pipeline consumption of the original audio.

Main fields of esp_gmf_vad_cfg_t:

  • sample_rate: sample rate; WebRTC VAD supports 8 kHz / 16 kHz / 32 kHz

  • frame_ms: WebRTC VAD frame duration; supports 10 / 20 / 30 ms

  • vad_mode: VAD sensitivity mode

  • result_callback: state change callback, returns the underlying vad_state_t

  • model_name and partition_label: VADNet model name and model partition label

static void vad_cb(vad_state_t state, void *ctx)
{
    /* Update application logic based on VAD state */
}

esp_gmf_vad_cfg_t cfg = ESP_GMF_VAD_CFG_DEFAULT();
cfg.result_callback = vad_cb;
esp_gmf_obj_handle_t vad = NULL;
esp_gmf_vad_init(&cfg, &vad);

When the VADNet backend is selected, the element loads the VADNet model from the model partition and uses the frame length required by the model; when the WebRTC backend is selected, frame_ms controls the processing duration per call.

ai_doa: Standalone Direction of Arrival Estimation

ai_doa estimates the direction of the sound source based on two microphone signals; the processing result is returned as an angle value via callback without outputting new PCM data. It is suitable for applications where a microphone array needs to sense the direction of the sound source.

Main fields of esp_gmf_doa_cfg_t:

  • sample_rate: sample rate; default 16 kHz

  • resolution: direction estimation resolution

  • d_mics: physical distance between the two microphones in meters

  • frame_ms: audio duration required to produce one DOA result; default 64 ms

  • input_format: input channel arrangement; must contain exactly two M channels

  • result_callback: direction estimation result callback

static void doa_cb(float angle, void *ctx)
{
    /* angle is the direction of arrival estimation result */
}

esp_gmf_doa_cfg_t cfg = ESP_GMF_DOA_CFG_DEFAULT();
cfg.result_callback = doa_cb;
esp_gmf_obj_handle_t doa = NULL;
esp_gmf_doa_init(&cfg, &doa);

Performance

The bottleneck of AI Audio elements is concentrated in the underlying esp-sr algorithm; the GMF layer overhead is mainly acquire-release and callback dispatch. Optimization recommendations:

Module

Main Bottleneck

Optimization Direction

ai_afe

CPU when wake model + AEC + NS run simultaneously

Assign feed / fetch to different cores (default 0 / 1); use afe_config_init to disable temporarily unneeded features

ai_aec

Filter length

Use filter_len = 2 on CPU-constrained SoCs like ESP32-C5; AEC can be disabled when only recording speech in quiet environments

ai_wn

WakeNet model inference

Choose 1-channel version (DET_MODE_90) for det_mode to halve computation

ai_ns

NS model or WebRTC NS computation

Use mono input; choose NSNet2 or WebRTC backend based on actual noise conditions

ai_vad

VAD model or WebRTC VAD computation

Use shorter frame length for WebRTC backend to reduce latency; ensure correct model partition for VADNet backend

ai_doa

DOA algorithm and dual-microphone channel extraction

Reduce frame_ms to lower callback interval; set d_mics according to actual microphone spacing

AFE Manager

feed / fetch queue length and ringbuffer size

Watch afe_config_t.feed_buffer_size; read_cb should not block too long to avoid slowing the algorithm

Application Examples

  • elements/gmf_ai_audio/examples/wwe: Complete wake word detection project, covering ai_afe + manager creation, event callback handling, and command word triggering

  • elements/gmf_ai_audio/examples/aec_rec: AEC recording project, demonstrating ai_aec connected to a pipeline and outputting echo-cancelled PCM

  • elements/gmf_ai_audio/examples/wwe/README_CN.md and elements/gmf_ai_audio/examples/aec_rec/README_CN.md: Board wiring, Kconfig options, and run instructions for each project

Use idf.py create-project-from-example "espressif/gmf_ai_audio=<version>:wwe" to generate a compilable project directly based on this component.

Debugging Tools

ESP Audio Analyzer is Espressif’s audio testing solution, combining a device-side test project with a web-based analysis interface. Over a WebSocket connection, it runs standardized tests on microphones, speakers, AEC, and related capabilities, and outputs metrics such as THD and SNR along with structured test reports. After the device joins the network, connect from the web page to start testing.

The test project is built on gmf_ai_audio: the recording pipeline uses ai_afe with AEC enabled in the AFE by default. When tuning AEC performance, you can verify echo cancellation in full-duplex play-and-record scenarios without manually capturing PCM or writing playback scripts. The web UI adjusts MIC gain, playback volume, and channel format (M / R / N layout, e.g. MMNR) in real time to match hardware reference wiring and observe AEC changes. Exported raw recordings and before/after comparisons in reports help troubleshoot echo residual and similar issues.

  • Covers 11 standardized audio tests across microphone, speaker, and AEC modules

  • Test project enables AEC inside ai_afe by default, consistent with the element configuration in this document

  • Web UI supports MIC gain, playback volume, and channel format adjustment for AEC comparison

  • Supports raw recording export and structured test reports

  • Companion test project: esp_audio_analyzer_app

SoC Compatibility

Different elements depend on different esp-sr models and hardware acceleration capabilities; the support matrix is as follows:

Element

ESP32

ESP32-S3

ESP32-S31

ESP32-C3

ESP32-C5

ESP32-P4

ai_afe

Supported

Supported

Supported

Not supported

Not supported

Supported

ai_aec

Supported

Supported

Supported

Not supported

Supported

Supported

ai_wn

Supported

Supported

Supported

Supported

Supported

Supported

ai_ns

Supported

Supported

Supported

Not supported

Supported

Supported

ai_vad

Supported

Supported

Supported

Supported

Supported

Supported

ai_doa

Not supported

Supported

Supported

Not supported

Not supported

Not supported

Both ai_afe and ai_wn depend on the esp-sr model data partition; the application must reserve a model partition in the partition table and flash the corresponding model. For model preparation and flashing steps, refer to the esp-sr documentation and the model configuration instructions in elements/gmf_ai_audio/examples/wwe/README_CN.md.

FAQ

Q: Wake word detection sensitivity is insufficient or events are not reported. How to troubleshoot?

Check in order: whether afe_config_t.wakenet_init is true, whether the model partition is correctly flashed, whether the number of M channels in input_format matches the hardware microphone wiring, and whether the microphone sampling level is too low (use an oscilloscope or esp_gmf_afe_wakeup_info_t.data_volume to back-calculate). The wwe example’s README.md provides a complete hardware checklist.

Q: feed_task triggered a task watchdog timeout?

AFE inference has high CPU usage; feed_task and fetch_task should be assigned to different cores. On single-core ESP32 chips, feed_task easily times out when competing with other high-load application tasks; it is recommended to increase fetch_task_setting.prio or use a dedicated timer task to write input data to the AFE.

Q: ai_aec output has noticeable echo residue?

Confirm four things: whether the reference signal (R channel) is connected to the speaker output reference, whether the sample rate is 16 kHz, whether there is a timing offset between the microphone and reference, and whether filter_len is too small (recommended 4 for ESP32-S3 / P4). For specific debugging methods, see the header comments in esp_gmf_aec.c and the esp-sr AEC documentation.

Q: No event after command word detection begin?

Check whether vcmd_detect_en is set to true in esp_gmf_afe_cfg_t, whether mn_language matches the model language (cn / en), and whether a command word was input within vcmd_timeout. After timeout, ESP_GMF_AFE_EVT_VCMD_DECT_TIMEOUT is returned; begin must be called again.

Q: How to choose between ai_wn and ai_afe?

Use ai_wn for lightweight wake-word-only scenarios (Bluetooth speakers, sensor nodes); use ai_afe when full voice interaction is needed (wakeup + VAD + command word / AEC / NS). Both process raw multi-channel PCM and cannot be chained in the same pipeline.

Q: How to use esp_gmf_afe_manager standalone without connecting to a GMF pipeline?

esp_gmf_afe_manager_create() does not require the caller to be a GMF element; both read_cb and result_cb are ordinary callbacks. It can be used standalone in non-GMF scenarios with self-managed input/output loops; it no longer provides the acquire-release protocol and pipeline control capability.

API Reference

Header files for this component:

  • esp_gmf_afe_manager.h: AFE manager configuration, feature toggling, pause / resume

  • esp_gmf_afe.h: ai_afe element initialization, command word control, manual wakeup, event callbacks

  • esp_gmf_aec.h: ai_aec element configuration

  • esp_gmf_wn.h: ai_wn element configuration and detection callbacks

  • esp_gmf_ns.h: ai_ns element configuration

  • esp_gmf_vad.h: ai_vad element configuration and result callbacks

  • esp_gmf_doa.h: ai_doa element configuration and direction estimation callbacks

  • esp_gmf_ai_audio_methods.h: runtime method name macros

Header File

Functions

esp_gmf_err_t esp_gmf_afe_manager_create(esp_gmf_afe_manager_cfg_t *cfg, esp_gmf_afe_manager_handle_t *handle)

Create an AFE Manager instance.

Parameters:
  • cfg[in] Pointer to the AFE manager configuration structure

  • handle[out] Pointer to the created AFE manager handle

Returns:

  • ESP_GMF_ERR_OK Success

  • ESP_GMF_ERR_FAIL Failed to create the AFE manager

  • ESP_GMF_ERR_MEMORY_LACK Insufficient memory allocation

esp_gmf_err_t esp_gmf_afe_manager_destroy(esp_gmf_afe_manager_handle_t handle)

Destroy an AFE Manager instance.

Parameters:

handle[in] AFE manager handle to be destroyed

Returns:

  • ESP_GMF_ERR_OK Success

  • ESP_GMF_ERR_INVALID_ARG Invalid handle

esp_gmf_err_t esp_gmf_afe_manager_set_read_cb(esp_gmf_afe_manager_handle_t handle, esp_gmf_afe_manager_read_cb_t read_cb, void *read_ctx)

Set the audio input read callback for the AFE Manager.

Note

If the read callback is set to NULL, the AFE Manager will be suspended

Parameters:
  • handle[in] AFE manager handle

  • read_cb[in] Function pointer to the read callback

  • read_ctx[in] User-defined context to be passed to the callback

Returns:

  • ESP_GMF_ERR_OK Success

  • ESP_GMF_ERR_INVALID_ARG Invalid arguments

esp_gmf_err_t esp_gmf_afe_manager_set_result_cb(esp_gmf_afe_manager_handle_t handle, esp_gmf_afe_manager_result_cb_t proc_cb, void *user_ctx)

Register a processing result callback for the AFE Manager.

Parameters:
  • handle[in] AFE manager handle

  • proc_cb[in] Function pointer to the result callback

  • user_ctx[in] User-defined context to be passed to the callback

Returns:

  • ESP_GMF_ERR_OK Success

  • ESP_GMF_ERR_INVALID_ARG Invalid arguments

esp_gmf_err_t esp_gmf_afe_manager_suspend(esp_gmf_afe_manager_handle_t handle, bool suspend)

Suspend or resume the AFE Manager.

Parameters:
  • handle[in] AFE manager handle

  • suspend[in] true to suspend, false to resume

Returns:

  • ESP_GMF_ERR_OK Success

  • ESP_GMF_ERR_INVALID_ARG Invalid arguments

esp_gmf_err_t esp_gmf_afe_manager_enable_features(esp_gmf_afe_manager_handle_t handle, esp_gmf_afe_feature_t feature, bool enable)

Enable or disable specific features in the AFE Manager.

Parameters:
  • handle[in] AFE manager handle

  • feature[in] Feature to be configured (see esp_gmf_afe_feature_t)

  • enable[in] true to enable, false to disable

Returns:

  • ESP_GMF_ERR_OK Success

  • ESP_GMF_ERR_INVALID_ARG Invalid arguments

esp_gmf_err_t esp_gmf_afe_manager_get_features(esp_gmf_afe_manager_handle_t handle, esp_gmf_afe_manager_features_t *features)

Retrieve the current feature enable states of the AFE Manager.

Parameters:
  • handle[in] AFE manager handle

  • features[out] Pointer to a structure to store the feature states

Returns:

  • ESP_GMF_ERR_OK Success

  • ESP_GMF_ERR_INVALID_ARG Invalid arguments

esp_gmf_err_t esp_gmf_afe_manager_get_chunk_size(esp_gmf_afe_manager_handle_t handle, size_t *size)

Get the processing chunk size for the AFE Manager.

Note

The chunk size represents the number of audio samples per channel. The AFE Manager processes data in fixed-size chunks.

Parameters:
  • handle[in] AFE manager handle

  • size[out] Pointer to store the chunk size (unit: samples)

Returns:

  • ESP_GMF_ERR_OK Success

  • ESP_GMF_ERR_INVALID_ARG Invalid arguments

esp_gmf_err_t esp_gmf_afe_manager_get_input_ch_num(esp_gmf_afe_manager_handle_t handle, uint8_t *ch_num)

Retrieve the number of input channels for the AFE Manager.

Parameters:
  • handle[in] AFE manager handle

  • ch_num[out] Pointer to store the number of channels

Returns:

  • ESP_GMF_ERR_OK Success

  • ESP_GMF_ERR_INVALID_ARG Invalid arguments

Structures

struct esp_gmf_afe_manager_task_setting_t

Configuration structure for the task setting.

Public Members

uint32_t stack_size

Task stack size

uint8_t core

Task core id

uint8_t prio

Task priority

struct esp_gmf_afe_manager_cfg_t

Configuration structure for the AFE manager.

Public Members

afe_config_t *afe_cfg

Configuration of ESP AFE

esp_gmf_afe_manager_task_setting_t feed_task_setting

Feed task setting

esp_gmf_afe_manager_task_setting_t fetch_task_setting

Fetch task setting

esp_gmf_afe_manager_read_cb_t read_cb

Callback function for reading audio data

void *read_ctx

Context for the read callback function

esp_gmf_afe_manager_result_cb_t result_cb

Callback function for processing AFE results

void *result_ctx

Context for the result callback function

struct esp_gmf_afe_manager_features_t

GMF AFE Manager Feature Configuration.

    This structure defines the feature enable states for the AFE manager
    A value of `true` indicates that the feature is enabled, while `false` indicates it is disabled

Public Members

bool wakeup

Wake-up detection

bool vad

Voice Activity Detection (VAD)

bool ns

Noise Suppression (NS)

bool aec

Acoustic Echo Cancellation (AEC)

bool se

Speech Enhancement (SE)

Macros

ESP_AFE_MANAGER_FEED_TASK_CORE

The AFE Manager aims to provide users with a simple interface for managing AFE (Audio front end) functions, including WakeNet, VAD, AEC, SE, and more This component will automatically create feed and fetch tasks, users only need to provide data read callback functions and result processing callback functions Users can configure AFE functions through the afe_config_t structure The data fed into AFE must be in 16-bit PCM format with a sampling rate of 16kHz, the number of channels and channel arrangement are determined by the configuration in the afe_config_init function, for details, please refer to the description of the afe_config_init function which provide by esp-sr

ESP_AFE_MANAGER_FEED_TASK_PRIO
ESP_AFE_MANAGER_FEED_TASK_STACK
ESP_AFE_MANAGER_FETCH_TASK_CORE
ESP_AFE_MANAGER_FETCH_TASK_PRIO
ESP_AFE_MANAGER_FETCH_TASK_STACK
DEFAULT_GMF_AFE_MANAGER_CFG(_afe_cfg, _read_cb, _read_ctx, _result_cb, _result_ctx)

Type Definitions

typedef void *esp_gmf_afe_manager_handle_t

Handle for the AFE manager.

typedef void (*esp_gmf_afe_manager_result_cb_t)(afe_fetch_result_t *result, void *user_ctx)

Callback function type for processing AFE results.

Param result:

[in] Pointer to the result structure

Param user_ctx:

[in] User context to be passed to the callback function

typedef int32_t (*esp_gmf_afe_manager_read_cb_t)(void *buffer, int buf_sz, void *user_ctx, uint32_t ticks)

Callback type for reading data.

Param buffer:

[in] Pointer to the buffer to read data into

Param buf_sz:

[in] Size of the buffer

Param user_ctx:

[in] User context to be passed to the callback function

Param ticks:

[in] Number of ticks to wait for data

Return:

Enumerations

enum esp_gmf_afe_feature_t

Values:

enumerator ESP_AFE_FEATURE_WAKENET

WakeNet function

enumerator ESP_AFE_FEATURE_VAD

Voice Activity Detection function

enumerator ESP_AFE_FEATURE_AEC

Acoustic Echo Cancellation function

enumerator ESP_AFE_FEATURE_SE

Speech Enhancement function

Header File

Functions

esp_gmf_err_t esp_gmf_afe_init(void *config, esp_gmf_obj_handle_t *handle)

Initialize the GMF AFE.

Parameters:
  • config[in] Pointer to the configuration structure

  • handle[out] Pointer to the handle to be created

Returns:

  • ESP_GMF_ERR_OK Success

  • ESP_GMF_ERR_MEMORY_LACK Memory allocation failed

  • ESP_GMF_ERR_INVALID_ARG Invalid argument

esp_gmf_err_t esp_gmf_afe_vcmd_detection_begin(esp_gmf_element_handle_t handle)

Begin voice command detection.

Parameters:

handle[in] Handle to the GMF object

Returns:

  • ESP_GMF_ERR_OK Success

  • ESP_GMF_ERR_INVALID_ARG Invalid argument

  • ESP_GMF_ERR_INVALID_STATE Voice command not enabled

esp_gmf_err_t esp_gmf_afe_vcmd_detection_cancel(esp_gmf_element_handle_t handle)

Cancel voice command detection.

Note

This function is used to clear the states of voice command detection process, the voice command detection will stay enabled, and the user can call esp_gmf_afe_vcmd_detection_begin to start the detection again

Parameters:

handle[in] Handle to the GMF object

Returns:

  • ESP_GMF_ERR_OK Success

  • ESP_GMF_ERR_INVALID_ARG Invalid argument

  • ESP_GMF_ERR_INVALID_STATE Voice command not enabled

esp_gmf_err_t esp_gmf_afe_set_event_cb(esp_gmf_element_handle_t handle, esp_gmf_afe_event_cb_t cb, void *ctx)

Set the event callback for the AFE (Audio Front-End) element.

    This function registers a callback function to handle events generated by the
    AFE element. The callback will be invoked with the specified context whenever
    an event occurs
Parameters:
  • handle – The handle to the AFE element

  • cb – The callback function to handle AFE events

  • ctx – User-defined context to be passed to the callback function

Returns:

  • ESP_GMF_ERR_OK Success

  • ESP_GMF_ERR_INVALID_ARG Invalid argument

  • ESP_GMF_ERR_INVALID_STATE Config not exist

esp_gmf_err_t esp_gmf_afe_keep_awake(esp_gmf_element_handle_t handle, bool enable)

Enable or disable keep-awake mode.

    When keep-awake mode is enabled, the system will remain in the wake state
    and prevent wakeup_end events from being triggered automatically
    This is useful for scenarios where you want to keep the system active
    without automatic timeout
Parameters:
  • handle – The handle to the AFE element

  • enable – True to enable keep-awake mode, false to disable

Returns:

  • ESP_GMF_ERR_OK Success

  • ESP_GMF_ERR_INVALID_ARG Invalid argument

  • ESP_GMF_ERR_INVALID_STATE Config not exist

  • ESP_GMF_ERR_TIMEOUT Command send timeout

esp_gmf_err_t esp_gmf_afe_trigger_wakeup(esp_gmf_element_handle_t handle)

Manually trigger wakeup state.

    This function allows manual activation of the wakeup state without waiting
    for automatic wakeword detection. It is useful in the following scenarios:

    1. **Button-triggered activation**: When users press a physical button to
       activate voice interaction, bypassing the need for wakewords
    2. **External event-driven activation**: When the system needs to enter
       wakeup state based on external triggers (sensors, timers, network events)

    After calling this function, the AFE will enter wakeup state and begin
    listening for voice commands (if voice command detection is enabled).
    The system will generate ESP_GMF_AFE_EVT_WAKEUP_START event and remain
    active according to the configured wakeup_time duration.
Parameters:

handle[in] Handle to the GMF object

Returns:

  • ESP_GMF_ERR_OK Success

  • ESP_GMF_ERR_INVALID_ARG Invalid argument

  • ESP_GMF_ERR_INVALID_STATE Element not opened

  • ESP_GMF_ERR_TIMEOUT Command send timeout

esp_gmf_err_t esp_gmf_afe_trigger_sleep(esp_gmf_element_handle_t handle)

Manually trigger sleep of wakeup state.

Parameters:

handle[in] Handle to the GMF object

Returns:

  • ESP_GMF_ERR_OK Success

  • ESP_GMF_ERR_INVALID_ARG Invalid argument

  • ESP_GMF_ERR_INVALID_STATE Element not opened

  • ESP_GMF_ERR_TIMEOUT Command send timeout

Structures

struct esp_gmf_afe_wakeup_info_t

Information when wakeup state detected, event data for “ESP_GMF_AFE_EVT_WAKEUP_START”.

Public Members

float data_volume

Volume of input audio, the unit is decibel(dB)

int wake_word_index

Wake word index which start from 1

int wakenet_model_index

Wakenets index which start from 1

struct esp_gmf_afe_vcmd_info_t

Information when voice command detected, event data for ESP_GMF_AFE_EVT_VCMD_DECTECTED

Public Members

int phrase_id

Phrase ID

float prob

probability

char str[ESP_GMF_AFE_VCMD_MAX_LEN]

Command string

struct esp_gmf_afe_evt_t

Event structure for GMF AFE.

Public Members

esp_gmf_afe_event_t type

Event type

void *event_data

Event data

size_t data_len

Length of event data

struct esp_gmf_afe_cfg_t

Configuration structure for GMF AFE wrapper.

Public Members

esp_gmf_afe_manager_handle_t afe_manager

AFE Manager handle

uint32_t delay_samples

Number of samples to delay Note: If the user wants to using the output of AFE only after detecting the VAD start event, the time corresponding to the value of this parameter should not be less than the vad_min_speech_ms in afe_config_t used when creating the afe_manager, otherwise, a small portion of the data at the beginning of the voice may be lost

void *models

List of models

uint32_t wakeup_time

Unit:ms. The duration that the wakeup state remains when VAD is not triggered

uint32_t wakeup_end

Unit:ms. When the silence time after AUDIO_REC_VAD_END state exceeds this value, it is determined as AUDIO_REC_WAKEUP_END

bool vcmd_detect_en

Enable voice command detection

uint32_t vcmd_timeout

Timeout for voice command detection, units: ms

const char *mn_language

Language for the multi-net model, cn or en

esp_gmf_afe_event_cb_t event_cb

Callback function for AI audio events

void *event_ctx

User context to be passed to the callback function

Macros

ESP_GMF_AFE_VCMD_MAX_LEN
ESP_GMF_AFE_DEFAULT_DELAY_SAMPLES
ESP_GMF_AFE_DEFAULT_WAKEUP_TIME_MS
ESP_GMF_AFE_DEFAULT_WAKEUP_END_MS
ESP_GMF_AFE_DEFAULT_VCMD_TIMEOUT_MS
DEFAULT_GMF_AFE_CFG(__afe_manager, __event_cb, __event_ctx, __models)

Type Definitions

typedef void (*esp_gmf_afe_event_cb_t)(esp_gmf_element_handle_t el, esp_gmf_afe_evt_t *event, void *user_data)

Callback type for GMF AFE events.

Enumerations

enum esp_gmf_afe_event_t

AFE manager event type.

Values:

enumerator ESP_GMF_AFE_EVT_WAKEUP_START

Wakeup start

enumerator ESP_GMF_AFE_EVT_WAKEUP_END

Wakeup stop

enumerator ESP_GMF_AFE_EVT_VAD_START

Vad start

enumerator ESP_GMF_AFE_EVT_VAD_END

Vad stop

enumerator ESP_GMF_AFE_EVT_VCMD_DECT_TIMEOUT

Voice command detect timeout

enumerator ESP_GMF_AFE_EVT_VCMD_DECTECTED

Form 0 is the id of the voice commands detected by Multinet

Header File

Functions

esp_gmf_err_t esp_gmf_aec_init(esp_gmf_aec_cfg_t *cfg, esp_gmf_obj_handle_t *out_handle)

Initialize the Espressif AEC element.

Parameters:
  • cfg[in] Pointer to the configuration structure

  • out_handle[out] Pointer to the handle to be created

Returns:

  • ESP_GMF_ERR_OK Success

  • ESP_GMF_ERR_MEMORY_LACK Memory allocation failed

  • ESP_GMF_ERR_INVALID_ARG Invalid argument

Structures

struct esp_gmf_aec_cfg_t

Configuration structure for AEC.

Note

The input format, same as afe config: M to represent the microphone channel, R to represent the playback reference channel, N to represent an unknown or unused channel For example, input_format=”MMNR” indicates that the input data consists of four channels, which are the microphone channel, the microphone channel, an unused channel, and the playback channel

Public Members

uint8_t filter_len

The length of filter. The larger the filter, the higher the CPU loading Recommended filter_length = 4 for esp32s3 and esp32p4. Recommended filter_length = 2 for esp32c5

afe_type_t type

AFE type

afe_mode_t mode

AFE mode

char *input_format

Input format

Header File

Functions

esp_gmf_err_t esp_gmf_wn_init(esp_gmf_wn_cfg_t *config, esp_gmf_element_handle_t *handle)

Initialize the WakeNet element.

Parameters:
  • config[in] Pointer to the configuration structure

  • handle[out] Pointer to the handle to be initialized

Returns:

  • ESP_GMF_ERR_OK Success

  • ESP_GMF_ERR_INVALID_ARG Invalid argument

  • ESP_GMF_ERR_MEMORY_LACK Memory allocation failed

  • ESP_GMF_ERR_FAIL Other failures

esp_gmf_err_t esp_gmf_wn_set_detect_cb(esp_gmf_element_handle_t handle, esp_wn_detect_cb_t detect_cb, void *ctx)

Set the voice trigger detection callback for WakeNet This function registers a user-defined callback that will be invoked when WakeNet detects a wake word.

Parameters:
  • handle[in] Handle to the WakeNet element

  • detect_cb[in] Callback function to be called on wake word detection

  • ctx[in] User-defined context to be passed to the callback

Returns:

  • ESP_GMF_ERR_OK Success

  • ESP_GMF_ERR_INVALID_ARG Invalid argument

Structures

struct esp_gmf_wn_cfg_t

Configuration structure for WakeNet.

Note

The input format, same as afe config: M to represent the microphone channel, R to represent the playback reference channel, N to represent an unknown or unused channel For example, input_format=”MMNR” indicates that the input data consists of four channels, which are the microphone channel, the microphone channel, an unused channel, and the playback channel

Public Members

srmodel_list_t *models

Model list containing wake word models

det_mode_t det_mode

Detection mode

char *input_format

Input format

esp_wn_detect_cb_t detect_cb

Detection callback function

void *user_ctx

User context to be passed to the callback function

Type Definitions

typedef void (*esp_wn_detect_cb_t)(esp_gmf_element_handle_t handle, int32_t trigger_ch, void *user_ctx)

Callback type for WakeNet detection.

Param handle:

[in] Handle to the WakeNet object

Param trigger_ch:

[in] The microphone channel that triggered the detection

Param user_ctx:

[in] User context passed during initialization

Header File

Macros

ESP_GMF_METHOD_AFE_START_VCMD_DET
ESP_GMF_METHOD_AFE_START_VCMD_DET_ARG_EN

Was this page helpful?