Audio Front-end Framework

[中文]

Overview

Any voice-enabled product needs to perform well in a noisy environment, and audio front-end (AFE) algorithms play an important role in building a sensitive voice-user interface (VUI). Espressif’s AI Lab has created a set of audio front-end algorithms that can offer this functionality. Customers can use these algorithms with Espressif’s powerful ESP32-S3 series of chips, in order to build high-performance, yet low-cost, products with a voice-user interface.

Name

Description

AEC (Acoustic Echo Cancellation)

Supports maximum two-mic processing, which can effectively remove the echo in the mic input signal, and help with further speech recognition.

NS (Noise Suppression)

Supports single-channel processing and can suppress the non-human noise in single-channel audio, especially for stationary noise.

BSS (Blind Source Separation)

Supports dual-channel processing, which can well separate the target sound source from the rest of the interference sound, so as to extract the useful audio signal and ensure the quality of the subsequent speech.

MISO (Multi Input Single Output)

Supports dual channel input and single channel output. It is used to select a channel of audio output with high signal-to-noise ratio when there is no WakeNet enable in the dual mic scene.

VAD (Voice Activity Detection)

Supports real-time output of the voice activity state of the current frame.

AGC (Automatic Gain Control)

Dynamically adjusts the amplitude of the output audio, and amplifies the output amplitude when a weak signal is input; When the input signal reaches a certain strength, the output amplitude will be compressed.

WakeNet

A wake word engine built upon neural network, and is specially designed for low-power embedded MCUs.

Usage Scenarios

This section introduces two typical usage scenarios of Espressif AFE framework.

Speech Recognition

Workflow

overview

Data Flow

overview
  1. Use ESP_AFE_SR_HANDLE() to create and initialize AFE. Note, voice_communication_init must be configured as false.

  2. Use feed() to input audio data, which will perform the AEC algorithm inside feed() first.

  3. Perform the BSS/NS algorithms inside feed() first.

  4. Use fetch() to obtain processed single channel audio data and related information. Note, VAD processing and wake word detection will be performed inside fetch(). The specific behavior depends on the configuration of afe_config_t structure.

Voice Communication

Workflow

overview

Data Flow

overview
  1. Use ESP_AFE_VC_HANDLE() to create and initialize AFE. Note, voice_communication_init must be configured as true.

  2. Use feed() to input audio data, which will perform the AEC algorithm inside feed() first.

  3. Perform the BSS/NS algorithms inside feed() first. Additional MISO algorithm will be performed for dual mic setup.

  4. Use fetch() to obtain processed single channel audio data and related information. The AGC algorithm processing will be carried out. And the specific gain depends on the config of afe_config_t structure. If it’s dual mic, the NS algorithm processing will be carried out before AGC.

Note

  1. The wakenet_init and voice_communication_init in afe_config_t cannot be configured to true at the same time.

  2. feed() and fetch() are visible to users, while other AFE interal tasks such as BSS/NS/MISO are not visible to users.

  3. AEC algorithm is performed in feed().

  4. When aec_init is configured to false, BSS/NS algorithms are performed in feed().

Select AFE Handle

Espressif AFE supports both single mic and dual mic setups, and allows flexible combinations of algorithms.

  • Single mic
    • Internal task is performed inside the NS algorithm

  • Dual mic
    • Internal task is performed inside the BSS algorithm

    • An additional internal task is performed inside the MISO algorithm for voice communication scenario (i.e., wakenet_init = false and voice_communication_init = true)

To obtain the AFE Handle, use the commands below:

  • Speech recognition

    esp_afe_sr_iface_t *afe_handle = &ESP_AFE_SR_HANDLE;
    
  • Voice communication

    esp_afe_sr_iface_t *afe_handle = &ESP_AFE_VC_HANDLE;
    

Input Audio Data

Currently, Espressif AFE framework supports both single mic and dual mic setups. Users can configure the number of channels based on the input audio (esp_afe_sr_iface_op_feed_t()).

To be specific, users can configure the pcm_config in AFE_CONFIG_DEFAULT():

  • total_ch_num : total number of channels

  • mic_num : number of mic channels

  • ref_num : number of REF channels

When configuring, note the following requirements:

  1. total_ch_num = mic_num + ref_num

  2. ref_num = 0 or ref_num = 1 (This is because AEC only supports up to one reference data now)

The supported configurations are:

total_ch_num=1, mic_num=1, ref_num=0
total_ch_num=2, mic_num=1, ref_num=1
total_ch_num=2, mic_num=2, ref_num=0
total_ch_num=3, mic_num=2, ref_num=1

AFE Single Mic

  • Input audio data format: 16 KHz, 16 bit, two channels (one is mic data, another is REF data). Note that if AEC is not required, then there is no need for reference data. Therefore, users can only configure one channel of mic data, and the ref_num can be set to 0.

  • The input data frame length varies to the algorithm modules configured by the user. Users can use get_feed_chunksize() to get the number of sampling points (the data type of sampling points is int16).

The input data is arranged as follows:

input data of single mic

AFE Dual Mic

  • Input audio data format: 16 KHz, 16 bit, three channels (two are mic data, another is REF data). Note that if AEC is not required, then there is no need for reference data. Therefore, users can only configure two channels of mic data, and the ref_num can be set to 0.

  • The input data frame length varies to the algorithm modules configured by the user. Users can use get_feed_chunksize() to obtain the data size required (i.e., get_feed_chunksize() * total_ch_num * sizeof(short)).

The input data is arranged as follows:

input data of dual mic

Output Audio

The output audio of AFE is single-channel data. - In the speech recognition scenario, AFE outputs single-channel data with human voice when WakeNet is enabled. - In the voice communication scenario, AFE outputs single channel data with higher signal-to-noise ratio.

Enable Wake Word Engine WakeNet

When performing AFE audio front-end processing, the user can choose whether to enable wake word engine WakeNet to allow waking up the chip via wake words.

Users can disable WakeNet to reduce the CPU resource consumption and perform other operations after wake-up, such as offline or online speech recognition. To do so, users can configure disable_wakenet() to enter Bypass mode.

Users can also call enable_wakenet() to enable WakeNet later whenever needed.

ESP32-S3 allows users to switch among different wake words. After the initialization of AFE, ESP32-S3 allows users to change wake words by calling set_wakenet() . For example, use set_wakenet(afe_data, "wn9_hilexin") to use “Hi Lexin” as the wake word. For details on how to configure more than one wake words, see Section flash_model.

Enable Acoustic Echo Cancellation (AEC)

The usage of AEC is similar to that of WakeNet. Users can disable or enable AEC according to requirements.

  • Disable AEC

    afe->disable_aec(afe_data);

  • Enable AEC

    afe->enable_aec(afe_data);

Programming Procedures

Define afe_handle

afe_handle is the function handle that the user calls the AFE interface. Therefore, the first step is to obtain afe_handle.

  • Speech recognition

    esp_afe_sr_iface_t *afe_handle = &ESP_AFE_SR_HANDLE;
    
  • Voice communication

    esp_afe_sr_iface_t *afe_handle = &ESP_AFE_VC_HANDLE;
    

Configure AFE

Get the configuration of AFE:

afe_config_t afe_config = AFE_CONFIG_DEFAULT();

Users can further configure the corresponding parameters in afe_config:

#define AFE_CONFIG_DEFAULT() { \
    // Configures whether or not to enable AEC
    .aec_init = true, \
    // Configures whether or not to enable BSS/NS
    .se_init = true, \
    // Configures whether or not to enable VAD (only for speech recognition)
    .vad_init = true, \
    // Configures whether or not to enable WakeNet
    .wakenet_init = true, \
    // Configures whether or not to enable voice communication (cannot be enabled when wakenet_init is also enabled)
    .voice_communication_init = false, \
    // Configures whether or not to enable AGC for voice communication
    .voice_communication_agc_init = false, \
    // Configures the AGC gain (unit: dB)
    .voice_communication_agc_gain = 15, \
    // Configures the VAD mode (the larger the number is, the more aggressive VAD is)
    .vad_mode = VAD_MODE_3, \
    // Configures the wake model. See details below.
    .wakenet_model_name = NULL, \
    // Configures the wake mode. (corresponding to wakeup channels. This should be configured based on the number of mic channels)
    .wakenet_mode = DET_MODE_2CH_90, \
    // Configures AFE mode (SR_MODE_LOW_COST or SR_MODE_HIGH_PERF)
    .afe_mode = SR_MODE_LOW_COST, \
    // Configures the internal BSS/NS/MISO algorithm of AFE will be running on which CPU core
    .afe_perferred_core = 0, \
    // Configures the priority of BSS/NS/MISO algorithm tasks
    .afe_perferred_priority = 5, \
    // Configures the internal ringbuf size
    .afe_ringbuf_size = 50, \
    // Configures the memory allocation mode. See details below.
    .memory_alloc_mode = AFE_MEMORY_ALLOC_MORE_PSRAM, \
    // Configures the linear audio amplification level. See details below.
    .agc_mode = AFE_MN_PEAK_AGC_MODE_2, \
    // Configures the total number of audio channels
    .pcm_config.total_ch_num = 3, \
    // Configures the number of microphone channels
    .pcm_config.mic_num = 2, \
    // Configures the number of reference channels
    .pcm_config.ref_num = 1, \
}
  • wakenet_model_nameconfigures the wake model. The default value in AFE_CONFIG_DEFAULT() is NULL. Note:
    • After selecting the wake model via idf.py menuconfig, please configure create_from_config to the configured wake model (type string) before using. For more information about wake model, go to Section flash_model .

    • esp_srmodel_filter() can be used to obtain the model name. However, if more than one models are configured via idf.py menuconfig , this function returns any of them configured models randomly.

  • afe_mode :configures the AFE mode.

    • SR_MODE_LOW_COST : quantized, which uses less resource

    • SR_MODE_HIGH_PERF : unquantized, which uses more resource

    For details, see afe_sr_mode_t .

  • memory_alloc_modeconfigures how the memory is allocated
    • AFE_MEMORY_ALLOC_MORE_INTERNAL : allocate most memory from internal ram

    • AFE_MEMORY_ALLOC_INTERNAL_PSRAM_BALANCE : allocate some memory from the internal ram

    • AFE_MEMORY_ALLOC_MORE_PSRAM : allocate most memory from external psram

  • agc_modeconfigures peak agc mode. Note that, this parameter is only for speech recognition scenarios, and is only valid when WakeNet is enabled:
    • AFE_MN_PEAK_AGC_MODE_1 : feed linearly amplified audio signals to MultiNet, peak is -5 dB.

    • AFE_MN_PEAK_AGC_MODE_2 : feed linearly amplified audio signals to MultiNet, peak is -4 dB.

    • AFE_MN_PEAK_AGC_MODE_3 : feed linearly amplified audio signals to MultiNet, peak is -3 dB.

    • AFE_MN_PEAK_NO_AGC : feed original audio signals to MultiNet.

  • pcm_configconfigures the audio signals fed through feed() :
    • total_ch_num : total number of channels

    • mic_num : number of mic channels

    • ref_num : number of REF channels

    There are some limitation when configuring these parameters. For details, see Section Input Audio Data .

Create afe_data

The user uses the esp_afe_sr_iface_op_create_from_config_t() function to create the data handle based on the parameters configured in previous steps.

/**
* @brief Function to initialze a AFE_SR instance
*
* @param afe_config        The config of AFE_SR
* @returns Handle to the AFE_SR data
*/
typedef esp_afe_sr_data_t* (*esp_afe_sr_iface_op_create_from_config_t)(afe_config_t *afe_config);

Feed Audio Data

After initializing AFE, users need to input audio data into AFE by feed() function for processing. The format of input audio data can be found in Section Input Audio Data .

/**
* @brief Feed samples of an audio stream to the AFE_SR
*
* @Warning  The input data should be arranged in the format of channel interleaving.
*           The last channel is reference signal if it has reference data.
*
* @param afe   The AFE_SR object to query
*
* @param in    The input microphone signal, only support signed 16-bit @ 16 KHZ. The frame size can be queried by the
*              `get_feed_chunksize`.
* @return      The size of input
*/
typedef int (*esp_afe_sr_iface_op_feed_t)(esp_afe_sr_data_t *afe, const int16_t* in);

Get the number of audio channels

get_total_channel_num() function can provide the number of channels that need to be put into feed() function. Its return value is equal to pcm_config.mic_num + pcm_config.ref_num configured in AFE_CONFIG_DEFAULT().

/**
* @brief Get the total channel number which be config
*
* @param afe   The AFE_SR object to query
* @return      The amount of total channels
*/
typedef int (*esp_afe_sr_iface_op_get_total_channel_num_t)(esp_afe_sr_data_t *afe);

Fetch Audio Data

Users can get the processed single-channel audio and related information by fetch() function.

The number of data sampling points of fetch() (the data type of sampling point is int16) can be obtained by get_feed_chunksize().

/**
* @brief Get the amount of each channel samples per frame that need to be passed to the function
*
* Every speech enhancement AFE_SR processes a certain number of samples at the same time. This function
* can be used to query that amount. Note that the returned amount is in 16-bit samples, not in bytes.
*
* @param afe The AFE_SR object to query
* @return The amount of samples to feed the fetch function
*/
typedef int (*esp_afe_sr_iface_op_get_samp_chunksize_t)(esp_afe_sr_data_t *afe);

The declaration of fetch():

/**
* @brief fetch enhanced samples of an audio stream from the AFE_SR
*
* @Warning  The output is single channel data, no matter how many channels the input is.
*
* @param afe   The AFE_SR object to query
* @return      The result of output, please refer to the definition of `afe_fetch_result_t`. (The frame size of output audio can be queried by the `get_fetch_chunksize`.)
*/
typedef afe_fetch_result_t* (*esp_afe_sr_iface_op_fetch_t)(esp_afe_sr_data_t *afe);

Its return value is a pointer of structure, and the structure is defined as follows:

/**
* @brief The result of fetch function
*/
typedef struct afe_fetch_result_t
{
int16_t *data;                          // the data of audio.
int data_size;                          // the size of data. The unit is byte.
int wakeup_state;                       // the value is wakenet_state_t
int wake_word_index;                    // if the wake word is detected. It will store the wake word index which start from 1.
int vad_state;                          // the value is afe_vad_state_t
int trigger_channel_id;                 // the channel index of output
int wake_word_length;                   // the length of wake word. It's unit is the number of samples.
int ret_value;                          // the return state of fetch function
void* reserved;                         // reserved for future use
} afe_fetch_result_t;

Resource Occupancy

For the resource occupancy for this model, see Resource Occupancy.