Voice Activity Detection (VAD)
Introduction
Voice Activity Detection (VAD) module facilitates the hardware implementation of the first-stage algorithm for voice wake-up and other multimedia functions.
Additionally, it provides hardware support for low-power voice wake-up solutions.
For LP I2S documentation, see Low Power Inter-IC Sound.
Hardware State Machine
LP VAD driver provides a structure lp_vad_config_t
to configure the LP VAD module:
lp_vad_config_t::init_frame_num
, number of init frames that are used for VAD to denoise, this helps the VAD to decrease the accidental trigger ratio. Note too big values may lead to voice activity miss.lp_vad_config_t::min_energy_thresh
, minimum energy threshold, voice activities with energy higher than this value will be detected.lp_vad_config_t::skip_band_energy_thresh
, skip band energy threshold or not, the passband energy check determines whether the proportion of passband energy within the total frequency domain meets the required threshold. Note in different environments, enabling the passband energy check may reduce false trigger rates but could also increase the rate of missed detections.lp_vad_config_t::speak_activity_thresh
, when in speak-activity-listening-state, if number of the detected speak activity is higher than this value, VAD runs into speak-activity-detected-state.lp_vad_config_t::non_speak_activity_thresh
, when in speak-activity-detected-state, if the number of the detected speak activity is higher than this value, but lower thanlp_vad_config_t::max_speak_activity_thresh
,if the number of the detected non-speak activity is higher than this value, VAD runs into speak-activity-listening-state
if the number of the detected non-speak activity is lower than this value, VAD keeps in speak-activity-detected-state
lp_vad_config_t::min_speak_activity_thresh
, when in speak-activity-detected-state, if the number of the detected speak activity is higher than this value, but lower thanlp_vad_config_t::max_speak_activity_thresh
, then the VAD state machine will depends on the value oflp_vad_config_t::non_speak_activity_thresh
lp_vad_config_t::max_speak_activity_thresh
, when in speak-activity-detected-state, if the number of the detected speak activity is higher than this value, VAD runs into speak-activity-listening-state
Above configurations can change the VAD state machine shown below:
┌──────────────────────────────────┐
│ │
┌─────────────┤ speak-activity-listening-state │ ◄───────────────┐
│ │ │ │
│ └──────────────────────────────────┘ │
│ ▲ │
│ │ │
│ │ │
│ │ │
│ │ │
detected speak activity │ │ detected speak activity │ detected speak activity
>= │ │ >= │ >=
'speak_activity_thresh' │ │ 'min_speak_activity_thresh' │ 'max_speak_activity_thresh'
│ │ │
│ │ && │
│ │ │
│ │ detected non-speak activity │
│ │ < │
│ │ 'non_speak_activity_thresh' │
│ │ │
│ │ │
│ │ │
│ │ │
│ │ │
│ ┌───────────┴─────────────────────┐ │
│ │ │ │
└───────────► │ speak-activity-detected-state ├─────────────────┘
│ │
└─┬───────────────────────────────┘
│
│ ▲
│ │
│ │
│ │ detected speak activity
│ │ >=
│ │ 'min_speak_activity_thresh'
│ │
│ │ &&
│ │
│ │ detected non-speak activity
│ │ <
└─────────────────────┘ 'non_speak_activity_thresh'
HP Driver Functional Overview
The VAD HP driver is used for configure the LP VAD to be working under the control of the HP core. The HP core can also be woken up by the VAD when voice activity is detected.
Resource Allocation
lp_vad_init_config_t
is the configuration structure that is needed to create a LP I2S VAD unit handle. To create a LP I2S VAD unit handle, you will need to first create a LP I2S channel handle. see Low Power Inter-IC Sound.
You can call lp_i2s_vad_new_unit()
to create the handle. If the VAD unit is no longer used, you should recycle the allocated resource by calling lp_i2s_vad_del_unit()
.
vad_unit_handle_t vad_handle = NULL;
lp_vad_init_config_t init_config = {
.lp_i2s_chan = rx_handle,
.vad_config = {
.init_frame_num = 100,
.min_energy_thresh = 100,
.speak_activity_thresh = 10,
.non_speak_activity_thresh = 30,
.min_speak_activity_thresh = 3,
.max_speak_activity_thresh = 100,
},
};
ESP_ERROR_CHECK(lp_i2s_vad_new_unit(vad_id, init_config, &vad_handle));
ESP_ERROR_CHECK(lp_i2s_vad_del_unit(vad_handle));
Enable and Disable the VAD
Before using a VAD unit to detect voice activity, you need to enable the VAD unit by calling lp_i2s_vad_enable()
, this function switches the driver state from init to enable, and also enables the VAD hardware. Calling lp_i2s_vad_disable()
does the opposite, that is, put the driver back to the init state, the hardware will stop as well.
HP Core Wake-up
esp_sleep_enable_vad_wakeup()
can help you to set the VAD to be working as the HP core wake-up source. To make VAD work during sleep, you should let the system maintain the RTC domain and XTAL power. See code example below:
ESP_ERROR_CHECK(esp_sleep_enable_vad_wakeup());
LP Driver Functional Overview
The VAD LP driver is mainly for LP core wake-up. The VAD can be configured under HP core control, then it can wakeup the LP core when voice activities are detected.
Resource Allocation
lp_core_lp_vad_cfg_t
and lp_core_lp_vad_init()
are used to initialize the VAD LP driver.
lp_core_lp_vad_deinit()
is used to recycle the allocated resources.
Enable and Disable the VAD
lp_core_lp_vad_enable()
and lp_core_lp_vad_disable()
are used for enabling / disabling the hardware.
LP Core Wake-up
Set ULP_LP_CORE_WAKEUP_SOURCE_LP_VAD
in ulp_lp_core_cfg_t
to enable the VAD to be working as the LP core wake-up source.
static void load_and_start_lp_core_firmware(ulp_lp_core_cfg_t* cfg, const uint8_t* firmware_start, const uint8_t* firmware_end)
{
TEST_ASSERT(ulp_lp_core_load_binary(firmware_start,
(firmware_end - firmware_start)) == ESP_OK);
TEST_ASSERT(ulp_lp_core_run(cfg) == ESP_OK);
}
ulp_lp_core_cfg_t cfg = {
.wakeup_source = ULP_LP_CORE_WAKEUP_SOURCE_LP_VAD,
};
load_and_start_lp_core_firmware(&cfg, lp_core_main_vad_bin_start, lp_core_main_vad_bin_end);
API Reference
Header File
This header file can be included with:
#include "driver/lp_i2s_vad.h"
This header file is a part of the API provided by the
esp_driver_i2s
component. To declare that your component depends onesp_driver_i2s
, add the following to your CMakeLists.txt:REQUIRES esp_driver_i2s
or
PRIV_REQUIRES esp_driver_i2s
Functions
-
esp_err_t lp_i2s_vad_new_unit(lp_vad_t vad_id, const lp_vad_init_config_t *init_config, vad_unit_handle_t *ret_unit)
New LP VAD unit.
- Parameters
vad_id -- [in] VAD id
init_config -- [in] Initial configurations
ret_unit -- [out] Unit handle
- Returns
ESP_OK: On success
ESP_ERR_INVALID_ARG: Invalid argument
ESP_ERR_INVALID_STATE: Driver state is invalid, you shouldn't call this API at this moment
-
esp_err_t lp_i2s_vad_enable(vad_unit_handle_t unit)
Enable LP VAD.
- Parameters
unit -- [in] VAD handle
- Returns
ESP_OK: On success
ESP_ERR_INVALID_ARG: Invalid argument
ESP_ERR_INVALID_STATE: Driver state is invalid, you shouldn't call this API at this moment
-
esp_err_t lp_i2s_vad_disable(vad_unit_handle_t unit)
Disable LP VAD.
- Parameters
unit -- [in] VAD handle
- Returns
ESP_OK: On success
ESP_ERR_INVALID_ARG: Invalid argument
ESP_ERR_INVALID_STATE: Driver state is invalid, you shouldn't call this API at this moment
-
esp_err_t lp_i2s_vad_del_unit(vad_unit_handle_t unit)
Delete LP VAD unit.
- Parameters
unit -- [in] VAD handle
- Returns
ESP_OK: On success
ESP_ERR_INVALID_ARG: Invalid argument
ESP_ERR_INVALID_STATE: Driver state is invalid, you shouldn't call this API at this moment
Structures
-
struct lp_vad_config_t
LP VAD configurations.
Public Members
-
int init_frame_num
Number of init frames that are used for VAD to denoise, this helps the VAD to decrease the accidental trigger ratio. Note too big values may lead to voice activity miss
-
int min_energy_thresh
Minimum energy threshold, voice activities with energy higher than this value will be detected.
-
bool skip_band_energy_thresh
Skip band energy threshold or not, the passband energy check determines whether the proportion of passband energy within the total frequency domain meets the required threshold. Note in different environments, enabling the passband energy check may reduce false trigger rates but could also increase the rate of missed detections.
-
int speak_activity_thresh
When in speak-activity-listening-state, if number of the detected speak activity is higher than this value, VAD runs into speak-activity-detected-state
-
int non_speak_activity_thresh
When in speak-activity-detected-state, if the number of the detected speak activity is higher than this value, but lower than
max_speak_activity_thresh
:if the number of the detected non-speak activity is higher than this value, VAD runs into speak-activity-listening-state
if the number of the detected non-speak activity is lower than this value, VAD keeps in speak-activity-detected-state
-
int min_speak_activity_thresh
When in speak-activity-detected-state, if the number of the detected speak activity is higher than this value, but lower than
max_speak_activity_thresh
, then the VAD state machine will depends on the value ofnon_speak_activity_thresh
-
int max_speak_activity_thresh
When in speak-activity-detected-state, if the number of the detected speak activity is higher than this value, VAD runs into speak-activity-listening-state
-
int init_frame_num
-
struct lp_vad_init_config_t
LP VAD Init Configurations.
Public Members
-
lp_i2s_chan_handle_t lp_i2s_chan
LP I2S channel handle.
-
lp_vad_config_t vad_config
LP VAD config.
-
lp_i2s_chan_handle_t lp_i2s_chan
Type Definitions
-
typedef uint32_t lp_vad_t
State Machine ┌──────────────────────────────────┐ │ │ ┌─────────────┤ speak-activity-listening-state │ ◄───────────────┐ │ │ │ │ │ └──────────────────────────────────┘ │ │ ▲ │ │ │ │ │ │ │ │ │ │ │ │ │ detected speak activity │ │ detected speak activity │ detected speak activity >= │ │ >= │ >= 'speak_activity_thresh' │ │ 'min_speak_activity_thresh' │ 'max_speak_activity_thresh' │ │ │ │ │ && │ │ │ │ │ │ detected non-speak activity │ │ │ < │ │ │ 'non_speak_activity_thresh' │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ┌───────────┴─────────────────────┐ │ │ │ │ │ └───────────► │ speak-activity-detected-state ├─────────────────┘ │ │ └─┬───────────────────────────────┘ │ │ ▲ │ │ │ │ │ │ detected speak activity │ │ >= │ │ 'min_speak_activity_thresh' │ │ │ │ && │ │ │ │ detected non-speak activity │ │ < └─────────────────────┘ 'non_speak_activity_thresh'.
LP VAD peripheral
-
typedef struct vad_unit_ctx_t *vad_unit_handle_t
Type of VAD unit handle.
Header File
components/ulp/lp_core/shared/include/ulp_lp_core_lp_vad_shared.h
This header file can be included with:
#include "ulp_lp_core_lp_vad_shared.h"
This header file is a part of the API provided by the
ulp
component. To declare that your component depends onulp
, add the following to your CMakeLists.txt:REQUIRES ulp
or
PRIV_REQUIRES ulp
Functions
-
esp_err_t lp_core_lp_vad_init(lp_vad_t vad_id, const lp_core_lp_vad_cfg_t *init_config)
State Machine ┌──────────────────────────────────┐ │ │ ┌─────────────┤ speak-activity-listening-state │ ◄───────────────┐ │ │ │ │ │ └──────────────────────────────────┘ │ │ ▲ │ │ │ │ │ │ │ │ │ │ │ │ │ detected speak activity │ │ detected speak activity │ detected speak activity >= │ │ >= │ >= 'speak_activity_thresh' │ │ 'min_speak_activity_thresh' │ 'max_speak_activity_thresh' │ │ │ │ │ && │ │ │ │ │ │ detected non-speak activity │ │ │ < │ │ │ 'non_speak_activity_thresh' │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ┌───────────┴─────────────────────┐ │ │ │ │ │ └───────────► │ speak-activity-detected-state ├─────────────────┘ │ │ └─┬───────────────────────────────┘ │ │ ▲ │ │ │ │ │ │ detected speak activity │ │ >= │ │ 'min_speak_activity_thresh' │ │ │ │ && │ │ │ │ detected non-speak activity │ │ < └─────────────────────┘ 'non_speak_activity_thresh'.
LP VAD init
- Parameters
vad_id -- [in] VAD ID
init_config -- [in] Initial configurations
- Returns
ESP_OK: On success
ESP_ERR_INVALID_ARG: Invalid argument
ESP_ERR_INVALID_STATE: Driver state is invalid, you shouldn't call this API at this moment
-
esp_err_t lp_core_lp_vad_enable(lp_vad_t vad_id)
Enable LP VAD.
- Parameters
vad_id -- [in] VAD ID
- Returns
ESP_OK: On success
ESP_ERR_INVALID_ARG: Invalid argument
ESP_ERR_INVALID_STATE: Driver state is invalid, you shouldn't call this API at this moment
Type Definitions
-
typedef lp_vad_init_config_t lp_core_lp_vad_cfg_t
LP VAD configurations.