Espressif Speech Wake-up Solution Customization Process
Wake Word Customization Process
Espressif provides users with the wake word customization :
Espressif has already opened some wake words for customers’ commercial use, such as “HI Leixi”, or “Nihao Xiaoxin”.
For a complete list, see Table Publicly Available Wake Words Provided by Espressif .
Espressif also plans to provide more wake words that are free for commercial use soon.
Offline wake word customization can also be provided by Espressif:
Training corpus provided by customer
Customer must provide at least 20,000 qualified corpus entries. See detailed requirements in Section Requirements on Corpus .
It usually takes two to three weeks for Espressif to train and optimize the received corpus.
A fee will be charged for training and optimizing the corpus.
Training corpus provided by Espressif
Espressif provides all the corpus required for training.
The time required to collect corpus needs to be discussed separately. After the corpus is ready, it usually takes two to three weeks for Espressif to train and optimize the received corpus.
A fee will be charged for training and optimizing the corpus. A separate fee will be changed for collecting the corpus.
The actual fee and time for your customization depend on the number of wake words you need and the scale of your mass production. For details, please contact our sales person .
About Espressif wake word engine WakeNet:
Currently, up to 5 wake words are supported by each WakeNet model.
A wake word usually consists of 3 to 6 symbols, such as “Hi Leixin”, “xiaoaitongxue”, “nihaotianmao”.
More than one WakeNet models can be used together. However, more resource will be consumed when you use more models.
For more details, see Section WakeNet Wake Word Model .
Requirements on Corpus
As mentioned above, customers can provide Espressif with training corpus collected themselves or purchased from a third party. However, there are some limitations:
Audio file format
Sample rate: 16 kHz
Encoding: 16-bit signed int
Channel: mono
Format: WAV
Sampling requirement
Number of samples: more than 500 people, including men and women of all ages and at least 100 children.
Sampling environment: a quiet room (< 40 dB). It is recommended to use a professional audio room.
Recording device: high-fidelity microphone.
- How to sample:
At 1 meters away from the microphone: each person speaks the wake word out loud for 15 times (5 times in fast speed, 5 times in normal speed, 5 times in slow speed).
At 3 meters away from the microphone: each person speaks the wake word out loud for 15 times (5 times in fast speed, 5 times in normal speed, 5 times in slow speed).
File name: it is recommended to name the samples according to the age, gender, and quantity of the collected samples, such as
female_age_fast_id.wav
. Or you can use a separate file to present such information.
Hardware Design and Test
The voice wake-up performance heavily depends on the hardware design and cavity structure. Therefore, please pay special attention to the following requirements:
Hardware Design
Speaker designs: customers can make their own designs by modifying the reference designs (schematic/PCB) provided by Espressif. Also, Espressif can also review customers’ speaker designs to avoid some common design issues.
Cavity structure: cavity should be designed by acoustic specialists. Espressif does not provide ID design reference. Customers can refer to other mainstream speaker cavity designs on the market, such as Tmall Genie, Xiaodu Smart Speaker, and Google Smart Speaker, etc.
Customers can perform the following tests to verify the hardware designs. Note that it’s suggested to perform the following tests in a professional audio room. Customers can adjust the actual tests based on their actual testing environment.
Recording test to verify the gain and distortion of mic and codec
Play the sample (90 dB, 0.1 meter away from the mic), and adjust the gain to ensure that the recording is not saturated.
Use a sweep file of 0~20 KHz, and start recording using the sampling rate of 16 KHz. The recording should not have obvious frequency aliasing.
Record 100 samples, and feed these samples to open cloud voice recognition API. A certain recognition rate must be reached.
Playback test to verify the distortion of power amplifier (PA) and speaker
Test PA power @ 1% Total Harmonic Distortion (THD)
Speech algorithms test to verify the AEC, BFM and NS models
Adjust the delays of the reference signals based on the different requirements of different AEC algorithms.
Test the product based on the actual use scenario. For example, play
85DB-90DB Dreamer.wav
(a song) and record.Analyze the processed signals to evaluate the performance of AEC, BFM, NS, etc.
DSP performance test to identify the correct DSP parameter and minimize the nonlinear distortion in the DSP algorithm
Noise Suppression
Acoustic Echo Cancellation
Speech Enhancement
Customers can also send 1 or 2 pieces of hardware to Espressif and ask us to optimize the product for better wakeup performance.