Multimedia Technology Wiki: Component Description
Note
This document is automatically translated using AI. Please excuse any detailed errors. The official English version is still in progress.
ESP-New-JPEG Component
Note
For basic knowledge about JPEG, please refer to JPEG
Overview:
ESP-New-JPEG is a lightweight JPEG encoding and decoding library launched by Espressif Systems. To improve efficiency, the JPEG encoder and decoder have been deeply optimized to reduce memory consumption and enhance processing performance. For the ESP32-S3 chip that supports SIMD instructions, these instructions are used to further improve processing speed. In addition, rotation, cropping, and scaling functions have been extended, which can be performed simultaneously during the encoding and decoding process, thereby simplifying user operations. For chips with smaller memory, a block mode has been introduced to support processing part of the image content multiple times, effectively reducing memory pressure.
ESP-New-JPEG supports JPEG encoding and decoding of the Baseline Profile. The rotation, cropping, scaling, and block mode functions can only take effect under specific configurations.
JPEG Encoder Features:
The basic features supported by the encoder are as follows:
Supports decoding of any width and height
Supports the following pixel formats: RGB888, RGBA, YCbYCr, YCbY2YCrY2, GRAY
When using the YCbY2YCrY2 format, only YUV420 and Gray subsampling are supported
Supports YUV444, YUV422, YUV420, Gray subsampling
Supports quality setting range: 1-100
The extended features are as follows:
Supports 0°, 90°, 180°, 270° clockwise rotation
Supports dual-task encoding
Supports block mode encoding
Dual-task encoding can be used on dual-core chips, fully utilizing the advantages of dual-core parallel encoding. The principle is that one core handles the main encoding task, and the other core is responsible for the entropy encoding part of the work. In most cases, enabling dual-core encoding can bring about a 1.5 times performance improvement. You can choose whether to enable dual-core decoding through menuconfig configuration, and adjust the core and priority of the entropy encoding task.
Block encoding refers to encoding the data of one image block at a time, and encoding the complete image after multiple processing. When subsampling YUV420, the height of each block is 16 rows and the width is the image width; under other subsampling formats, the height of each block is 8 rows and the width is the image width. Since the amount of data processed by block encoding each time is small, the image buffer can be placed in DRAM, thereby improving the encoding speed. The workflow of block encoding is shown in the following figure:
The configuration requirements for extended features are as follows:
JPEG Decoder Features:
The basic features supported by the decoder are as follows:
Supports decoding of any width and height
Supports single-channel and three-channel decoding
Supports the following pixel format outputs: RGB888, RGB565 (big-endian), RGB565 (little-endian), CbYCrY
The extended features are as follows:
Supports scaling (maximum reduction ratio is 1/8)
Supports cropping (cropping from the upper left corner)
Supports 0°, 90°, 180°, 270° clockwise rotation
Supports block mode decoding
When using the zoom and crop functions, you need to configure the corresponding parameter jpeg_resolution_t. The component supports handling width or height separately. For example, when only cropping the width and keeping the height unchanged, you can set clipper.height = 0, at this time the height of the image will remain the height of the original JPEG image.
The process of scaling, cropping, and rotating is sequential, as shown in the diagram below. The decoded JPEG data stream is first scaled, then cropped, and finally rotated and output.
Block decoding refers to decoding only one image block at a time, and the entire image is decoded after multiple processing. In YUV420 subsampling, the height of each block is 16 lines, and the width is the image width; for other subsampling formats, the height of each block is 8 lines, and the width is the image width. Since block decoding processes a small amount of data each time, it is more friendly to chips without PSRAM, and placing the output image buffer in DRAM can also improve the decoding speed. The operation of block decoding can be seen as the reverse process of block encoding.
The configuration requirements for extended functions are as follows:
When block decoding is enabled, other extended functions cannot be used
The width and height in the configuration parameters of scaling, cropping, and rotating are required to be multiples of 8
When scaling and cropping are enabled at the same time, the size of the crop is required to be smaller than the size after scaling
Usage:
The ESP-New-JPEG component is hosted on Github. You can add this component to your project by entering the following command in the project.
idf.py add-dependency “espressif/esp_new_jpeg”
The test_app folder under the esp_new_jpeg folder contains a runnable test project, which shows the related API call process. Before using the ESP-New-JPEG component, it is recommended to refer to and debug this test project to familiarize yourself with the use of the API.
Performance:
ESP-New-JPEG has deeply optimized the JPEG encoding and decoding architecture:
Optimized data processing flow, improved the reuse efficiency of intermediate data, and reduced memory copy overhead.
Assembly-level optimization for Xtensa architecture chips; significantly improved computing performance on ESP32-S3 chips that support SIMD instructions.
Integrated various image operations such as cropping and rotating into the encoder and decoder to improve overall system efficiency
Please refer to Performance for codec performance test data.
FAQ:
Q: Does ESP-New-JPEG support decoding progressive JPEG?
A: No, ESP-New-JPEG only supports decoding baseline JPEG. You can use the following code to check whether the image is a progressive JPEG. Output 1 indicates progressive JPEG, and output 0 indicates baseline JPEG.
python >>> from PIL import Image >>> Image.open("file_name.jpg").info.get('progressive', 0)
Q: Why does the output image look misaligned?
A: This problem usually occurs when some columns appear on the left or right side of the image, and these columns appear on the other side of the image. If you are using ESP32-S3, the possible reason is that the output buffer of the decoder or the input buffer of the encoder is not aligned to 16 bytes. Please use the jpeg_calloc_align() function to allocate the buffer.
Q: How to preview the raw data of the image, such as viewing RGB888 data?
A: You can use yuvplayer. It supports viewing grayscale, RGB888, RGB565 (little-endian), UYVY, YUYV, YUV420P, etc.
Related Links:
Component Registry: esp_new_jpeg component
GMF-AI-Audio Component
Overview:
GMF-AI-Audio is a voice interaction component developed based on the GMF framework. By encapsulating ESP-SR, it provides a complete interaction logic from voice wake-up to command recognition. The component integrates functions such as Wake Word detection, Voice Activity Detection (VAD), voice command recognition, and Acoustic Echo Cancellation (AEC), enabling efficient and natural voice interaction experiences in smart speakers, smart home devices, etc.
Supported Scenarios:
Method |
Corresponding Scenario |
|---|---|
Immediately upload voice data after wake-up, stop uploading at the Wakeup End stage |
Implement VAD function in the cloud, RTC scenario |
Wait for VAD to trigger after wake-up and start uploading, stop uploading after VAD ends |
Traditional interaction method of smart hardware |
No wake-up, wait for VAD to trigger and start uploading, stop uploading after VAD ends |
New cloud processing logic |
Immediately upload voice data after pressing the button, stop after releasing |
Devices with limited computing power implement voice functions through interaction with the cloud |
Wait for VAD to trigger after pressing the button and start uploading, stop uploading after VAD ends |
Solve the problem of excessive data volume caused by relying solely on VAD |
Detect command words after wake-up |
Default usage logic |
No wake-up, wait for VAD to trigger and detect command words |
Can be applied to some vehicle systems |
Detect command words after pressing the button |
Toys |
Continuous command word recognition |
Home control |
Related Links:
Detailed Documentation: GMF AI Audio Component
Demo Project: GMF AI Audio Example
ESP-H264 Component
Overview:
ESP-H264 is a lightweight H.264 encoder and decoder component developed by Espressif Systems, offering both hardware and software implementations. The hardware encoder is designed specifically for the ESP32-P4 chip, capable of achieving 1080P@30fps. The software encoder is based on openh264, and the decoder is based on tinyH264. Both are optimized for memory and CPU usage, ensuring optimal performance on Espressif chips.
Features:
Encoder Features
Hardware Encoder (ESP32-P4):
Supports Baseline Profile (maximum frame size 36864 macroblocks)
Supports width range [80, 1088] pixels, height range [80, 2048] pixels
Supports quality-priority bitrate control
Supports YUV420 raw data format
Supports dynamic adjustment of bitrate, frame rate, GOP, QP, etc.
Supports single-stream and dual-stream encoders
Supports deblocking filter, ROI, motion vector functions
Supports SPS and PPS encoding
Software Encoder:
Supports Baseline Profile (maximum frame size 36864 macroblocks)
Supports any resolution greater than 16 pixels in width and height
Supports quality-priority bitrate control
Supports YUYV and IYUV raw data formats
Supports dynamic adjustment of bitrate and frame rate
Supports SPS and PPS encoding
Decoder Features
Supports Baseline Profile (maximum frame size 36864 macroblocks)
Supports various widths and heights
Supports Long Term Reference (LTR) frames
Supports Memory Management Control Operations (MMCO)
Supports modification of reference image lists
Supports multiple reference frames specified in Sequence Parameter Set (SPS)
Supports IYUV output format
Performance:
Encoding Performance: ESP32-P4 is recommended to use hardware encoder, ESP32-S3 and other boards use software encoder
Hardware Encoder (only for ESP32-P4):
Better performance and power consumption, supports up to 1080P@30fps
Supports single-stream/dual-stream encoding
Supports dynamic adjustment of bitrate, frame rate, GOP, QP, etc.
Supports advanced features such as deblocking filter, ROI, motion vector, etc.
Software Encoder (all platforms):
Limited performance and power consumption, but no resolution limit
Supports YUYV and IYUV formats, richer color formats
Supports all Espressif chip platforms, more board choices
Based on OpenH264 open source project
Platform |
Type |
Maximum Resolution |
Maximum Performance |
Remarks |
|---|---|---|---|---|
ESP32-S3 |
Software Encoder |
Any |
320×240@11fps |
|
ESP32-P4 |
Hardware Encoder |
≤1080P |
1920×1080@30fps |
Hardware Acceleration |
Decoding Performance: All boards are recommended to use software decoder
Software Decoder (all platforms):
Limited performance and power consumption, but no resolution limit
Supports IYUV output format
Supports advanced features such as long-term reference frames, memory management control, etc.
Based on TinyH264 open source project
Platform |
Type |
Maximum Resolution |
Maximum Performance |
|---|---|---|---|
ESP32-S3 |
Software Decoder |
Any |
320×192@27fps |
ESP32-P4 |
Software Decoder |
Any |
1280×720@10fps |
Warning
Memory consumption strongly depends on the resolution and encoding data of the H.264 stream. It is recommended to adjust the memory allocation according to the actual application scenario.
Tip
Using a dual-task decoder can significantly improve decoding performance, especially in high-resolution video processing.
Component Links:
Component Registry: esp_h264 component
Sample Projects: ESP H.264 Sample Projects
Usage Tips: ESP-H264 Usage Tips Document
Related Resources:
ESP-Image-Effects Component
Overview:
ESP-Image-Effects is an image processing engine developed by Espressif Systems, integrating basic functions such as rotation, color space conversion, scaling, and cropping. As one of the core components of Espressif’s audio and video development platform, the ESP-Image-Effects module has deeply restructured the underlying algorithms, combined with efficient memory management and hardware acceleration, achieving high performance, low power consumption, and low memory occupancy. In addition, each image processing function adopts a consistent API architecture design, reducing the learning cost for users and facilitating rapid development. This engine is widely used in the Internet of Things, smart cameras, industrial vision, and other fields.
Features:
Image Color Conversion
Supports any input resolution
Supports bypass mode for the same input/output format
Supports BT.601/BT.709/BT.2020 color space standards
Supports fast color conversion algorithms for format and resolution
Comprehensive format support matrix:
Color Conversion Format Support Input Format
Supported Output Formats
RGB/BGR565_LE/BE RGB/BGR888
RGB565_LE/BGR/RGB565_LE/BE RGB/BGR888 YUV_PLANAR/PACKET YUYV/UYVY O_UYY_E_VYY/I420
ARGB/BGR888
RGB565_LE/BGR/RGB565_LE/BE RGB/BGR888 YUV_PLANAR O_UYY_E_VYY/I420
YUV_PACKET/UYVY/YUYV
RGB565_LE/BGR/RGB565_LE/BE RGB/BGR888 O_UYY_E_VYY/I420
O_UYY_E_VYY/I420
RGB565_LE/BGR/RGB565_LE/BE RGB/BGR888 O_UYY_E_VYY
Image Rotation
Supports bypass mode
Supports any input resolution
Supports clockwise rotation at any angle
Supports ESP_IMG_PIXEL_FMT_Y/RGB565/BGR565/RGB888/BGR888/YUV_PACKET formats
Supports fast clockwise rotation algorithms for specific angles, formats, and resolutions
Image Scaling
Supports bypass mode
Supports any input resolution
Supports up-sampling and down-sampling operations
Supports ESP_IMG_PIXEL_FMT_RGB565/BGR565/RGB888/BGR888/YUV_PACKET formats
Supports various filtering algorithms: optimized down-sampling and bilinear interpolation
Image Cropping
Supports bypass mode
Supports any input resolution
Supports up-sampling and down-sampling operations
Supports flexible area selection
Supports ESP_IMG_PIXEL_FMT_Y/RGB565/BGR565/RGB888/BGR888/YUV_PACKET formats
Performance:
The ESP-Image-Effects component has completed performance testing under 1080P. For specific performance data, please refer to the ESP32-P4 Performance Document. This component uses efficient memory management and hardware acceleration technology to achieve high performance, low power consumption, and low memory occupancy.
Related Links:
Component Registry: esp_image_effects component
Example Project: ESP-Image-Effects Example Project
Component Release Document: Image Processing Release Document
Frequently Asked Questions: ESP-Image-Effects Official Document
ESP-Audio-Effects Component
Overview:
ESP-Audio-Effects is a powerful and flexible audio processing library, designed to provide developers with efficient audio effect processing capabilities. This component is widely used in various smart audio devices, including smart speakers, headphones, audio playback devices, and voice interaction systems.
Features:
Automatic Level Control: Automatically adjusts input gain to stabilize audio volume. Progressive adjustment ensures smooth transition. Dynamic correction of over-amplification to avoid clipping distortion.
Equalizer: Provides fine control over filter type, frequency, gain, and Q factor. Suitable for audio tuning and professional signal shaping.
Fade In/Out: Implements fade in and fade out effects, ensuring smooth transitions between tracks.
Speed and Pitch Processing: Supports real-time speed and pitch modification, achieving more dynamic playback effects.
Mixer: Merges multiple input streams into one output, with start/target weights and transition time configurable for each input.
Data Interleaver: Handles interleaving and de-interleaving of audio data buffers.
Sample Rate Conversion: Performs sample rate conversion between multiples of 4000 and 11025.
Channel Conversion: Remaps audio channel layout using weight array.
Bit Depth Conversion: Supports conversion between U8, S16, S24, and S32 bit depths.
The table below lists the supported sample rates, channel numbers, and sample bit depths for each module. If users want to know detailed introduction, performance, examples, and other information about each module, they can click the README link in the Module column.
Module |
Sample Rate |
Channel Number |
Sample Bit Depth |
Data Layout |
|---|---|---|---|---|
Full Range |
Full Range |
s16, s24, s32 |
Interleaved and De-interleaved |
|
Full Range |
Full Range |
s16, s24, s32 |
Interleaved and De-interleaved |
|
Full Range |
Full Range |
s16, s24, s32 |
Interleaved and De-interleaved |
|
4 to 192 kHz, and integer multiples of 4000 or 11025 |
Full range |
s16, s24, s32 |
Interleaved |
|
Full range |
Full range |
s16, s24, s32 |
Interleaved and deinterleaved |
|
Full range |
Full range |
s16, s24, s32 |
Interleaved and deinterleaved |
|
4 to 192 kHz, and integer multiples of 4000 or 11025 |
Full range |
s16, s24, s32 |
Interleaved and deinterleaved |
|
Full range |
Full range |
s16, s24, s32 |
Interleaved and deinterleaved |
|
Full range |
Full range |
u8, s16, s24, s32 |
Interleaved and deinterleaved |
Data Layout:
ESP-Audio-Effects supports both interleaved and deinterleaved audio formats:
Interleaved format: Use the
esp_ae_xxx_process()API to process this layout. For example:
L0 R0 L1 R1 L2 R2 ...Where L and R represent left and right channel samples respectively.
Deinterleaved format: Use the
esp_ae_xxx_deintlv_process()API. Each channel is stored in a separate buffer:
L1, L2, L3, ... // Left channel R1, R2, R3, ... // Right channel
API Style:
ESP-Audio-Effects provides a consistent and developer-friendly API:
Category |
Function |
Description |
|---|---|---|
Initialization |
|
Create an audio effect handle. |
Interleaved Processing |
|
Process interleaved audio data. |
Deinterleaved Processing |
|
Process deinterleaved audio data. |
Set Parameters |
|
Set component-specific parameters. |
Get Parameters |
|
Get current parameters. |
Release |
|
Release resources and destroy the handle. |
Related links:
Component registry: esp_audio_effects component
ESP-Audio-Codec Component
Overview:
ESP-Audio-Codec is an audio encoding and decoding processing module developed by Espressif for SoC platforms. It provides a standardized encoding and decoding interface framework, making it easy for users to flexibly expand and combine different audio formats. This module mainly includes three parts: ESP Audio Encoder, ESP Audio Decoder, and Simple Decoder.
ESP Audio Encoder provides a unified encoder interface, supporting the registration of various encoders (such as AAC, AMR-NB, AMR-WB, ADPCM, G711A, G711U, PCM, OPUS, ALAC, etc.). Users can create one or more encoder instances based on the interface to achieve multi-channel simultaneous encoding. They can also directly call the API of the specified encoder to reduce the call level.
ESP Audio Decoder provides a unified decoder interface, supporting the registration of various decoders (such as AAC, MP3, AMR-NB, AMR-WB, ADPCM, G711A, G711U, VORBIS, OPUS, ALAC, etc.). Users can create one or more decoder instances through the interface to achieve multi-channel simultaneous decoding. They can also directly call the API of the specified decoder to reduce the call level. ESP Audio Decoder only supports processing audio frame data (i.e., the input data must be frame boundaries).
Simple Decoder aggregates and organizes audio frames through the parser, and then calls ESP Audio Decoder for decoding, simplifying the parsing and positioning of audio frames. Users can input data of any length. The audio containers supported by this decoder include AAC, MP3, WAV, FLAC, AMRNB, AMRWB, M4A, etc.
Main Features:
Easy-to-use interface: Provides a user-friendly interface for easy development and integration.
High performance and lightweight: The module is optimized for high performance and low memory usage.
Dual-layer decoder API: ESP Audio Decoder can be used when the input data is a frame boundary; Simple Decoder can be used for data of any length. Both APIs are similar, making it easy to switch.
Highly customizable: Through the registration interface, users can easily add custom decoders, encoders, or simple decoders, or override the default implementation without modifying the application code.
Functional Features:
ESP Audio Encoder: Provides a unified encoder interface, all encoders can be operated through a unified API (see esp_audio_enc.h). The module supports registering custom encoders or overriding the default implementation through
esp_audio_enc_register(), or usingesp_audio_enc_register_default()to register all supported encoders at once, and can be managed uniformly throughmenuconfig. The encoders supported by the module and their detailed parameters are as follows:AAC:
Supports AAC-LC (Low Complexity) encoding
Sampling rate (Hz): 96000, 88200, 64000, 48000, 44100, 32000, 24000, 22050, 16000, 12000, 11025, 8000
Number of channels: mono, stereo
Bit depth: 16 bits
Fixed bit rate: 12 Kbps ~ 160 Kbps
Option to write ADTS header
AMR:
Supports Narrowband (NB) and Wideband (WB) encoding
AMRNB sampling rate: 8 kHz
AMRWB sampling rate: 16 kHz
Number of channels: mono
Bit depth: 16 bits
AMRNB bit rate (Kbps): 4.75, 5.15, 5.9, 6.7, 7.4, 7.95, 10.2, 12.2
AMRWB bit rate (Kbps): 6.6, 8.85, 12.65, 14.25, 15.85, 18.25, 19.85, 23.05, 23.85
Supports DTX (Discontinuous Transmission)
ADPCM:
Supports all sampling rates
Number of channels: mono, stereo
Bit depth: 16 bits
G711:
Supports A-LAW and U-LAW
Supports all sampling rates
Supports all channel numbers
Bit depth: 16 bits
OPUS:
Sampling rate (Hz): 8000, 12000, 16000, 24000, 48000
Number of channels: mono, stereo
Bit depth: 16 bits
Fixed bit rate: 20 Kbps ~ 510 Kbps
Frame duration (ms): 2.5, 5, 10, 20, 40, 60, 80, 100, 120
Supports VoIP and music mode
Adjustable encoding complexity (0~10)
Supports FEC (Forward Error Correction), DTX (Discontinuous Transmission), VBR (Variable Bit Rate)
ALAC:
Sampling rate (Hz): [1000, 384000]
Number of channels: [1, 8]
Bit depth: 16, 24, 32 bits
SBC:
Sampling rate (Hz): 16000, 32000, 44100, 48000
Channel mode: mono, stereo, joint stereo
Bit depth: 16 bits
SBC mode: standard, mSBC
Block length: 4, 8, 12, 16
Number of subbands: 4, 8
Allocation method: loudness, SNR
bitpool range: 2~250
LC3:
Sampling rate (Hz): 8000, 16000, 24000, 32000, 44100, 48000
Supports all channels
Bit depth: 16, 24, 32 bits
Frame duration (dms): 7.5, 10
nbyte range: 20~400
Supports adding a 2-byte length prefix to each frame
PCM
ESP Audio Decoder: Provides a unified decoder interface, all decoders can be operated through a unified API (see esp_audio_dec.h). The module supports registering custom decoders or overriding the default implementation through
esp_audio_dec_register(), or usingesp_audio_dec_register_default()to register all supported decoders at once, and can be managed uniformly throughmenuconfig. The decoders supported by the module and their detailed parameters are as follows:AAC:
Supports AAC-LC, AAC-Plus encoding
Configurable whether to enable AAC-Plus decoding, to reduce CPU and memory usage
Sampling rate (Hz): 96000, 88200, 64000, 48000, 44100, 32000, 24000, 22050, 16000, 12000, 11025, 8000
Number of channels: mono, stereo
Bit depth: 16 bits
Supports decoding audio data with or without ADTS headers
AMR:
Supports narrowband (NB) and wideband (WB) decoding
AMRNB sampling rate: 8 kHz
AMRWB sampling rate: 16 kHz
Number of channels: mono
Bit depth: 16 bits
ADPCM:
Supports all sampling rates
Number of channels: mono, stereo
Bit depth: 16 bits
Only supports IMA-ADPCM
G711:
Supports A-LAW and U-LAW
Supports all sampling rates
Supports all channel numbers
Bit depth: 16 bits
OPUS:
Sampling rate (Hz): 8000, 12000, 16000, 24000, 48000
Channel numbers: mono, stereo
Bit depth: 16 bits
Supports self-segmentation packet decoding
ALAC:
Sampling rate (Hz): 8000, 12000, 16000, 24000, 48000
Channel numbers: mono, stereo
Bit depth: 16 bits
FLAC:
Sampling rate (Hz): 96000, 48000, 44100, 32000, 24000, 22050, 16000, 12000, 11025, 8000
Channel numbers: [1, 8]
Bit depth: 16, 24, 32 bits
VORBIS:
Sampling rate (Hz): 48000, 44100, 32000, 24000, 22050, 16000, 12000, 11025, 8000
Channel numbers: mono, stereo
Bit depth: 16 bits
Only supports VORBIS frame decoding, need to remove OGG header
User needs to provide general header information first
SBC:
Sampling rate (Hz): 16000, 32000, 44100, 48000
Channel numbers: mono, stereo
Bit depth: 16 bits
SBC mode: standard, mSBC
Supports packet loss concealment (PLC)
LC3:
Sampling rate (Hz): 8000, 16000, 24000, 32000, 44100, 48000
Supports all channels
Bit depth: 16, 24, 32 bits
Frame duration (dms): 7.5, 10
nbyte range: 20~400
Supports 2-byte length prefix frame data decoding
Supports packet loss concealment (PLC)
MP3
Simple Decoder
The simple decoder can be operated through the API, see esp_audio_simple_dec.h
Supports audio frame search and decoding of some audio containers
Supports general parser, users can add custom parsers according to the rules
Supports custom simple decoders to adapt to new file formats
Supports custom parser and decoder pairing: default parser can be used with custom decoder
Only supports streaming decoding, does not support seek
The supported audio containers and descriptions are as follows:
Audio Container |
Description |
|---|---|
AAC |
Supports AAC-Plus (configurable), parser can input data of any size |
MP3 |
Supports Layer 1, 2, 3, parser can input data of any size |
AMRNB |
Only supports files with AMRNB file header, parser can input data of any size |
AMRWB |
Only supports files with AMRWB file header, parser can input data of any size |
FLAC |
Only supports files with FLAC file header, parser can input data of any size |
WAV |
Supports G711A, G711U, PCM, ADPCM, parser can input data of any size |
M4A |
Supports MP3, AAC, ALAC, and only supports MDAT after MOOV, parser can input data of any size |
TS |
Supports MP3, AAC, with a parser that can input data of any size |
G711 |
Supports G711A, G711U, can input data of any size |
ADPCM |
Only supports IMA-ADPCM, input frames without a parser must be complete audio frames |
SBC |
Supports SBC and MSBC, input frames without a parser must be complete audio frames |
LC3 |
Supports LC3, input frames without a parser must be complete audio frames |
OPUS |
Supports OPUS, input frames without a parser must be complete audio frames |
Performance:
Encoder Performance
Encoder
Sampling Rate (Hz)
Channels
Memory (KB)
CPU Usage (%)
AAC
48000
2
51.4
12.9
G711-A
8000
1
0.06
0.32
G711-U
8000
1
0.06
0.33
AMR-NB
8000
1
3.3
17.81
AMR-WB
16000
1
5.6
37.69
ADPCM
48000
2
0.01
2.69
OPUS
48000
2
29.4
24.9
SBC
48000
2
1.85
9.55
LC3
48000
2
3.67
46.57
Encoder CPU usage highly depends on encoding parameters (such as bitrate, complexity, etc.)
AAC encoder test bitrate is 90 kbps
AMR-NB/AMR-WB encoder test bitrates are 12.2 kbps/8.85 kbps respectively
OPUS encoder test bitrate is 90 kbps, complexity is 0
Memory only counts heap usage, not including stack. When supporting all encoders, the recommended task stack size is about 40 K
Decoder Performance
Decoder
Sampling Rate (Hz)
Channels
Memory (KB)
CPU Usage (%)
AAC
48000
2
51.2
6.75
G711-A
8000
1
0.04
0.14
G711-U
8000
1
0.04
0.13
AMR-NB
8000
1
1.8
4.23
AMR-WB
16000
1
5.4
9.5
ADPCM
48000
2
0.11
2.43
OPUS
48000
2
26.6
5.86
MP3
44100
2
28
8.17
FLAC
44100
2
89.4
8.0
SBC
48000
2
0.21
8.14
LC3
48000
2
1.36
17.5
MP3 and FLAC decoders are tested with real audio data, others with sine wave PCM encoded data
The test file for the AAC decoder is AAC-LC; AAC-Plus decoding consumes more memory and CPU
Memory only counts heap usage. When supporting all decoders, it is recommended that the task stack size is about 20 K
Codec Comparison:
The following table compares the features of the Codecs supported by ESP-Audio-Codec:
Codec |
Features |
Typical Bitrate Range (kbps) |
Applicable Scenarios |
|---|---|---|---|
AAC (Advanced Audio Coding) |
Lossy compression, better sound quality than MP3, more efficient at the same bitrate; widely supported. |
96 – 320 (stereo typically uses 128–256) |
Online music, video streaming (YouTube, Apple Music, radio). |
MP3 |
The most popular lossy compression format, excellent compatibility, but slightly less efficient than AAC/Opus. |
128 – 320 (as low as 64 can also be used) |
Music download, traditional players, car audio. |
AMR-NB / AMR-WB |
Optimized for voice, clear voice at low bitrates; NB (8kHz), WB (16kHz). |
AMR-NB: 4.75 – 12.2; AMR-WB: 6.6 – 23.85 |
Mobile communication (2G/3G phone calls), VoIP, voice messages. |
ADPCM |
Simple compression, low latency, limited sound quality; not very efficient. |
Common 16 – 64 |
Early voice storage, embedded devices, simple audio transmission sensitive to latency. |
G.711 (A-law / μ-law) |
Waveform coding, fixed at 64 kbps, sound quality close to telephone level; extremely low latency. |
Fixed at 64 |
Landline, VoIP (such as SIP), call centers. |
OPUS |
Low latency, high sound quality, supports narrowband to full band, strong adaptability; open source and free. |
6 – 510 (common voice 16–32, music 64–128) |
Real-time voice (VoIP, conference), music stream, game voice, WebRTC. |
Vorbis |
Open source lossy compression, good sound quality, better compression rate than MP3; gradually replaced by Opus. |
64 – 320 (commonly used 128–192) |
Open source streaming media (OGG container), some games and applications. |
FLAC |
Lossless compression, retains original sound quality, compression rate about 40–60%. |
700 – 1100 (CD quality, depends on content) |
High fidelity music storage, music download (Hi-Res music). |
ALAC |
Apple’s lossless compression, similar to FLAC, but limited ecosystem. |
700 – 1100 (similar to FLAC) |
Apple Music lossless audio, iTunes, iOS/macOS ecosystem. |
SBC |
Simple, low power consumption, default encoding for Bluetooth A2DP, average sound quality. |
192 – 320 (commonly used 256) |
Bluetooth headphones, Bluetooth speakers. |
LC3 |
Inherits from SBC, used for Bluetooth LE Audio; low power consumption, better sound quality than SBC, low latency. |
16 – 160 (commonly 96–128) |
Bluetooth LE Audio (TWS earbuds, hearing aids), IoT audio. |
SoC Compatibility:
The table below shows the support status of ESP-Audio-Codec on various Espressif chips. “✔” indicates support, “✘” indicates no support.
Chip |
v2.0.0 |
|---|---|
ESP32 |
✔ |
ESP32-S2 |
✔ |
ESP32-C3 |
✔ |
ESP32-C6 |
✔ |
ESP32-S3 |
✔ |
ESP32-P4 |
✔ |
ESP32-C2 |
✔ |
ESP32-C5 |
✔ |
ESP32-H4 |
✔ |
ESP32-H2 |
✘ |
Usage:
Encoder Example
For detailed usage, please refer to: audio_encoder_test.c
If you need to use a custom encoder, please follow the steps below:
Implement the custom encoder interface, for details, see: struct esp_audio_enc_ops_t
Define a custom audio encoder type in the enumeration
esp_audio_type_t, the definition range is betweenESP_AUDIO_TYPE_CUSTOMIZEDandESP_AUDIO_TYPE_CUSTOMIZED_MAX, for details, see: enum esp_audio_type_tIf you want to override the default encoder, there is no need to define a custom audio encoder type, you can directly use the existing encoder type
Register the custom encoder, for details, see: esp_audio_enc_register()
Decoder Example
For detailed usage, please refer to: audio_decoder_test.c
If you need to use a custom decoder, please follow the steps below:
Implement the custom decoder interface, for details, see: struct esp_audio_dec_ops_t
Customize the audio decoder type in the enumeration
esp_audio_type_t, the definition range is betweenESP_AUDIO_TYPE_CUSTOMIZEDandESP_AUDIO_TYPE_CUSTOMIZED_MAX, see: Enumeration esp_audio_type_tIf you want to override the default decoder, there is no need to customize the audio decoder type, you can directly use the existing decoder type
Register the custom decoder, see: esp_audio_dec_register()
Simple Decoder Usage Example
For detailed usage, please refer to: simple_decoder_test.c
If you need to use a custom simple decoder, please follow the steps below:
Implement the custom simple decoder interface, for the interface form, see: Structure esp_audio_simple_dec_reg_info_t
Customize the simple audio decoder type in the enumeration
esp_audio_simple_dec_type_t, the definition range is betweenESP_AUDIO_SIMPLE_DEC_TYPE_CUSTOMandESP_AUDIO_SIMPLE_DEC_TYPE_CUSTOM_MAX, see: Enumeration esp_audio_simple_dec_type_tIf you want to override the default decoder, there is no need to customize the audio decoder type, you can directly use the existing decoder type
Register the custom simple decoder, see: esp_audio_simple_dec_register()
Related Links:
Component Registry: esp_audio_codec component
ESP-Media-Protocols Component
Overview:
Multimedia protocols are a collection of various communication protocols, widely used in scenarios such as streaming media transmission, device control, and device interconnection communication. ESP-Media-Protocols is a multimedia protocol library launched by Espressif, providing support for basic and mainstream multimedia protocols.
Protocol |
Layer |
Function |
|---|---|---|
RTP/RTCP |
Transport Layer |
Real-time transmission of audio and video streams, providing quality information |
RTSP |
Application Layer |
Supports being streamed as a server, supports streaming and pushing as a client |
SIP |
Application Layer |
Session terminal, supports registration to SIP server, supports initiating and receiving sessions |
RTMP |
Application Layer |
Supports being streamed and receiving pushes as a server, supports streaming and pushing as a client |
MRM |
/ |
Multi-device master-slave synchronized music playback |
UPnP |
/ |
Device interconnection, media and service sharing |
How to use:
The ESP-Media-Protocols component is hosted on Github. You can add this component to your project by entering the following command in the project.
idf.py add-dependency “jimforr/esp_media_protocols”
Before using the ESP-Media-Protocols component, it is recommended to refer to and debug the following example projects to familiarize yourself with the use of the API and the specific application of the protocol stack.
Performance:
Protocol |
Real-time |
Data Stream |
Control Stream |
Device Discovery |
TLS Encryption |
Complexity |
|---|---|---|---|---|---|---|
RTSP |
High |
Yes |
Yes |
Manual |
No |
Medium |
SIP |
High |
Yes |
Yes |
Manual |
Yes |
Medium |
RTMP |
Medium |
Yes |
Basic |
Manual |
Yes |
Medium |
MRM |
High |
Yes |
Yes |
Automatic |
No |
Low |
UPnP |
Low |
Yes |
Yes |
Automatic |
No |
Medium |
Real-time
Low latency: Data for control or command transmission, latency about 20 ms.
Low latency: Audio, video or other media stream transmission, latency about 300 ms.
Medium latency: Live stream based on RTMP, latency about 2 seconds.
Security
TLS (optional)
MD5 Digest Authentication (SIP mandatory)
Scalability
Customizable protocol header and body
Supports subscription and notification, can register services
Concurrency
Supports multiple client connections (RTMP)
Compatibility
SIP supports linphone, Asterisk FreePBX, Freeswitch, Kamailio
RTSP supports ffmpeg, vlc, live555, mediamtx
RTMP supports ffmpeg, vlc
UPnP supports NetEase Cloud Music
Media Support
Please refer to README
Memory Consumption Data
Please refer to README
FAQ:
Q: Does ESP-Media-Protocols support all protocols and features?
A: ESP-Media-Protocols currently supports the basic protocols and features widely used in the embedded field. Some unsupported protocols such as SRTP, HLS, etc., can be found and used under other components or repositories. The supported protocol specifications will be continuously iterated and expanded, and we will also update and consider expansion according to customer needs. In the future, we plan to support some new protocols with strong features.
Q: Some protocol features overlap, how to choose when using?
A: According to the application scenario, specifically analyze the functional requirements, latency requirements, and network environment. For example, if the real-time requirement is high and real-time control (pause, fast forward, rewind, positioning) is needed, RTSP is usually used; if the real-time requirement is high and real-time interaction is needed, SIP can be used to create a session; if it is a large-scale live broadcast in the browser, with high requirements for stability and compatibility, and no high real-time requirements, RTMP can be considered.
For more related questions, please refer to the Issues section in the following protocol directory: