Multimedia Technology Wiki: Component Description

[中文]

Note

This document is automatically translated using AI. Please excuse any detailed errors. The official English version is still in progress.

ESP-New-JPEG Component

Note

For basic knowledge about JPEG, please refer to JPEG

Overview:

ESP-New-JPEG is a lightweight JPEG encoding and decoding library launched by Espressif Systems. To improve efficiency, the JPEG encoder and decoder have been deeply optimized to reduce memory consumption and enhance processing performance. For the ESP32-S3 chip that supports SIMD instructions, these instructions are used to further improve processing speed. In addition, rotation, cropping, and scaling functions have been extended, which can be performed simultaneously during the encoding and decoding process, thereby simplifying user operations. For chips with smaller memory, a block mode has been introduced to support processing part of the image content multiple times, effectively reducing memory pressure.

ESP-New-JPEG supports JPEG encoding and decoding of the Baseline Profile. The rotation, cropping, scaling, and block mode functions can only take effect under specific configurations.

JPEG Encoder Features:

The basic features supported by the encoder are as follows:

  • Supports decoding of any width and height

  • Supports the following pixel formats: RGB888, RGBA, YCbYCr, YCbY2YCrY2, GRAY

    • When using the YCbY2YCrY2 format, only YUV420 and Gray subsampling are supported

  • Supports YUV444, YUV422, YUV420, Gray subsampling

  • Supports quality setting range: 1-100

The extended features are as follows:

  • Supports 0°, 90°, 180°, 270° clockwise rotation

  • Supports dual-task encoding

  • Supports block mode encoding

Dual-task encoding can be used on dual-core chips, fully utilizing the advantages of dual-core parallel encoding. The principle is that one core handles the main encoding task, and the other core is responsible for the entropy encoding part of the work. In most cases, enabling dual-core encoding can bring about a 1.5 times performance improvement. You can choose whether to enable dual-core decoding through menuconfig configuration, and adjust the core and priority of the entropy encoding task.

Block encoding refers to encoding the data of one image block at a time, and encoding the complete image after multiple processing. When subsampling YUV420, the height of each block is 16 rows and the width is the image width; under other subsampling formats, the height of each block is 8 rows and the width is the image width. Since the amount of data processed by block encoding each time is small, the image buffer can be placed in DRAM, thereby improving the encoding speed. The workflow of block encoding is shown in the following figure:

The configuration requirements for extended features are as follows:

JPEG Decoder Features:

The basic features supported by the decoder are as follows:

  • Supports decoding of any width and height

  • Supports single-channel and three-channel decoding

  • Supports the following pixel format outputs: RGB888, RGB565 (big-endian), RGB565 (little-endian), CbYCrY

The extended features are as follows:

  • Supports scaling (maximum reduction ratio is 1/8)

  • Supports cropping (cropping from the upper left corner)

  • Supports 0°, 90°, 180°, 270° clockwise rotation

  • Supports block mode decoding

When using the zoom and crop functions, you need to configure the corresponding parameter jpeg_resolution_t. The component supports handling width or height separately. For example, when only cropping the width and keeping the height unchanged, you can set clipper.height = 0, at this time the height of the image will remain the height of the original JPEG image.

The process of scaling, cropping, and rotating is sequential, as shown in the diagram below. The decoded JPEG data stream is first scaled, then cropped, and finally rotated and output.

Block decoding refers to decoding only one image block at a time, and the entire image is decoded after multiple processing. In YUV420 subsampling, the height of each block is 16 lines, and the width is the image width; for other subsampling formats, the height of each block is 8 lines, and the width is the image width. Since block decoding processes a small amount of data each time, it is more friendly to chips without PSRAM, and placing the output image buffer in DRAM can also improve the decoding speed. The operation of block decoding can be seen as the reverse process of block encoding.

The configuration requirements for extended functions are as follows:

  • When block decoding is enabled, other extended functions cannot be used

  • The width and height in the configuration parameters of scaling, cropping, and rotating are required to be multiples of 8

  • When scaling and cropping are enabled at the same time, the size of the crop is required to be smaller than the size after scaling

Usage:

The ESP-New-JPEG component is hosted on Github. You can add this component to your project by entering the following command in the project.

idf.py add-dependency “espressif/esp_new_jpeg”

The test_app folder under the esp_new_jpeg folder contains a runnable test project, which shows the related API call process. Before using the ESP-New-JPEG component, it is recommended to refer to and debug this test project to familiarize yourself with the use of the API.

Performance:

ESP-New-JPEG has deeply optimized the JPEG encoding and decoding architecture:

  • Optimized data processing flow, improved the reuse efficiency of intermediate data, and reduced memory copy overhead.

  • Assembly-level optimization for Xtensa architecture chips; significantly improved computing performance on ESP32-S3 chips that support SIMD instructions.

  • Integrated various image operations such as cropping and rotating into the encoder and decoder to improve overall system efficiency

Please refer to Performance for codec performance test data.

FAQ:

Q: Does ESP-New-JPEG support decoding progressive JPEG?

A: No, ESP-New-JPEG only supports decoding baseline JPEG. You can use the following code to check whether the image is a progressive JPEG. Output 1 indicates progressive JPEG, and output 0 indicates baseline JPEG.

python
>>> from PIL import Image
>>> Image.open("file_name.jpg").info.get('progressive', 0)

Q: Why does the output image look misaligned?

A: This problem usually occurs when some columns appear on the left or right side of the image, and these columns appear on the other side of the image. If you are using ESP32-S3, the possible reason is that the output buffer of the decoder or the input buffer of the encoder is not aligned to 16 bytes. Please use the jpeg_calloc_align() function to allocate the buffer.

Q: How to preview the raw data of the image, such as viewing RGB888 data?

A: You can use yuvplayer. It supports viewing grayscale, RGB888, RGB565 (little-endian), UYVY, YUYV, YUV420P, etc.

Related Links:

GMF-AI-Audio Component

Overview:

GMF-AI-Audio is a voice interaction component developed based on the GMF framework. By encapsulating ESP-SR, it provides a complete interaction logic from voice wake-up to command recognition. The component integrates functions such as Wake Word detection, Voice Activity Detection (VAD), voice command recognition, and Acoustic Echo Cancellation (AEC), enabling efficient and natural voice interaction experiences in smart speakers, smart home devices, etc.

Supported Scenarios:

Method

Corresponding Scenario

Immediately upload voice data after wake-up, stop uploading at the Wakeup End stage

Implement VAD function in the cloud, RTC scenario

Wait for VAD to trigger after wake-up and start uploading, stop uploading after VAD ends

Traditional interaction method of smart hardware

No wake-up, wait for VAD to trigger and start uploading, stop uploading after VAD ends

New cloud processing logic

Immediately upload voice data after pressing the button, stop after releasing

Devices with limited computing power implement voice functions through interaction with the cloud

Wait for VAD to trigger after pressing the button and start uploading, stop uploading after VAD ends

Solve the problem of excessive data volume caused by relying solely on VAD

Detect command words after wake-up

Default usage logic

No wake-up, wait for VAD to trigger and detect command words

Can be applied to some vehicle systems

Detect command words after pressing the button

Toys

Continuous command word recognition

Home control

Related Links:

ESP-H264 Component

Overview:

ESP-H264 is a lightweight H.264 encoder and decoder component developed by Espressif Systems, offering both hardware and software implementations. The hardware encoder is designed specifically for the ESP32-P4 chip, capable of achieving 1080P@30fps. The software encoder is based on openh264, and the decoder is based on tinyH264. Both are optimized for memory and CPU usage, ensuring optimal performance on Espressif chips.

Features:

  • Encoder Features

    • Hardware Encoder (ESP32-P4):

      • Supports Baseline Profile (maximum frame size 36864 macroblocks)

      • Supports width range [80, 1088] pixels, height range [80, 2048] pixels

      • Supports quality-priority bitrate control

      • Supports YUV420 raw data format

      • Supports dynamic adjustment of bitrate, frame rate, GOP, QP, etc.

      • Supports single-stream and dual-stream encoders

      • Supports deblocking filter, ROI, motion vector functions

      • Supports SPS and PPS encoding

    • Software Encoder:

      • Supports Baseline Profile (maximum frame size 36864 macroblocks)

      • Supports any resolution greater than 16 pixels in width and height

      • Supports quality-priority bitrate control

      • Supports YUYV and IYUV raw data formats

      • Supports dynamic adjustment of bitrate and frame rate

      • Supports SPS and PPS encoding

  • Decoder Features

    • Supports Baseline Profile (maximum frame size 36864 macroblocks)

    • Supports various widths and heights

    • Supports Long Term Reference (LTR) frames

    • Supports Memory Management Control Operations (MMCO)

    • Supports modification of reference image lists

    • Supports multiple reference frames specified in Sequence Parameter Set (SPS)

    • Supports IYUV output format

Performance:

  • Encoding Performance: ESP32-P4 is recommended to use hardware encoder, ESP32-S3 and other boards use software encoder

    • Hardware Encoder (only for ESP32-P4):

      • Better performance and power consumption, supports up to 1080P@30fps

      • Supports single-stream/dual-stream encoding

      • Supports dynamic adjustment of bitrate, frame rate, GOP, QP, etc.

      • Supports advanced features such as deblocking filter, ROI, motion vector, etc.

    • Software Encoder (all platforms):

      • Limited performance and power consumption, but no resolution limit

      • Supports YUYV and IYUV formats, richer color formats

      • Supports all Espressif chip platforms, more board choices

      • Based on OpenH264 open source project

Encoding Performance Comparison

Platform

Type

Maximum Resolution

Maximum Performance

Remarks

ESP32-S3

Software Encoder

Any

320×240@11fps

ESP32-P4

Hardware Encoder

≤1080P

1920×1080@30fps

Hardware Acceleration

  • Decoding Performance: All boards are recommended to use software decoder

    • Software Decoder (all platforms):

      • Limited performance and power consumption, but no resolution limit

      • Supports IYUV output format

      • Supports advanced features such as long-term reference frames, memory management control, etc.

      • Based on TinyH264 open source project

Decoding Performance Comparison

Platform

Type

Maximum Resolution

Maximum Performance

ESP32-S3

Software Decoder

Any

320×192@27fps

ESP32-P4

Software Decoder

Any

1280×720@10fps

Warning

Memory consumption strongly depends on the resolution and encoding data of the H.264 stream. It is recommended to adjust the memory allocation according to the actual application scenario.

Tip

Using a dual-task decoder can significantly improve decoding performance, especially in high-resolution video processing.

Component Links:

Related Resources:

ESP-Image-Effects Component

Overview:

ESP-Image-Effects is an image processing engine developed by Espressif Systems, integrating basic functions such as rotation, color space conversion, scaling, and cropping. As one of the core components of Espressif’s audio and video development platform, the ESP-Image-Effects module has deeply restructured the underlying algorithms, combined with efficient memory management and hardware acceleration, achieving high performance, low power consumption, and low memory occupancy. In addition, each image processing function adopts a consistent API architecture design, reducing the learning cost for users and facilitating rapid development. This engine is widely used in the Internet of Things, smart cameras, industrial vision, and other fields.

Features:

  • Image Color Conversion

    • Supports any input resolution

    • Supports bypass mode for the same input/output format

    • Supports BT.601/BT.709/BT.2020 color space standards

    • Supports fast color conversion algorithms for format and resolution

    • Comprehensive format support matrix:

    Color Conversion Format Support

    Input Format

    Supported Output Formats

    RGB/BGR565_LE/BE RGB/BGR888

    RGB565_LE/BGR/RGB565_LE/BE RGB/BGR888 YUV_PLANAR/PACKET YUYV/UYVY O_UYY_E_VYY/I420

    ARGB/BGR888

    RGB565_LE/BGR/RGB565_LE/BE RGB/BGR888 YUV_PLANAR O_UYY_E_VYY/I420

    YUV_PACKET/UYVY/YUYV

    RGB565_LE/BGR/RGB565_LE/BE RGB/BGR888 O_UYY_E_VYY/I420

    O_UYY_E_VYY/I420

    RGB565_LE/BGR/RGB565_LE/BE RGB/BGR888 O_UYY_E_VYY

  • Image Rotation

    • Supports bypass mode

    • Supports any input resolution

    • Supports clockwise rotation at any angle

    • Supports ESP_IMG_PIXEL_FMT_Y/RGB565/BGR565/RGB888/BGR888/YUV_PACKET formats

    • Supports fast clockwise rotation algorithms for specific angles, formats, and resolutions

  • Image Scaling

    • Supports bypass mode

    • Supports any input resolution

    • Supports up-sampling and down-sampling operations

    • Supports ESP_IMG_PIXEL_FMT_RGB565/BGR565/RGB888/BGR888/YUV_PACKET formats

    • Supports various filtering algorithms: optimized down-sampling and bilinear interpolation

  • Image Cropping

    • Supports bypass mode

    • Supports any input resolution

    • Supports up-sampling and down-sampling operations

    • Supports flexible area selection

    • Supports ESP_IMG_PIXEL_FMT_Y/RGB565/BGR565/RGB888/BGR888/YUV_PACKET formats

Performance:

The ESP-Image-Effects component has completed performance testing under 1080P. For specific performance data, please refer to the ESP32-P4 Performance Document. This component uses efficient memory management and hardware acceleration technology to achieve high performance, low power consumption, and low memory occupancy.

Related Links:

ESP-Audio-Effects Component

Overview:

ESP-Audio-Effects is a powerful and flexible audio processing library, designed to provide developers with efficient audio effect processing capabilities. This component is widely used in various smart audio devices, including smart speakers, headphones, audio playback devices, and voice interaction systems.

Features:

  • Automatic Level Control: Automatically adjusts input gain to stabilize audio volume. Progressive adjustment ensures smooth transition. Dynamic correction of over-amplification to avoid clipping distortion.

  • Equalizer: Provides fine control over filter type, frequency, gain, and Q factor. Suitable for audio tuning and professional signal shaping.

  • Fade In/Out: Implements fade in and fade out effects, ensuring smooth transitions between tracks.

  • Speed and Pitch Processing: Supports real-time speed and pitch modification, achieving more dynamic playback effects.

  • Mixer: Merges multiple input streams into one output, with start/target weights and transition time configurable for each input.

  • Data Interleaver: Handles interleaving and de-interleaving of audio data buffers.

  • Sample Rate Conversion: Performs sample rate conversion between multiples of 4000 and 11025.

  • Channel Conversion: Remaps audio channel layout using weight array.

  • Bit Depth Conversion: Supports conversion between U8, S16, S24, and S32 bit depths.

The table below lists the supported sample rates, channel numbers, and sample bit depths for each module. If users want to know detailed introduction, performance, examples, and other information about each module, they can click the README link in the Module column.

Color Conversion Format Support

Module

Sample Rate

Channel Number

Sample Bit Depth

Data Layout

Automatic Level Control

Full Range

Full Range

s16, s24, s32

Interleaved and De-interleaved

Equalizer

Full Range

Full Range

s16, s24, s32

Interleaved and De-interleaved

Fade In/Out

Full Range

Full Range

s16, s24, s32

Interleaved and De-interleaved

Sonic Pitch Processing

4 to 192 kHz, and integer multiples of 4000 or 11025

Full range

s16, s24, s32

Interleaved

Mixer

Full range

Full range

s16, s24, s32

Interleaved and deinterleaved

Data Weaver

Full range

Full range

s16, s24, s32

Interleaved and deinterleaved

Sample Rate Conversion

4 to 192 kHz, and integer multiples of 4000 or 11025

Full range

s16, s24, s32

Interleaved and deinterleaved

Channel Conversion

Full range

Full range

s16, s24, s32

Interleaved and deinterleaved

Bit Depth Conversion

Full range

Full range

u8, s16, s24, s32

Interleaved and deinterleaved

Data Layout:

ESP-Audio-Effects supports both interleaved and deinterleaved audio formats:

  1. Interleaved format: Use the esp_ae_xxx_process() API to process this layout. For example:

L0 R0 L1 R1 L2 R2 ...

Where L and R represent left and right channel samples respectively.

  1. Deinterleaved format: Use the esp_ae_xxx_deintlv_process() API. Each channel is stored in a separate buffer:

L1, L2, L3, ...  // Left channel
R1, R2, R3, ...  // Right channel

API Style:

ESP-Audio-Effects provides a consistent and developer-friendly API:

Color Conversion Format Support

Category

Function

Description

Initialization

esp_ae_xxx_open()

Create an audio effect handle.

Interleaved Processing

esp_ae_xxx_process()

Process interleaved audio data.

Deinterleaved Processing

esp_ae_xxx_deintlv_process()

Process deinterleaved audio data.

Set Parameters

esp_ae_xxx_set_xxx()

Set component-specific parameters.

Get Parameters

esp_ae_xxx_get_xxx()

Get current parameters.

Release

esp_ae_xxx_close()

Release resources and destroy the handle.

Related links:

ESP-Audio-Codec Component

Overview:

ESP-Audio-Codec is an audio encoding and decoding processing module developed by Espressif for SoC platforms. It provides a standardized encoding and decoding interface framework, making it easy for users to flexibly expand and combine different audio formats. This module mainly includes three parts: ESP Audio Encoder, ESP Audio Decoder, and Simple Decoder.

  • ESP Audio Encoder provides a unified encoder interface, supporting the registration of various encoders (such as AAC, AMR-NB, AMR-WB, ADPCM, G711A, G711U, PCM, OPUS, ALAC, etc.). Users can create one or more encoder instances based on the interface to achieve multi-channel simultaneous encoding. They can also directly call the API of the specified encoder to reduce the call level.

  • ESP Audio Decoder provides a unified decoder interface, supporting the registration of various decoders (such as AAC, MP3, AMR-NB, AMR-WB, ADPCM, G711A, G711U, VORBIS, OPUS, ALAC, etc.). Users can create one or more decoder instances through the interface to achieve multi-channel simultaneous decoding. They can also directly call the API of the specified decoder to reduce the call level. ESP Audio Decoder only supports processing audio frame data (i.e., the input data must be frame boundaries).

  • Simple Decoder aggregates and organizes audio frames through the parser, and then calls ESP Audio Decoder for decoding, simplifying the parsing and positioning of audio frames. Users can input data of any length. The audio containers supported by this decoder include AAC, MP3, WAV, FLAC, AMRNB, AMRWB, M4A, etc.

Main Features:

  • Easy-to-use interface: Provides a user-friendly interface for easy development and integration.

  • High performance and lightweight: The module is optimized for high performance and low memory usage.

  • Dual-layer decoder API: ESP Audio Decoder can be used when the input data is a frame boundary; Simple Decoder can be used for data of any length. Both APIs are similar, making it easy to switch.

  • Highly customizable: Through the registration interface, users can easily add custom decoders, encoders, or simple decoders, or override the default implementation without modifying the application code.

Functional Features:

  • ESP Audio Encoder: Provides a unified encoder interface, all encoders can be operated through a unified API (see esp_audio_enc.h). The module supports registering custom encoders or overriding the default implementation through esp_audio_enc_register(), or using esp_audio_enc_register_default() to register all supported encoders at once, and can be managed uniformly through menuconfig. The encoders supported by the module and their detailed parameters are as follows:

    • AAC:

      • Supports AAC-LC (Low Complexity) encoding

      • Sampling rate (Hz): 96000, 88200, 64000, 48000, 44100, 32000, 24000, 22050, 16000, 12000, 11025, 8000

      • Number of channels: mono, stereo

      • Bit depth: 16 bits

      • Fixed bit rate: 12 Kbps ~ 160 Kbps

      • Option to write ADTS header

    • AMR:

      • Supports Narrowband (NB) and Wideband (WB) encoding

      • AMRNB sampling rate: 8 kHz

      • AMRWB sampling rate: 16 kHz

      • Number of channels: mono

      • Bit depth: 16 bits

      • AMRNB bit rate (Kbps): 4.75, 5.15, 5.9, 6.7, 7.4, 7.95, 10.2, 12.2

      • AMRWB bit rate (Kbps): 6.6, 8.85, 12.65, 14.25, 15.85, 18.25, 19.85, 23.05, 23.85

      • Supports DTX (Discontinuous Transmission)

    • ADPCM:

      • Supports all sampling rates

      • Number of channels: mono, stereo

      • Bit depth: 16 bits

    • G711:

      • Supports A-LAW and U-LAW

      • Supports all sampling rates

      • Supports all channel numbers

      • Bit depth: 16 bits

    • OPUS:

      • Sampling rate (Hz): 8000, 12000, 16000, 24000, 48000

      • Number of channels: mono, stereo

      • Bit depth: 16 bits

      • Fixed bit rate: 20 Kbps ~ 510 Kbps

      • Frame duration (ms): 2.5, 5, 10, 20, 40, 60, 80, 100, 120

      • Supports VoIP and music mode

      • Adjustable encoding complexity (0~10)

      • Supports FEC (Forward Error Correction), DTX (Discontinuous Transmission), VBR (Variable Bit Rate)

    • ALAC:

      • Sampling rate (Hz): [1000, 384000]

      • Number of channels: [1, 8]

      • Bit depth: 16, 24, 32 bits

    • SBC:

      • Sampling rate (Hz): 16000, 32000, 44100, 48000

      • Channel mode: mono, stereo, joint stereo

      • Bit depth: 16 bits

      • SBC mode: standard, mSBC

      • Block length: 4, 8, 12, 16

      • Number of subbands: 4, 8

      • Allocation method: loudness, SNR

      • bitpool range: 2~250

    • LC3:

      • Sampling rate (Hz): 8000, 16000, 24000, 32000, 44100, 48000

      • Supports all channels

      • Bit depth: 16, 24, 32 bits

      • Frame duration (dms): 7.5, 10

      • nbyte range: 20~400

      • Supports adding a 2-byte length prefix to each frame

    • PCM

  • ESP Audio Decoder: Provides a unified decoder interface, all decoders can be operated through a unified API (see esp_audio_dec.h). The module supports registering custom decoders or overriding the default implementation through esp_audio_dec_register(), or using esp_audio_dec_register_default() to register all supported decoders at once, and can be managed uniformly through menuconfig. The decoders supported by the module and their detailed parameters are as follows:

    • AAC:

      • Supports AAC-LC, AAC-Plus encoding

      • Configurable whether to enable AAC-Plus decoding, to reduce CPU and memory usage

      • Sampling rate (Hz): 96000, 88200, 64000, 48000, 44100, 32000, 24000, 22050, 16000, 12000, 11025, 8000

      • Number of channels: mono, stereo

      • Bit depth: 16 bits

      • Supports decoding audio data with or without ADTS headers

    • AMR:

      • Supports narrowband (NB) and wideband (WB) decoding

      • AMRNB sampling rate: 8 kHz

      • AMRWB sampling rate: 16 kHz

      • Number of channels: mono

      • Bit depth: 16 bits

    • ADPCM:

      • Supports all sampling rates

      • Number of channels: mono, stereo

      • Bit depth: 16 bits

      • Only supports IMA-ADPCM

    • G711:

      • Supports A-LAW and U-LAW

      • Supports all sampling rates

      • Supports all channel numbers

      • Bit depth: 16 bits

    • OPUS:

      • Sampling rate (Hz): 8000, 12000, 16000, 24000, 48000

      • Channel numbers: mono, stereo

      • Bit depth: 16 bits

      • Supports self-segmentation packet decoding

    • ALAC:

      • Sampling rate (Hz): 8000, 12000, 16000, 24000, 48000

      • Channel numbers: mono, stereo

      • Bit depth: 16 bits

    • FLAC:

      • Sampling rate (Hz): 96000, 48000, 44100, 32000, 24000, 22050, 16000, 12000, 11025, 8000

      • Channel numbers: [1, 8]

      • Bit depth: 16, 24, 32 bits

    • VORBIS:

      • Sampling rate (Hz): 48000, 44100, 32000, 24000, 22050, 16000, 12000, 11025, 8000

      • Channel numbers: mono, stereo

      • Bit depth: 16 bits

      • Only supports VORBIS frame decoding, need to remove OGG header

      • User needs to provide general header information first

    • SBC:

      • Sampling rate (Hz): 16000, 32000, 44100, 48000

      • Channel numbers: mono, stereo

      • Bit depth: 16 bits

      • SBC mode: standard, mSBC

      • Supports packet loss concealment (PLC)

    • LC3:

      • Sampling rate (Hz): 8000, 16000, 24000, 32000, 44100, 48000

      • Supports all channels

      • Bit depth: 16, 24, 32 bits

      • Frame duration (dms): 7.5, 10

      • nbyte range: 20~400

      • Supports 2-byte length prefix frame data decoding

      • Supports packet loss concealment (PLC)

    • MP3

  • Simple Decoder

    • The simple decoder can be operated through the API, see esp_audio_simple_dec.h

    • Supports audio frame search and decoding of some audio containers

    • Supports general parser, users can add custom parsers according to the rules

    • Supports custom simple decoders to adapt to new file formats

    • Supports custom parser and decoder pairing: default parser can be used with custom decoder

    • Only supports streaming decoding, does not support seek

    • The supported audio containers and descriptions are as follows:

Audio Container

Description

AAC

Supports AAC-Plus (configurable), parser can input data of any size

MP3

Supports Layer 1, 2, 3, parser can input data of any size

AMRNB

Only supports files with AMRNB file header, parser can input data of any size

AMRWB

Only supports files with AMRWB file header, parser can input data of any size

FLAC

Only supports files with FLAC file header, parser can input data of any size

WAV

Supports G711A, G711U, PCM, ADPCM, parser can input data of any size

M4A

Supports MP3, AAC, ALAC, and only supports MDAT after MOOV, parser can input data of any size

TS

Supports MP3, AAC, with a parser that can input data of any size

G711

Supports G711A, G711U, can input data of any size

ADPCM

Only supports IMA-ADPCM, input frames without a parser must be complete audio frames

SBC

Supports SBC and MSBC, input frames without a parser must be complete audio frames

LC3

Supports LC3, input frames without a parser must be complete audio frames

OPUS

Supports OPUS, input frames without a parser must be complete audio frames

Performance:

  • Encoder Performance

    Encoder

    Sampling Rate (Hz)

    Channels

    Memory (KB)

    CPU Usage (%)

    AAC

    48000

    2

    51.4

    12.9

    G711-A

    8000

    1

    0.06

    0.32

    G711-U

    8000

    1

    0.06

    0.33

    AMR-NB

    8000

    1

    3.3

    17.81

    AMR-WB

    16000

    1

    5.6

    37.69

    ADPCM

    48000

    2

    0.01

    2.69

    OPUS

    48000

    2

    29.4

    24.9

    SBC

    48000

    2

    1.85

    9.55

    LC3

    48000

    2

    3.67

    46.57

    • Encoder CPU usage highly depends on encoding parameters (such as bitrate, complexity, etc.)

    • AAC encoder test bitrate is 90 kbps

    • AMR-NB/AMR-WB encoder test bitrates are 12.2 kbps/8.85 kbps respectively

    • OPUS encoder test bitrate is 90 kbps, complexity is 0

    • Memory only counts heap usage, not including stack. When supporting all encoders, the recommended task stack size is about 40 K

  • Decoder Performance

    Decoder

    Sampling Rate (Hz)

    Channels

    Memory (KB)

    CPU Usage (%)

    AAC

    48000

    2

    51.2

    6.75

    G711-A

    8000

    1

    0.04

    0.14

    G711-U

    8000

    1

    0.04

    0.13

    AMR-NB

    8000

    1

    1.8

    4.23

    AMR-WB

    16000

    1

    5.4

    9.5

    ADPCM

    48000

    2

    0.11

    2.43

    OPUS

    48000

    2

    26.6

    5.86

    MP3

    44100

    2

    28

    8.17

    FLAC

    44100

    2

    89.4

    8.0

    SBC

    48000

    2

    0.21

    8.14

    LC3

    48000

    2

    1.36

    17.5

    • MP3 and FLAC decoders are tested with real audio data, others with sine wave PCM encoded data

    • The test file for the AAC decoder is AAC-LC; AAC-Plus decoding consumes more memory and CPU

    • Memory only counts heap usage. When supporting all decoders, it is recommended that the task stack size is about 20 K

Codec Comparison:

The following table compares the features of the Codecs supported by ESP-Audio-Codec:

Common Audio Codec Feature Comparison

Codec

Features

Typical Bitrate Range (kbps)

Applicable Scenarios

AAC (Advanced Audio Coding)

Lossy compression, better sound quality than MP3, more efficient at the same bitrate; widely supported.

96 – 320 (stereo typically uses 128–256)

Online music, video streaming (YouTube, Apple Music, radio).

MP3

The most popular lossy compression format, excellent compatibility, but slightly less efficient than AAC/Opus.

128 – 320 (as low as 64 can also be used)

Music download, traditional players, car audio.

AMR-NB / AMR-WB

Optimized for voice, clear voice at low bitrates; NB (8kHz), WB (16kHz).

AMR-NB: 4.75 – 12.2; AMR-WB: 6.6 – 23.85

Mobile communication (2G/3G phone calls), VoIP, voice messages.

ADPCM

Simple compression, low latency, limited sound quality; not very efficient.

Common 16 – 64

Early voice storage, embedded devices, simple audio transmission sensitive to latency.

G.711 (A-law / μ-law)

Waveform coding, fixed at 64 kbps, sound quality close to telephone level; extremely low latency.

Fixed at 64

Landline, VoIP (such as SIP), call centers.

OPUS

Low latency, high sound quality, supports narrowband to full band, strong adaptability; open source and free.

6 – 510 (common voice 16–32, music 64–128)

Real-time voice (VoIP, conference), music stream, game voice, WebRTC.

Vorbis

Open source lossy compression, good sound quality, better compression rate than MP3; gradually replaced by Opus.

64 – 320 (commonly used 128–192)

Open source streaming media (OGG container), some games and applications.

FLAC

Lossless compression, retains original sound quality, compression rate about 40–60%.

700 – 1100 (CD quality, depends on content)

High fidelity music storage, music download (Hi-Res music).

ALAC

Apple’s lossless compression, similar to FLAC, but limited ecosystem.

700 – 1100 (similar to FLAC)

Apple Music lossless audio, iTunes, iOS/macOS ecosystem.

SBC

Simple, low power consumption, default encoding for Bluetooth A2DP, average sound quality.

192 – 320 (commonly used 256)

Bluetooth headphones, Bluetooth speakers.

LC3

Inherits from SBC, used for Bluetooth LE Audio; low power consumption, better sound quality than SBC, low latency.

16 – 160 (commonly 96–128)

Bluetooth LE Audio (TWS earbuds, hearing aids), IoT audio.

SoC Compatibility:

The table below shows the support status of ESP-Audio-Codec on various Espressif chips. “✔” indicates support, “✘” indicates no support.

Chip

v2.0.0

ESP32

ESP32-S2

ESP32-C3

ESP32-C6

ESP32-S3

ESP32-P4

ESP32-C2

ESP32-C5

ESP32-H4

ESP32-H2

Usage:

  • Encoder Example

    • For detailed usage, please refer to: audio_encoder_test.c

    • If you need to use a custom encoder, please follow the steps below:

      1. Implement the custom encoder interface, for details, see: struct esp_audio_enc_ops_t

      2. Define a custom audio encoder type in the enumeration esp_audio_type_t, the definition range is between ESP_AUDIO_TYPE_CUSTOMIZED and ESP_AUDIO_TYPE_CUSTOMIZED_MAX, for details, see: enum esp_audio_type_t

      3. If you want to override the default encoder, there is no need to define a custom audio encoder type, you can directly use the existing encoder type

      4. Register the custom encoder, for details, see: esp_audio_enc_register()

  • Decoder Example

    • For detailed usage, please refer to: audio_decoder_test.c

    • If you need to use a custom decoder, please follow the steps below:

      1. Implement the custom decoder interface, for details, see: struct esp_audio_dec_ops_t

      2. Customize the audio decoder type in the enumeration esp_audio_type_t, the definition range is between ESP_AUDIO_TYPE_CUSTOMIZED and ESP_AUDIO_TYPE_CUSTOMIZED_MAX, see: Enumeration esp_audio_type_t

      3. If you want to override the default decoder, there is no need to customize the audio decoder type, you can directly use the existing decoder type

      4. Register the custom decoder, see: esp_audio_dec_register()

  • Simple Decoder Usage Example

    • For detailed usage, please refer to: simple_decoder_test.c

    • If you need to use a custom simple decoder, please follow the steps below:

      1. Implement the custom simple decoder interface, for the interface form, see: Structure esp_audio_simple_dec_reg_info_t

      2. Customize the simple audio decoder type in the enumeration esp_audio_simple_dec_type_t, the definition range is between ESP_AUDIO_SIMPLE_DEC_TYPE_CUSTOM and ESP_AUDIO_SIMPLE_DEC_TYPE_CUSTOM_MAX, see: Enumeration esp_audio_simple_dec_type_t

      3. If you want to override the default decoder, there is no need to customize the audio decoder type, you can directly use the existing decoder type

      4. Register the custom simple decoder, see: esp_audio_simple_dec_register()

Related Links:

ESP-Media-Protocols Component

Overview:

Multimedia protocols are a collection of various communication protocols, widely used in scenarios such as streaming media transmission, device control, and device interconnection communication. ESP-Media-Protocols is a multimedia protocol library launched by Espressif, providing support for basic and mainstream multimedia protocols.

Protocol

Layer

Function

RTP/RTCP

Transport Layer

Real-time transmission of audio and video streams, providing quality information

RTSP

Application Layer

Supports being streamed as a server, supports streaming and pushing as a client

SIP

Application Layer

Session terminal, supports registration to SIP server, supports initiating and receiving sessions

RTMP

Application Layer

Supports being streamed and receiving pushes as a server, supports streaming and pushing as a client

MRM

/

Multi-device master-slave synchronized music playback

UPnP

/

Device interconnection, media and service sharing

How to use:

The ESP-Media-Protocols component is hosted on Github. You can add this component to your project by entering the following command in the project.

idf.py add-dependency “jimforr/esp_media_protocols”

Before using the ESP-Media-Protocols component, it is recommended to refer to and debug the following example projects to familiarize yourself with the use of the API and the specific application of the protocol stack.

Performance:

Comparison of protocol performance

Protocol

Real-time

Data Stream

Control Stream

Device Discovery

TLS Encryption

Complexity

RTSP

High

Yes

Yes

Manual

No

Medium

SIP

High

Yes

Yes

Manual

Yes

Medium

RTMP

Medium

Yes

Basic

Manual

Yes

Medium

MRM

High

Yes

Yes

Automatic

No

Low

UPnP

Low

Yes

Yes

Automatic

No

Medium

  • Real-time

    • Low latency: Data for control or command transmission, latency about 20 ms.

    • Low latency: Audio, video or other media stream transmission, latency about 300 ms.

    • Medium latency: Live stream based on RTMP, latency about 2 seconds.

  • Security

    • TLS (optional)

    • MD5 Digest Authentication (SIP mandatory)

  • Scalability

    • Customizable protocol header and body

    • Supports subscription and notification, can register services

  • Concurrency

    • Supports multiple client connections (RTMP)

  • Compatibility

    • SIP supports linphone, Asterisk FreePBX, Freeswitch, Kamailio

    • RTSP supports ffmpeg, vlc, live555, mediamtx

    • RTMP supports ffmpeg, vlc

    • UPnP supports NetEase Cloud Music

  • Media Support

  • Memory Consumption Data

FAQ

Q: Does ESP-Media-Protocols support all protocols and features?

A: ESP-Media-Protocols currently supports the basic protocols and features widely used in the embedded field. Some unsupported protocols such as SRTP, HLS, etc., can be found and used under other components or repositories. The supported protocol specifications will be continuously iterated and expanded, and we will also update and consider expansion according to customer needs. In the future, we plan to support some new protocols with strong features.

Q: Some protocol features overlap, how to choose when using?

A: According to the application scenario, specifically analyze the functional requirements, latency requirements, and network environment. For example, if the real-time requirement is high and real-time control (pause, fast forward, rewind, positioning) is needed, RTSP is usually used; if the real-time requirement is high and real-time interaction is needed, SIP can be used to create a session; if it is a large-scale live broadcast in the browser, with high requirements for stability and compatibility, and no high real-time requirements, RTMP can be considered.

For more related questions, please refer to the Issues section in the following protocol directory: