Multimedia Technology Wiki: Component Description

[中文]

Note

This document is automatically translated using AI. Please excuse any detailed errors. The official English version is still in progress.

ESP-New-JPEG Component

Note

For basic knowledge about JPEG, please refer to JPEG

Overview

ESP-New-JPEG is a lightweight JPEG encoding and decoding library launched by Espressif Systems. To improve efficiency, the JPEG encoder and decoder have been deeply optimized to reduce memory consumption and enhance processing performance. For the ESP32-S3 chip that supports SIMD instructions, these instructions are used to further improve processing speed. In addition, rotation, cropping, and scaling functions have been extended, which can be performed simultaneously during the encoding and decoding process, thereby simplifying user operations. For chips with smaller memory, a block mode has been introduced to support processing part of the image content multiple times, effectively reducing memory pressure.

ESP-New-JPEG supports JPEG encoding and decoding of the Baseline Profile. The rotation, cropping, scaling, and block mode functions can only take effect under specific configurations.

JPEG Encoder Function

The basic features supported by the encoder are as follows:

  • Supports decoding of any width and height

  • Supports the following pixel formats: RGB888, RGB565 (big-endian), RGB565 (little-endian), RGBA, YCbYCr, CbYCrY, YCbY2YCrY2, GRAY

    • When using the YCbY2YCrY2 format, only YUV420 and Gray subsampling are supported

  • Supports YUV444, YUV422, YUV420, Gray subsampling

  • Supports quality setting range: 1-100

The extended features are as follows:

  • Supports 0°, 90°, 180°, 270° clockwise rotation

  • Supports dual-task encoding

  • Supports block mode encoding

Dual-task encoding can be used on dual-core chips, fully utilizing the advantages of dual-core parallel encoding. The principle is that one core handles the main encoding task, and the other core is responsible for the entropy encoding part of the work. In most cases, enabling dual-core encoding can bring about a 1.5 times performance improvement. You can choose whether to enable dual-core decoding through menuconfig configuration, and adjust the core and priority of the entropy encoding task.

Block encoding refers to encoding the data of one image block at a time, and encoding the complete image after multiple processing. When subsampling YUV420, the height of each block is 16 rows and the width is the image width; under other subsampling formats, the height of each block is 8 rows and the width is the image width. Since the amount of data processed by block encoding each time is small, the image buffer can be placed in DRAM, thereby improving the encoding speed. The workflow of block encoding is shown in the following figure:

The configuration requirements for extended features are as follows:

JPEG Decoder Functionality

The basic features supported by the decoder are as follows:

  • Supports decoding of any width and height

  • Supports single-channel and three-channel decoding

  • Supports the following pixel format outputs: RGB888, RGB565 (big-endian), RGB565 (little-endian), CbYCrY

The extended features are as follows:

  • Supports scaling (maximum reduction ratio is 1/8)

  • Supports cropping (cropping from the upper left corner)

  • Supports 0°, 90°, 180°, 270° clockwise rotation

  • Supports block mode decoding

The process of scaling, cropping, and rotating is sequential, as shown in the diagram below. The decoded JPEG data stream is first scaled, then cropped, and finally rotated and output.

When using the zoom and crop functions, you need to configure the corresponding parameters in the jpeg_resolution_t structure. The component supports handling width or height separately. For example, when only cropping the width while keeping the height unchanged, you can set clipper.height = 0, at which point the height of the image will remain the original JPEG image height. The processing flow can be completed through the following detailed or simplified configuration.

// Detailed configuration
jpeg_dec_config_t config = DEFAULT_JPEG_DEC_CONFIG();
config.output_type = JPEG_PIXEL_FORMAT_RGB565_LE;
config.scale.width = 320;
config.scale.height = 120;
config.clipper.width = 192;
config.clipper.height = 120;
config.rotate = JPEG_ROTATE_90D;

// Simplified configuration
jpeg_dec_config_t config = DEFAULT_JPEG_DEC_CONFIG();
config.output_type = JPEG_PIXEL_FORMAT_RGB565_LE;
config.scale.width = 0;  // keep width unchanged by setting to 0
config.scale.height = 120;
config.clipper.width = 192;
config.clipper.height = 0;  // keep height unchanged by setting to 0
config.rotate = JPEG_ROTATE_90D;

Block decoding refers to decoding only one image block at a time, and the entire image is decoded after multiple processing. In YUV420 subsampling, the height of each block is 16 lines, and the width is the image width; for other subsampling formats, the height of each block is 8 lines, and the width is the image width. Since block decoding processes a small amount of data each time, it is more friendly to chips without PSRAM, and placing the output image buffer in DRAM can also improve the decoding speed. The operation of block decoding can be seen as the reverse process of block encoding.

The typical usage method of block decoding is as follows:

jpeg_dec_config_t config = DEFAULT_JPEG_DEC_CONFIG();
config.block_enable = true;

jpeg_dec_open();
jpeg_dec_parse_header();

int output_len = 0;
int process_count = 0;
jpeg_dec_get_outbuf_len(hd, &output_len);
jpeg_dec_get_process_count(hd, &process_count);

for (int block_cnt = 0; block_cnt < process_count; block_cnt++) {
  jpeg_dec_process();
}

jpeg_dec_close();

The configuration requirements for extended functions are as follows:

  • When block decoding is enabled, other extended functions cannot be used

  • The width and height in the configuration parameters of scaling, cropping, and rotating are required to be multiples of 8

  • When scaling and cropping are enabled at the same time, the size of the crop is required to be smaller than the size after scaling

Performance

ESP-New-JPEG has deeply optimized the JPEG encoding and decoding architecture:

  • Optimize data processing flow, improve the reuse efficiency of intermediate data, and reduce memory copy overhead.

  • Perform assembly-level optimization for Xtensa architecture chips; significantly improve computational performance on ESP32-S3 chips that support SIMD instructions.

  • Integrate various image operations such as cropping and rotating into the codec to improve the overall system efficiency.

For codec performance test data, please refer to Performance.

Usage Method

The ESP-New-JPEG component is hosted on Github. You can add this component to your project by entering the following command in your project.

idf.py add-dependency “espressif/esp_new_jpeg”

The test_app folder under the esp_new_jpeg directory contains a runnable test project, which demonstrates the related API call process. Before using the ESP-New-JPEG component, it is recommended to refer to and debug this test project to familiarize yourself with the API usage.

FAQ

Q: Does ESP-New-JPEG support decoding progressive JPEG?

A: No, ESP-New-JPEG only supports decoding baseline JPEG. You can use the following code to check whether the image is a progressive JPEG. Output 1 indicates progressive JPEG, and output 0 indicates baseline JPEG.

python
>>> from PIL import Image
>>> Image.open("file_name.jpg").info.get('progressive', 0)

Q: Why does the output image look misaligned?

A: This problem usually occurs when some columns appear on the left or right side of the image, and these columns appear on the other side of the image. If you are using ESP32-S3, the possible reason is that the output buffer of the decoder or the input buffer of the encoder is not aligned to 16 bytes. Please use the jpeg_calloc_align() function to allocate the buffer.

Q: How to preview the raw data of the image, such as viewing RGB888 data?

A: You can use yuvplayer. It supports viewing grayscale, RGB888, RGB565 (little-endian), UYVY, YUYV, YUV420P, etc.

Q: Why is ESP_NEW_JPEG decoding slower on ESP32-P4?

A: ESP_NEW_JPEG has not yet been optimized for ESP32-P4. However, ESP32-P4 is equipped with a hardware JPEG encoding and decoding module, and its hardware decoding performance is superior to software decoding. It is recommended to use the hardware JPEG module on ESP32-P4 to achieve better decoding performance. You can refer to JPEG Image Encoder and Decoder - ESP32-P4 for more information.

Q: Will ESP_NEW_JPEG be integrated with the hardware encoder/decoder into a single component, similar to the H264 component?

A: No plans at the moment.

Q: How to estimate decoding speed?

A: The decoding speed of a specific resolution image can be estimated through tested benchmark data. For example, if the resolution of the image to be tested is 480x512, and the known decoding speed of 640x480 is 13.24 fps, then the estimated decoding speed of 480x512 can be calculated as 13.24 * (480/640) * (512/480) = 10.59 fps.

Refer to the Performance for the tested data.

Q: What is the memory consumption of ESP_NEW_JPEG?

A: Currently, only the memory consumption of the decoder has been accounted for.

  • When the scale is not enabled, the memory consumption is constant, about 10 KB. Most of the fixed memory is allocated when open() is called, and all memory is released when close() is called.

  • When the scale is enabled, memory consumption increases with the increase in image width.

Q: How to understand the concept of stream processing in ESP_NEW_JPEG?

A: Basic usage of ESP_NEW_JPEG decoding interface: open() > parse_header() > process() > close()

If every image parameter is the same, opening and closing each time would waste resources. Therefore, a streaming processing example was designed: open once, loop parse_header > process, and close after finishing.

Related Links:

GMF-AI-Audio Component

Overview

GMF-AI-Audio is a voice interaction component developed based on the GMF framework. By encapsulating ESP-SR, it provides a complete interaction logic from voice wake-up to command recognition. The component integrates functions such as Wake Word detection, Voice Activity Detection (VAD), voice command recognition, and Acoustic Echo Cancellation (AEC), enabling efficient and natural voice interaction experiences in smart speakers, smart home devices, etc.

Support Scenarios

Method

Corresponding Scenario

Immediately upload voice data after wake-up, stop uploading at the Wakeup End stage

Implement VAD function in the cloud, RTC scenario

Wait for VAD to trigger after wake-up and start uploading, stop uploading after VAD ends

Traditional interaction method of smart hardware

No wake-up, wait for VAD to trigger and start uploading, stop uploading after VAD ends

New cloud processing logic

Immediately upload voice data after pressing the button, stop after releasing

Devices with limited computing power implement voice functions through interaction with the cloud

Wait for VAD to trigger after pressing the button and start uploading, stop uploading after VAD ends

Solve the problem of excessive data volume caused by relying solely on VAD

Detect command words after wake-up

Default usage logic

No wake-up, wait for VAD to trigger and detect command words

Can be applied to some vehicle systems

Detect command words after pressing the button

Toys

Continuous command word recognition

Home control

ESP-H264 Component

Overview

ESP-H264 is a lightweight H.264 encoder and decoder component developed by Espressif Systems, offering both hardware and software implementations. The hardware encoder is designed specifically for the ESP32-P4 chip, capable of achieving 1080P@30fps. The software encoder is based on openh264, and the decoder is based on tinyH264. Both are optimized for memory and CPU usage, ensuring optimal performance on Espressif chips.

Function

Encoder Function

  • Hardware Encoder (ESP32-P4):

    • Supports Baseline Profile (maximum frame size 36864 macroblocks)

    • Supports width range [80, 1088] pixels, height range [80, 2048] pixels.

    • Supports quality-priority bitrate control

    • Supports RGB888, BGR565_BE, VUY, UYVY, YUV420(O_UYY_E_VYY) raw data formats

    • Supports dynamic adjustment of parameters such as bitrate, framerate, GOP, QP, etc.

    • Supports single-stream and dual-stream encoders

    • Supports block filter, ROI, and motion vector functions

    • Supports SPS and PPS encoding

  • Software Encoder:

    • Supports Baseline Profile (maximum frame size 36864 macroblocks)

    • Supports any resolution with width and height greater than 16 pixels

    • Supports quality-priority bitrate control

    • Supports YUYV and IYUV raw data formats

    • Supports dynamic adjustment of bitrate and framerate

    • Supports SPS and PPS encoding

Decoder Function

  • Supports Baseline Profile (maximum frame size 36864 macroblocks)

  • Supports various widths and heights

  • Supports Long-Term Reference (LTR) frames

  • Supports Memory Management Control Operation (MMCO)

  • Supports modification of reference image list

  • Supports multiple reference frames specified in the Sequence Parameter Set (SPS)

  • Supports IYUV output format

Performance

Encoding Performance: It is recommended to use a hardware encoder for ESP32-P4, while ESP32-S3 and other boards should use a software encoder.

  • Hardware Encoder (ESP32-P4 only):

    • Better performance and power consumption, supporting up to 1080P@30fps at maximum

    • Supports single-stream/dual-stream encoding

    • Supports dynamic adjustment of parameters such as bitrate, framerate, GOP, QP, etc.

    • Supports advanced features such as deblocking filter, ROI, motion vector, etc.

  • Software Encoder (All Platforms):

    • Limited performance consumption, but no resolution limit

    • Supports YUYV and IYUV formats, offering richer color formats

    • Supports all Espressif chip platforms, with more board options available

    • Based on the OpenH264 open source project

Encoding Performance Comparison

Platform

Type

Maximum Resolution

Maximum Performance

Remarks

ESP32-S3

Software Encoder

Any

320×240@11fps

ESP32-P4

Hardware Encoder

≤1080P

1920×1080@30fps

Hardware Acceleration

Decoding Performance: It is recommended to use software decoders for all boards.

  • Software Decoder (All Platforms):

    • Performance consumption is limited, but there is no resolution limit.

    • Supports IYUV output format

    • Supports advanced features such as long-term reference frames, memory management control, etc.

    • Based on the TinyH264 open source project

Decoding Performance Comparison

Platform

Type

Maximum Resolution

Maximum Performance

ESP32-S3

Software Decoder

Any

320×192@27fps

ESP32-P4

Software Decoder

Any

1280×720@10fps

Warning

Memory consumption strongly depends on the resolution and encoding data of the H.264 stream. It is recommended to adjust the memory allocation according to the actual application scenario.

Tip

Using a dual-task decoder can significantly improve decoding performance, especially in high-resolution video processing.

ESP-Image-Effects Component

Overview

ESP-Image-Effects is an image processing engine developed by Espressif Systems, integrating basic functions such as rotation, color space conversion, scaling, and cropping. As one of the core components of Espressif’s audio and video development platform, the ESP-Image-Effects module has deeply restructured the underlying algorithms, combined with efficient memory management and hardware acceleration, achieving high performance, low power consumption, and low memory occupancy. In addition, each image processing function adopts a consistent API architecture design, reducing the learning cost for users and facilitating rapid development. This engine is widely used in the Internet of Things, smart cameras, industrial vision, and other fields.

Function

Image Color Conversion

  • Supports any input resolution

  • Supports bypass mode with the same input/output format

  • Supports BT.601/BT.709/BT.2020 color space standards

  • Supports fast color conversion algorithms for format and resolution

  • Comprehensive Format Support Matrix:

Supported Color Conversion Formats

Input Format

Supported Output Formats

RGB/BGR565_LE/BE RGB/BGR888

RGB565_LE/BGR/RGB565_LE/BE RGB/BGR888 YUV_PLANAR/PACKET YUYV/UYVY O_UYY_E_VYY/I420

ARGB/BGR888

RGB565_LE/BGR/RGB565_LE/BE RGB/BGR888 YUV_PLANAR O_UYY_E_VYY/I420

YUV_PACKET/UYVY/YUYV

RGB565_LE/BGR/RGB565_LE/BE RGB/BGR888 O_UYY_E_VYY/I420

O_UYY_E_VYY/I420

RGB565_LE/BGR/RGB565_LE/BE RGB/BGR888 O_UYY_E_VYY

Image Rotation

  • Supports bypass mode

  • Supports any input resolution

  • Supports rotation in any angle clockwise

  • Supports ESP_IMG_PIXEL_FMT_Y/RGB565/BGR565/RGB888/BGR888/YUV_PACKET formats

  • Supports fast clockwise rotation algorithms for specific angles, formats, and resolutions.

Image Scaling

  • Supports bypass mode

  • Supports any input resolution

  • Supports up-sampling and down-sampling operations

  • Supports ESP_IMG_PIXEL_FMT_RGB565/BGR565/RGB888/BGR888/YUV_PACKET formats

  • Supports various filtering algorithms: optimized downsampling and bilinear interpolation

Image Cropping

  • Supports bypass mode

  • Supports any input resolution

  • Supports up-sampling and down-sampling operations

  • Supports flexible region selection

  • Supports ESP_IMG_PIXEL_FMT_Y/RGB565/BGR565/RGB888/BGR888/YUV_PACKET formats

Performance

The ESP-Image-Effects component has completed performance testing under 1080P. For specific performance data, please refer to the ESP32-P4 Performance Document. This component uses efficient memory management and hardware acceleration technology to achieve high performance, low power consumption, and low memory occupancy.

Related Links

ESP-Audio-Effects Component

Overview

ESP-Audio-Effects is a powerful and flexible audio processing library, designed to provide developers with efficient audio effect processing capabilities. This component is widely used in various smart audio devices, including smart speakers, headphones, audio playback devices, and voice interaction systems.

Function

  • Automatic Level Control: Automatically adjusts input gain to stabilize audio volume. Progressive adjustment ensures smooth transition. Dynamic correction of over-amplification to avoid clipping distortion.

  • Equalizer: Provides fine control over filter type, frequency, gain, and Q factor. Suitable for audio tuning and professional signal shaping.

  • Fade In/Out: Implements fade in and fade out effects, ensuring smooth transitions between tracks.

  • Speed and Pitch Processing: Supports real-time speed and pitch modification, achieving more dynamic playback effects.

  • Mixer: Merges multiple input streams into one output, with start/target weights and transition time configurable for each input.

  • Data Interleaver: Handles interleaving and de-interleaving of audio data buffers.

  • Sample Rate Conversion: Performs sample rate conversion between multiples of 4000 and 11025.

  • Channel Conversion: Remaps audio channel layout using weight array.

  • Bit Depth Conversion: Supports conversion between U8, S16, S24, and S32 bit depths.

  • Dynamic Range Control: Adjusts the dynamic range of the audio signal based on different playback environments and devices. The dynamic range represents the difference between the quietest and loudest parts of the audio signal.

  • Multi-band Dynamic Range Compression: The audio signal is divided into multiple frequency ranges through a bandpass filter, and dynamic range processing is performed independently for each range.

The sampling rate, number of channels, and bit depth supported by each module can be referred to in the README.

Data Layout

ESP-Audio-Effects supports both interleaved and deinterleaved audio formats:

  1. Interleaved format: Use the esp_ae_xxx_process() API to process this layout. For example:

L0 R0 L1 R1 L2 R2 ...

Where L and R represent left and right channel samples respectively.

  1. Deinterleaved format: Use the esp_ae_xxx_deintlv_process() API. Each channel is stored in a separate buffer:

L1, L2, L3, ...  // Left channel
R1, R2, R3, ...  // Right channel

API Style

ESP-Audio-Effects provides a consistent and developer-friendly API:

Color Conversion Format Support

Category

Function

Description

Initialization

esp_ae_xxx_open()

Create an audio effect handle.

Interleaved Processing

esp_ae_xxx_process()

Process interleaved audio data.

Deinterleaved Processing

esp_ae_xxx_deintlv_process()

Process deinterleaved audio data.

Set Parameters

esp_ae_xxx_set_xxx()

Set component-specific parameters.

Get Parameters

esp_ae_xxx_get_xxx()

Get current parameters.

Release

esp_ae_xxx_close()

Release resources and destroy the handle.

Related Links

ESP-Audio-Codec Component

For an introduction and performance specifications of the ESP-Audio-Codec component, please refer to ESP-Audio-Codec.

Codec Comparison

The following table compares the features of the Codecs supported by ESP-Audio-Codec:

Common Audio Codec Feature Comparison

Codec

Features

Typical Bitrate Range (kbps)

Applicable Scenarios

AAC (Advanced Audio Coding)

Lossy compression, better sound quality than MP3, more efficient at the same bitrate; widely supported.

96 – 320 (stereo typically uses 128–256)

Online music, video streaming (YouTube, Apple Music, radio).

MP3

The most popular lossy compression format, excellent compatibility, but slightly less efficient than AAC/Opus.

128 – 320 (as low as 64 can also be used)

Music download, traditional players, car audio.

AMR-NB / AMR-WB

Optimized for voice, clear voice at low bitrates; NB (8kHz), WB (16kHz).

AMR-NB: 4.75 – 12.2; AMR-WB: 6.6 – 23.85

Mobile communication (2G/3G phone calls), VoIP, voice messages.

ADPCM

Simple compression, low latency, limited sound quality; not very efficient.

Common 16 – 64

Early voice storage, embedded devices, simple audio transmission sensitive to latency.

G.711 (A-law / μ-law)

Waveform coding, fixed at 64 kbps, sound quality close to telephone level; extremely low latency.

Fixed at 64

Landline, VoIP (such as SIP), call centers.

OPUS

Low latency, high sound quality, supports narrowband to full band, strong adaptability; open source and free.

6 – 510 (common voice 16–32, music 64–128)

Real-time voice (VoIP, conference), music stream, game voice, WebRTC.

Vorbis

Open source lossy compression, good sound quality, better compression rate than MP3; gradually replaced by Opus.

64 – 320 (commonly used 128–192)

Open source streaming media (OGG container), some games and applications.

FLAC

Lossless compression, retains original sound quality, compression rate about 40–60%.

700 – 1100 (CD quality, depends on content)

High fidelity music storage, music download (Hi-Res music).

ALAC

Apple’s lossless compression, similar to FLAC, but limited ecosystem.

700 – 1100 (similar to FLAC)

Apple Music lossless audio, iTunes, iOS/macOS ecosystem.

SBC

Simple, low power consumption, default encoding for Bluetooth A2DP, average sound quality.

192 – 320 (commonly used 256)

Bluetooth headphones, Bluetooth speakers.

LC3

Inherits from SBC, used for Bluetooth LE Audio; low power consumption, better sound quality than SBC, low latency.

16 – 160 (commonly 96–128)

Bluetooth LE Audio (TWS earbuds, hearing aids), IoT audio.

Usage Method

Encoder Usage Example

For detailed usage, please refer to: audio_encoder_test.c. - If you need to use a custom encoder, please follow the steps below:

  1. Implement the custom encoder interface, for the interface details, see: struct esp_audio_enc_ops_t.

  2. Customize the audio encoder type in the enumeration esp_audio_type_t, the definition range is between ESP_AUDIO_TYPE_CUSTOMIZED and ESP_AUDIO_TYPE_CUSTOMIZED_MAX. For details, see: Enumeration esp_audio_type_t.

  3. If you want to override the default encoder, there is no need to customize the audio encoder type, you can directly use the existing encoder type.

  4. Register a custom encoder, see: esp_audio_enc_register()

Decoder Usage Example

For detailed usage, please refer to: audio_decoder_test.c. - If you need to use a custom decoder, please follow the steps below:

  1. Implement the custom decoder interface, for the interface details, see: struct esp_audio_dec_ops_t.

  2. Customize the audio decoder type in the enumeration esp_audio_type_t. The definition range is between ESP_AUDIO_TYPE_CUSTOMIZED and ESP_AUDIO_TYPE_CUSTOMIZED_MAX. For more details, see: Enumeration esp_audio_type_t.

  3. If you want to override the default decoder, there is no need to customize the audio decoder type, you can directly use the existing decoder type.

  4. Register a custom decoder, see: esp_audio_dec_register()

Simple Decoder Usage Example

For detailed usage, please refer to: simple_decoder_test.c. - If you need to use a custom simple decoder, please follow the steps below:

  1. Implement a custom simple decoder interface, for the interface details, see: struct esp_audio_simple_dec_reg_info_t.

  2. Customize the simple audio decoder type in the enumeration esp_audio_simple_dec_type_t. The definition range is between ESP_AUDIO_SIMPLE_DEC_TYPE_CUSTOM and ESP_AUDIO_SIMPLE_DEC_TYPE_CUSTOM_MAX. For more details, see: Enumeration esp_audio_simple_dec_type_t.

  3. If you want to override the default decoder, there is no need to customize the audio decoder type, you can directly use the existing decoder type.

  4. Register a custom simple decoder, see: esp_audio_simple_dec_register()

Related Links

ESP-Media-Protocols Component

Overview

The multimedia protocol is a collection of various communication protocols, widely used in scenarios such as streaming media transmission, device control, and device interconnection communication. Typical application methods are as follows:

Most security systems have network cameras with built-in RTSP servers, which compress the collected video and provide video streams using the RTP protocol. This allows access and streaming by monitoring platforms, NVRs, VLC players, etc. - GB28181 (full name: GB/T 28181-2016) is a national standard issued by the Chinese Ministry of Public Security. It defines the technical requirements for information transmission, exchange, and control of public safety video surveillance networking systems. It uses the SIP protocol to complete device registration, heartbeat, call and other signaling controls, uses SDP to describe media session information, and uses RTP and RTCP for real-time transmission and control of media data. - VoIP, video conferencing, visual intercom systems, based on the SIP protocol to complete call and voice, video communication functions; - Broadcasting system, live streaming platform, after the device collects the media stream, it pushes to the server based on the RTMP protocol, and multiple client devices pull and play from the server based on the RTMP protocol.

ESP-Media-Protocols is a multimedia protocol library launched by Espressif Systems, providing support for basic and mainstream multimedia protocols.

Protocol

Layer

Function

Common Application Scenarios

RTP/RTCP

Transport Layer

Real-time transmission of audio and video streams, providing quality information

Low-latency transmission of media data from network cameras, real-time calls/conferences, RTCP provides transmission quality statistics

RTSP

Application Layer

Supports being streamed as a server, supports streaming and pushing as a client

Low-latency unidirectional transmission and playback control of network camera’s media data

SIP

Application Layer

Session terminal, supports registration to SIP server, supports initiating and receiving sessions

Low-latency bidirectional transmission of media data between intercoms and telephone terminals, realized through session management for intercom and conference functions.

RTMP

Application Layer

Supports being streamed and receiving pushes as a server, supports streaming and pushing as a client

Live streaming and backend distribution (device streaming to live server/platform), live access

MRM

/

Multi-device master-slave synchronized music playback

Synchronized multi-room audio playback (smart speakers, synchronized multi-device home theater)

UPnP

/

Device interconnection, media and service sharing

Device discovery and media sharing within the home (Mobile/PC discovers TV/NAS and casts or plays screen)

Performance Data

Comparison of protocol performance

Protocol

Real-time

Data Stream

Control Stream

Device Discovery

TLS Encryption

Complexity

RTSP

High

Yes

Yes

Manual

No

Medium

SIP

High

Yes

Yes

Manual

Yes

Medium

RTMP

Medium

Yes

Basic

Manual

Yes

Medium

MRM

High

Yes

Yes

Automatic

No

Low

UPnP

Low

Yes

Yes

Automatic

No

Medium

  • Real-time

    • Low latency: Data for control or command transmission, latency about 20 ms.

    • Low latency: Audio, video or other media stream transmission, latency about 300 ms.

    • Medium latency: Live stream based on RTMP, latency about 2 seconds.

  • Security

    • TLS (optional)

    • MD5 Digest Authentication (SIP mandatory)

  • Scalability

    • Customizable protocol header and body

    • Supports subscription and notification, can register services

  • Concurrency

    • Supports multiple client connections (RTMP)

  • Compatibility

    • SIP supports linphone, Asterisk FreePBX, Freeswitch, Kamailio

    • RTSP supports ffmpeg, vlc, live555, mediamtx

    • RTMP supports ffmpeg, vlc

    • UPnP supports NetEase Cloud Music

  • Media Support

  • Memory Consumption Data

You can easily identify the protocol to use through the following flowchart:

Usage Method

The ESP-Media-Protocols component is hosted on Github. You can add this component to your project by entering the following command in your project.

idf.py add-dependency "espressif/esp_media_protocols^0.5.1"

Before using the ESP-Media-Protocols component, it is recommended to refer to and debug the following example projects first, in order to familiarize yourself with the usage of the API and the specific application of the protocol stack.

FAQ

Q: Does ESP-Media-Protocols support all protocols and features?

A: ESP-Media-Protocols currently supports the basic protocols and features widely used in the embedded field. Some unsupported protocols such as SRTP, HLS, etc., can be found and used under other components or repositories. The supported protocol specifications will be continuously iterated and expanded, and we will also update and consider expansion according to customer needs. In the future, we plan to support some new protocols with strong features.

Q: Some protocol features overlap, how to choose when using?

A: According to the application scenario, specifically analyze the functional requirements, latency requirements, and network environment. For example, if the real-time requirement is high and real-time control (pause, fast forward, rewind, positioning) is needed, RTSP is usually used; if the real-time requirement is high and real-time interaction is needed, SIP can be used to create a session; if it is a large-scale live broadcast in the browser, with high requirements for stability and compatibility, and no high real-time requirements, RTMP can be considered.

For more related questions, please refer to the Issues section in the following protocol directory:

ESP-MIDI Component

Overview

ESP-MIDI is a MIDI (Musical Instrument Digital Interface) software library launched by Espressif Systems, providing efficient MIDI file parsing and real-time audio synthesis capabilities. ESP-MIDI supports SoundFont sound libraries and custom audio libraries, capable of outputting high-fidelity, distortion-free audio effects. Combining the characteristics of small MIDI file size and rich sound library resources, ESP-MIDI achieves a balance of excellent sound quality and high performance, providing developers with a comprehensive MIDI processing solution.

The related information is as follows:

FAQ

Q: How to load SoundFont files?

A: ESP-MIDI supports full SoundFont 2 (SF2) file parsing and playback. You can load SF2 files through the tone library loading interface, or use a user-defined sound library.

Q: Does ESP-MIDI support real-time MIDI input?

A: Yes, ESP-MIDI supports real-time MIDI event encoding and decoding, including various MIDI event types such as note on/off, program change, control change, pitch bend, channel pressure, and more.

Q: How to control the playback speed?

A: ESP-MIDI supports setting and changing the BPM (Beats Per Minute) and speed, including dynamic adjustments. You can set and modify the playback speed through the API.

Q: How is audio output integrated?

A: ESP-MIDI provides a callback-based audio output interface, which can be flexibly integrated with different audio backends, such as ESP-ADF, ESP-Audio-Codec, etc.