AI Agent Solution
Note
This document is automatically translated using AI. Please excuse any detailed errors. The official English version is still in progress.
Overview
AI Agents have implemented audio and video interaction application code based on the ESP32 platform. This application is based on the ESP-GMF architecture and integrates AI Agent device-side development, providing developers with a complete audio and video interaction solution.
Application Architecture
The AI Agents application is based on the ESP-GMF architecture and mainly includes the following two core modules:
Audio-processor module
Mainly responsible for audio data processing, including:
Playback
Supports local audio file playback
Supports network audio playback
Supports decoding of various audio formats
Can be used as a source of background music or prompt sounds
Feeder (streaming playback)
Plays real-time streaming audio data (such as WebSocket, HTTP stream, memory buffer)
Commonly used in TTS, real-time voice delivery, online audio playback, etc.
Can be combined with Mixer for mixed audio output
Recorder
Audio collection function
Supports 3A algorithm processing (AEC, ANS, AGC)
Supports encoded output (PCM, AMR, OPUS, WAV, etc.)
Can be used for intelligent voice interaction, voice upload, etc.
Mixer (mixing)
Mixes Playback and Feeder for mixed audio output
Can expand multiple input channels
Suitable for background music + real-time voice, prompt sound overlay, etc.
Video-processor module
Mainly responsible for video data processing, including:
Video capture
Video encoding and decoding
Video rendering
Feature Characteristics
The table below lists the mainstream AI platforms supported by the AI Agents application and the feature support in each AI platform:
Platform |
Voice Call |
Voice Interaction |
Visual Processing |
Audio and Video Dialogue |
Example Link |
|---|---|---|---|---|---|
Volcano RTC |
✓ |
✓ |
✓ |
✓ |
|
COZE |
✓ |
||||
BRTC |
✓ |
✓ |
✓ |
✓ |
|
Tencent Cloud RTC |
✓ |
||||
Tongyi |
✓ |
✓ |
✓ |
✓ |
To be released |