Introduction to LLM Solution
Note
This document is automatically translated using AI. Please excuse any detailed errors. The official English version is still in progress.
LLM Solution Overview
- Solution Features:
The rise of large models like ChatGPT has driven a global AI boom, with cloud platforms upgrading their intelligence and AI technology continuously penetrating various industries. Espressif has built a solid technical foundation with its open, shared, and ecosystem-based intelligent hardware platform.
Espressif provides voice and vision large model solution references, with corresponding solutions for low-cost C3, flagship S3, high-performance P4, and dual-band C5.
Espressif is also working on large language model privatization deployment to provide strong support for accelerating customer product implementation.
Solution Overview Diagram:

Voice Solution Details
Positioning |
C Series (ESP32-C3/C5) |
S Series (ESP32-S3) |
---|---|---|
Entry Level |
ESP-Hi - 0.96” 160 * 80 LCD - No Codec Required |
|
Cost-Effective |
ESP-Spot - No Screen - 5G Band |
ESP-Gogo & ESP-Eyemoji
|
High Performance |
EchoEar - 1.85” 360*360 LCD - Dual Microphone Array |
Vision Solution Details
Positioning |
S Series (ESP32-S3) |
P Series (ESP32-P4) |
---|---|---|
Cost-Effective |
ESP-Sparkbot - DVP VGA/720p Camera - 1.54” 240*240 LCD |
ESP-Brookesia - USB Camera - 720P Touch Screen |
Chip Feature Comparison
Features |
ESP32-S3 |
ESP32-C3 |
ESP32-C5 |
---|---|---|---|
Recommended Scenarios |
Multi-modal interaction, voice and vision terminals with display |
Cost-effective lightweight edge applications |
5G advantage scenarios: anti-interference, network compatibility |
Voice |
Multiple microphones, supports AEC echo cancellation |
Single microphone solution, basic collection |
Single microphone solution, basic collection |
Display |
Multiple display solutions & complex interface interaction |
Limited IO resources, restricted display solutions |
Slightly more IO resources than C3, restricted display solutions |
Camera |
Supports SPI / DVP / USB cameras |
Only supports SPI camera |
Only supports SPI camera |
Touch |
Built-in Touch sensor |
No Touch, requires external touch chip |
No Touch, requires external touch chip |
Memory |
Up to 512KB SRAM + PSRAM support |
400KB SRAM |
Up to 384KB SRAM + PSRAM support |
Local AI Capability |
Vector computing instructions + neural network accelerator |
Supports lightweight models |
Supports lightweight models |
Category |
Solution Details |
---|---|
Cloud Platform Integration |
RainMaker & Matter:
- Provides rapid cloud platform access capability
|
Cloud Server |
MCP & Multi-platform Access & Offline Deployment:
- Supports multiple cloud services and privatization deployment solutions
|
Software Framework |
ESP-Brookesia:
- Provides complete software development framework support
|
Application Solutions |
Audio & Image Solutions:
- Provides multiple chip and detailed scenario choices
AI Empowerment:
- ESP-PDD:MIC/SPK/LCD small module for rapid upgrade and evaluation of existing products
- AT Commands:Simple and quick integration
|
- Process Architecture Introduction:
ESP chips as the edge side mainly implement data collection, preliminary processing, and transmission. Due to processing performance limitations, LLM-related processing still relies on cloud servers. Below is the overall system architecture:
Edge Tasks |
Cloud Tasks |
---|---|
- Data Collection and Preliminary Processing: Real-time collection and preliminary processing of voice and image data through Espressif chips;
- Local AI Model Inference: Deploy lightweight models for offline or low-latency processing;
- Real-time Transmission: Utilize full-duplex RTC protocol to ensure timely data transmission to the cloud.
|
- Large Model Training and Deep Analysis: Perform complex calculations on collected data to provide intelligent decision support;
- Algorithm Updates and Optimization: Cloud computing resources used for model iteration, real-time feedback to the edge;
- Remote Updates and Maintenance: Implement continuous system updates and maintenance through OTA and other methods.
|
- Privatization Deployment Value:
Can accelerate testing and improve stability
Combined with embedded devices, can achieve low-latency, high-privacy smart home experience
Downstream enterprises can quickly integrate intelligent interaction capabilities by purchasing complete deployment solutions, lowering technical barriers and shortening product implementation cycles
Privatization Deployment Architecture:

Common LLM Application Scenarios
Application Scenario |
Product Form |
Recommended Solution |
---|---|---|
Plush/Desktop Pet Toys |
- With Screen: Supports expression/interaction display
- Without Screen: Pure voice and action interaction
|
- ESP-Eyemoji: Cute expressions, voice interaction
- ESP-Spot: Lightweight without screen, supports gestures
|
Smart Speaker |
- Single Mic: Near-field interaction
- Dual Mic: Sound source localization
|
- ESP-Hi: Ultra-low cost single mic solution
- EchoEar: Dual microphone array, far-field wake-up
|
Automotive Applications |
- Single Screen: Driving information, cute eye expression display
- Dual Screen: Driving information, dual eye expression display
|
- EchoEar: Single-screen voice interaction
- ESP-Gogo & ESP-Eyemoji: Dual screen voice interaction
|
AI Empowerment for Existing Products |
SPK/MIC/LCD Small Module |
- ESP-PDD: Rapid AI product formation/evaluation
- AT Commands: Simple and quick integration
|
ESP-Spot
ESP-Spot:ESP-Spot is an AI action voice interaction core module based on ESP32-S3 / ESP32-C5, focusing on voice interaction, AI perception, and intelligent control. It not only has offline voice wake-up and AI dialogue functions but also can achieve doll touch perception through ESP32-S3’s built-in touch/proximity sensing peripherals. The device has a built-in accelerometer that can recognize postures and actions, enabling richer interactions.
Related Links:
Related Videos: https://www.bilibili.com/video/BV1ekRAYVEZ1/?share_source=copy_web&vd_source=819d03f30389b111c508986ee27e0332
Features:
Screenless solution, focusing on voice and action interaction
Low cost, no display screen
Can expand panel to dual-screen ESP-Gogo and ESP-Eyemoji
S3/C5 dual adaptation, can push C5 5GHz
ESP-SparkBot
ESP-SparkBot:ESP-SparkBot is based on ESP32-S3, integrating voice interaction, image recognition, and multimedia entertainment. It can transform into a remote control car, play with local AI, support large model dialogue, real-time video transmission, and HD video projection. Powerful performance, endless fun!
Related Links:
ESP-Hi
ESP-Hi:ESP-Hi is a high-integration AI voice solution based on ESP32-C3, using ESP32-C3’s built-in ADC as the microphone collection device and I2S PDM directly as audio output, achieving low board-level material cost.
Description:
High Integration: Uses ESP32-C3’s built-in ADC as the microphone collection device. Uses I2S PDM directly as audio output, thus eliminating the need for external CODEC chip. Achieves low board-level material cost.
Low Resource Usage: Audio transceiver only uses 4 IO ports, uses very little CPU and memory, reserves sufficient resources for application development.
Multiple Interaction Methods: With screen and LED indicators, supports buttons, shaking, and voice wake-up.
Related Links:
Code Repository: Updating
Related Videos: https://www.bilibili.com/video/BV1BHJtz6E2S/
Features:
Currently the lowest board-level material cost AI voice solution
C3 wake word lightweight model, supports offline wake-up
ESP-P4 Phone
ESP-P4 Phone:A handheld device solution with screen based on ESP32-P4, combined with ESP-Brookesia’s Phone UI functionality, achieving Android-like system effects.
Hardware:
Main Control: ESP32-P4
Wi-Fi: ESP32-C6
LCD & Touch: 720P MIPI-DSI ILI9881 & GT911
Audio: 8311
Type-C: USB2.0
Related Links:
Code Repository: https://gitlab.espressif.cn:6688/ae_group/tools/esp-dev-tools/-/tree/feat/embedded_world_demo?ref_type=heads
Related Videos: To be added
Features:
720P high-resolution touch screen
“ESP32-C6 + ESP-Hosted” Wi-Fi solution
Android-like system effect, provides common functionality (such as network configuration) App
EchoEar
EchoEar: A customized version of the AI speaker for DouBao, a rechargeable round screen device that supports dual-microphone sound source positioning. Paired with a turntable, it can rotate directions and has touch and battery functions.
Hardware:
ESP32-S3
Related Links:
Related Videos: Updating
LLM Hardware Solution Summary
Audio Input Solution Comparison Table
Solution No. |
Solution Type |
Resource Usage |
Cost |
Effects and Recommended Application Scenarios |
---|---|---|---|---|
1 |
Digital Microphone (MSM261S4030H0R etc.) |
1 I2S (3 pins) |
High |
|
2 |
Dedicated Audio ADC + Analog Microphone (ES7210) |
1 I2S + 1 I2C (5 pins) |
Medium-High |
|
3 |
CODEC + Analog Microphone (ES8311 etc.) |
1 I2S + 1 I2C (5 pins) |
Medium |
|
4 |
Internal ADC + Op-amp + Analog Microphone |
1 Internal ADC (1 pin) |
Lowest |
|
Audio Output Solution Comparison Table
Solution No. |
Solution Type |
Resource Usage |
Cost |
Effects and Recommended Application Scenarios |
---|---|---|---|---|
1 |
I2S Digital Amplifier (MAX98357A etc.) |
1 I2S + 1 PA control (4 pins) |
Medium |
|
2 |
CODEC + Analog Amplifier (ES8311 + NS4150) |
1 I2S + 1 I2C + 1 PA control (6 pins) |
Low |
|
3 |
I2S PDM + Analog Amplifier (NS4150) |
2 I2S pins + 1 PA control (3 pins) |
Lowest |
|
The above audio solutions can be freely combined, but if customers are in the design phase, considering cost and performance, only the following solutions are recommended:
Voice Solution Recommendation Comparison Table
Category |
Type Description |
Solution Features |
Recommended Hardware |
Reference Development Board |
---|---|---|---|---|
Optimal Cost |
Analog Mic + OPA Internal ADC Audio Collection + I2S PDM Output |
|
ESP32-C3 / ESP32-C5 |
ESP-HI |
Balanced Choice |
Single Microphone + External Codec Chip (e.g., ES8311) |
|
ESP32-S3 / ESP32-P4 / ESP32-C5 |
|
Best Performance |
Multiple Microphones + External Decoder + External Audio ADC Chip (e.g., ES8311 + ES7210) |
|
ESP32-S3 / ESP32-P4 |
AI Vision Hardware Solution Comparison Table
Interface Type |
Camera Performance |
Supported Chips |
Reference Development Board |
Features |
---|---|---|---|---|
SPI |
Low Resolution Image (max 320×240) |
ESP32-S3 / ESP32-C5 / ESP32-C3 |
ESP32-S3-EYE |
|
DVP (Parallel) |
Low to Medium Resolution (e.g., VGA, 720p) |
ESP32-S3 / ESP32-P4 |
ESP32-S3-EYE |
|
USB |
Low to High Resolution (depends on USB camera) |
ESP32-S3 / ESP32-P4 |
ESP32-S3-USB-OTG |
|
MIPI CSI |
High Resolution Image (e.g., 1080p, 4K) |
ESP32-S3 / ESP32-P4 |
ESP32-P4-Function-EV-Board |
|
In summary: ESP32-C3 / ESP32-C5 only support SPI interface cameras, currently adapted to BD3901 camera. For ESP32-S3 with clarity requirements, DVP camera is first recommended, for ESP32-P4, MIPI camera is first recommended.
Edge AI Information Sharing
Voice Capability Enhancement
- Wake word support related repository references:
TTS supports low-cost custom wake words (expected to be released by end of May)
Vision Capability Enhancement
Supports local YOLO model operation
Can implement basic object detection functionality
Large Model Integration Examples