Introduction to LLM Solution

Note

This document is automatically translated using AI. Please excuse any detailed errors. The official English version is still in progress.

LLM Solution Overview

Solution Features:

The rise of large models like ChatGPT has driven a global AI boom, with cloud platforms upgrading their intelligence and AI technology continuously penetrating various industries. Espressif has built a solid technical foundation with its open, shared, and ecosystem-based intelligent hardware platform.
Espressif provides voice and vision large model solution references, with corresponding solutions for low-cost C3, flagship S3, high-performance P4, and dual-band C5.
Espressif is also working on large language model privatization deployment to provide strong support for accelerating customer product implementation.

Solution Overview Diagram:

Voice Solution Details

Voice Solution Product Matrix
Positioning	C Series (ESP32-C3/C5)	S Series (ESP32-S3)
Entry Level	ESP-Hi - 0.96” 160 * 80 LCD - No Codec Required
Cost-Effective	ESP-Spot - No Screen - 5G Band	ESP-Gogo & ESP-Eyemoji 1.85” 360360 & 0.71” 160 160 LCD Dual HD Screens
High Performance		EchoEar - 1.85” 360*360 LCD - Dual Microphone Array

Vision Solution Details

Vision Solution Product Matrix
Positioning	S Series (ESP32-S3)	P Series (ESP32-P4)
Cost-Effective	ESP-Sparkbot - DVP VGA/720p Camera - 1.54” 240*240 LCD	ESP-Brookesia - USB Camera - 720P Touch Screen

Chip Feature Comparison

ESP32-S3 vs ESP32-C3 vs ESP32-C5 LLM Application Selection Comparison
Features	ESP32-S3	ESP32-C3	ESP32-C5
Recommended Scenarios	Multi-modal interaction, voice and vision terminals with display	Cost-effective lightweight edge applications	5G advantage scenarios: anti-interference, network compatibility
Voice	Multiple microphones, supports AEC echo cancellation	Single microphone solution, basic collection	Single microphone solution, basic collection
Display	Multiple display solutions & complex interface interaction	Limited IO resources, restricted display solutions	Slightly more IO resources than C3, restricted display solutions
Camera	Supports SPI / DVP / USB cameras	Only supports SPI camera	Only supports SPI camera
Touch	Built-in Touch sensor	No Touch, requires external touch chip	No Touch, requires external touch chip
Memory	Up to 512KB SRAM + PSRAM support	400KB SRAM	Up to 384KB SRAM + PSRAM support
Local AI Capability	Vector computing instructions + neural network accelerator	Supports lightweight models	Supports lightweight models

LLM Solution Product Matrix Overview
Category	Solution Details
Cloud Platform Integration	RainMaker & Matter： - Provides rapid cloud platform access capability
Cloud Server	MCP & Multi-platform Access & Offline Deployment： - Supports multiple cloud services and privatization deployment solutions
Software Framework	ESP-Brookesia： - Provides complete software development framework support
Application Solutions	Audio & Image Solutions： - Provides multiple chip and detailed scenario choices AI Empowerment： - ESP-PDD：MIC/SPK/LCD small module for rapid upgrade and evaluation of existing products - AT Commands：Simple and quick integration

Process Architecture Introduction:: ESP chips as the edge side mainly implement data collection, preliminary processing, and transmission. Due to processing performance limitations, LLM-related processing still relies on cloud servers. Below is the overall system architecture:

Edge and Cloud Task Division
Edge Tasks	Cloud Tasks
- Data Collection and Preliminary Processing: Real-time collection and preliminary processing of voice and image data through Espressif chips; - Local AI Model Inference: Deploy lightweight models for offline or low-latency processing; - Real-time Transmission: Utilize full-duplex RTC protocol to ensure timely data transmission to the cloud.	- Large Model Training and Deep Analysis: Perform complex calculations on collected data to provide intelligent decision support; - Algorithm Updates and Optimization: Cloud computing resources used for model iteration, real-time feedback to the edge; - Remote Updates and Maintenance: Implement continuous system updates and maintenance through OTA and other methods.

Privatization Deployment Value:

Can accelerate testing and improve stability
Combined with embedded devices, can achieve low-latency, high-privacy smart home experience
Downstream enterprises can quickly integrate intelligent interaction capabilities by purchasing complete deployment solutions, lowering technical barriers and shortening product implementation cycles

Privatization Deployment Architecture:

Common LLM Application Scenarios

Common LLM Application Scenario Classification
Application Scenario	Product Form	Recommended Solution
Plush/Desktop Pet Toys	- With Screen: Supports expression/interaction display - Without Screen: Pure voice and action interaction	- ESP-Eyemoji: Cute expressions, voice interaction - ESP-Spot: Lightweight without screen, supports gestures
Smart Speaker	- Single Mic: Near-field interaction - Dual Mic: Sound source localization	- ESP-Hi: Ultra-low cost single mic solution - EchoEar: Dual microphone array, far-field wake-up
Automotive Applications	- Single Screen: Driving information, cute eye expression display - Dual Screen: Driving information, dual eye expression display	- EchoEar: Single-screen voice interaction - ESP-Gogo & ESP-Eyemoji: Dual screen voice interaction
AI Empowerment for Existing Products	SPK/MIC/LCD Small Module	- ESP-PDD: Rapid AI product formation/evaluation - AT Commands: Simple and quick integration

ESP-Spot

ESP-Spot：ESP-Spot is an AI action voice interaction core module based on ESP32-S3 / ESP32-C5, focusing on voice interaction, AI perception, and intelligent control. It not only has offline voice wake-up and AI dialogue functions but also can achieve doll touch perception through ESP32-S3’s built-in touch/proximity sensing peripherals. The device has a built-in accelerometer that can recognize postures and actions, enabling richer interactions.

Related Links:

Features:

Screenless solution, focusing on voice and action interaction
Low cost, no display screen
Can expand panel to dual-screen ESP-Gogo and ESP-Eyemoji
S3/C5 dual adaptation, can push C5 5GHz

ESP-SparkBot

ESP-SparkBot：ESP-SparkBot is based on ESP32-S3, integrating voice interaction, image recognition, and multimedia entertainment. It can transform into a remote control car, play with local AI, support large model dialogue, real-time video transmission, and HD video projection. Powerful performance, endless fun!

Related Links:

ESP-Hi

ESP-Hi：ESP-Hi is a high-integration AI voice solution based on ESP32-C3, using ESP32-C3’s built-in ADC as the microphone collection device and I2S PDM directly as audio output, achieving low board-level material cost.

Description:

High Integration: Uses ESP32-C3’s built-in ADC as the microphone collection device. Uses I2S PDM directly as audio output, thus eliminating the need for external CODEC chip. Achieves low board-level material cost.
Low Resource Usage: Audio transceiver only uses 4 IO ports, uses very little CPU and memory, reserves sufficient resources for application development.
Multiple Interaction Methods: With screen and LED indicators, supports buttons, shaking, and voice wake-up.

Related Links:

Code Repository: Updating
Related Videos: https://www.bilibili.com/video/BV1BHJtz6E2S/

Features:

Currently the lowest board-level material cost AI voice solution
C3 wake word lightweight model, supports offline wake-up

ESP-P4 Phone

ESP-P4 Phone：A handheld device solution with screen based on ESP32-P4, combined with ESP-Brookesia’s Phone UI functionality, achieving Android-like system effects.

Hardware:

Main Control: ESP32-P4
Wi-Fi: ESP32-C6
LCD & Touch: 720P MIPI-DSI ILI9881 & GT911
Audio: 8311
Type-C: USB2.0

Related Links:

Code Repository: https://gitlab.espressif.cn:6688/ae_group/tools/esp-dev-tools/-/tree/feat/embedded_world_demo?ref_type=heads
Related Videos: To be added

Features:

720P high-resolution touch screen
“ESP32-C6 + ESP-Hosted” Wi-Fi solution
Android-like system effect, provides common functionality (such as network configuration) App

EchoEar

EchoEar: EchoEar Meow Companion is a smart AI development kit created by Espressif in collaboration with the Volcano Engine Button Large Model team. It is suitable for voice interactive products such as toys, smart speakers, and smart control centers that require large model empowerment. The device is equipped with the ESP32-S3-WROOM-1 module, a 1.85-inch QSPI round touch screen, and a dual microphone array, supporting offline voice wake-up and sound source localization algorithms. Combined with the large model capabilities provided by the Volcano Engine, Meow Companion can achieve full-duplex voice interaction, multimodal recognition, and intelligent body control, providing a solid foundation for developers to create a complete end-side AI application experience.

Hardware:

ESP32-S3

Related Links:

Code Repository: https://github.com/espressif/esp-brookesia/tree/master/products/speaker
Related Video: https://www.bilibili.com/video/BV17AMwzBEqG/?spm_id_from=333.1387.homepage.video_card.click
JLCPCB: https://oshwhub.com/esp-college/echoear

LLM Hardware Solution Summary

Audio Input Solution Comparison Table

Solution No.	Solution Type	Resource Usage	Cost	Effects and Recommended Application Scenarios
1	Digital Microphone (MSM261S4030H0R etc.)	1 I2S (3 pins)	High	Simple wiring Cannot achieve echo cancellation Suitable for DIY / board area limited scenarios
2	Dedicated Audio ADC + Analog Microphone (ES7210)	1 I2S + 1 I2C (5 pins)	Medium-High	Recommended for multiple microphone scenarios
3	CODEC + Analog Microphone (ES8311 etc.)	1 I2S + 1 I2C (5 pins)	Medium	Low cost Good effect Recommended for use
4	Internal ADC + Op-amp + Analog Microphone	1 Internal ADC (1 pin)	Lowest	Lowest cost audio input solution Meets basic audio input requirements

Audio Output Solution Comparison Table

Solution No.	Solution Type	Resource Usage	Cost	Effects and Recommended Application Scenarios
1	I2S Digital Amplifier (MAX98357A etc.)	1 I2S + 1 PA control (4 pins)	Medium	Good effect but no volume control Suitable for products only requiring audio output
2	CODEC + Analog Amplifier (ES8311 + NS4150)	1 I2S + 1 I2C + 1 PA control (6 pins)	Low	Optimal cost and effect Recommended for AI audio
3	I2S PDM + Analog Amplifier (NS4150)	2 I2S pins + 1 PA control (3 pins)	Lowest	Lowest cost but uses CPU resources Suitable for cost-sensitive scenarios

The above audio solutions can be freely combined, but if customers are in the design phase, considering cost and performance, only the following solutions are recommended:

Voice Solution Recommendation Comparison Table

Category	Type Description	Solution Features	Recommended Hardware	Reference Development Board
Optimal Cost	Analog Mic + OPA Internal ADC Audio Collection + I2S PDM Output	Single microphone input, mono playback Implements basic audio collection and playback Minimum IO usage Suitable for close-range dialogue, low-cost applications	ESP32-C3 / ESP32-C5	ESP-HI
Balanced Choice	Single Microphone + External Codec Chip (e.g., ES8311)	Supports echo cancellation Good single microphone input effect Balanced performance and cost	ESP32-S3 / ESP32-P4 / ESP32-C5	ESP32-P4-Function-EV-Board
Best Performance	Multiple Microphones + External Decoder + External Audio ADC Chip (e.g., ES8311 + ES7210)	Supports far-field voice wake-up Supports echo cancellation, noise suppression Suitable for high-performance voice applications	ESP32-S3 / ESP32-P4	ESP32-S3-Korvo-2

AI Vision Hardware Solution Comparison Table

Interface Type	Camera Performance	Supported Chips	Reference Development Board	Features
SPI	Low Resolution Image (max 320×240)	ESP32-S3 / ESP32-C5 / ESP32-C3	ESP32-S3-EYE	Simple interface, suitable for beginners Lowest cost Limited frame rate and resolution Mainly used for image recognition, simple detection
DVP (Parallel)	Low to Medium Resolution (e.g., VGA, 720p)	ESP32-S3 / ESP32-P4	ESP32-S3-EYE	Older but mature interface Can support basic video streaming High resource usage (requires many GPIOs) Medium imaging quality, suitable for medium visual tasks
USB	Low to High Resolution (depends on USB camera)	ESP32-S3 / ESP32-P4	ESP32-S3-USB-OTG	Plug and play with UVC cameras Supports high resolution, high frame rate Complex software support (requires USB Host capability) Suitable for applications requiring high image quality
MIPI CSI	High Resolution Image (e.g., 1080p, 4K)	ESP32-S3 / ESP32-P4	ESP32-P4-Function-EV-Board	High bandwidth, low power consumption High hardware requirements (requires MIPI PHY) Supports HD, low-latency image capture Suitable for edge AI, visual analysis, recognition scenarios

In summary: ESP32-C3 / ESP32-C5 only support SPI interface cameras, currently adapted to BD3901 camera. For ESP32-S3 with clarity requirements, DVP camera is first recommended, for ESP32-P4, MIPI camera is first recommended.

Introduction to LLM Solution

LLM Solution Overview

Voice Solution Details

Vision Solution Details

Chip Feature Comparison

Common LLM Application Scenarios

ESP-Spot

ESP-SparkBot

ESP-Hi

ESP-P4 Phone

EchoEar

LLM Hardware Solution Summary

Audio Input Solution Comparison Table

Audio Output Solution Comparison Table

Voice Solution Recommendation Comparison Table

AI Vision Hardware Solution Comparison Table

Edge AI Information Sharing