Introduction to LLM Solution

[中文]

Note

This document is automatically translated using AI. Please excuse any detailed errors. The official English version is still in progress.

LLM Solution Overview

Solution Features:
  • The rise of large models like ChatGPT has driven a global AI boom, with cloud platforms upgrading their intelligence and AI technology continuously penetrating various industries. Espressif has built a solid technical foundation with its open, shared, and ecosystem-based intelligent hardware platform.

  • Espressif provides voice and vision large model solution references, with corresponding solutions for low-cost C3, flagship S3, high-performance P4, and dual-band C5.

  • Espressif is also working on large language model privatization deployment to provide strong support for accelerating customer product implementation.

Solution Overview Diagram:

LLM Solution Overview Diagram

Voice Solution Details

Voice Solution Product Matrix

Positioning

C Series (ESP32-C3/C5)

S Series (ESP32-S3)

Entry Level

ESP-Hi - 0.96” 160 * 80 LCD - No Codec Required

Cost-Effective

ESP-Spot - No Screen - 5G Band

ESP-Gogo & ESP-Eyemoji

  • 1.85” 360*360 & 0.71” 160 * 160 LCD

  • Dual HD Screens

High Performance

EchoEar - 1.85” 360*360 LCD - Dual Microphone Array

Vision Solution Details

Vision Solution Product Matrix

Positioning

S Series (ESP32-S3)

P Series (ESP32-P4)

Cost-Effective

ESP-Sparkbot - DVP VGA/720p Camera - 1.54” 240*240 LCD

ESP-Brookesia - USB Camera - 720P Touch Screen

Chip Feature Comparison

ESP32-S3 vs ESP32-C3 vs ESP32-C5 LLM Application Selection Comparison

Features

ESP32-S3

ESP32-C3

ESP32-C5

Recommended Scenarios

Multi-modal interaction, voice and vision terminals with display

Cost-effective lightweight edge applications

5G advantage scenarios: anti-interference, network compatibility

Voice

Multiple microphones, supports AEC echo cancellation

Single microphone solution, basic collection

Single microphone solution, basic collection

Display

Multiple display solutions & complex interface interaction

Limited IO resources, restricted display solutions

Slightly more IO resources than C3, restricted display solutions

Camera

Supports SPI / DVP / USB cameras

Only supports SPI camera

Only supports SPI camera

Touch

Built-in Touch sensor

No Touch, requires external touch chip

No Touch, requires external touch chip

Memory

Up to 512KB SRAM + PSRAM support

400KB SRAM

Up to 384KB SRAM + PSRAM support

Local AI Capability

Vector computing instructions + neural network accelerator

Supports lightweight models

Supports lightweight models

LLM Solution Product Matrix Overview

Category

Solution Details

Cloud Platform Integration

RainMaker & Matter:
- Provides rapid cloud platform access capability

Cloud Server

MCP & Multi-platform Access & Offline Deployment:
- Supports multiple cloud services and privatization deployment solutions

Software Framework

ESP-Brookesia:
- Provides complete software development framework support

Application Solutions

Audio & Image Solutions:
- Provides multiple chip and detailed scenario choices

AI Empowerment:
- ESP-PDD:MIC/SPK/LCD small module for rapid upgrade and evaluation of existing products
- AT Commands:Simple and quick integration
Process Architecture Introduction:

ESP chips as the edge side mainly implement data collection, preliminary processing, and transmission. Due to processing performance limitations, LLM-related processing still relies on cloud servers. Below is the overall system architecture:

Edge and Cloud Task Division

Edge Tasks

Cloud Tasks

- Data Collection and Preliminary Processing: Real-time collection and preliminary processing of voice and image data through Espressif chips;
- Local AI Model Inference: Deploy lightweight models for offline or low-latency processing;
- Real-time Transmission: Utilize full-duplex RTC protocol to ensure timely data transmission to the cloud.
- Large Model Training and Deep Analysis: Perform complex calculations on collected data to provide intelligent decision support;
- Algorithm Updates and Optimization: Cloud computing resources used for model iteration, real-time feedback to the edge;
- Remote Updates and Maintenance: Implement continuous system updates and maintenance through OTA and other methods.
Privatization Deployment Value:
  • Can accelerate testing and improve stability

  • Combined with embedded devices, can achieve low-latency, high-privacy smart home experience

  • Downstream enterprises can quickly integrate intelligent interaction capabilities by purchasing complete deployment solutions, lowering technical barriers and shortening product implementation cycles

Privatization Deployment Architecture:

Privatization Deployment Architecture Diagram

Common LLM Application Scenarios

Common LLM Application Scenario Classification

Application Scenario

Product Form

Recommended Solution

Plush/Desktop Pet Toys

- With Screen: Supports expression/interaction display
- Without Screen: Pure voice and action interaction
- ESP-Eyemoji: Cute expressions, voice interaction
- ESP-Spot: Lightweight without screen, supports gestures

Smart Speaker

- Single Mic: Near-field interaction
- Dual Mic: Sound source localization
- ESP-Hi: Ultra-low cost single mic solution
- EchoEar: Dual microphone array, far-field wake-up

Automotive Applications

- Single Screen: Driving information, cute eye expression display
- Dual Screen: Driving information, dual eye expression display
- EchoEar: Single-screen voice interaction
- ESP-Gogo & ESP-Eyemoji: Dual screen voice interaction

AI Empowerment for Existing Products

SPK/MIC/LCD Small Module

- ESP-PDD: Rapid AI product formation/evaluation
- AT Commands: Simple and quick integration

ESP-Spot

  • ESP-Spot:ESP-Spot is an AI action voice interaction core module based on ESP32-S3 / ESP32-C5, focusing on voice interaction, AI perception, and intelligent control. It not only has offline voice wake-up and AI dialogue functions but also can achieve doll touch perception through ESP32-S3’s built-in touch/proximity sensing peripherals. The device has a built-in accelerometer that can recognize postures and actions, enabling richer interactions.

Related Links:

Features:

  • Screenless solution, focusing on voice and action interaction

  • Low cost, no display screen

  • Can expand panel to dual-screen ESP-Gogo and ESP-Eyemoji

  • S3/C5 dual adaptation, can push C5 5GHz

ESP-SparkBot

  • ESP-SparkBot:ESP-SparkBot is based on ESP32-S3, integrating voice interaction, image recognition, and multimedia entertainment. It can transform into a remote control car, play with local AI, support large model dialogue, real-time video transmission, and HD video projection. Powerful performance, endless fun!

Related Links:

ESP-Hi

  • ESP-Hi:ESP-Hi is a high-integration AI voice solution based on ESP32-C3, using ESP32-C3’s built-in ADC as the microphone collection device and I2S PDM directly as audio output, achieving low board-level material cost.

Description:

  • High Integration: Uses ESP32-C3’s built-in ADC as the microphone collection device. Uses I2S PDM directly as audio output, thus eliminating the need for external CODEC chip. Achieves low board-level material cost.

  • Low Resource Usage: Audio transceiver only uses 4 IO ports, uses very little CPU and memory, reserves sufficient resources for application development.

  • Multiple Interaction Methods: With screen and LED indicators, supports buttons, shaking, and voice wake-up.

Related Links:

Features:

  • Currently the lowest board-level material cost AI voice solution

  • C3 wake word lightweight model, supports offline wake-up

ESP-P4 Phone

  • ESP-P4 Phone:A handheld device solution with screen based on ESP32-P4, combined with ESP-Brookesia’s Phone UI functionality, achieving Android-like system effects.

Hardware:

  • Main Control: ESP32-P4

  • Wi-Fi: ESP32-C6

  • LCD & Touch: 720P MIPI-DSI ILI9881 & GT911

  • Audio: 8311

  • Type-C: USB2.0

Related Links:

Features:

  • 720P high-resolution touch screen

  • “ESP32-C6 + ESP-Hosted” Wi-Fi solution

  • Android-like system effect, provides common functionality (such as network configuration) App

EchoEar

  • EchoEar: EchoEar Meow Companion is a smart AI development kit created by Espressif in collaboration with the Volcano Engine Button Large Model team. It is suitable for voice interactive products such as toys, smart speakers, and smart control centers that require large model empowerment. The device is equipped with the ESP32-S3-WROOM-1 module, a 1.85-inch QSPI round touch screen, and a dual microphone array, supporting offline voice wake-up and sound source localization algorithms. Combined with the large model capabilities provided by the Volcano Engine, Meow Companion can achieve full-duplex voice interaction, multimodal recognition, and intelligent body control, providing a solid foundation for developers to create a complete end-side AI application experience.

Hardware:

  • ESP32-S3

Related Links:

LLM Hardware Solution Summary

Audio Input Solution Comparison Table

Solution No.

Solution Type

Resource Usage

Cost

Effects and Recommended Application Scenarios

1

Digital Microphone (MSM261S4030H0R etc.)

1 I2S (3 pins)

High

  • Simple wiring

  • Cannot achieve echo cancellation

  • Suitable for DIY / board area limited scenarios

2

Dedicated Audio ADC + Analog Microphone (ES7210)

1 I2S + 1 I2C (5 pins)

Medium-High

  • Recommended for multiple microphone scenarios

3

CODEC + Analog Microphone (ES8311 etc.)

1 I2S + 1 I2C (5 pins)

Medium

  • Low cost

  • Good effect

  • Recommended for use

4

Internal ADC + Op-amp + Analog Microphone

1 Internal ADC (1 pin)

Lowest

  • Lowest cost audio input solution

  • Meets basic audio input requirements

Audio Output Solution Comparison Table

Solution No.

Solution Type

Resource Usage

Cost

Effects and Recommended Application Scenarios

1

I2S Digital Amplifier (MAX98357A etc.)

1 I2S + 1 PA control (4 pins)

Medium

  • Good effect but no volume control

  • Suitable for products only requiring audio output

2

CODEC + Analog Amplifier (ES8311 + NS4150)

1 I2S + 1 I2C + 1 PA control (6 pins)

Low

  • Optimal cost and effect

  • Recommended for AI audio

3

I2S PDM + Analog Amplifier (NS4150)

2 I2S pins + 1 PA control (3 pins)

Lowest

  • Lowest cost but uses CPU resources

  • Suitable for cost-sensitive scenarios

The above audio solutions can be freely combined, but if customers are in the design phase, considering cost and performance, only the following solutions are recommended:

Voice Solution Recommendation Comparison Table

Category

Type Description

Solution Features

Recommended Hardware

Reference Development Board

Optimal Cost

Analog Mic + OPA Internal ADC Audio Collection + I2S PDM Output

  • Single microphone input, mono playback

  • Implements basic audio collection and playback

  • Minimum IO usage

  • Suitable for close-range dialogue, low-cost applications

ESP32-C3 / ESP32-C5

ESP-HI

Balanced Choice

Single Microphone + External Codec Chip (e.g., ES8311)

  • Supports echo cancellation

  • Good single microphone input effect

  • Balanced performance and cost

ESP32-S3 / ESP32-P4 / ESP32-C5

ESP32-P4-Function-EV-Board

Best Performance

Multiple Microphones + External Decoder + External Audio ADC Chip (e.g., ES8311 + ES7210)

  • Supports far-field voice wake-up

  • Supports echo cancellation, noise suppression

  • Suitable for high-performance voice applications

ESP32-S3 / ESP32-P4

ESP32-S3-Korvo-2

AI Vision Hardware Solution Comparison Table

Interface Type

Camera Performance

Supported Chips

Reference Development Board

Features

SPI

Low Resolution Image (max 320×240)

ESP32-S3 / ESP32-C5 / ESP32-C3

ESP32-S3-EYE

  • Simple interface, suitable for beginners

  • Lowest cost

  • Limited frame rate and resolution

  • Mainly used for image recognition, simple detection

DVP (Parallel)

Low to Medium Resolution (e.g., VGA, 720p)

ESP32-S3 / ESP32-P4

ESP32-S3-EYE

  • Older but mature interface

  • Can support basic video streaming

  • High resource usage (requires many GPIOs)

  • Medium imaging quality, suitable for medium visual tasks

USB

Low to High Resolution (depends on USB camera)

ESP32-S3 / ESP32-P4

ESP32-S3-USB-OTG

  • Plug and play with UVC cameras

  • Supports high resolution, high frame rate

  • Complex software support (requires USB Host capability)

  • Suitable for applications requiring high image quality

MIPI CSI

High Resolution Image (e.g., 1080p, 4K)

ESP32-S3 / ESP32-P4

ESP32-P4-Function-EV-Board

  • High bandwidth, low power consumption

  • High hardware requirements (requires MIPI PHY)

  • Supports HD, low-latency image capture

  • Suitable for edge AI, visual analysis, recognition scenarios

In summary: ESP32-C3 / ESP32-C5 only support SPI interface cameras, currently adapted to BD3901 camera. For ESP32-S3 with clarity requirements, DVP camera is first recommended, for ESP32-P4, MIPI camera is first recommended.

AI Perception Solution

Tactile Perception

Adding tactile interaction to AI products can significantly enhance user experience. In categories such as plush toys and desktop companions, it is recommended to use capacitive touch sensors to achieve “touch perception”.

  • Experience Advantages: Structurally, it can achieve a “zero presence” feeling, triggering without the need for forceful pressing, making the interaction more natural. It is suitable for children and weak operation scenarios.

  • Chip Support: The ESP32-S3 / ESP32-P4 comes with built-in Touch capacitive sensing peripherals, which can directly drive touch electrodes without the need for a dedicated touch IC. The ESP32-C3 / ESP32-C5 does not have built-in Touch. If tactile interaction is required, an external capacitive touch chip or button solution can be used.

  • Reference Design: You can refer to the hardware and software implementation of ESP-Spot. The code repository provides examples of touch detection and AI linkage.

Posture Sensing

Through the IMU (Inertial Measurement Unit), the device can have the ability of “attitude awareness”, which is used for sleep wake-up and interaction enhancement.

  • Typical Applications: Detect actions such as picking up/putting down, tilting, shaking, etc., to implement “wake up on pick up”; Combined with algorithms, it can recognize gestures such as nodding, shaking head, flipping, etc., enriching the interaction dimension of screenless or small screen products.

  • Selection Suggestions: A commonly used 6-axis IMU (such as the LSM6DS series) can meet most scenarios; if a magnetometer is needed for absolute orientation, a 9-axis IMU can be chosen.

  • Reference Design: Both ESP-Spot and ESP-SparkBot integrate IMU, which can be used as a reference for sleep wake-up and simple posture interaction.

Screen and UI

The number of screens and resolution should be chosen based on the product form and UI complexity. You can refer to the driver solutions and interface descriptions in LCD Application Solution.

  • Resolution and Performance: The higher the resolution and the more frequent the refresh rate, the greater the CPU/bus and memory usage. In applications with wake words, voice recognition, or local AI, sufficient CPU margin must be reserved for audio and AI tasks to avoid screen refresh-induced lag or wake-up delay.

  • Typical Configuration: Entry-level products commonly use 0.96”~1.54”, 160×160 or 240×240; products with emoticons or complex UI can choose 1.85”, 360×360 or higher resolution. For specifics, please refer to the resource usage description in the document and adjust according to actual measurements.

Note

The resource usage given in the LCD Application Solution is based on a “screen-only” scenario. If the actual product runs wake-up words, voice front-end, TTS or visual algorithms at the same time, sufficient CPU and memory margin should be reserved.

Get the AI product moving

If you need to give your AI product the ability to perform actions (such as turning its head, waving its hand, moving its mechanical arm, etc.), you can choose the type of motor based on the requirements of force, precision, and cost.

  • Servo: Simple angle control, unified interface, suitable for head rotation and simple limb movements. See the reference design at ESP-Hi.

  • DC Motor: Suitable for continuous rotation scenarios such as wheeled movement and fans, it needs to be used in conjunction with a driver and an optional encoder for speed/position control. See the design reference at ESP-SparkBot.

  • Stepper Motor: With a small step angle and high positioning accuracy, it is suitable for applications that require precise angles (such as pan-tilt units, pointing mechanisms). See the reference design at EchoEar.

  • Robotic Arm: A robotic arm solution combining servo motors and stepper motors. Refer to ESP32P4-Robotic-Arm.

The aforementioned reference designs all include hardware connections and software driver examples, which can be trimmed and expanded after selecting the product form.

Introduction to Wake Word

The current ESP32-C3 / ESP32-C5 and ESP32-S3 / ESP32-P4 all support local wake words; different series have differences in model scale and front-end algorithms, and the choice needs to be balanced with performance and cost.

  • ESP32-C3 / ESP32-C5: The wake word model they carry is smaller in size, and due to computational limitations, the front-end AFE (Audio Front-End, such as noise reduction, VAD, etc.) has been trimmed. Overall, the wake-up rate and noise resistance are weaker than S3/P4; suitable for applications that do not require high wake-up sensitivity and are cost-sensitive.

  • ESP32-S3 / ESP32-P4: Uses a larger wake word model and retains a more complete front-end AFE process, resulting in better wake performance. Moreover, the wake word can be accelerated by the extended AI instruction set, which takes up less CPU usage, making it easier to run in parallel with tasks such as voice recognition and TTS.

The reference design can refer to ESP-Hi and EchoEar, which respectively cover entry-level and high-performance voice interaction scenarios.

Edge AI Information Sharing

  1. Voice Capability Enhancement

    • Wake word support related repository references:
    • TTS supports low-cost custom wake words (expected to be released by end of May)

  2. Vision Capability Enhancement

    • Supports local YOLO model operation

    • Can implement basic object detection functionality

  3. Large Model Integration Examples