Introduction to LLM Solution

[中文]

Note

This document is automatically translated using AI. Please excuse any detailed errors. The official English version is still in progress.

LLM Solution Overview

Solution Features:
  • The rise of large models like ChatGPT has driven a global AI boom, with cloud platforms upgrading their intelligence and AI technology continuously penetrating various industries. Espressif has built a solid technical foundation with its open, shared, and ecosystem-based intelligent hardware platform.

  • Espressif provides voice and vision large model solution references, with corresponding solutions for low-cost C3, flagship S3, high-performance P4, and dual-band C5.

  • Espressif is also working on large language model privatization deployment to provide strong support for accelerating customer product implementation.

Solution Overview Diagram:

LLM Solution Overview Diagram

Voice Solution Details

Voice Solution Product Matrix

Positioning

C Series (ESP32-C3/C5)

S Series (ESP32-S3)

Entry Level

ESP-Hi - 0.96” 160 * 80 LCD - No Codec Required

Cost-Effective

ESP-Spot - No Screen - 5G Band

ESP-Gogo & ESP-Eyemoji

  • 1.85” 360*360 & 0.71” 160 * 160 LCD

  • Dual HD Screens

High Performance

EchoEar - 1.85” 360*360 LCD - Dual Microphone Array

Vision Solution Details

Vision Solution Product Matrix

Positioning

S Series (ESP32-S3)

P Series (ESP32-P4)

Cost-Effective

ESP-Sparkbot - DVP VGA/720p Camera - 1.54” 240*240 LCD

ESP-Brookesia - USB Camera - 720P Touch Screen

Chip Feature Comparison

ESP32-S3 vs ESP32-C3 vs ESP32-C5 LLM Application Selection Comparison

Features

ESP32-S3

ESP32-C3

ESP32-C5

Recommended Scenarios

Multi-modal interaction, voice and vision terminals with display

Cost-effective lightweight edge applications

5G advantage scenarios: anti-interference, network compatibility

Voice

Multiple microphones, supports AEC echo cancellation

Single microphone solution, basic collection

Single microphone solution, basic collection

Display

Multiple display solutions & complex interface interaction

Limited IO resources, restricted display solutions

Slightly more IO resources than C3, restricted display solutions

Camera

Supports SPI / DVP / USB cameras

Only supports SPI camera

Only supports SPI camera

Touch

Built-in Touch sensor

No Touch, requires external touch chip

No Touch, requires external touch chip

Memory

Up to 512KB SRAM + PSRAM support

400KB SRAM

Up to 384KB SRAM + PSRAM support

Local AI Capability

Vector computing instructions + neural network accelerator

Supports lightweight models

Supports lightweight models

LLM Solution Product Matrix Overview

Category

Solution Details

Cloud Platform Integration

RainMaker & Matter:
- Provides rapid cloud platform access capability

Cloud Server

MCP & Multi-platform Access & Offline Deployment:
- Supports multiple cloud services and privatization deployment solutions

Software Framework

ESP-Brookesia:
- Provides complete software development framework support

Application Solutions

Audio & Image Solutions:
- Provides multiple chip and detailed scenario choices

AI Empowerment:
- ESP-PDD:MIC/SPK/LCD small module for rapid upgrade and evaluation of existing products
- AT Commands:Simple and quick integration
Process Architecture Introduction:

ESP chips as the edge side mainly implement data collection, preliminary processing, and transmission. Due to processing performance limitations, LLM-related processing still relies on cloud servers. Below is the overall system architecture:

Edge and Cloud Task Division

Edge Tasks

Cloud Tasks

- Data Collection and Preliminary Processing: Real-time collection and preliminary processing of voice and image data through Espressif chips;
- Local AI Model Inference: Deploy lightweight models for offline or low-latency processing;
- Real-time Transmission: Utilize full-duplex RTC protocol to ensure timely data transmission to the cloud.
- Large Model Training and Deep Analysis: Perform complex calculations on collected data to provide intelligent decision support;
- Algorithm Updates and Optimization: Cloud computing resources used for model iteration, real-time feedback to the edge;
- Remote Updates and Maintenance: Implement continuous system updates and maintenance through OTA and other methods.
Privatization Deployment Value:
  • Can accelerate testing and improve stability

  • Combined with embedded devices, can achieve low-latency, high-privacy smart home experience

  • Downstream enterprises can quickly integrate intelligent interaction capabilities by purchasing complete deployment solutions, lowering technical barriers and shortening product implementation cycles

Privatization Deployment Architecture:

Privatization Deployment Architecture Diagram

Common LLM Application Scenarios

Common LLM Application Scenario Classification

Application Scenario

Product Form

Recommended Solution

Plush/Desktop Pet Toys

- With Screen: Supports expression/interaction display
- Without Screen: Pure voice and action interaction
- ESP-Eyemoji: Cute expressions, voice interaction
- ESP-Spot: Lightweight without screen, supports gestures

Smart Speaker

- Single Mic: Near-field interaction
- Dual Mic: Sound source localization
- ESP-Hi: Ultra-low cost single mic solution
- EchoEar: Dual microphone array, far-field wake-up

Automotive Applications

- Single Screen: Driving information, cute eye expression display
- Dual Screen: Driving information, dual eye expression display
- EchoEar: Single-screen voice interaction
- ESP-Gogo & ESP-Eyemoji: Dual screen voice interaction

AI Empowerment for Existing Products

SPK/MIC/LCD Small Module

- ESP-PDD: Rapid AI product formation/evaluation
- AT Commands: Simple and quick integration

ESP-Spot

  • ESP-Spot:ESP-Spot is an AI action voice interaction core module based on ESP32-S3 / ESP32-C5, focusing on voice interaction, AI perception, and intelligent control. It not only has offline voice wake-up and AI dialogue functions but also can achieve doll touch perception through ESP32-S3’s built-in touch/proximity sensing peripherals. The device has a built-in accelerometer that can recognize postures and actions, enabling richer interactions.

Related Links:

Features:

  • Screenless solution, focusing on voice and action interaction

  • Low cost, no display screen

  • Can expand panel to dual-screen ESP-Gogo and ESP-Eyemoji

  • S3/C5 dual adaptation, can push C5 5GHz

ESP-SparkBot

  • ESP-SparkBot:ESP-SparkBot is based on ESP32-S3, integrating voice interaction, image recognition, and multimedia entertainment. It can transform into a remote control car, play with local AI, support large model dialogue, real-time video transmission, and HD video projection. Powerful performance, endless fun!

Related Links:

ESP-Hi

  • ESP-Hi:ESP-Hi is a high-integration AI voice solution based on ESP32-C3, using ESP32-C3’s built-in ADC as the microphone collection device and I2S PDM directly as audio output, achieving low board-level material cost.

Description:

  • High Integration: Uses ESP32-C3’s built-in ADC as the microphone collection device. Uses I2S PDM directly as audio output, thus eliminating the need for external CODEC chip. Achieves low board-level material cost.

  • Low Resource Usage: Audio transceiver only uses 4 IO ports, uses very little CPU and memory, reserves sufficient resources for application development.

  • Multiple Interaction Methods: With screen and LED indicators, supports buttons, shaking, and voice wake-up.

Related Links:

Features:

  • Currently the lowest board-level material cost AI voice solution

  • C3 wake word lightweight model, supports offline wake-up

ESP-P4 Phone

  • ESP-P4 Phone:A handheld device solution with screen based on ESP32-P4, combined with ESP-Brookesia’s Phone UI functionality, achieving Android-like system effects.

Hardware:

  • Main Control: ESP32-P4

  • Wi-Fi: ESP32-C6

  • LCD & Touch: 720P MIPI-DSI ILI9881 & GT911

  • Audio: 8311

  • Type-C: USB2.0

Related Links:

Features:

  • 720P high-resolution touch screen

  • “ESP32-C6 + ESP-Hosted” Wi-Fi solution

  • Android-like system effect, provides common functionality (such as network configuration) App

EchoEar

  • EchoEar: A customized version of the AI speaker for DouBao, a rechargeable round screen device that supports dual-microphone sound source positioning. Paired with a turntable, it can rotate directions and has touch and battery functions.

Hardware:

  • ESP32-S3

Related Links:

  • Related Videos: Updating

LLM Hardware Solution Summary

Audio Input Solution Comparison Table

Solution No.

Solution Type

Resource Usage

Cost

Effects and Recommended Application Scenarios

1

Digital Microphone (MSM261S4030H0R etc.)

1 I2S (3 pins)

High

  • Simple wiring

  • Cannot achieve echo cancellation

  • Suitable for DIY / board area limited scenarios

2

Dedicated Audio ADC + Analog Microphone (ES7210)

1 I2S + 1 I2C (5 pins)

Medium-High

  • Recommended for multiple microphone scenarios

3

CODEC + Analog Microphone (ES8311 etc.)

1 I2S + 1 I2C (5 pins)

Medium

  • Low cost

  • Good effect

  • Recommended for use

4

Internal ADC + Op-amp + Analog Microphone

1 Internal ADC (1 pin)

Lowest

  • Lowest cost audio input solution

  • Meets basic audio input requirements

Audio Output Solution Comparison Table

Solution No.

Solution Type

Resource Usage

Cost

Effects and Recommended Application Scenarios

1

I2S Digital Amplifier (MAX98357A etc.)

1 I2S + 1 PA control (4 pins)

Medium

  • Good effect but no volume control

  • Suitable for products only requiring audio output

2

CODEC + Analog Amplifier (ES8311 + NS4150)

1 I2S + 1 I2C + 1 PA control (6 pins)

Low

  • Optimal cost and effect

  • Recommended for AI audio

3

I2S PDM + Analog Amplifier (NS4150)

2 I2S pins + 1 PA control (3 pins)

Lowest

  • Lowest cost but uses CPU resources

  • Suitable for cost-sensitive scenarios

The above audio solutions can be freely combined, but if customers are in the design phase, considering cost and performance, only the following solutions are recommended:

Voice Solution Recommendation Comparison Table

Category

Type Description

Solution Features

Recommended Hardware

Reference Development Board

Optimal Cost

Analog Mic + OPA Internal ADC Audio Collection + I2S PDM Output

  • Single microphone input, mono playback

  • Implements basic audio collection and playback

  • Minimum IO usage

  • Suitable for close-range dialogue, low-cost applications

ESP32-C3 / ESP32-C5

ESP-HI

Balanced Choice

Single Microphone + External Codec Chip (e.g., ES8311)

  • Supports echo cancellation

  • Good single microphone input effect

  • Balanced performance and cost

ESP32-S3 / ESP32-P4 / ESP32-C5

ESP32-P4-Function-EV-Board

Best Performance

Multiple Microphones + External Decoder + External Audio ADC Chip (e.g., ES8311 + ES7210)

  • Supports far-field voice wake-up

  • Supports echo cancellation, noise suppression

  • Suitable for high-performance voice applications

ESP32-S3 / ESP32-P4

ESP32-S3-Korvo-2

AI Vision Hardware Solution Comparison Table

Interface Type

Camera Performance

Supported Chips

Reference Development Board

Features

SPI

Low Resolution Image (max 320×240)

ESP32-S3 / ESP32-C5 / ESP32-C3

ESP32-S3-EYE

  • Simple interface, suitable for beginners

  • Lowest cost

  • Limited frame rate and resolution

  • Mainly used for image recognition, simple detection

DVP (Parallel)

Low to Medium Resolution (e.g., VGA, 720p)

ESP32-S3 / ESP32-P4

ESP32-S3-EYE

  • Older but mature interface

  • Can support basic video streaming

  • High resource usage (requires many GPIOs)

  • Medium imaging quality, suitable for medium visual tasks

USB

Low to High Resolution (depends on USB camera)

ESP32-S3 / ESP32-P4

ESP32-S3-USB-OTG

  • Plug and play with UVC cameras

  • Supports high resolution, high frame rate

  • Complex software support (requires USB Host capability)

  • Suitable for applications requiring high image quality

MIPI CSI

High Resolution Image (e.g., 1080p, 4K)

ESP32-S3 / ESP32-P4

ESP32-P4-Function-EV-Board

  • High bandwidth, low power consumption

  • High hardware requirements (requires MIPI PHY)

  • Supports HD, low-latency image capture

  • Suitable for edge AI, visual analysis, recognition scenarios

In summary: ESP32-C3 / ESP32-C5 only support SPI interface cameras, currently adapted to BD3901 camera. For ESP32-S3 with clarity requirements, DVP camera is first recommended, for ESP32-P4, MIPI camera is first recommended.

Edge AI Information Sharing

  1. Voice Capability Enhancement

    • Wake word support related repository references:
    • TTS supports low-cost custom wake words (expected to be released by end of May)

  2. Vision Capability Enhancement

    • Supports local YOLO model operation

    • Can implement basic object detection functionality

  3. Large Model Integration Examples