Introduction to LLM Solution

[中文]

Note

This document is automatically translated using AI. Please excuse any detailed errors. The official English version is still in progress.

LLM Solution Overview

Solution Features:
  • The rise of large models like ChatGPT has driven a global AI boom, with cloud platforms upgrading their intelligence and AI technology continuously penetrating various industries. Espressif has built a solid technical foundation with its open, shared, and ecosystem-based intelligent hardware platform.

  • Espressif provides voice and vision large model solution references, with corresponding solutions for low-cost C3, flagship S3, high-performance P4, and dual-band C5.

  • Espressif is also working on large language model privatization deployment to provide strong support for accelerating customer product implementation.

Solution Overview Diagram:

LLM Solution Overview Diagram

Voice Solution Details

Voice Solution Product Matrix

Positioning

C Series (ESP32-C3/C5)

S Series (ESP32-S3)

Entry Level

ESP-Hi - 0.96” 160 * 80 LCD - No Codec Required

Cost-Effective

ESP-Spot - No Screen - 5G Band

ESP-Gogo & ESP-Eyemoji

  • 1.85” 360*360 & 0.71” 160 * 160 LCD

  • Dual HD Screens

High Performance

EchoEar - 1.85” 360*360 LCD - Dual Microphone Array

Vision Solution Details

Vision Solution Product Matrix

Positioning

S Series (ESP32-S3)

P Series (ESP32-P4)

Cost-Effective

ESP-Sparkbot - DVP VGA/720p Camera - 1.54” 240*240 LCD

ESP-Brookesia - USB Camera - 720P Touch Screen

Chip Feature Comparison

ESP32-S3 vs ESP32-C3 vs ESP32-C5 LLM Application Selection Comparison

Features

ESP32-S3

ESP32-C3

ESP32-C5

Recommended Scenarios

Multi-modal interaction, voice and vision terminals with display

Cost-effective lightweight edge applications

5G advantage scenarios: anti-interference, network compatibility

Voice

Multiple microphones, supports AEC echo cancellation

Single microphone solution, basic collection

Single microphone solution, basic collection

Display

Multiple display solutions & complex interface interaction

Limited IO resources, restricted display solutions

Slightly more IO resources than C3, restricted display solutions

Camera

Supports SPI / DVP / USB cameras

Only supports SPI camera

Only supports SPI camera

Touch

Built-in Touch sensor

No Touch, requires external touch chip

No Touch, requires external touch chip

Memory

Up to 512KB SRAM + PSRAM support

400KB SRAM

Up to 384KB SRAM + PSRAM support

Local AI Capability

Vector computing instructions + neural network accelerator

Supports lightweight models

Supports lightweight models

LLM Solution Product Matrix Overview

Category

Solution Details

Cloud Platform Integration

RainMaker & Matter:
- Provides rapid cloud platform access capability

Cloud Server

MCP & Multi-platform Access & Offline Deployment:
- Supports multiple cloud services and privatization deployment solutions

Software Framework

ESP-Brookesia:
- Provides complete software development framework support

Application Solutions

Audio & Image Solutions:
- Provides multiple chip and detailed scenario choices

AI Empowerment:
- ESP-PDD:MIC/SPK/LCD small module for rapid upgrade and evaluation of existing products
- AT Commands:Simple and quick integration
Process Architecture Introduction:

ESP chips as the edge side mainly implement data collection, preliminary processing, and transmission. Due to processing performance limitations, LLM-related processing still relies on cloud servers. Below is the overall system architecture:

Edge and Cloud Task Division

Edge Tasks

Cloud Tasks

- Data Collection and Preliminary Processing: Real-time collection and preliminary processing of voice and image data through Espressif chips;
- Local AI Model Inference: Deploy lightweight models for offline or low-latency processing;
- Real-time Transmission: Utilize full-duplex RTC protocol to ensure timely data transmission to the cloud.
- Large Model Training and Deep Analysis: Perform complex calculations on collected data to provide intelligent decision support;
- Algorithm Updates and Optimization: Cloud computing resources used for model iteration, real-time feedback to the edge;
- Remote Updates and Maintenance: Implement continuous system updates and maintenance through OTA and other methods.
Privatization Deployment Value:
  • Can accelerate testing and improve stability

  • Combined with embedded devices, can achieve low-latency, high-privacy smart home experience

  • Downstream enterprises can quickly integrate intelligent interaction capabilities by purchasing complete deployment solutions, lowering technical barriers and shortening product implementation cycles

Privatization Deployment Architecture:

Privatization Deployment Architecture Diagram

Common LLM Application Scenarios

Common LLM Application Scenario Classification

Application Scenario

Product Form

Recommended Solution

Plush/Desktop Pet Toys

- With Screen: Supports expression/interaction display
- Without Screen: Pure voice and action interaction
- ESP-Eyemoji: Cute expressions, voice interaction
- ESP-Spot: Lightweight without screen, supports gestures

Smart Speaker

- Single Mic: Near-field interaction
- Dual Mic: Sound source localization
- ESP-Hi: Ultra-low cost single mic solution
- EchoEar: Dual microphone array, far-field wake-up

Automotive Applications

- Single Screen: Driving information, cute eye expression display
- Dual Screen: Driving information, dual eye expression display
- EchoEar: Single-screen voice interaction
- ESP-Gogo & ESP-Eyemoji: Dual screen voice interaction

AI Empowerment for Existing Products

SPK/MIC/LCD Small Module

- ESP-PDD: Rapid AI product formation/evaluation
- AT Commands: Simple and quick integration

ESP-Spot

  • ESP-Spot:ESP-Spot is an AI action voice interaction core module based on ESP32-S3 / ESP32-C5, focusing on voice interaction, AI perception, and intelligent control. It not only has offline voice wake-up and AI dialogue functions but also can achieve doll touch perception through ESP32-S3’s built-in touch/proximity sensing peripherals. The device has a built-in accelerometer that can recognize postures and actions, enabling richer interactions.

Related Links:

Features:

  • Screenless solution, focusing on voice and action interaction

  • Low cost, no display screen

  • Can expand panel to dual-screen ESP-Gogo and ESP-Eyemoji

  • S3/C5 dual adaptation, can push C5 5GHz

ESP-SparkBot

  • ESP-SparkBot:ESP-SparkBot is based on ESP32-S3, integrating voice interaction, image recognition, and multimedia entertainment. It can transform into a remote control car, play with local AI, support large model dialogue, real-time video transmission, and HD video projection. Powerful performance, endless fun!

Related Links:

ESP-Hi

  • ESP-Hi:ESP-Hi is a high-integration AI voice solution based on ESP32-C3, using ESP32-C3’s built-in ADC as the microphone collection device and I2S PDM directly as audio output, achieving low board-level material cost.

Description:

  • High Integration: Uses ESP32-C3’s built-in ADC as the microphone collection device. Uses I2S PDM directly as audio output, thus eliminating the need for external CODEC chip. Achieves low board-level material cost.

  • Low Resource Usage: Audio transceiver only uses 4 IO ports, uses very little CPU and memory, reserves sufficient resources for application development.

  • Multiple Interaction Methods: With screen and LED indicators, supports buttons, shaking, and voice wake-up.

Related Links:

Features:

  • Currently the lowest board-level material cost AI voice solution

  • C3 wake word lightweight model, supports offline wake-up

ESP-P4 Phone

  • ESP-P4 Phone:A handheld device solution with screen based on ESP32-P4, combined with ESP-Brookesia’s Phone UI functionality, achieving Android-like system effects.

Hardware:

  • Main Control: ESP32-P4

  • Wi-Fi: ESP32-C6

  • LCD & Touch: 720P MIPI-DSI ILI9881 & GT911

  • Audio: 8311

  • Type-C: USB2.0

Related Links:

Features:

  • 720P high-resolution touch screen

  • “ESP32-C6 + ESP-Hosted” Wi-Fi solution

  • Android-like system effect, provides common functionality (such as network configuration) App

EchoEar

  • EchoEar: EchoEar Meow Companion is a smart AI development kit created by Espressif in collaboration with the Volcano Engine Button Large Model team. It is suitable for voice interactive products such as toys, smart speakers, and smart control centers that require large model empowerment. The device is equipped with the ESP32-S3-WROOM-1 module, a 1.85-inch QSPI round touch screen, and a dual microphone array, supporting offline voice wake-up and sound source localization algorithms. Combined with the large model capabilities provided by the Volcano Engine, Meow Companion can achieve full-duplex voice interaction, multimodal recognition, and intelligent body control, providing a solid foundation for developers to create a complete end-side AI application experience.

Hardware:

  • ESP32-S3

Related Links:

LLM Hardware Solution Summary

Audio Input Solution Comparison Table

Solution No.

Solution Type

Resource Usage

Cost

Effects and Recommended Application Scenarios

1

Digital Microphone (MSM261S4030H0R etc.)

1 I2S (3 pins)

High

  • Simple wiring

  • Cannot achieve echo cancellation

  • Suitable for DIY / board area limited scenarios

2

Dedicated Audio ADC + Analog Microphone (ES7210)

1 I2S + 1 I2C (5 pins)

Medium-High

  • Recommended for multiple microphone scenarios

3

CODEC + Analog Microphone (ES8311 etc.)

1 I2S + 1 I2C (5 pins)

Medium

  • Low cost

  • Good effect

  • Recommended for use

4

Internal ADC + Op-amp + Analog Microphone

1 Internal ADC (1 pin)

Lowest

  • Lowest cost audio input solution

  • Meets basic audio input requirements

Audio Output Solution Comparison Table

Solution No.

Solution Type

Resource Usage

Cost

Effects and Recommended Application Scenarios

1

I2S Digital Amplifier (MAX98357A etc.)

1 I2S + 1 PA control (4 pins)

Medium

  • Good effect but no volume control

  • Suitable for products only requiring audio output

2

CODEC + Analog Amplifier (ES8311 + NS4150)

1 I2S + 1 I2C + 1 PA control (6 pins)

Low

  • Optimal cost and effect

  • Recommended for AI audio

3

I2S PDM + Analog Amplifier (NS4150)

2 I2S pins + 1 PA control (3 pins)

Lowest

  • Lowest cost but uses CPU resources

  • Suitable for cost-sensitive scenarios

The above audio solutions can be freely combined, but if customers are in the design phase, considering cost and performance, only the following solutions are recommended:

Voice Solution Recommendation Comparison Table

Category

Type Description

Solution Features

Recommended Hardware

Reference Development Board

Optimal Cost

Analog Mic + OPA Internal ADC Audio Collection + I2S PDM Output

  • Single microphone input, mono playback

  • Implements basic audio collection and playback

  • Minimum IO usage

  • Suitable for close-range dialogue, low-cost applications

ESP32-C3 / ESP32-C5

ESP-HI

Balanced Choice

Single Microphone + External Codec Chip (e.g., ES8311)

  • Supports echo cancellation

  • Good single microphone input effect

  • Balanced performance and cost

ESP32-S3 / ESP32-P4 / ESP32-C5

ESP32-P4-Function-EV-Board

Best Performance

Multiple Microphones + External Decoder + External Audio ADC Chip (e.g., ES8311 + ES7210)

  • Supports far-field voice wake-up

  • Supports echo cancellation, noise suppression

  • Suitable for high-performance voice applications

ESP32-S3 / ESP32-P4

ESP32-S3-Korvo-2

AI Vision Hardware Solution Comparison Table

Interface Type

Camera Performance

Supported Chips

Reference Development Board

Features

SPI

Low Resolution Image (max 320×240)

ESP32-S3 / ESP32-C5 / ESP32-C3

ESP32-S3-EYE

  • Simple interface, suitable for beginners

  • Lowest cost

  • Limited frame rate and resolution

  • Mainly used for image recognition, simple detection

DVP (Parallel)

Low to Medium Resolution (e.g., VGA, 720p)

ESP32-S3 / ESP32-P4

ESP32-S3-EYE

  • Older but mature interface

  • Can support basic video streaming

  • High resource usage (requires many GPIOs)

  • Medium imaging quality, suitable for medium visual tasks

USB

Low to High Resolution (depends on USB camera)

ESP32-S3 / ESP32-P4

ESP32-S3-USB-OTG

  • Plug and play with UVC cameras

  • Supports high resolution, high frame rate

  • Complex software support (requires USB Host capability)

  • Suitable for applications requiring high image quality

MIPI CSI

High Resolution Image (e.g., 1080p, 4K)

ESP32-S3 / ESP32-P4

ESP32-P4-Function-EV-Board

  • High bandwidth, low power consumption

  • High hardware requirements (requires MIPI PHY)

  • Supports HD, low-latency image capture

  • Suitable for edge AI, visual analysis, recognition scenarios

In summary: ESP32-C3 / ESP32-C5 only support SPI interface cameras, currently adapted to BD3901 camera. For ESP32-S3 with clarity requirements, DVP camera is first recommended, for ESP32-P4, MIPI camera is first recommended.

Edge AI Information Sharing

  1. Voice Capability Enhancement

    • Wake word support related repository references:
    • TTS supports low-cost custom wake words (expected to be released by end of May)

  2. Vision Capability Enhancement

    • Supports local YOLO model operation

    • Can implement basic object detection functionality

  3. Large Model Integration Examples