小智 AI 聊天机器人

[English]

支持的芯片

ESP32-S3

小智是一个双向流式对话组件,连接到 xiaozhi.me 服务。它支持与使用 Qwen 和 DeepSeek 等大语言模型的 AI 代理进行实时语音/文本交互。

该组件非常适合语音助手和智能语音问答系统等用例。它具有低延迟和轻量级设计,适合在 ESP32 等嵌入式设备上运行的应用。

功能特性

  • 双向流式传输:与 AI 代理进行实时语音和文本交互

  • 多种通信协议:支持 WebSocket 和 MQTT+UDP 协议

  • 音频编解码支持:OPUS、G.711 和 PCM 音频格式

  • MCP 集成:设备端 MCP 用于设备控制(扬声器、LED、舵机、GPIO 等)

  • 多语言支持:中文和英文

  • 离线唤醒词:提供上报唤醒词的 API(如 esp_xiaozhi_chat_send_wake_word),与 ESP-SR 的集成由应用层实现

架构

小智使用流式 ASR(自动语音识别)+ LLM(大语言模型)+ TTS(文本转语音)架构进行语音交互:

  1. 音频输入:从麦克风捕获音频

  2. ASR:实时将语音转换为文本

  3. LLM:处理文本并生成响应

  4. TTS:将文本响应转换为语音

  5. 音频输出:通过扬声器播放音频

该组件与 MCP(模型上下文协议)集成,以实现设备控制功能。

示例

  1. 小智应用示例: ai/xiaozhi_chat。一个完整的语音助手应用,演示: - 与 AI 代理的语音交互 - 通过 MCP 协议进行设备控制 - 多语言支持 - 显示支持

API 参考

Header File

Functions

esp_err_t esp_xiaozhi_chat_init(esp_xiaozhi_chat_config_t *config, esp_xiaozhi_chat_handle_t *chat_hd)

Instance the chat module.

The current implementation supports only one chat instance at a time.

参数
  • config[in] Pointer to the chat configuration structure

  • chat_hd[out] Pointer to the chat handle

返回

  • ESP_OK On success

  • ESP_ERR_NO_MEM Out of memory

  • ESP_ERR_INVALID_ARG Invalid arguments

  • ESP_ERR_INVALID_STATE Another chat instance is already active

esp_err_t esp_xiaozhi_chat_deinit(esp_xiaozhi_chat_handle_t chat_hd)

Deinitialize the chat module.

This function releases chat-owned resources. If the chat session is still running, it will stop runtime resources first. The MCP engine is destroyed only when config.owns_mcp_engine was set to true during init.

参数

chat_hd[in] Handle to the chat instance

返回

  • ESP_OK On success

  • ESP_ERR_INVALID_ARG Invalid handle

esp_err_t esp_xiaozhi_chat_start(esp_xiaozhi_chat_handle_t chat_hd)

Start the chat session.

参数

chat_hd[in] Handle to the chat instance

返回

  • ESP_OK On success

  • ESP_ERR_INVALID_ARG Invalid handle

  • ESP_ERR_NOT_FOUND Required transport configuration is missing

  • Other Error from transport start (MQTT or WebSocket)

esp_err_t esp_xiaozhi_chat_stop(esp_xiaozhi_chat_handle_t chat_hd)

Stop the chat session.

Stops the active chat runtime, including audio channel and MCP manager resources, but does not destroy the configured MCP engine.

参数

chat_hd[in] Handle to the chat instance

返回

  • ESP_OK On success

  • ESP_ERR_INVALID_ARG Invalid handle

esp_err_t esp_xiaozhi_chat_open_audio_channel(esp_xiaozhi_chat_handle_t chat_hd, const esp_xiaozhi_chat_audio_t *audio, char *message, size_t message_len)

Open audio channel.

参数
  • chat_hd[in] Handle to the chat instance

  • audio[in] Optional audio params for the generated hello (format, sample_rate, channels, frame_duration). Used only when message is NULL. NULL or zero fields mean defaults: “opus”, 16000, 1, 60. Non-zero values must be within valid protocol ranges.

  • message[in] Optional message to send when opening the channel. If NULL, a default hello message will be generated

  • message_len[in] Length of the message buffer. If 0 with message NULL, a default hello is generated; if message is non-NULL, message_len must be > 0 (otherwise returns ESP_ERR_INVALID_ARG)

返回

  • ESP_OK On success

  • ESP_ERR_INVALID_ARG Invalid handle, invalid audio params, or invalid message/message_len combination

  • ESP_ERR_NO_MEM Failed to allocate hello message buffer

  • ESP_ERR_INVALID_SIZE Hello message buffer too small

  • Other Error from get_hello_message, transport_send_text, or audio_open

esp_err_t esp_xiaozhi_chat_close_audio_channel(esp_xiaozhi_chat_handle_t chat_hd)

Close audio channel.

参数

chat_hd[in] Handle to the chat instance

返回

  • ESP_OK On success

  • ESP_ERR_INVALID_ARG Invalid handle

esp_err_t esp_xiaozhi_chat_send_audio_data(esp_xiaozhi_chat_handle_t chat_hd, const char *data, size_t data_len)

Send audio data to the chat session.

参数
  • chat_hd[in] Handle to the chat instance

  • data[in] Pointer to the audio data buffer

  • data_len[in] Length of the audio data in bytes

返回

  • ESP_OK On success

  • ESP_ERR_INVALID_ARG Invalid handle, data, or data_len is 0

  • ESP_ERR_INVALID_STATE Transport not ready for binary (e.g. audio channel not open)

  • Other Error from transport_send_binary

esp_err_t esp_xiaozhi_chat_send_wake_word(esp_xiaozhi_chat_handle_t chat_hd, const char *wake_word)

Send wake word detected.

参数
  • chat_hd[in] Handle to the chat instance

  • wake_word[in] Pointer to the wake word (non-empty string)

返回

  • ESP_OK On success

  • ESP_ERR_INVALID_ARG Invalid handle or wake_word

  • ESP_ERR_INVALID_STATE No session (open audio channel first)

  • ESP_ERR_NO_MEM Failed to create JSON

  • Other Error from transport_send_text

esp_err_t esp_xiaozhi_chat_send_start_listening(esp_xiaozhi_chat_handle_t chat_hd, int mode)

Send start listening.

参数
  • chat_hd[in] Handle to the chat instance

  • mode[in] Listening mode

返回

  • ESP_OK On success

  • ESP_ERR_INVALID_ARG Invalid handle

  • ESP_ERR_INVALID_STATE No session (open audio channel first)

  • ESP_ERR_NO_MEM Failed to create JSON

  • Other Error from transport_send_text

esp_err_t esp_xiaozhi_chat_send_stop_listening(esp_xiaozhi_chat_handle_t chat_hd)

Send stop listening.

参数

chat_hd[in] Handle to the chat instance

返回

  • ESP_OK On success

  • ESP_ERR_INVALID_ARG Invalid handle

  • ESP_ERR_INVALID_STATE No session (open audio channel first)

  • ESP_ERR_NO_MEM Failed to create JSON

  • Other Error from transport_send_text

esp_err_t esp_xiaozhi_chat_send_abort_speaking(esp_xiaozhi_chat_handle_t chat_hd, esp_xiaozhi_chat_abort_speaking_reason_t reason)

Send abort speaking.

参数
  • chat_hd[in] Handle to the chat instance

  • reason[in] Reason for aborting speaking

返回

  • ESP_OK On success

  • ESP_ERR_INVALID_ARG Invalid handle

  • ESP_ERR_INVALID_STATE No session (open audio channel first)

  • ESP_ERR_NO_MEM Failed to create JSON

  • Other Error from transport_send_text

Structures

struct esp_xiaozhi_chat_audio_t

Audio packet for Xiaozhi chat; also used as audio params when passed to esp_xiaozhi_chat_open_audio_channel(). For packet use: set sample_rate, frame_duration, timestamp, payload, payload_size (format/channels ignored). For open_audio_channel use: set format (NULL = “opus”), sample_rate (0 = 16000, otherwise 8000-48000), channels (0 = 1, otherwise 1-2), frame_duration (0 = 60, otherwise 10-120); payload/timestamp/payload_size ignored.

Public Members

const char *format

Audio format for hello, e.g. “opus”, “pcm”. NULL means “opus”. Packet use: ignore

int sample_rate

Sample rate (Hz). For hello: 0 means 16000

int channels

Channel count for hello. 0 means 1. Packet use: ignore

int frame_duration

Frame duration (ms). For hello: 0 means 60

uint32_t timestamp

Timestamp (packet use only)

uint8_t *payload

Payload (packet use only)

size_t payload_size

Payload size (packet use only)

struct esp_xiaozhi_chat_tts_state_t

TTS state payload for ESP_XIAOZHI_CHAT_EVENT_CHAT_TTS_STATE Pointers valid only during the event callback.

Public Members

esp_xiaozhi_chat_tts_state_kind_t state

TTS state kind (start / stop / sentence_start)

const char *text

Non-NULL only when state is SENTENCE_START

struct esp_xiaozhi_chat_error_info_t

Error info for ESP_XIAOZHI_CHAT_EVENT_CHAT_ERROR (protocol layer only) Pointers valid only during the event callback.

Public Members

esp_err_t code

Error code

const char *source

Hint e.g. “transport”, “hello_timeout”, “udp”

struct esp_xiaozhi_chat_text_data_t

Text data structure for chat messages.

Public Members

esp_xiaozhi_chat_text_role_t role

Role of the message (user or assistant)

const char *text

Text content of the message

struct esp_xiaozhi_chat_config_t

Configuration structure for initializing a Xiaozhi chat session.

Public Members

esp_xiaozhi_chat_audio_type_t audio_type

Type of audio input/output to use

esp_xiaozhi_chat_audio_callback_t audio_callback

Callback function for handling audio data

esp_xiaozhi_chat_event_callback_t event_callback

Callback function for handling Xiaozhi events

void *audio_callback_ctx

Context pointer passed to the audio callback

void *event_callback_ctx

Context pointer passed to the event callback

esp_mcp_t *mcp_engine

MCP engine instance provided by the caller

bool owns_mcp_engine

Whether chat takes ownership of mcp_engine and destroys it in deinit

bool has_mqtt_config

True if server provides MQTT config (from get_info). When both MQTT and WebSocket supported, prefer MQTT

bool has_websocket_config

True if server provides WebSocket config (from get_info)

Macros

ESP_XIAOZHI_CHAT_EVENT_CONNECTED

Event bits for ESP event system (app may register for these). These are the only event bits exposed to the app; do not add internal sync flags here.

ESP_XIAOZHI_CHAT_EVENT_DISCONNECTED
ESP_XIAOZHI_CHAT_EVENT_AUDIO_CHANNEL_OPENED
ESP_XIAOZHI_CHAT_EVENT_AUDIO_CHANNEL_CLOSED
ESP_XIAOZHI_CHAT_EVENT_AUDIO_DATA_INCOMING
ESP_XIAOZHI_CHAT_EVENT_SERVER_GOODBYE
ESP_XIAOZHI_CHAT_DEFAULT_CONFIG()

Default configuration initializer for esp_xiaozhi_chat_config_t.

Type Definitions

typedef uint32_t esp_xiaozhi_chat_handle_t

Handle for a Xiaozhi chat session.

typedef void (*esp_xiaozhi_chat_audio_callback_t)(const uint8_t *data, int len, void *ctx)

Callback for receiving audio data during chat.

The data buffer is owned by the chat module and is only valid for the duration of this callback. Implementations must consume or copy the data before returning and must not store the pointer for asynchronous use.

Param data

Pointer to the audio data buffer, valid only during this callback

Param len

Length of the audio data in bytes

Param ctx

User-defined context passed to the callback

typedef void (*esp_xiaozhi_chat_event_callback_t)(esp_xiaozhi_chat_event_t event, void *event_data, void *ctx)

Callback for receiving chat events.

Param event

Chat event type

Param event_data

Optional output data associated with the event

Param ctx

User-defined context passed to the callback

Enumerations

enum esp_xiaozhi_chat_tts_state_kind_t

TTS state kind for protocol-layer notification (app decides device state)

Values:

enumerator ESP_XIAOZHI_CHAT_TTS_STATE_START

TTS playback started

enumerator ESP_XIAOZHI_CHAT_TTS_STATE_STOP

TTS playback stopped

enumerator ESP_XIAOZHI_CHAT_TTS_STATE_SENTENCE_START

TTS sentence started; text is valid

enum esp_xiaozhi_chat_event_t

Events that can occur during a Xiaozhi chat session (minimal protocol API)

Component only reports protocol facts; app handles state machine, UI, and system commands.

Values:

enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_SPEECH_STARTED

Emitted on TTS start; prefer CHAT_TTS_STATE for new code

enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_SPEECH_STOPPED

Emitted on TTS stop; prefer CHAT_TTS_STATE for new code

enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_ERROR

event_data = esp_xiaozhi_chat_error_info_t *

enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_TEXT

event_data = esp_xiaozhi_chat_text_data_t * (STT/TTS sentence)

enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_EMOJI

event_data = const char * (LLM emotion)

enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_TTS_STATE

event_data = esp_xiaozhi_chat_tts_state_t * (protocol TTS state)

enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_SYSTEM_CMD

event_data = const char * (e.g. “reboot”); app decides whether to execute

enum esp_xiaozhi_chat_audio_type_t

Supported audio formats for Xiaozhi chat.

Values:

enumerator ESP_XIAOZHI_CHAT_AUDIO_TYPE_OPUS

OPUS compressed audio format

enum esp_xiaozhi_chat_device_state_t

Device state for Xiaozhi chat.

Values:

enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_UNKNOWN
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_STARTING
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_WIFI_CONFIGURING
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_IDLE
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_CONNECTING
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_LISTENING
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_SPEAKING
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_UPGRADING
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_ACTIVATING
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_AUDIO_TESTING
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_FATAL_ERROR
enum esp_xiaozhi_chat_listening_mode_t

Listening mode for Xiaozhi chat.

Values:

enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_REALTIME
enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_AUTO
enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_MANUAL
enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_AUTO_STOP
enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_MANUAL_STOP
enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_UNKNOWN
enum esp_xiaozhi_chat_abort_speaking_reason_t

Reasons for aborting speaking.

Values:

enumerator ESP_XIAOZHI_CHAT_ABORT_SPEAKING_REASON_WAKE_WORD_DETECTED
enumerator ESP_XIAOZHI_CHAT_ABORT_SPEAKING_REASON_STOP_LISTENING
enumerator ESP_XIAOZHI_CHAT_ABORT_SPEAKING_REASON_UNKNOWN
enum esp_xiaozhi_chat_text_role_t

Text role enumeration for chat messages.

Values:

enumerator ESP_XIAOZHI_CHAT_TEXT_ROLE_USER

User message role

enumerator ESP_XIAOZHI_CHAT_TEXT_ROLE_ASSISTANT

Assistant message role

Header File

Functions

esp_err_t esp_xiaozhi_chat_get_info(esp_xiaozhi_chat_info_t *info)

Get Xiaozhi Chat Information from the HTTP server.

The function posts board information to the configured service endpoint, parses the response, updates the output structure, and persists MQTT/WebSocket settings to NVS when present in the server response.

参数

info[inout] Pointer to the information structure

返回

  • ESP_OK On success

  • ESP_ERR_INVALID_ARG Invalid info pointer

  • ESP_ERR_NO_MEM Failed to allocate working buffers or HTTP resources

  • ESP_ERR_INVALID_RESPONSE Server response is malformed or missing a valid body

  • Other Error from board info collection, HTTP client, JSON parsing, or keystore persistence

esp_err_t esp_xiaozhi_chat_free_info(esp_xiaozhi_chat_info_t *info)

Free Xiaozhi Chat Information.

Releases dynamically allocated string fields owned by info. This function does not zero the full structure or reset the boolean flags.

参数

info[inout] Pointer to the information structure

返回

  • ESP_OK On success

  • ESP_ERR_INVALID_ARG Invalid info pointer

Structures

struct esp_xiaozhi_chat_info_t

Information for Xiaozhi chat.

Public Members

char *current_version

Current version of the firmware

char *firmware_version

Firmware version

char *firmware_url

Firmware URL

char *serial_number

Serial number

char *activation_code

Activation code

char *activation_challenge

Activation challenge

char *activation_message

Activation message

int activation_timeout_ms

Activation timeout in milliseconds

bool has_serial_number

Has serial number

bool has_new_version

Has new version

bool has_activation_code

Has activation code

bool has_activation_challenge

Has activation challenge

bool has_mqtt_config

Has MQTT config

bool has_websocket_config

Has WebSocket config

bool has_server_time

Has server time