小智 AI 聊天机器人
支持的芯片 |
ESP32-S3 |
小智是一个双向流式对话组件,连接到 xiaozhi.me 服务。它支持与使用 Qwen 和 DeepSeek 等大语言模型的 AI 代理进行实时语音/文本交互。
该组件非常适合语音助手和智能语音问答系统等用例。它具有低延迟和轻量级设计,适合在 ESP32 等嵌入式设备上运行的应用。
功能特性
双向流式传输:与 AI 代理进行实时语音和文本交互
多种通信协议:支持 WebSocket 和 MQTT+UDP 协议
音频编解码支持:OPUS、G.711 和 PCM 音频格式
MCP 集成:设备端 MCP 用于设备控制(扬声器、LED、舵机、GPIO 等)
多语言支持:中文和英文
离线唤醒词:提供上报唤醒词的 API(如 esp_xiaozhi_chat_send_wake_word),与 ESP-SR 的集成由应用层实现
架构
小智使用流式 ASR(自动语音识别)+ LLM(大语言模型)+ TTS(文本转语音)架构进行语音交互:
音频输入:从麦克风捕获音频
ASR:实时将语音转换为文本
LLM:处理文本并生成响应
TTS:将文本响应转换为语音
音频输出:通过扬声器播放音频
该组件与 MCP(模型上下文协议)集成,以实现设备控制功能。
示例
小智应用示例: ai/xiaozhi_chat。一个完整的语音助手应用,演示: - 与 AI 代理的语音交互 - 通过 MCP 协议进行设备控制 - 多语言支持 - 显示支持
API 参考
Header File
Functions
-
esp_err_t esp_xiaozhi_chat_init(esp_xiaozhi_chat_config_t *config, esp_xiaozhi_chat_handle_t *chat_hd)
Instance the chat module.
The current implementation supports only one chat instance at a time.
- 参数
config – [in] Pointer to the chat configuration structure
chat_hd – [out] Pointer to the chat handle
- 返回
ESP_OK On success
ESP_ERR_NO_MEM Out of memory
ESP_ERR_INVALID_ARG Invalid arguments
ESP_ERR_INVALID_STATE Another chat instance is already active
-
esp_err_t esp_xiaozhi_chat_deinit(esp_xiaozhi_chat_handle_t chat_hd)
Deinitialize the chat module.
This function releases chat-owned resources. If the chat session is still running, it will stop runtime resources first. The MCP engine is destroyed only when
config.owns_mcp_enginewas set to true during init.- 参数
chat_hd – [in] Handle to the chat instance
- 返回
ESP_OK On success
ESP_ERR_INVALID_ARG Invalid handle
-
esp_err_t esp_xiaozhi_chat_start(esp_xiaozhi_chat_handle_t chat_hd)
Start the chat session.
- 参数
chat_hd – [in] Handle to the chat instance
- 返回
ESP_OK On success
ESP_ERR_INVALID_ARG Invalid handle
ESP_ERR_NOT_FOUND Required transport configuration is missing
Other Error from transport start (MQTT or WebSocket)
-
esp_err_t esp_xiaozhi_chat_stop(esp_xiaozhi_chat_handle_t chat_hd)
Stop the chat session.
Stops the active chat runtime, including audio channel and MCP manager resources, but does not destroy the configured MCP engine.
- 参数
chat_hd – [in] Handle to the chat instance
- 返回
ESP_OK On success
ESP_ERR_INVALID_ARG Invalid handle
-
esp_err_t esp_xiaozhi_chat_open_audio_channel(esp_xiaozhi_chat_handle_t chat_hd, const esp_xiaozhi_chat_audio_t *audio, char *message, size_t message_len)
Open audio channel.
- 参数
chat_hd – [in] Handle to the chat instance
audio – [in] Optional audio params for the generated hello (format, sample_rate, channels, frame_duration). Used only when message is NULL. NULL or zero fields mean defaults: “opus”, 16000, 1, 60. Non-zero values must be within valid protocol ranges.
message – [in] Optional message to send when opening the channel. If NULL, a default hello message will be generated
message_len – [in] Length of the message buffer. If 0 with message NULL, a default hello is generated; if message is non-NULL, message_len must be > 0 (otherwise returns ESP_ERR_INVALID_ARG)
- 返回
ESP_OK On success
ESP_ERR_INVALID_ARG Invalid handle, invalid audio params, or invalid message/message_len combination
ESP_ERR_NO_MEM Failed to allocate hello message buffer
ESP_ERR_INVALID_SIZE Hello message buffer too small
Other Error from get_hello_message, transport_send_text, or audio_open
-
esp_err_t esp_xiaozhi_chat_close_audio_channel(esp_xiaozhi_chat_handle_t chat_hd)
Close audio channel.
- 参数
chat_hd – [in] Handle to the chat instance
- 返回
ESP_OK On success
ESP_ERR_INVALID_ARG Invalid handle
-
esp_err_t esp_xiaozhi_chat_send_audio_data(esp_xiaozhi_chat_handle_t chat_hd, const char *data, size_t data_len)
Send audio data to the chat session.
- 参数
chat_hd – [in] Handle to the chat instance
data – [in] Pointer to the audio data buffer
data_len – [in] Length of the audio data in bytes
- 返回
ESP_OK On success
ESP_ERR_INVALID_ARG Invalid handle, data, or data_len is 0
ESP_ERR_INVALID_STATE Transport not ready for binary (e.g. audio channel not open)
Other Error from transport_send_binary
-
esp_err_t esp_xiaozhi_chat_send_wake_word(esp_xiaozhi_chat_handle_t chat_hd, const char *wake_word)
Send wake word detected.
- 参数
chat_hd – [in] Handle to the chat instance
wake_word – [in] Pointer to the wake word (non-empty string)
- 返回
ESP_OK On success
ESP_ERR_INVALID_ARG Invalid handle or wake_word
ESP_ERR_INVALID_STATE No session (open audio channel first)
ESP_ERR_NO_MEM Failed to create JSON
Other Error from transport_send_text
-
esp_err_t esp_xiaozhi_chat_send_start_listening(esp_xiaozhi_chat_handle_t chat_hd, int mode)
Send start listening.
- 参数
chat_hd – [in] Handle to the chat instance
mode – [in] Listening mode
- 返回
ESP_OK On success
ESP_ERR_INVALID_ARG Invalid handle
ESP_ERR_INVALID_STATE No session (open audio channel first)
ESP_ERR_NO_MEM Failed to create JSON
Other Error from transport_send_text
-
esp_err_t esp_xiaozhi_chat_send_stop_listening(esp_xiaozhi_chat_handle_t chat_hd)
Send stop listening.
- 参数
chat_hd – [in] Handle to the chat instance
- 返回
ESP_OK On success
ESP_ERR_INVALID_ARG Invalid handle
ESP_ERR_INVALID_STATE No session (open audio channel first)
ESP_ERR_NO_MEM Failed to create JSON
Other Error from transport_send_text
-
esp_err_t esp_xiaozhi_chat_send_abort_speaking(esp_xiaozhi_chat_handle_t chat_hd, esp_xiaozhi_chat_abort_speaking_reason_t reason)
Send abort speaking.
- 参数
chat_hd – [in] Handle to the chat instance
reason – [in] Reason for aborting speaking
- 返回
ESP_OK On success
ESP_ERR_INVALID_ARG Invalid handle
ESP_ERR_INVALID_STATE No session (open audio channel first)
ESP_ERR_NO_MEM Failed to create JSON
Other Error from transport_send_text
Structures
-
struct esp_xiaozhi_chat_audio_t
Audio packet for Xiaozhi chat; also used as audio params when passed to esp_xiaozhi_chat_open_audio_channel(). For packet use: set sample_rate, frame_duration, timestamp, payload, payload_size (format/channels ignored). For open_audio_channel use: set format (NULL = “opus”), sample_rate (0 = 16000, otherwise 8000-48000), channels (0 = 1, otherwise 1-2), frame_duration (0 = 60, otherwise 10-120); payload/timestamp/payload_size ignored.
Public Members
-
const char *format
Audio format for hello, e.g. “opus”, “pcm”. NULL means “opus”. Packet use: ignore
-
int sample_rate
Sample rate (Hz). For hello: 0 means 16000
-
int channels
Channel count for hello. 0 means 1. Packet use: ignore
-
int frame_duration
Frame duration (ms). For hello: 0 means 60
-
uint32_t timestamp
Timestamp (packet use only)
-
uint8_t *payload
Payload (packet use only)
-
size_t payload_size
Payload size (packet use only)
-
const char *format
-
struct esp_xiaozhi_chat_tts_state_t
TTS state payload for ESP_XIAOZHI_CHAT_EVENT_CHAT_TTS_STATE Pointers valid only during the event callback.
Public Members
-
esp_xiaozhi_chat_tts_state_kind_t state
TTS state kind (start / stop / sentence_start)
-
const char *text
Non-NULL only when state is SENTENCE_START
-
esp_xiaozhi_chat_tts_state_kind_t state
-
struct esp_xiaozhi_chat_error_info_t
Error info for ESP_XIAOZHI_CHAT_EVENT_CHAT_ERROR (protocol layer only) Pointers valid only during the event callback.
-
struct esp_xiaozhi_chat_text_data_t
Text data structure for chat messages.
Public Members
-
esp_xiaozhi_chat_text_role_t role
Role of the message (user or assistant)
-
const char *text
Text content of the message
-
esp_xiaozhi_chat_text_role_t role
-
struct esp_xiaozhi_chat_config_t
Configuration structure for initializing a Xiaozhi chat session.
Public Members
-
esp_xiaozhi_chat_audio_type_t audio_type
Type of audio input/output to use
-
esp_xiaozhi_chat_audio_callback_t audio_callback
Callback function for handling audio data
-
esp_xiaozhi_chat_event_callback_t event_callback
Callback function for handling Xiaozhi events
-
void *audio_callback_ctx
Context pointer passed to the audio callback
-
void *event_callback_ctx
Context pointer passed to the event callback
-
esp_mcp_t *mcp_engine
MCP engine instance provided by the caller
-
bool owns_mcp_engine
Whether chat takes ownership of mcp_engine and destroys it in deinit
-
bool has_mqtt_config
True if server provides MQTT config (from get_info). When both MQTT and WebSocket supported, prefer MQTT
-
bool has_websocket_config
True if server provides WebSocket config (from get_info)
-
esp_xiaozhi_chat_audio_type_t audio_type
Macros
-
ESP_XIAOZHI_CHAT_EVENT_CONNECTED
Event bits for ESP event system (app may register for these). These are the only event bits exposed to the app; do not add internal sync flags here.
-
ESP_XIAOZHI_CHAT_EVENT_DISCONNECTED
-
ESP_XIAOZHI_CHAT_EVENT_AUDIO_CHANNEL_OPENED
-
ESP_XIAOZHI_CHAT_EVENT_AUDIO_CHANNEL_CLOSED
-
ESP_XIAOZHI_CHAT_EVENT_AUDIO_DATA_INCOMING
-
ESP_XIAOZHI_CHAT_EVENT_SERVER_GOODBYE
-
ESP_XIAOZHI_CHAT_DEFAULT_CONFIG()
Default configuration initializer for esp_xiaozhi_chat_config_t.
Type Definitions
-
typedef uint32_t esp_xiaozhi_chat_handle_t
Handle for a Xiaozhi chat session.
-
typedef void (*esp_xiaozhi_chat_audio_callback_t)(const uint8_t *data, int len, void *ctx)
Callback for receiving audio data during chat.
The
databuffer is owned by the chat module and is only valid for the duration of this callback. Implementations must consume or copy the data before returning and must not store the pointer for asynchronous use.- Param data
Pointer to the audio data buffer, valid only during this callback
- Param len
Length of the audio data in bytes
- Param ctx
User-defined context passed to the callback
-
typedef void (*esp_xiaozhi_chat_event_callback_t)(esp_xiaozhi_chat_event_t event, void *event_data, void *ctx)
Callback for receiving chat events.
- Param event
Chat event type
- Param event_data
Optional output data associated with the event
- Param ctx
User-defined context passed to the callback
Enumerations
-
enum esp_xiaozhi_chat_tts_state_kind_t
TTS state kind for protocol-layer notification (app decides device state)
Values:
-
enumerator ESP_XIAOZHI_CHAT_TTS_STATE_START
TTS playback started
-
enumerator ESP_XIAOZHI_CHAT_TTS_STATE_STOP
TTS playback stopped
-
enumerator ESP_XIAOZHI_CHAT_TTS_STATE_SENTENCE_START
TTS sentence started; text is valid
-
enumerator ESP_XIAOZHI_CHAT_TTS_STATE_START
-
enum esp_xiaozhi_chat_event_t
Events that can occur during a Xiaozhi chat session (minimal protocol API)
Component only reports protocol facts; app handles state machine, UI, and system commands.
Values:
-
enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_SPEECH_STARTED
Emitted on TTS start; prefer CHAT_TTS_STATE for new code
-
enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_SPEECH_STOPPED
Emitted on TTS stop; prefer CHAT_TTS_STATE for new code
-
enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_ERROR
event_data = esp_xiaozhi_chat_error_info_t *
-
enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_TEXT
event_data = esp_xiaozhi_chat_text_data_t * (STT/TTS sentence)
-
enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_EMOJI
event_data = const char * (LLM emotion)
-
enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_TTS_STATE
event_data = esp_xiaozhi_chat_tts_state_t * (protocol TTS state)
-
enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_SYSTEM_CMD
event_data = const char * (e.g. “reboot”); app decides whether to execute
-
enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_SPEECH_STARTED
-
enum esp_xiaozhi_chat_audio_type_t
Supported audio formats for Xiaozhi chat.
Values:
-
enumerator ESP_XIAOZHI_CHAT_AUDIO_TYPE_OPUS
OPUS compressed audio format
-
enumerator ESP_XIAOZHI_CHAT_AUDIO_TYPE_OPUS
-
enum esp_xiaozhi_chat_device_state_t
Device state for Xiaozhi chat.
Values:
-
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_UNKNOWN
-
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_STARTING
-
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_WIFI_CONFIGURING
-
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_IDLE
-
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_CONNECTING
-
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_LISTENING
-
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_SPEAKING
-
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_UPGRADING
-
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_ACTIVATING
-
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_AUDIO_TESTING
-
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_FATAL_ERROR
-
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_UNKNOWN
-
enum esp_xiaozhi_chat_listening_mode_t
Listening mode for Xiaozhi chat.
Values:
-
enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_REALTIME
-
enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_AUTO
-
enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_MANUAL
-
enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_AUTO_STOP
-
enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_MANUAL_STOP
-
enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_UNKNOWN
-
enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_REALTIME
Header File
Functions
-
esp_err_t esp_xiaozhi_chat_get_info(esp_xiaozhi_chat_info_t *info)
Get Xiaozhi Chat Information from the HTTP server.
The function posts board information to the configured service endpoint, parses the response, updates the output structure, and persists MQTT/WebSocket settings to NVS when present in the server response.
- 参数
info – [inout] Pointer to the information structure
- 返回
ESP_OK On success
ESP_ERR_INVALID_ARG Invalid info pointer
ESP_ERR_NO_MEM Failed to allocate working buffers or HTTP resources
ESP_ERR_INVALID_RESPONSE Server response is malformed or missing a valid body
Other Error from board info collection, HTTP client, JSON parsing, or keystore persistence
-
esp_err_t esp_xiaozhi_chat_free_info(esp_xiaozhi_chat_info_t *info)
Free Xiaozhi Chat Information.
Releases dynamically allocated string fields owned by
info. This function does not zero the full structure or reset the boolean flags.- 参数
info – [inout] Pointer to the information structure
- 返回
ESP_OK On success
ESP_ERR_INVALID_ARG Invalid info pointer
Structures
-
struct esp_xiaozhi_chat_info_t
Information for Xiaozhi chat.
Public Members
-
char *current_version
Current version of the firmware
-
char *firmware_version
Firmware version
-
char *firmware_url
Firmware URL
-
char *serial_number
Serial number
-
char *activation_code
Activation code
-
char *activation_challenge
Activation challenge
-
char *activation_message
Activation message
-
int activation_timeout_ms
Activation timeout in milliseconds
-
bool has_serial_number
Has serial number
-
bool has_new_version
Has new version
-
bool has_activation_code
Has activation code
-
bool has_activation_challenge
Has activation challenge
-
bool has_mqtt_config
Has MQTT config
-
bool has_websocket_config
Has WebSocket config
-
bool has_server_time
Has server time
-
char *current_version