Xiaozhi AI Chatbot
Supported chips |
ESP32-S3 |
Xiaozhi is a bidirectional streaming dialogue component that connects to the xiaozhi.me service. It supports real-time voice/text interaction with AI agents using large language models like Qwen and DeepSeek.
This component is ideal for use cases such as voice assistants and intelligent voice Q&A systems. It features low latency and a lightweight design, making it suitable for applications running on embedded devices such as the ESP32.
Features
Bidirectional Streaming: Real-time voice and text interaction with AI agents
Multiple Communication Protocols: Supports WebSocket and MQTT+UDP protocols
Audio Codec Support: OPUS, G.711, and PCM audio formats
MCP Integration: Device-side MCP for device control (speaker, LED, servo, GPIO, etc.)
Multi-language Support: Chinese and English
Offline Wake Word: API to report wake word (e.g. esp_xiaozhi_chat_send_wake_word); ESP-SR integration is application-level
Architecture
Xiaozhi uses a streaming ASR (Automatic Speech Recognition) + LLM (Large Language Model) + TTS (Text-to-Speech) architecture for voice interaction:
Audio Input: Captures audio from microphone
ASR: Converts speech to text in real-time
LLM: Processes text and generates responses
TTS: Converts text responses to speech
Audio Output: Plays audio through speaker
The component integrates with the MCP (Model Context Protocol) to enable device control capabilities.
Examples
Xiaozhi App Example: ai/xiaozhi_chat. A complete voice assistant application demonstrating: - Voice interaction with AI agents - Device control via MCP protocol - Multi-language support - Display support
API Reference
Header File
Functions
-
esp_err_t esp_xiaozhi_chat_init(esp_xiaozhi_chat_config_t *config, esp_xiaozhi_chat_handle_t *chat_hd)
Instance the chat module.
The current implementation supports only one chat instance at a time.
- Parameters
config – [in] Pointer to the chat configuration structure
chat_hd – [out] Pointer to the chat handle
- Returns
ESP_OK On success
ESP_ERR_NO_MEM Out of memory
ESP_ERR_INVALID_ARG Invalid arguments
ESP_ERR_INVALID_STATE Another chat instance is already active
-
esp_err_t esp_xiaozhi_chat_deinit(esp_xiaozhi_chat_handle_t chat_hd)
Deinitialize the chat module.
This function releases chat-owned resources. If the chat session is still running, it will stop runtime resources first. The MCP engine is destroyed only when
config.owns_mcp_enginewas set to true during init.- Parameters
chat_hd – [in] Handle to the chat instance
- Returns
ESP_OK On success
ESP_ERR_INVALID_ARG Invalid handle
-
esp_err_t esp_xiaozhi_chat_start(esp_xiaozhi_chat_handle_t chat_hd)
Start the chat session.
- Parameters
chat_hd – [in] Handle to the chat instance
- Returns
ESP_OK On success
ESP_ERR_INVALID_ARG Invalid handle
ESP_ERR_NOT_FOUND Required transport configuration is missing
Other Error from transport start (MQTT or WebSocket)
-
esp_err_t esp_xiaozhi_chat_stop(esp_xiaozhi_chat_handle_t chat_hd)
Stop the chat session.
Stops the active chat runtime, including audio channel and MCP manager resources, but does not destroy the configured MCP engine.
- Parameters
chat_hd – [in] Handle to the chat instance
- Returns
ESP_OK On success
ESP_ERR_INVALID_ARG Invalid handle
-
esp_err_t esp_xiaozhi_chat_open_audio_channel(esp_xiaozhi_chat_handle_t chat_hd, const esp_xiaozhi_chat_audio_t *audio, char *message, size_t message_len)
Open audio channel.
- Parameters
chat_hd – [in] Handle to the chat instance
audio – [in] Optional audio params for the generated hello (format, sample_rate, channels, frame_duration). Used only when message is NULL. NULL or zero fields mean defaults: “opus”, 16000, 1, 60. Non-zero values must be within valid protocol ranges.
message – [in] Optional message to send when opening the channel. If NULL, a default hello message will be generated
message_len – [in] Length of the message buffer. If 0 with message NULL, a default hello is generated; if message is non-NULL, message_len must be > 0 (otherwise returns ESP_ERR_INVALID_ARG)
- Returns
ESP_OK On success
ESP_ERR_INVALID_ARG Invalid handle, invalid audio params, or invalid message/message_len combination
ESP_ERR_NO_MEM Failed to allocate hello message buffer
ESP_ERR_INVALID_SIZE Hello message buffer too small
Other Error from get_hello_message, transport_send_text, or audio_open
-
esp_err_t esp_xiaozhi_chat_close_audio_channel(esp_xiaozhi_chat_handle_t chat_hd)
Close audio channel.
- Parameters
chat_hd – [in] Handle to the chat instance
- Returns
ESP_OK On success
ESP_ERR_INVALID_ARG Invalid handle
-
esp_err_t esp_xiaozhi_chat_send_audio_data(esp_xiaozhi_chat_handle_t chat_hd, const char *data, size_t data_len)
Send audio data to the chat session.
- Parameters
chat_hd – [in] Handle to the chat instance
data – [in] Pointer to the audio data buffer
data_len – [in] Length of the audio data in bytes
- Returns
ESP_OK On success
ESP_ERR_INVALID_ARG Invalid handle, data, or data_len is 0
ESP_ERR_INVALID_STATE Transport not ready for binary (e.g. audio channel not open)
Other Error from transport_send_binary
-
esp_err_t esp_xiaozhi_chat_send_wake_word(esp_xiaozhi_chat_handle_t chat_hd, const char *wake_word)
Send wake word detected.
- Parameters
chat_hd – [in] Handle to the chat instance
wake_word – [in] Pointer to the wake word (non-empty string)
- Returns
ESP_OK On success
ESP_ERR_INVALID_ARG Invalid handle or wake_word
ESP_ERR_INVALID_STATE No session (open audio channel first)
ESP_ERR_NO_MEM Failed to create JSON
Other Error from transport_send_text
-
esp_err_t esp_xiaozhi_chat_send_start_listening(esp_xiaozhi_chat_handle_t chat_hd, int mode)
Send start listening.
- Parameters
chat_hd – [in] Handle to the chat instance
mode – [in] Listening mode
- Returns
ESP_OK On success
ESP_ERR_INVALID_ARG Invalid handle
ESP_ERR_INVALID_STATE No session (open audio channel first)
ESP_ERR_NO_MEM Failed to create JSON
Other Error from transport_send_text
-
esp_err_t esp_xiaozhi_chat_send_stop_listening(esp_xiaozhi_chat_handle_t chat_hd)
Send stop listening.
- Parameters
chat_hd – [in] Handle to the chat instance
- Returns
ESP_OK On success
ESP_ERR_INVALID_ARG Invalid handle
ESP_ERR_INVALID_STATE No session (open audio channel first)
ESP_ERR_NO_MEM Failed to create JSON
Other Error from transport_send_text
-
esp_err_t esp_xiaozhi_chat_send_abort_speaking(esp_xiaozhi_chat_handle_t chat_hd, esp_xiaozhi_chat_abort_speaking_reason_t reason)
Send abort speaking.
- Parameters
chat_hd – [in] Handle to the chat instance
reason – [in] Reason for aborting speaking
- Returns
ESP_OK On success
ESP_ERR_INVALID_ARG Invalid handle
ESP_ERR_INVALID_STATE No session (open audio channel first)
ESP_ERR_NO_MEM Failed to create JSON
Other Error from transport_send_text
Structures
-
struct esp_xiaozhi_chat_audio_t
Audio packet for Xiaozhi chat; also used as audio params when passed to esp_xiaozhi_chat_open_audio_channel(). For packet use: set sample_rate, frame_duration, timestamp, payload, payload_size (format/channels ignored). For open_audio_channel use: set format (NULL = “opus”), sample_rate (0 = 16000, otherwise 8000-48000), channels (0 = 1, otherwise 1-2), frame_duration (0 = 60, otherwise 10-120); payload/timestamp/payload_size ignored.
Public Members
-
const char *format
Audio format for hello, e.g. “opus”, “pcm”. NULL means “opus”. Packet use: ignore
-
int sample_rate
Sample rate (Hz). For hello: 0 means 16000
-
int channels
Channel count for hello. 0 means 1. Packet use: ignore
-
int frame_duration
Frame duration (ms). For hello: 0 means 60
-
uint32_t timestamp
Timestamp (packet use only)
-
uint8_t *payload
Payload (packet use only)
-
size_t payload_size
Payload size (packet use only)
-
const char *format
-
struct esp_xiaozhi_chat_tts_state_t
TTS state payload for ESP_XIAOZHI_CHAT_EVENT_CHAT_TTS_STATE Pointers valid only during the event callback.
Public Members
-
esp_xiaozhi_chat_tts_state_kind_t state
TTS state kind (start / stop / sentence_start)
-
const char *text
Non-NULL only when state is SENTENCE_START
-
esp_xiaozhi_chat_tts_state_kind_t state
-
struct esp_xiaozhi_chat_error_info_t
Error info for ESP_XIAOZHI_CHAT_EVENT_CHAT_ERROR (protocol layer only) Pointers valid only during the event callback.
-
struct esp_xiaozhi_chat_text_data_t
Text data structure for chat messages.
Public Members
-
esp_xiaozhi_chat_text_role_t role
Role of the message (user or assistant)
-
const char *text
Text content of the message
-
esp_xiaozhi_chat_text_role_t role
-
struct esp_xiaozhi_chat_config_t
Configuration structure for initializing a Xiaozhi chat session.
Public Members
-
esp_xiaozhi_chat_audio_type_t audio_type
Type of audio input/output to use
-
esp_xiaozhi_chat_audio_callback_t audio_callback
Callback function for handling audio data
-
esp_xiaozhi_chat_event_callback_t event_callback
Callback function for handling Xiaozhi events
-
void *audio_callback_ctx
Context pointer passed to the audio callback
-
void *event_callback_ctx
Context pointer passed to the event callback
-
esp_mcp_t *mcp_engine
MCP engine instance provided by the caller
-
bool owns_mcp_engine
Whether chat takes ownership of mcp_engine and destroys it in deinit
-
bool has_mqtt_config
True if server provides MQTT config (from get_info). When both MQTT and WebSocket supported, prefer MQTT
-
bool has_websocket_config
True if server provides WebSocket config (from get_info)
-
esp_xiaozhi_chat_audio_type_t audio_type
Macros
-
ESP_XIAOZHI_CHAT_EVENT_CONNECTED
Event bits for ESP event system (app may register for these). These are the only event bits exposed to the app; do not add internal sync flags here.
-
ESP_XIAOZHI_CHAT_EVENT_DISCONNECTED
-
ESP_XIAOZHI_CHAT_EVENT_AUDIO_CHANNEL_OPENED
-
ESP_XIAOZHI_CHAT_EVENT_AUDIO_CHANNEL_CLOSED
-
ESP_XIAOZHI_CHAT_EVENT_AUDIO_DATA_INCOMING
-
ESP_XIAOZHI_CHAT_EVENT_SERVER_GOODBYE
-
ESP_XIAOZHI_CHAT_DEFAULT_CONFIG()
Default configuration initializer for esp_xiaozhi_chat_config_t.
Type Definitions
-
typedef uint32_t esp_xiaozhi_chat_handle_t
Handle for a Xiaozhi chat session.
-
typedef void (*esp_xiaozhi_chat_audio_callback_t)(const uint8_t *data, int len, void *ctx)
Callback for receiving audio data during chat.
The
databuffer is owned by the chat module and is only valid for the duration of this callback. Implementations must consume or copy the data before returning and must not store the pointer for asynchronous use.- Param data
Pointer to the audio data buffer, valid only during this callback
- Param len
Length of the audio data in bytes
- Param ctx
User-defined context passed to the callback
-
typedef void (*esp_xiaozhi_chat_event_callback_t)(esp_xiaozhi_chat_event_t event, void *event_data, void *ctx)
Callback for receiving chat events.
- Param event
Chat event type
- Param event_data
Optional output data associated with the event
- Param ctx
User-defined context passed to the callback
Enumerations
-
enum esp_xiaozhi_chat_tts_state_kind_t
TTS state kind for protocol-layer notification (app decides device state)
Values:
-
enumerator ESP_XIAOZHI_CHAT_TTS_STATE_START
TTS playback started
-
enumerator ESP_XIAOZHI_CHAT_TTS_STATE_STOP
TTS playback stopped
-
enumerator ESP_XIAOZHI_CHAT_TTS_STATE_SENTENCE_START
TTS sentence started; text is valid
-
enumerator ESP_XIAOZHI_CHAT_TTS_STATE_START
-
enum esp_xiaozhi_chat_event_t
Events that can occur during a Xiaozhi chat session (minimal protocol API)
Component only reports protocol facts; app handles state machine, UI, and system commands.
Values:
-
enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_SPEECH_STARTED
Emitted on TTS start; prefer CHAT_TTS_STATE for new code
-
enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_SPEECH_STOPPED
Emitted on TTS stop; prefer CHAT_TTS_STATE for new code
-
enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_ERROR
event_data = esp_xiaozhi_chat_error_info_t *
-
enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_TEXT
event_data = esp_xiaozhi_chat_text_data_t * (STT/TTS sentence)
-
enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_EMOJI
event_data = const char * (LLM emotion)
-
enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_TTS_STATE
event_data = esp_xiaozhi_chat_tts_state_t * (protocol TTS state)
-
enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_SYSTEM_CMD
event_data = const char * (e.g. “reboot”); app decides whether to execute
-
enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_SPEECH_STARTED
-
enum esp_xiaozhi_chat_audio_type_t
Supported audio formats for Xiaozhi chat.
Values:
-
enumerator ESP_XIAOZHI_CHAT_AUDIO_TYPE_OPUS
OPUS compressed audio format
-
enumerator ESP_XIAOZHI_CHAT_AUDIO_TYPE_OPUS
-
enum esp_xiaozhi_chat_device_state_t
Device state for Xiaozhi chat.
Values:
-
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_UNKNOWN
-
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_STARTING
-
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_WIFI_CONFIGURING
-
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_IDLE
-
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_CONNECTING
-
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_LISTENING
-
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_SPEAKING
-
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_UPGRADING
-
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_ACTIVATING
-
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_AUDIO_TESTING
-
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_FATAL_ERROR
-
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_UNKNOWN
-
enum esp_xiaozhi_chat_listening_mode_t
Listening mode for Xiaozhi chat.
Values:
-
enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_REALTIME
-
enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_AUTO
-
enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_MANUAL
-
enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_AUTO_STOP
-
enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_MANUAL_STOP
-
enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_UNKNOWN
-
enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_REALTIME
Header File
Functions
-
esp_err_t esp_xiaozhi_chat_get_info(esp_xiaozhi_chat_info_t *info)
Get Xiaozhi Chat Information from the HTTP server.
The function posts board information to the configured service endpoint, parses the response, updates the output structure, and persists MQTT/WebSocket settings to NVS when present in the server response.
- Parameters
info – [inout] Pointer to the information structure
- Returns
ESP_OK On success
ESP_ERR_INVALID_ARG Invalid info pointer
ESP_ERR_NO_MEM Failed to allocate working buffers or HTTP resources
ESP_ERR_INVALID_RESPONSE Server response is malformed or missing a valid body
Other Error from board info collection, HTTP client, JSON parsing, or keystore persistence
-
esp_err_t esp_xiaozhi_chat_free_info(esp_xiaozhi_chat_info_t *info)
Free Xiaozhi Chat Information.
Releases dynamically allocated string fields owned by
info. This function does not zero the full structure or reset the boolean flags.- Parameters
info – [inout] Pointer to the information structure
- Returns
ESP_OK On success
ESP_ERR_INVALID_ARG Invalid info pointer
Structures
-
struct esp_xiaozhi_chat_info_t
Information for Xiaozhi chat.
Public Members
-
char *current_version
Current version of the firmware
-
char *firmware_version
Firmware version
-
char *firmware_url
Firmware URL
-
char *serial_number
Serial number
-
char *activation_code
Activation code
-
char *activation_challenge
Activation challenge
-
char *activation_message
Activation message
-
int activation_timeout_ms
Activation timeout in milliseconds
-
bool has_serial_number
Has serial number
-
bool has_new_version
Has new version
-
bool has_activation_code
Has activation code
-
bool has_activation_challenge
Has activation challenge
-
bool has_mqtt_config
Has MQTT config
-
bool has_websocket_config
Has WebSocket config
-
bool has_server_time
Has server time
-
char *current_version