Xiaozhi AI Chatbot

[English]

Supported chips

ESP32-S3

Xiaozhi is a bidirectional streaming dialogue component that connects to the xiaozhi.me service. It supports real-time voice/text interaction with AI agents using large language models like Qwen and DeepSeek.

This component is ideal for use cases such as voice assistants and intelligent voice Q&A systems. It features low latency and a lightweight design, making it suitable for applications running on embedded devices such as the ESP32.

Features

  • Bidirectional Streaming: Real-time voice and text interaction with AI agents

  • Multiple Communication Protocols: Supports WebSocket and MQTT+UDP protocols

  • Audio Codec Support: OPUS, G.711, and PCM audio formats

  • MCP Integration: Device-side MCP for device control (speaker, LED, servo, GPIO, etc.)

  • Multi-language Support: Chinese and English

  • Offline Wake Word: API to report wake word (e.g. esp_xiaozhi_chat_send_wake_word); ESP-SR integration is application-level

Architecture

Xiaozhi uses a streaming ASR (Automatic Speech Recognition) + LLM (Large Language Model) + TTS (Text-to-Speech) architecture for voice interaction:

  1. Audio Input: Captures audio from microphone

  2. ASR: Converts speech to text in real-time

  3. LLM: Processes text and generates responses

  4. TTS: Converts text responses to speech

  5. Audio Output: Plays audio through speaker

The component integrates with the MCP (Model Context Protocol) to enable device control capabilities.

Examples

  1. Xiaozhi App Example: ai/xiaozhi_chat. A complete voice assistant application demonstrating: - Voice interaction with AI agents - Device control via MCP protocol - Multi-language support - Display support

API Reference

Header File

Functions

esp_err_t esp_xiaozhi_chat_init(esp_xiaozhi_chat_config_t *config, esp_xiaozhi_chat_handle_t *chat_hd)

Instance the chat module.

The current implementation supports only one chat instance at a time.

Parameters
  • config[in] Pointer to the chat configuration structure

  • chat_hd[out] Pointer to the chat handle

Returns

  • ESP_OK On success

  • ESP_ERR_NO_MEM Out of memory

  • ESP_ERR_INVALID_ARG Invalid arguments

  • ESP_ERR_INVALID_STATE Another chat instance is already active

esp_err_t esp_xiaozhi_chat_deinit(esp_xiaozhi_chat_handle_t chat_hd)

Deinitialize the chat module.

This function releases chat-owned resources. If the chat session is still running, it will stop runtime resources first. The MCP engine is destroyed only when config.owns_mcp_engine was set to true during init.

Parameters

chat_hd[in] Handle to the chat instance

Returns

  • ESP_OK On success

  • ESP_ERR_INVALID_ARG Invalid handle

esp_err_t esp_xiaozhi_chat_start(esp_xiaozhi_chat_handle_t chat_hd)

Start the chat session.

Parameters

chat_hd[in] Handle to the chat instance

Returns

  • ESP_OK On success

  • ESP_ERR_INVALID_ARG Invalid handle

  • ESP_ERR_NOT_FOUND Required transport configuration is missing

  • Other Error from transport start (MQTT or WebSocket)

esp_err_t esp_xiaozhi_chat_stop(esp_xiaozhi_chat_handle_t chat_hd)

Stop the chat session.

Stops the active chat runtime, including audio channel and MCP manager resources, but does not destroy the configured MCP engine.

Parameters

chat_hd[in] Handle to the chat instance

Returns

  • ESP_OK On success

  • ESP_ERR_INVALID_ARG Invalid handle

esp_err_t esp_xiaozhi_chat_open_audio_channel(esp_xiaozhi_chat_handle_t chat_hd, const esp_xiaozhi_chat_audio_t *audio, char *message, size_t message_len)

Open audio channel.

Parameters
  • chat_hd[in] Handle to the chat instance

  • audio[in] Optional audio params for the generated hello (format, sample_rate, channels, frame_duration). Used only when message is NULL. NULL or zero fields mean defaults: “opus”, 16000, 1, 60. Non-zero values must be within valid protocol ranges.

  • message[in] Optional message to send when opening the channel. If NULL, a default hello message will be generated

  • message_len[in] Length of the message buffer. If 0 with message NULL, a default hello is generated; if message is non-NULL, message_len must be > 0 (otherwise returns ESP_ERR_INVALID_ARG)

Returns

  • ESP_OK On success

  • ESP_ERR_INVALID_ARG Invalid handle, invalid audio params, or invalid message/message_len combination

  • ESP_ERR_NO_MEM Failed to allocate hello message buffer

  • ESP_ERR_INVALID_SIZE Hello message buffer too small

  • Other Error from get_hello_message, transport_send_text, or audio_open

esp_err_t esp_xiaozhi_chat_close_audio_channel(esp_xiaozhi_chat_handle_t chat_hd)

Close audio channel.

Parameters

chat_hd[in] Handle to the chat instance

Returns

  • ESP_OK On success

  • ESP_ERR_INVALID_ARG Invalid handle

esp_err_t esp_xiaozhi_chat_send_audio_data(esp_xiaozhi_chat_handle_t chat_hd, const char *data, size_t data_len)

Send audio data to the chat session.

Parameters
  • chat_hd[in] Handle to the chat instance

  • data[in] Pointer to the audio data buffer

  • data_len[in] Length of the audio data in bytes

Returns

  • ESP_OK On success

  • ESP_ERR_INVALID_ARG Invalid handle, data, or data_len is 0

  • ESP_ERR_INVALID_STATE Transport not ready for binary (e.g. audio channel not open)

  • Other Error from transport_send_binary

esp_err_t esp_xiaozhi_chat_send_wake_word(esp_xiaozhi_chat_handle_t chat_hd, const char *wake_word)

Send wake word detected.

Parameters
  • chat_hd[in] Handle to the chat instance

  • wake_word[in] Pointer to the wake word (non-empty string)

Returns

  • ESP_OK On success

  • ESP_ERR_INVALID_ARG Invalid handle or wake_word

  • ESP_ERR_INVALID_STATE No session (open audio channel first)

  • ESP_ERR_NO_MEM Failed to create JSON

  • Other Error from transport_send_text

esp_err_t esp_xiaozhi_chat_send_start_listening(esp_xiaozhi_chat_handle_t chat_hd, int mode)

Send start listening.

Parameters
  • chat_hd[in] Handle to the chat instance

  • mode[in] Listening mode

Returns

  • ESP_OK On success

  • ESP_ERR_INVALID_ARG Invalid handle

  • ESP_ERR_INVALID_STATE No session (open audio channel first)

  • ESP_ERR_NO_MEM Failed to create JSON

  • Other Error from transport_send_text

esp_err_t esp_xiaozhi_chat_send_stop_listening(esp_xiaozhi_chat_handle_t chat_hd)

Send stop listening.

Parameters

chat_hd[in] Handle to the chat instance

Returns

  • ESP_OK On success

  • ESP_ERR_INVALID_ARG Invalid handle

  • ESP_ERR_INVALID_STATE No session (open audio channel first)

  • ESP_ERR_NO_MEM Failed to create JSON

  • Other Error from transport_send_text

esp_err_t esp_xiaozhi_chat_send_abort_speaking(esp_xiaozhi_chat_handle_t chat_hd, esp_xiaozhi_chat_abort_speaking_reason_t reason)

Send abort speaking.

Parameters
  • chat_hd[in] Handle to the chat instance

  • reason[in] Reason for aborting speaking

Returns

  • ESP_OK On success

  • ESP_ERR_INVALID_ARG Invalid handle

  • ESP_ERR_INVALID_STATE No session (open audio channel first)

  • ESP_ERR_NO_MEM Failed to create JSON

  • Other Error from transport_send_text

Structures

struct esp_xiaozhi_chat_audio_t

Audio packet for Xiaozhi chat; also used as audio params when passed to esp_xiaozhi_chat_open_audio_channel(). For packet use: set sample_rate, frame_duration, timestamp, payload, payload_size (format/channels ignored). For open_audio_channel use: set format (NULL = “opus”), sample_rate (0 = 16000, otherwise 8000-48000), channels (0 = 1, otherwise 1-2), frame_duration (0 = 60, otherwise 10-120); payload/timestamp/payload_size ignored.

Public Members

const char *format

Audio format for hello, e.g. “opus”, “pcm”. NULL means “opus”. Packet use: ignore

int sample_rate

Sample rate (Hz). For hello: 0 means 16000

int channels

Channel count for hello. 0 means 1. Packet use: ignore

int frame_duration

Frame duration (ms). For hello: 0 means 60

uint32_t timestamp

Timestamp (packet use only)

uint8_t *payload

Payload (packet use only)

size_t payload_size

Payload size (packet use only)

struct esp_xiaozhi_chat_tts_state_t

TTS state payload for ESP_XIAOZHI_CHAT_EVENT_CHAT_TTS_STATE Pointers valid only during the event callback.

Public Members

esp_xiaozhi_chat_tts_state_kind_t state

TTS state kind (start / stop / sentence_start)

const char *text

Non-NULL only when state is SENTENCE_START

struct esp_xiaozhi_chat_error_info_t

Error info for ESP_XIAOZHI_CHAT_EVENT_CHAT_ERROR (protocol layer only) Pointers valid only during the event callback.

Public Members

esp_err_t code

Error code

const char *source

Hint e.g. “transport”, “hello_timeout”, “udp”

struct esp_xiaozhi_chat_text_data_t

Text data structure for chat messages.

Public Members

esp_xiaozhi_chat_text_role_t role

Role of the message (user or assistant)

const char *text

Text content of the message

struct esp_xiaozhi_chat_config_t

Configuration structure for initializing a Xiaozhi chat session.

Public Members

esp_xiaozhi_chat_audio_type_t audio_type

Type of audio input/output to use

esp_xiaozhi_chat_audio_callback_t audio_callback

Callback function for handling audio data

esp_xiaozhi_chat_event_callback_t event_callback

Callback function for handling Xiaozhi events

void *audio_callback_ctx

Context pointer passed to the audio callback

void *event_callback_ctx

Context pointer passed to the event callback

esp_mcp_t *mcp_engine

MCP engine instance provided by the caller

bool owns_mcp_engine

Whether chat takes ownership of mcp_engine and destroys it in deinit

bool has_mqtt_config

True if server provides MQTT config (from get_info). When both MQTT and WebSocket supported, prefer MQTT

bool has_websocket_config

True if server provides WebSocket config (from get_info)

Macros

ESP_XIAOZHI_CHAT_EVENT_CONNECTED

Event bits for ESP event system (app may register for these). These are the only event bits exposed to the app; do not add internal sync flags here.

ESP_XIAOZHI_CHAT_EVENT_DISCONNECTED
ESP_XIAOZHI_CHAT_EVENT_AUDIO_CHANNEL_OPENED
ESP_XIAOZHI_CHAT_EVENT_AUDIO_CHANNEL_CLOSED
ESP_XIAOZHI_CHAT_EVENT_AUDIO_DATA_INCOMING
ESP_XIAOZHI_CHAT_EVENT_SERVER_GOODBYE
ESP_XIAOZHI_CHAT_DEFAULT_CONFIG()

Default configuration initializer for esp_xiaozhi_chat_config_t.

Type Definitions

typedef uint32_t esp_xiaozhi_chat_handle_t

Handle for a Xiaozhi chat session.

typedef void (*esp_xiaozhi_chat_audio_callback_t)(const uint8_t *data, int len, void *ctx)

Callback for receiving audio data during chat.

The data buffer is owned by the chat module and is only valid for the duration of this callback. Implementations must consume or copy the data before returning and must not store the pointer for asynchronous use.

Param data

Pointer to the audio data buffer, valid only during this callback

Param len

Length of the audio data in bytes

Param ctx

User-defined context passed to the callback

typedef void (*esp_xiaozhi_chat_event_callback_t)(esp_xiaozhi_chat_event_t event, void *event_data, void *ctx)

Callback for receiving chat events.

Param event

Chat event type

Param event_data

Optional output data associated with the event

Param ctx

User-defined context passed to the callback

Enumerations

enum esp_xiaozhi_chat_tts_state_kind_t

TTS state kind for protocol-layer notification (app decides device state)

Values:

enumerator ESP_XIAOZHI_CHAT_TTS_STATE_START

TTS playback started

enumerator ESP_XIAOZHI_CHAT_TTS_STATE_STOP

TTS playback stopped

enumerator ESP_XIAOZHI_CHAT_TTS_STATE_SENTENCE_START

TTS sentence started; text is valid

enum esp_xiaozhi_chat_event_t

Events that can occur during a Xiaozhi chat session (minimal protocol API)

Component only reports protocol facts; app handles state machine, UI, and system commands.

Values:

enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_SPEECH_STARTED

Emitted on TTS start; prefer CHAT_TTS_STATE for new code

enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_SPEECH_STOPPED

Emitted on TTS stop; prefer CHAT_TTS_STATE for new code

enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_ERROR

event_data = esp_xiaozhi_chat_error_info_t *

enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_TEXT

event_data = esp_xiaozhi_chat_text_data_t * (STT/TTS sentence)

enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_EMOJI

event_data = const char * (LLM emotion)

enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_TTS_STATE

event_data = esp_xiaozhi_chat_tts_state_t * (protocol TTS state)

enumerator ESP_XIAOZHI_CHAT_EVENT_CHAT_SYSTEM_CMD

event_data = const char * (e.g. “reboot”); app decides whether to execute

enum esp_xiaozhi_chat_audio_type_t

Supported audio formats for Xiaozhi chat.

Values:

enumerator ESP_XIAOZHI_CHAT_AUDIO_TYPE_OPUS

OPUS compressed audio format

enum esp_xiaozhi_chat_device_state_t

Device state for Xiaozhi chat.

Values:

enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_UNKNOWN
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_STARTING
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_WIFI_CONFIGURING
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_IDLE
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_CONNECTING
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_LISTENING
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_SPEAKING
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_UPGRADING
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_ACTIVATING
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_AUDIO_TESTING
enumerator ESP_XIAOZHI_CHAT_DEVICE_STATE_FATAL_ERROR
enum esp_xiaozhi_chat_listening_mode_t

Listening mode for Xiaozhi chat.

Values:

enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_REALTIME
enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_AUTO
enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_MANUAL
enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_AUTO_STOP
enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_MANUAL_STOP
enumerator ESP_XIAOZHI_CHAT_LISTENING_MODE_UNKNOWN
enum esp_xiaozhi_chat_abort_speaking_reason_t

Reasons for aborting speaking.

Values:

enumerator ESP_XIAOZHI_CHAT_ABORT_SPEAKING_REASON_WAKE_WORD_DETECTED
enumerator ESP_XIAOZHI_CHAT_ABORT_SPEAKING_REASON_STOP_LISTENING
enumerator ESP_XIAOZHI_CHAT_ABORT_SPEAKING_REASON_UNKNOWN
enum esp_xiaozhi_chat_text_role_t

Text role enumeration for chat messages.

Values:

enumerator ESP_XIAOZHI_CHAT_TEXT_ROLE_USER

User message role

enumerator ESP_XIAOZHI_CHAT_TEXT_ROLE_ASSISTANT

Assistant message role

Header File

Functions

esp_err_t esp_xiaozhi_chat_get_info(esp_xiaozhi_chat_info_t *info)

Get Xiaozhi Chat Information from the HTTP server.

The function posts board information to the configured service endpoint, parses the response, updates the output structure, and persists MQTT/WebSocket settings to NVS when present in the server response.

Parameters

info[inout] Pointer to the information structure

Returns

  • ESP_OK On success

  • ESP_ERR_INVALID_ARG Invalid info pointer

  • ESP_ERR_NO_MEM Failed to allocate working buffers or HTTP resources

  • ESP_ERR_INVALID_RESPONSE Server response is malformed or missing a valid body

  • Other Error from board info collection, HTTP client, JSON parsing, or keystore persistence

esp_err_t esp_xiaozhi_chat_free_info(esp_xiaozhi_chat_info_t *info)

Free Xiaozhi Chat Information.

Releases dynamically allocated string fields owned by info. This function does not zero the full structure or reset the boolean flags.

Parameters

info[inout] Pointer to the information structure

Returns

  • ESP_OK On success

  • ESP_ERR_INVALID_ARG Invalid info pointer

Structures

struct esp_xiaozhi_chat_info_t

Information for Xiaozhi chat.

Public Members

char *current_version

Current version of the firmware

char *firmware_version

Firmware version

char *firmware_url

Firmware URL

char *serial_number

Serial number

char *activation_code

Activation code

char *activation_challenge

Activation challenge

char *activation_message

Activation message

int activation_timeout_ms

Activation timeout in milliseconds

bool has_serial_number

Has serial number

bool has_new_version

Has new version

bool has_activation_code

Has activation code

bool has_activation_challenge

Has activation challenge

bool has_mqtt_config

Has MQTT config

bool has_websocket_config

Has WebSocket config

bool has_server_time

Has server time