Quick start
Pick a provider
OpenAI and ElevenLabs are the most reliable hosted options. Microsoft and
Local CLI work without an API key. See the provider matrix
for the full list.
Set the API key
Export the env var for your provider (for example
OPENAI_API_KEY,
ELEVENLABS_API_KEY). Microsoft and Local CLI need no key.Auto-TTS is off by default. When
messages.tts.provider is unset,
OpenClaw picks the first configured provider in registry auto-select order.Supported providers
| Provider | Auth | Notes |
|---|---|---|
| Azure Speech | AZURE_SPEECH_KEY + AZURE_SPEECH_REGION (also AZURE_SPEECH_API_KEY, SPEECH_KEY, SPEECH_REGION) | Native Ogg/Opus voice-note output and telephony. |
| DeepInfra | DEEPINFRA_API_KEY | OpenAI-compatible TTS. Defaults to hexgrad/Kokoro-82M. |
| ElevenLabs | ELEVENLABS_API_KEY or XI_API_KEY | Voice cloning, multilingual, deterministic via seed. |
| Google Gemini | GEMINI_API_KEY or GOOGLE_API_KEY | Gemini API TTS; persona-aware via promptTemplate: "audio-profile-v1". |
| Gradium | GRADIUM_API_KEY | Voice-note and telephony output. |
| Inworld | INWORLD_API_KEY | Streaming TTS API. Native Opus voice-note and PCM telephony. |
| Local CLI | none | Runs a configured local TTS command. |
| Microsoft | none | Public Edge neural TTS via node-edge-tts. Best-effort, no SLA. |
| MiniMax | MINIMAX_API_KEY (or Token Plan: MINIMAX_OAUTH_TOKEN, MINIMAX_CODE_PLAN_KEY, MINIMAX_CODING_API_KEY) | T2A v2 API. Defaults to speech-2.8-hd. |
| OpenAI | OPENAI_API_KEY | Also used for auto-summary; supports persona instructions. |
| OpenRouter | OPENROUTER_API_KEY (can reuse models.providers.openrouter.apiKey) | Default model hexgrad/kokoro-82m. |
| Volcengine | VOLCENGINE_TTS_API_KEY or BYTEPLUS_SEED_SPEECH_API_KEY (legacy AppID/token: VOLCENGINE_TTS_APPID/_TOKEN) | BytePlus Seed Speech HTTP API. |
| Vydra | VYDRA_API_KEY | Shared image, video, and speech provider. |
| xAI | XAI_API_KEY | xAI batch TTS. Native Opus voice-note is not supported. |
| Xiaomi MiMo | XIAOMI_API_KEY | MiMo TTS through Xiaomi chat completions. |
summaryModel (or
agents.defaults.model.primary), so that provider must also be authenticated
if you keep summaries enabled.
Configuration
TTS config lives undermessages.tts in ~/.openclaw/openclaw.json. Pick a
preset and adapt the provider block:
- Azure Speech
- ElevenLabs
- Google Gemini
- Gradium
- Inworld
- Local CLI
- Microsoft (no key)
- MiniMax
- OpenAI + ElevenLabs
- OpenRouter
- Volcengine
- xAI
- Xiaomi MiMo
Per-agent voice overrides
Useagents.list[].tts when one agent should speak with a different provider,
voice, model, persona, or auto-TTS mode. The agent block deep-merges over
messages.tts, so provider credentials can stay in the global provider config:
agents.list[].tts.persona alongside provider
config — it overrides the global messages.tts.persona for that agent only.
Precedence order for automatic replies, /tts audio, /tts status, and the
tts agent tool:
messages.tts- active
agents.list[].tts - channel override, when the channel supports
channels.<channel>.tts - account override, when the channel passes
channels.<channel>.accounts.<id>.tts - local
/ttspreferences for this host - inline
[[tts:...]]directives when model overrides are enabled
messages.tts and
deep-merge over the earlier layers, so shared provider credentials can stay in
messages.tts while a channel or bot account changes only voice, model, persona,
or auto mode:
Personas
A persona is a stable spoken identity that can be applied deterministically across providers. It can prefer one provider, define provider-neutral prompt intent, and carry provider-specific bindings for voices, models, prompt templates, seeds, and voice settings.Minimal persona
Full persona (provider-neutral prompt)
Persona resolution
The active persona is selected deterministically:/tts persona <id>local preference, if set.messages.tts.persona, if set.- No persona.
- Direct overrides (CLI, gateway, Talk, allowed TTS directives).
/tts provider <id>local preference.- Active persona’s
provider. messages.tts.provider.- Registry auto-select.
messages.tts.providers.<id>messages.tts.personas.<persona>.providers.<id>- Trusted request overrides
- Allowed model-emitted TTS directive overrides
How providers use persona prompts
Persona prompt fields (profile, scene, sampleContext, style, accent,
pacing, constraints) are provider-neutral. Each provider decides how
to use them:
Google Gemini
Google Gemini
Wraps persona prompt fields in a Gemini TTS prompt structure only when
the effective Google provider config sets
promptTemplate: "audio-profile-v1"
or personaPrompt. The older audioProfile and speakerName fields are
still prepended as Google-specific prompt text. Inline audio tags such as
[whispers] or [laughs] inside a [[tts:text]] block are preserved
inside the Gemini transcript; OpenClaw does not generate these tags.OpenAI
OpenAI
Maps persona prompt fields to the request
instructions field only when
no explicit OpenAI instructions is configured. Explicit instructions
always wins.Other providers
Other providers
Use only the provider-specific persona bindings under
personas.<id>.providers.<provider>. Persona prompt fields are ignored
unless the provider implements its own persona-prompt mapping.Fallback policy
fallbackPolicy controls behavior when a persona has no binding for the
attempted provider:
| Policy | Behavior |
|---|---|
preserve-persona | Default. Provider-neutral prompt fields stay available; the provider may use them or ignore them. |
provider-defaults | Persona is omitted from prompt preparation for that attempt; the provider uses its neutral defaults while fallback to other providers continues. |
fail | Skip that provider attempt with reasonCode: "not_configured" and personaBinding: "missing". Fallback providers are still tried. |
Model-driven directives
By default, the assistant can emit[[tts:...]] directives to override
voice, model, or speed for a single reply, plus an optional
[[tts:text]]...[[/tts:text]] block for expressive cues that should appear in
audio only:
messages.tts.auto is "tagged", directives are required to trigger
audio. Streaming block delivery strips directives from visible text before the
channel sees them, even when split across adjacent blocks.
provider=... is ignored unless modelOverrides.allowProvider: true. When a
reply declares provider=..., the other keys in that directive are parsed
only by that provider; unsupported keys are stripped and reported as TTS
directive warnings.
Available directive keys:
provider(registered provider id; requiresallowProvider: true)voice/voiceName/voice_name/google_voice/voiceIdmodel/google_modelstability,similarityBoost,style,speed,useSpeakerBoostvol/volume(MiniMax volume, 0–10)pitch(MiniMax integer pitch, −12 to 12; fractional values are truncated)emotion(Volcengine emotion tag)applyTextNormalization(auto|on|off)languageCode(ISO 639-1)seed
Slash commands
Single command/tts. On Discord, OpenClaw also registers /voice because
/tts is a built-in Discord command — text /tts ... still works.
Commands require an authorized sender (allowlist/owner rules apply) and either
commands.text or native command registration must be enabled./tts onwrites the local TTS preference toalways;/tts offwrites it tooff./tts chat on|off|defaultwrites a session-scoped auto-TTS override for the current chat./tts persona <id>writes the local persona preference;/tts persona offclears it./tts latestreads the latest assistant reply from the current session transcript and sends it as audio once. It stores only a hash of that reply on the session entry to suppress duplicate voice sends./tts audiogenerates a one-off audio reply (does not toggle TTS on).limitandsummaryare stored in local prefs, not the main config./tts statusincludes fallback diagnostics for the latest attempt —Fallback: <primary> -> <used>,Attempts: ..., and per-attempt detail (provider:outcome(reasonCode) latency)./statusshows the active TTS mode plus configured provider, model, voice, and sanitized custom endpoint metadata when TTS is enabled.
Per-user preferences
Slash commands write local overrides toprefsPath. The default is
~/.openclaw/settings/tts.json; override with the OPENCLAW_TTS_PREFS env var
or messages.tts.prefsPath.
| Stored field | Effect |
|---|---|
auto | Local auto-TTS override (always, off, …) |
provider | Local primary provider override |
persona | Local persona override |
maxLength | Summary threshold (default 1500 chars) |
summarize | Summary toggle (default true) |
messages.tts plus the active
agents.list[].tts block for that host.
Output formats (fixed)
TTS voice delivery is channel-capability driven. Channel plugins advertise whether voice-style TTS should ask providers for a nativevoice-note target or
keep normal audio-file synthesis and only mark compatible output for voice
delivery.
- Voice-note capable channels: voice-note replies prefer Opus (
opus_48000_64from ElevenLabs,opusfrom OpenAI).- 48kHz / 64kbps is a good voice message tradeoff.
- Feishu / WhatsApp: when a voice-note reply is produced as MP3/WebM/WAV/M4A
or another likely audio file, the channel plugin transcodes it to 48kHz
Ogg/Opus with
ffmpegbefore sending the native voice message. WhatsApp sends the result through the Baileysaudiopayload withptt: trueandaudio/ogg; codecs=opus. If conversion fails, Feishu receives the original file as an attachment; WhatsApp send fails rather than posting an incompatible PTT payload. - BlueBubbles: keeps provider synthesis on the normal audio-file path; MP3 and CAF outputs are marked for iMessage voice memo delivery.
- Other channels: MP3 (
mp3_44100_128from ElevenLabs,mp3from OpenAI).- 44.1kHz / 128kbps is the default balance for speech clarity.
- MiniMax: MP3 (
speech-2.8-hdmodel, 32kHz sample rate) for normal audio attachments. For channel-advertised voice-note targets, OpenClaw transcodes the MiniMax MP3 to 48kHz Opus withffmpegbefore delivery when the channel advertises transcoding. - Xiaomi MiMo: MP3 by default, or WAV when configured. For channel-advertised voice-note targets, OpenClaw transcodes Xiaomi output to 48kHz Opus with
ffmpegbefore delivery when the channel advertises transcoding. - Local CLI: uses the configured
outputFormat. Voice-note targets are converted to Ogg/Opus and telephony output is converted to raw 16 kHz mono PCM withffmpeg. - Google Gemini: Gemini API TTS returns raw 24kHz PCM. OpenClaw wraps it as WAV for audio attachments, transcodes it to 48kHz Opus for voice-note targets, and returns PCM directly for Talk/telephony.
- Gradium: WAV for audio attachments, Opus for voice-note targets, and
ulaw_8000at 8 kHz for telephony. - Inworld: MP3 for normal audio attachments, native
OGG_OPUSfor voice-note targets, and rawPCMat 22050 Hz for Talk/telephony. - xAI: MP3 by default;
responseFormatmay bemp3,wav,pcm,mulaw, oralaw. OpenClaw uses xAI’s batch REST TTS endpoint and returns a complete audio attachment; xAI’s streaming TTS WebSocket is not used by this provider path. Native Opus voice-note format is not supported by this path. - Microsoft: uses
microsoft.outputFormat(defaultaudio-24khz-48kbitrate-mono-mp3).- The bundled transport accepts an
outputFormat, but not all formats are available from the service. - Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus).
- Telegram
sendVoiceaccepts OGG/MP3/M4A; use OpenAI/ElevenLabs if you need guaranteed Opus voice messages. - If the configured Microsoft output format fails, OpenClaw retries with MP3.
- The bundled transport accepts an
Auto-TTS behavior
Whenmessages.tts.auto is enabled, OpenClaw:
- Skips TTS if the reply already contains media or a
MEDIA:directive. - Skips very short replies (under 10 chars).
- Summarizes long replies when summaries are enabled, using
summaryModel(oragents.defaults.model.primary). - Attaches the generated audio to the reply.
- In
mode: "final", still sends audio-only TTS for streamed final replies after the text stream completes; the generated media goes through the same channel media normalization as normal reply attachments.
maxLength and summary is off (or no API key for the
summary model), audio is skipped and the normal text reply is sent.
Output formats by channel
| Target | Format |
|---|---|
| Feishu / Matrix / Telegram / WhatsApp | Voice-note replies prefer Opus (opus_48000_64 from ElevenLabs, opus from OpenAI). 48 kHz / 64 kbps balances clarity and size. |
| Other channels | MP3 (mp3_44100_128 from ElevenLabs, mp3 from OpenAI). 44.1 kHz / 128 kbps default for speech. |
| Talk / telephony | Provider-native PCM (Inworld 22050 Hz, Google 24 kHz), or ulaw_8000 from Gradium for telephony. |
- Feishu / WhatsApp transcoding: When a voice-note reply lands as MP3/WebM/WAV/M4A, the channel plugin transcodes to 48 kHz Ogg/Opus with
ffmpeg. WhatsApp sends through Baileys withptt: trueandaudio/ogg; codecs=opus. If conversion fails: Feishu falls back to attaching the original file; WhatsApp send fails rather than posting an incompatible PTT payload. - MiniMax / Xiaomi MiMo: Default MP3 (32 kHz for MiniMax
speech-2.8-hd); transcoded to 48 kHz Opus for voice-note targets viaffmpeg. - Local CLI: Uses configured
outputFormat. Voice-note targets are converted to Ogg/Opus and telephony output to raw 16 kHz mono PCM. - Google Gemini: Returns raw 24 kHz PCM. OpenClaw wraps as WAV for attachments, transcodes to 48 kHz Opus for voice-note targets, returns PCM directly for Talk/telephony.
- Inworld: MP3 attachments, native
OGG_OPUSvoice-note, rawPCM22050 Hz for Talk/telephony. - xAI: MP3 by default;
responseFormatmay bemp3|wav|pcm|mulaw|alaw. Uses xAI’s batch REST endpoint — streaming WebSocket TTS is not used. Native Opus voice-note format is not supported. - Microsoft: Uses
microsoft.outputFormat(defaultaudio-24khz-48kbitrate-mono-mp3). TelegramsendVoiceaccepts OGG/MP3/M4A; use OpenAI/ElevenLabs if you need guaranteed Opus voice messages. If the configured Microsoft format fails, OpenClaw retries with MP3.
Field reference
Top-level messages.tts.*
Top-level messages.tts.*
Auto-TTS mode.
inbound only sends audio after an inbound voice message; tagged only sends audio when the reply includes [[tts:...]] directives or a [[tts:text]] block.Legacy toggle.
openclaw doctor --fix migrates this to auto."all" includes tool/block replies in addition to final replies.Speech provider id. When unset, OpenClaw uses the first configured provider in registry auto-select order. Legacy
provider: "edge" is rewritten to "microsoft" by openclaw doctor --fix.Active persona id from
personas. Normalized to lowercase.Stable spoken identity. Fields:
label, description, provider, fallbackPolicy, prompt, providers.<provider>. See Personas.Cheap model for auto-summary; defaults to
agents.defaults.model.primary. Accepts provider/model or a configured model alias.Allow the model to emit TTS directives.
enabled defaults to true; allowProvider defaults to false.Provider-owned settings keyed by speech provider id. Legacy direct blocks (
messages.tts.openai, .elevenlabs, .microsoft, .edge) are rewritten by openclaw doctor --fix; commit only messages.tts.providers.<id>.Hard cap for TTS input characters.
/tts audio fails if exceeded.Request timeout in milliseconds.
Override the local prefs JSON path (provider/limit/summary). Default
~/.openclaw/settings/tts.json.Azure Speech
Azure Speech
Env:
AZURE_SPEECH_KEY, AZURE_SPEECH_API_KEY, or SPEECH_KEY.Azure Speech region (e.g.
eastus). Env: AZURE_SPEECH_REGION or SPEECH_REGION.Optional Azure Speech endpoint override (alias
baseUrl).Azure voice ShortName. Default
en-US-JennyNeural.SSML language code. Default
en-US.Azure
X-Microsoft-OutputFormat for standard audio. Default audio-24khz-48kbitrate-mono-mp3.Azure
X-Microsoft-OutputFormat for voice-note output. Default ogg-24khz-16bit-mono-opus.ElevenLabs
ElevenLabs
Falls back to
ELEVENLABS_API_KEY or XI_API_KEY.Model id (e.g.
eleven_multilingual_v2, eleven_v3).ElevenLabs voice id.
stability, similarityBoost, style (each 0..1), useSpeakerBoost (true|false), speed (0.5..2.0, 1.0 = normal).Text normalization mode.
2-letter ISO 639-1 (e.g.
en, de).Integer
0..4294967295 for best-effort determinism.Override ElevenLabs API base URL.
Google Gemini
Google Gemini
Falls back to
GEMINI_API_KEY / GOOGLE_API_KEY. If omitted, TTS can reuse models.providers.google.apiKey before env fallback.Gemini TTS model. Default
gemini-3.1-flash-tts-preview.Gemini prebuilt voice name. Default
Kore. Alias: voice.Natural-language style prompt prepended before spoken text.
Optional speaker label prepended before spoken text when your prompt uses a named speaker.
Set to
audio-profile-v1 to wrap active persona prompt fields in a deterministic Gemini TTS prompt structure.Google-specific extra persona prompt text appended to the template’s Director’s Notes.
Only
https://generativelanguage.googleapis.com is accepted.Gradium
Gradium
Inworld
Inworld
Local CLI (tts-local-cli)
Local CLI (tts-local-cli)
Local executable or command string for CLI TTS.
Command arguments. Supports
{{Text}}, {{OutputPath}}, {{OutputDir}}, {{OutputBase}} placeholders.Expected CLI output format. Default
mp3 for audio attachments.Command timeout in milliseconds. Default
120000.Optional command working directory.
Optional environment overrides for the command.
Microsoft (no API key)
Microsoft (no API key)
Allow Microsoft speech usage.
Microsoft neural voice name (e.g.
en-US-MichelleNeural).Language code (e.g.
en-US).Microsoft output format. Default
audio-24khz-48kbitrate-mono-mp3. Not all formats are supported by the bundled Edge-backed transport.Percent strings (e.g.
+10%, -5%).Write JSON subtitles alongside the audio file.
Proxy URL for Microsoft speech requests.
Request timeout override (ms).
Legacy alias. Run
openclaw doctor --fix to rewrite persisted config to providers.microsoft.MiniMax
MiniMax
Falls back to
MINIMAX_API_KEY. Token Plan auth via MINIMAX_OAUTH_TOKEN, MINIMAX_CODE_PLAN_KEY, or MINIMAX_CODING_API_KEY.Default
https://api.minimax.io. Env: MINIMAX_API_HOST.Default
speech-2.8-hd. Env: MINIMAX_TTS_MODEL.Default
English_expressive_narrator. Env: MINIMAX_TTS_VOICE_ID.0.5..2.0. Default 1.0.(0, 10]. Default 1.0.Integer
-12..12. Default 0. Fractional values are truncated before the request.OpenAI
OpenAI
Falls back to
OPENAI_API_KEY.OpenAI TTS model id (e.g.
gpt-4o-mini-tts).Voice name (e.g.
alloy, cedar).Explicit OpenAI
instructions field. When set, persona prompt fields are not auto-mapped.Override the OpenAI TTS endpoint. Resolution order: config →
OPENAI_TTS_BASE_URL → https://api.openai.com/v1. Non-default values are treated as OpenAI-compatible TTS endpoints, so custom model and voice names are accepted.OpenRouter
OpenRouter
Env:
OPENROUTER_API_KEY. Can reuse models.providers.openrouter.apiKey.Default
https://openrouter.ai/api/v1. Legacy https://openrouter.ai/v1 is normalized.Default
hexgrad/kokoro-82m. Alias: modelId.Default
af_alloy. Alias: voiceId.Default
mp3.Provider-native speed override.
Volcengine (BytePlus Seed Speech)
Volcengine (BytePlus Seed Speech)
Env:
VOLCENGINE_TTS_API_KEY or BYTEPLUS_SEED_SPEECH_API_KEY.Default
seed-tts-1.0. Env: VOLCENGINE_TTS_RESOURCE_ID. Use seed-tts-2.0 when your project has TTS 2.0 entitlement.App key header. Default
aGjiRDfUWi. Env: VOLCENGINE_TTS_APP_KEY.Override the Seed Speech TTS HTTP endpoint. Env:
VOLCENGINE_TTS_BASE_URL.Voice type. Default
en_female_anna_mars_bigtts. Env: VOLCENGINE_TTS_VOICE.Provider-native speed ratio.
Provider-native emotion tag.
Legacy Volcengine Speech Console fields. Env:
VOLCENGINE_TTS_APPID, VOLCENGINE_TTS_TOKEN, VOLCENGINE_TTS_CLUSTER (default volcano_tts).xAI
xAI
Env:
XAI_API_KEY.Default
https://api.x.ai/v1. Env: XAI_BASE_URL.Default
eve. Live voices: ara, eve, leo, rex, sal, una.BCP-47 language code or
auto. Default en.Default
mp3.Provider-native speed override.
Xiaomi MiMo
Xiaomi MiMo
Env:
XIAOMI_API_KEY.Default
https://api.xiaomimimo.com/v1. Env: XIAOMI_BASE_URL.Default
mimo-v2.5-tts. Env: XIAOMI_TTS_MODEL. Also supports mimo-v2-tts.Default
mimo_default. Env: XIAOMI_TTS_VOICE.Default
mp3. Env: XIAOMI_TTS_FORMAT.Optional natural-language style instruction sent as the user message; not spoken.
Agent tool
Thetts tool converts text to speech and returns an audio attachment for
reply delivery. On Feishu, Matrix, Telegram, and WhatsApp, the audio is
delivered as a voice message rather than a file attachment. Feishu and
WhatsApp can transcode non-Opus TTS output on this path when ffmpeg is
available.
WhatsApp sends audio through Baileys as a PTT voice note (audio with
ptt: true) and sends visible text separately from PTT audio because
clients do not consistently render captions on voice notes.
The tool accepts optional channel and timeoutMs fields; timeoutMs is a
per-call provider request timeout in milliseconds.
Gateway RPC
| Method | Purpose |
|---|---|
tts.status | Read current TTS state and last attempt. |
tts.enable | Set local auto preference to always. |
tts.disable | Set local auto preference to off. |
tts.convert | One-off text → audio. |
tts.setProvider | Set local provider preference. |
tts.setPersona | Set local persona preference. |
tts.providers | List configured providers and status. |
Service links
- OpenAI text-to-speech guide
- OpenAI Audio API reference
- Azure Speech REST text-to-speech
- Azure Speech provider
- ElevenLabs Text to Speech
- ElevenLabs Authentication
- Gradium
- Inworld TTS API
- MiniMax T2A v2 API
- Volcengine TTS HTTP API
- Xiaomi MiMo speech synthesis
- node-edge-tts
- Microsoft Speech output formats
- xAI text to speech