Nodes and media

Media understanding

OpenClaw can summarize inbound media (image/audio/video) before the reply pipeline runs. It auto-detects when local tools or provider keys are available, and can be disabled or customized. If understanding is off, models still receive the original files/URLs as usual.

Vendor-specific media behavior is registered by vendor plugins, while OpenClaw core owns the shared tools.media config, fallback order, and reply-pipeline integration.

Goals

Optional: pre-digest inbound media into short text for faster routing + better command parsing.
Preserve original media delivery to the model (always).
Support provider APIs and CLI fallbacks.
Allow multiple models with ordered fallback (error/size/timeout).

High-level behavior

Collect attachments

Collect inbound attachments (MediaPaths, MediaUrls, MediaTypes).

Select per-capability

For each enabled capability (image/audio/video), select attachments per policy (default: first).

Choose model

Choose the first eligible model entry (size + capability + auth).

Fallback on failure

If a model fails or the media is too large, fall back to the next entry.

Apply success block

On success:

Body becomes [Image], [Audio], or [Video] block.
Audio sets {{Transcript}}; command parsing uses caption text when present, otherwise the transcript.
Captions are preserved as User text: inside the block.

If understanding fails or is disabled, the reply flow continues with the original body + attachments.

Config overview

tools.media supports shared models plus per-capability overrides:

Top-level keys

tools.media.models: shared model list (use capabilities to gate).
tools.media.image / tools.media.audio / tools.media.video:
- defaults (prompt, maxChars, maxBytes, timeoutSeconds, language)
- provider overrides (baseUrl, headers, providerOptions)
- Deepgram audio options via tools.media.audio.providerOptions.deepgram
- audio transcript echo controls (echoTranscript, default false; echoFormat)
- optional per-capability models list (preferred before shared models)
- attachments policy (mode, maxAttachments, prefer)
- scope (optional gating by channel/chatType/session key)
tools.media.concurrency: max concurrent capability runs (default 2).

json5

{  tools: {    media: {      models: [        /* shared list */      ],      image: {        /* optional overrides */      },      audio: {        /* optional overrides */        echoTranscript: true,        echoFormat: '📝 "{transcript}"',      },      video: {        /* optional overrides */      },    },  },}

Model entries

Each models[] entry can be provider or CLI:

Provider entry

json5

{  type: "provider", // default if omitted  provider: "openai",  model: "gpt-5.5",  prompt: "Describe the image in <= 500 chars.",  maxChars: 500,  maxBytes: 10485760,  timeoutSeconds: 60,  capabilities: ["image"], // optional, used for multi-modal entries  profile: "vision-profile",  preferredProfile: "vision-fallback",}

CLI entry

json5

{  type: "cli",  command: "gemini",  args: [    "-m",    "gemini-3-flash",    "--allowed-tools",    "read_file",    "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.",  ],  maxChars: 500,  maxBytes: 52428800,  timeoutSeconds: 120,  capabilities: ["video", "image"],}

CLI templates can also use:

{{MediaDir}} (directory containing the media file)
{{OutputDir}} (scratch dir created for this run)
{{OutputBase}} (scratch file base path, no extension)

Provider credentials (`apiKey`)

Provider media understanding uses the same provider auth resolution as normal model calls: auth profiles, environment variables, then models.providers.<providerId>.apiKey.

tools.media.*.models[] entries do not accept an inline apiKey field. The provider value in a media model entry, such as openai or moonshot, must have credentials available through one of the standard provider auth sources.

Minimal example:

json5

{  models: {    providers: {      openai: { apiKey: "&lt;OPENAI_API_KEY&gt;" },      moonshot: { apiKey: "&lt;MOONSHOT_API_KEY&gt;" },    },  },}

For the full provider auth reference, including profiles, environment variables, and custom base URLs, see Tools and custom providers.

Defaults and limits

Recommended defaults:

maxChars: 500 for image/video (short, command-friendly)
maxChars: unset for audio (full transcript unless you set a limit)
maxBytes:
- image: 10MB
- audio: 20MB
- video: 50MB

Rules

If media exceeds maxBytes, that model is skipped and the next model is tried.
Audio files smaller than 1024 bytes are treated as empty/corrupt and skipped before provider/CLI transcription; inbound reply context receives a deterministic placeholder transcript so the agent knows the note was too small.
If the model returns more than maxChars, output is trimmed.
prompt defaults to simple "Describe the {media}." plus the maxChars guidance (image/video only).
If the active primary image model already supports vision natively, OpenClaw skips the [Image] summary block and passes the original image into the model instead.
If a Gateway/WebChat primary model is text-only, image attachments are preserved as offloaded media://inbound/* refs so the image/PDF tools or configured image model can still inspect them instead of losing the attachment.
Explicit openclaw infer image describe --model <provider/model> requests are different: they run that image-capable provider/model directly, including Ollama refs such as ollama/qwen2.5vl:7b.
If <capability>.enabled: true but no models are configured, OpenClaw tries the active reply model when its provider supports the capability.

Auto-detect media understanding (default)

If tools.media.<capability>.enabled is not set to false and you haven't configured models, OpenClaw auto-detects in this order and stops at the first working option:

Active reply model

Active reply model when its provider supports the capability.

agents.defaults.imageModel

agents.defaults.imageModel primary/fallback refs (image only). Prefer provider/model refs. Bare refs are qualified from configured image-capable provider model entries only when the match is unique.

Local CLIs (audio only)

Local CLIs (if installed):

sherpa-onnx-offline (requires SHERPA_ONNX_MODEL_DIR with encoder/decoder/joiner/tokens)
whisper-cli (whisper-cpp; uses WHISPER_CPP_MODEL or the bundled tiny model)
whisper (Python CLI; downloads models automatically)

Gemini CLI

gemini using read_many_files.

Provider auth

Configured models.providers.* entries that support the capability are tried before the bundled fallback order.
Image-only config providers with an image-capable model auto-register for media understanding even when they are not a bundled vendor plugin.
Ollama image understanding is available when selected explicitly, for example through agents.defaults.imageModel or openclaw infer image describe --model ollama/<vision-model>.

Bundled fallback order:

Audio: OpenAI → Groq → xAI → Deepgram → OpenRouter → Google → SenseAudio → ElevenLabs → Mistral
Image: OpenAI → Anthropic → Google → MiniMax → MiniMax Portal → Z.AI
Video: Google → Qwen → Moonshot

To disable auto-detection, set:

json5

{  tools: {    media: {      audio: {        enabled: false,      },    },  },}

Proxy environment support (provider models)

When provider-based audio and video media understanding is enabled, OpenClaw honors standard outbound proxy environment variables for provider HTTP calls:

HTTPS_PROXY
HTTP_PROXY
ALL_PROXY
https_proxy
http_proxy
all_proxy

If no proxy env vars are set, media understanding uses direct egress. If the proxy value is malformed, OpenClaw logs a warning and falls back to direct fetch.

Capabilities (optional)

If you set capabilities, the entry only runs for those media types. For shared lists, OpenClaw can infer defaults:

openai, anthropic, minimax: image
minimax-portal: image
moonshot: image + video
openrouter: image + audio
google (Gemini API): image + audio + video
qwen: image + video
mistral: audio
zai: image
groq: audio
xai: audio
deepgram: audio
Any models.providers.<id>.models[] catalog with an image-capable model: image

For CLI entries, set capabilities explicitly to avoid surprising matches. If you omit capabilities, the entry is eligible for the list it appears in.

Provider support matrix (OpenClaw integrations)

Capability	Provider integration	Notes
Image	OpenAI, OpenAI Codex OAuth, Codex app-server, OpenRouter, Anthropic, Google, MiniMax, Moonshot, Qwen, Z.AI, config providers	Vendor plugins register image support; `openai/` can use API-key or Codex OAuth routing; `codex/` uses a bounded Codex app-server turn; MiniMax and MiniMax OAuth both use `MiniMax-VL-01`; image-capable config providers auto-register.
Audio	OpenAI, Groq, xAI, Deepgram, OpenRouter, Google, SenseAudio, ElevenLabs, Mistral	Provider transcription (Whisper/Groq/xAI/Deepgram/OpenRouter STT/Gemini/SenseAudio/Scribe/Voxtral).
Video	Google, Qwen, Moonshot	Provider video understanding via vendor plugins; Qwen video understanding uses the Standard DashScope endpoints.

Model selection guidance

Prefer the strongest latest-generation model available for each media capability when quality and safety matter.
For tool-enabled agents handling untrusted inputs, avoid older/weaker media models.
Keep at least one fallback per capability for availability (quality model + faster/cheaper model).
CLI fallbacks (whisper-cli, whisper, gemini) are useful when provider APIs are unavailable.
parakeet-mlx note: with --output-dir, OpenClaw reads <output-dir>/<media-basename>.txt when output format is txt (or unspecified); non-txt formats fall back to stdout.

Attachment policy

Per-capability attachments controls which attachments are processed:

mode"first" | "all"default: first

Whether to process the first selected attachment or all of them.

maxAttachmentsnumberdefault: 1

Cap the number processed.

prefer"first" | "last" | "path" | "url"

Selection preference among candidate attachments.

When mode: "all", outputs are labeled [Image 1/2], [Audio 2/2], etc.

File-attachment extraction behavior

Extracted file text is wrapped as untrusted external content before it is appended to the media prompt.
The injected block uses explicit boundary markers like <<<EXTERNAL_UNTRUSTED_CONTENT id="...">>> / <<<END_EXTERNAL_UNTRUSTED_CONTENT id="...">>> and includes a Source: External metadata line.
This attachment-extraction path intentionally omits the long SECURITY NOTICE: banner to avoid bloating the media prompt; the boundary markers and metadata still remain.
If a file has no extractable text, OpenClaw injects [No extractable text].
If a PDF falls back to rendered page images in this path, the media prompt keeps the placeholder [PDF content rendered to images; images not forwarded to model] because this attachment-extraction step forwards text blocks, not the rendered PDF images.

Config examples

Shared models + overrides

json5

{  tools: {    media: {      models: [        { provider: "openai", model: "gpt-5.5", capabilities: ["image"] },        {          provider: "google",          model: "gemini-3-flash-preview",          capabilities: ["image", "audio", "video"],        },        {          type: "cli",          command: "gemini",          args: [            "-m",            "gemini-3-flash",            "--allowed-tools",            "read_file",            "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.",          ],          capabilities: ["image", "video"],        },      ],      audio: {        attachments: { mode: "all", maxAttachments: 2 },      },      video: {        maxChars: 500,      },    },  },}

Audio + video only

json5

{  tools: {    media: {      audio: {        enabled: true,        models: [          { provider: "openai", model: "gpt-4o-mini-transcribe" },          {            type: "cli",            command: "whisper",            args: ["--model", "base", "{{MediaPath}}"],          },        ],      },      video: {        enabled: true,        maxChars: 500,        models: [          { provider: "google", model: "gemini-3-flash-preview" },          {            type: "cli",            command: "gemini",            args: [              "-m",              "gemini-3-flash",              "--allowed-tools",              "read_file",              "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.",            ],          },        ],      },    },  },}

Image-only

json5

{  tools: {    media: {      image: {        enabled: true,        maxBytes: 10485760,        maxChars: 500,        models: [          { provider: "openai", model: "gpt-5.5" },          { provider: "anthropic", model: "claude-opus-4-6" },          {            type: "cli",            command: "gemini",            args: [              "-m",              "gemini-3-flash",              "--allowed-tools",              "read_file",              "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.",            ],          },        ],      },    },  },}

Multi-modal single entry

json5

{  tools: {    media: {      image: {        models: [          {            provider: "google",            model: "gemini-3.1-pro-preview",            capabilities: ["image", "video", "audio"],          },        ],      },      audio: {        models: [          {            provider: "google",            model: "gemini-3.1-pro-preview",            capabilities: ["image", "video", "audio"],          },        ],      },      video: {        models: [          {            provider: "google",            model: "gemini-3.1-pro-preview",            capabilities: ["image", "video", "audio"],          },        ],      },    },  },}

Status output

When media understanding runs, /status includes a short summary line:

Code

📎 Media: image ok (openai/gpt-5.4) · audio skipped (maxBytes)

This shows per-capability outcomes and the chosen provider/model when applicable.

Notes

Understanding is best-effort. Errors do not block replies.
Attachments are still passed to models even when understanding is disabled.
Use scope to limit where understanding runs (e.g. only DMs).

Was this useful?

Media understanding

Goals

High-level behavior

Collect attachments

Select per-capability

Choose model

Fallback on failure

Apply success block

Config overview

Model entries

Provider entry

CLI entry

Provider credentials (`apiKey`)

Defaults and limits

Auto-detect media understanding (default)

Active reply model

agents.defaults.imageModel

Local CLIs (audio only)

Gemini CLI

Provider auth

Proxy environment support (provider models)

Capabilities (optional)

Provider support matrix (OpenClaw integrations)

Model selection guidance

Attachment policy

Config examples

Shared models + overrides

Audio + video only

Image-only

Multi-modal single entry

Status output

Notes

On this page

Molty

Goals

High-level behavior

Collect attachments

Select per-capability

Choose model

Fallback on failure

Apply success block

Config overview

Model entries

Provider entry

CLI entry

Provider credentials (apiKey)

Defaults and limits

Auto-detect media understanding (default)

Active reply model

agents.defaults.imageModel

Local CLIs (audio only)

Gemini CLI

Provider auth

Proxy environment support (provider models)

Capabilities (optional)

Provider support matrix (OpenClaw integrations)

Model selection guidance

Attachment policy

Config examples

Shared models + overrides

Audio + video only

Image-only

Multi-modal single entry

Status output

Notes

Related

On this page

Provider credentials (`apiKey`)