2026-01-17 03:52:37 +00:00
---
summary: "Inbound image/audio/video understanding (optional) with provider + CLI fallbacks"
read_when:
- Designing or refactoring media understanding
- Tuning inbound audio/video/image preprocessing
2026-01-31 16:04:03 -05:00
title: "Media Understanding"
2026-01-17 03:52:37 +00:00
---
2026-01-31 21:13:13 +09:00
2026-01-17 03:52:37 +00:00
# Media Understanding (Inbound) — 2026-01-17
2026-01-30 03:15:10 +01:00
OpenClaw can **summarize inbound media** (image/audio/video) before the reply pipeline runs. It auto‑ detects when local tools or provider keys are available, and can be disabled or customized. If understanding is off, models still receive the original files/URLs as usual.
2026-01-17 03:52:37 +00:00
## Goals
2026-01-31 21:13:13 +09:00
2026-01-17 03:52:37 +00:00
- Optional: pre‑ digest inbound media into short text for faster routing + better command parsing.
- Preserve original media delivery to the model (always).
- Support **provider APIs** and **CLI fallbacks** .
- Allow multiple models with ordered fallback (error/size/timeout).
## High‑ level behavior
2026-01-31 21:13:13 +09:00
1. Collect inbound attachments (`MediaPaths` , `MediaUrls` , `MediaTypes` ).
2. For each enabled capability (image/audio/video), select attachments per policy (default: **first** ).
3. Choose the first eligible model entry (size + capability + auth).
4. If a model fails or the media is too large, **fall back to the next entry** .
5. On success:
2026-01-17 03:52:37 +00:00
- `Body` becomes `[Image]` , `[Audio]` , or `[Video]` block.
2026-01-17 06:44:12 +00:00
- Audio sets `{{Transcript}}` ; command parsing uses caption text when present,
otherwise the transcript.
2026-01-17 03:52:37 +00:00
- Captions are preserved as `User text:` inside the block.
If understanding fails or is disabled, **the reply flow continues** with the original body + attachments.
## Config overview
2026-01-31 21:13:13 +09:00
2026-01-17 04:38:20 +00:00
`tools.media` supports **shared models** plus per‑ capability overrides:
2026-01-31 21:13:13 +09:00
2026-01-17 04:38:20 +00:00
- `tools.media.models` : shared model list (use `capabilities` to gate).
- `tools.media.image` / `tools.media.audio` / `tools.media.video` :
- defaults (`prompt` , `maxChars` , `maxBytes` , `timeoutSeconds` , `language` )
2026-01-17 09:12:19 +00:00
- provider overrides (`baseUrl` , `headers` , `providerOptions` )
- Deepgram audio options via `tools.media.audio.providerOptions.deepgram`
2026-01-17 04:38:20 +00:00
- optional **per‑ capability `models` list** (preferred before shared models)
- `attachments` policy (`mode` , `maxAttachments` , `prefer` )
- `scope` (optional gating by channel/chatType/session key)
- `tools.media.concurrency` : max concurrent capability runs (default **2** ).
2026-01-17 03:52:37 +00:00
```json5
{
tools: {
media: {
2026-01-31 21:13:13 +09:00
models: [
/* shared list */
],
image: {
/* optional overrides */
},
audio: {
/* optional overrides */
},
video: {
/* optional overrides */
},
},
},
2026-01-17 03:52:37 +00:00
}
```
### Model entries
2026-01-31 21:13:13 +09:00
2026-01-17 03:52:37 +00:00
Each `models[]` entry can be **provider** or **CLI** :
```json5
{
2026-01-31 21:13:13 +09:00
type: "provider", // default if omitted
2026-01-17 03:52:37 +00:00
provider: "openai",
model: "gpt-5.2",
prompt: "Describe the image in < = 500 chars.",
maxChars: 500,
maxBytes: 10485760,
timeoutSeconds: 60,
capabilities: ["image"], // optional, used for multi‑ modal entries
profile: "vision-profile",
2026-01-31 21:13:13 +09:00
preferredProfile: "vision-fallback",
2026-01-17 03:52:37 +00:00
}
```
```json5
{
type: "cli",
command: "gemini",
args: [
"-m",
"gemini-3-flash",
"--allowed-tools",
"read_file",
2026-01-31 21:13:13 +09:00
"Read the media at {{MediaPath}} and describe it in < = {{MaxChars}} characters.",
2026-01-17 03:52:37 +00:00
],
maxChars: 500,
maxBytes: 52428800,
timeoutSeconds: 120,
2026-01-31 21:13:13 +09:00
capabilities: ["video", "image"],
2026-01-17 03:52:37 +00:00
}
```
2026-01-23 05:47:16 +00:00
CLI templates can also use:
2026-01-31 21:13:13 +09:00
2026-01-23 05:47:16 +00:00
- `{{MediaDir}}` (directory containing the media file)
- `{{OutputDir}}` (scratch dir created for this run)
- `{{OutputBase}}` (scratch file base path, no extension)
2026-01-17 03:52:37 +00:00
## Defaults and limits
2026-01-31 21:13:13 +09:00
2026-01-17 03:52:37 +00:00
Recommended defaults:
2026-01-31 21:13:13 +09:00
2026-01-17 03:52:37 +00:00
- `maxChars` : **500** for image/video (short, command‑ friendly)
- `maxChars` : **unset** for audio (full transcript unless you set a limit)
- `maxBytes` :
- image: **10MB**
- audio: **20MB**
- video: **50MB**
Rules:
2026-01-31 21:13:13 +09:00
2026-01-17 03:52:37 +00:00
- If media exceeds `maxBytes` , that model is skipped and the **next model is tried** .
- If the model returns more than `maxChars` , output is trimmed.
- `prompt` defaults to simple “Describe the {media}.” plus the `maxChars` guidance (image/video only).
2026-01-30 03:15:10 +01:00
- If `<capability>.enabled: true` but no models are configured, OpenClaw tries the
2026-01-17 06:44:12 +00:00
**active reply model** when its provider supports the capability.
2026-01-17 03:52:37 +00:00
2026-01-23 05:47:16 +00:00
### Auto-detect media understanding (default)
2026-01-31 21:13:13 +09:00
2026-01-23 05:47:16 +00:00
If `tools.media.<capability>.enabled` is **not** set to `false` and you haven’ t
2026-01-30 03:15:10 +01:00
configured models, OpenClaw auto-detects in this order and **stops at the first
2026-01-23 05:47:16 +00:00
working option**:
2026-01-31 21:13:13 +09:00
1. **Local CLIs** (audio only; if installed)
2026-01-23 05:47:16 +00:00
- `sherpa-onnx-offline` (requires `SHERPA_ONNX_MODEL_DIR` with encoder/decoder/joiner/tokens)
- `whisper-cli` (`whisper-cpp` ; uses `WHISPER_CPP_MODEL` or the bundled tiny model)
- `whisper` (Python CLI; downloads models automatically)
2026-01-31 21:13:13 +09:00
2. **Gemini CLI** (`gemini` ) using `read_many_files`
3. **Provider keys**
2026-01-23 05:47:16 +00:00
- Audio: OpenAI → Groq → Deepgram → Google
- Image: OpenAI → Anthropic → Google → MiniMax
- Video: Google
To disable auto-detection, set:
2026-01-31 21:13:13 +09:00
2026-01-18 14:49:11 +00:00
```json5
{
tools: {
media: {
audio: {
2026-01-31 21:13:13 +09:00
enabled: false,
},
},
},
2026-01-18 14:49:11 +00:00
}
```
2026-01-31 21:13:13 +09:00
2026-01-23 05:47:16 +00:00
Note: Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI is on `PATH` (we expand `~` ), or set an explicit CLI model with a full command path.
2026-01-18 14:49:11 +00:00
2026-01-17 03:52:37 +00:00
## Capabilities (optional)
2026-01-31 21:13:13 +09:00
2026-01-17 04:38:20 +00:00
If you set `capabilities` , the entry only runs for those media types. For shared
2026-01-30 03:15:10 +01:00
lists, OpenClaw can infer defaults:
2026-01-31 21:13:13 +09:00
2026-01-17 04:38:20 +00:00
- `openai` , `anthropic` , `minimax` : **image**
2026-01-17 03:52:37 +00:00
- `google` (Gemini API): **image + audio + video**
2026-01-17 04:38:20 +00:00
- `groq` : **audio**
2026-01-17 08:46:40 +00:00
- `deepgram` : **audio**
2026-01-17 03:52:37 +00:00
2026-01-17 04:38:20 +00:00
For CLI entries, **set `capabilities` explicitly** to avoid surprising matches.
2026-01-17 03:52:37 +00:00
If you omit `capabilities` , the entry is eligible for the list it appears in.
2026-01-30 03:15:10 +01:00
## Provider support matrix (OpenClaw integrations)
2026-01-31 21:13:13 +09:00
2026-02-22 19:03:56 -05:00
| Capability | Provider integration | Notes |
| ---------- | ------------------------------------------------ | --------------------------------------------------------- |
| Image | OpenAI / Anthropic / Google / others via `pi-ai` | Any image-capable model in the registry works. |
| Audio | OpenAI, Groq, Deepgram, Google, Mistral | Provider transcription (Whisper/Deepgram/Gemini/Voxtral). |
| Video | Google (Gemini API) | Provider video understanding. |
2026-01-17 03:52:37 +00:00
## Recommended providers
2026-01-31 21:13:13 +09:00
2026-01-17 03:52:37 +00:00
**Image**
2026-01-31 21:13:13 +09:00
2026-01-17 03:52:37 +00:00
- Prefer your active model if it supports images.
2026-02-05 16:54:44 -05:00
- Good defaults: `openai/gpt-5.2` , `anthropic/claude-opus-4-6` , `google/gemini-3-pro-preview` .
2026-01-17 03:52:37 +00:00
**Audio**
2026-01-31 21:13:13 +09:00
2026-02-22 19:03:56 -05:00
- `openai/gpt-4o-mini-transcribe` , `groq/whisper-large-v3-turbo` , `deepgram/nova-3` , or `mistral/voxtral-mini-latest` .
2026-01-23 05:47:16 +00:00
- CLI fallback: `whisper-cli` (whisper-cpp) or `whisper` .
2026-01-17 08:46:40 +00:00
- Deepgram setup: [Deepgram (audio transcription) ](/providers/deepgram ).
2026-01-17 03:52:37 +00:00
**Video**
2026-01-31 21:13:13 +09:00
2026-01-17 03:52:37 +00:00
- `google/gemini-3-flash-preview` (fast), `google/gemini-3-pro-preview` (richer).
- CLI fallback: `gemini` CLI (supports `read_file` on video/audio).
2026-01-17 04:38:20 +00:00
## Attachment policy
2026-01-31 21:13:13 +09:00
2026-01-17 04:38:20 +00:00
Per‑ capability `attachments` controls which attachments are processed:
2026-01-31 21:13:13 +09:00
2026-01-17 04:38:20 +00:00
- `mode` : `first` (default) or `all`
- `maxAttachments` : cap the number processed (default **1** )
- `prefer` : `first` , `last` , `path` , `url`
When `mode: "all"` , outputs are labeled `[Image 1/2]` , `[Audio 2/2]` , etc.
2026-01-17 03:52:37 +00:00
## Config examples
2026-01-17 04:38:20 +00:00
### 1) Shared models list + overrides
2026-01-31 21:13:13 +09:00
2026-01-17 04:38:20 +00:00
```json5
{
tools: {
media: {
models: [
{ provider: "openai", model: "gpt-5.2", capabilities: ["image"] },
2026-01-31 21:13:13 +09:00
{
provider: "google",
model: "gemini-3-flash-preview",
capabilities: ["image", "audio", "video"],
},
2026-01-17 04:38:20 +00:00
{
type: "cli",
command: "gemini",
args: [
"-m",
"gemini-3-flash",
"--allowed-tools",
"read_file",
2026-01-31 21:13:13 +09:00
"Read the media at {{MediaPath}} and describe it in < = {{MaxChars}} characters.",
2026-01-17 04:38:20 +00:00
],
2026-01-31 21:13:13 +09:00
capabilities: ["image", "video"],
},
2026-01-17 04:38:20 +00:00
],
audio: {
2026-01-31 21:13:13 +09:00
attachments: { mode: "all", maxAttachments: 2 },
2026-01-17 04:38:20 +00:00
},
video: {
2026-01-31 21:13:13 +09:00
maxChars: 500,
},
},
},
2026-01-17 04:38:20 +00:00
}
```
### 2) Audio + Video only (image off)
2026-01-31 21:13:13 +09:00
2026-01-17 03:52:37 +00:00
```json5
{
tools: {
media: {
audio: {
enabled: true,
models: [
2026-01-23 05:47:16 +00:00
{ provider: "openai", model: "gpt-4o-mini-transcribe" },
2026-01-17 03:52:37 +00:00
{
type: "cli",
command: "whisper",
2026-01-31 21:13:13 +09:00
args: ["--model", "base", "{{MediaPath}}"],
},
],
2026-01-17 03:52:37 +00:00
},
video: {
enabled: true,
maxChars: 500,
models: [
{ provider: "google", model: "gemini-3-flash-preview" },
{
type: "cli",
command: "gemini",
args: [
"-m",
"gemini-3-flash",
"--allowed-tools",
"read_file",
2026-01-31 21:13:13 +09:00
"Read the media at {{MediaPath}} and describe it in < = {{MaxChars}} characters.",
],
},
],
},
},
},
2026-01-17 03:52:37 +00:00
}
```
2026-01-17 04:38:20 +00:00
### 3) Optional image understanding
2026-01-31 21:13:13 +09:00
2026-01-17 03:52:37 +00:00
```json5
{
tools: {
media: {
image: {
enabled: true,
maxBytes: 10485760,
maxChars: 500,
models: [
{ provider: "openai", model: "gpt-5.2" },
2026-02-05 16:54:44 -05:00
{ provider: "anthropic", model: "claude-opus-4-6" },
2026-01-17 03:52:37 +00:00
{
type: "cli",
command: "gemini",
args: [
"-m",
"gemini-3-flash",
"--allowed-tools",
"read_file",
2026-01-31 21:13:13 +09:00
"Read the media at {{MediaPath}} and describe it in < = {{MaxChars}} characters.",
],
},
],
},
},
},
2026-01-17 03:52:37 +00:00
}
```
2026-01-17 04:38:20 +00:00
### 4) Multi‑ modal single entry (explicit capabilities)
2026-01-31 21:13:13 +09:00
2026-01-17 03:52:37 +00:00
```json5
{
tools: {
media: {
2026-01-31 21:13:13 +09:00
image: {
models: [
{
provider: "google",
model: "gemini-3-pro-preview",
capabilities: ["image", "video", "audio"],
},
],
},
audio: {
models: [
{
provider: "google",
model: "gemini-3-pro-preview",
capabilities: ["image", "video", "audio"],
},
],
},
video: {
models: [
{
provider: "google",
model: "gemini-3-pro-preview",
capabilities: ["image", "video", "audio"],
},
],
},
},
},
2026-01-17 03:52:37 +00:00
}
```
2026-01-17 07:27:38 +00:00
## Status output
2026-01-31 21:13:13 +09:00
2026-01-17 07:27:38 +00:00
When media understanding runs, `/status` includes a short summary line:
```
📎 Media: image ok (openai/gpt-5.2) · audio skipped (maxBytes)
```
This shows per‑ capability outcomes and the chosen provider/model when applicable.
2026-01-17 03:52:37 +00:00
## Notes
2026-01-31 21:13:13 +09:00
2026-01-17 03:52:37 +00:00
- Understanding is **best‑ effort** . Errors do not block replies.
- Attachments are still passed to models even when understanding is disabled.
- Use `scope` to limit where understanding runs (e.g. only DMs).
## Related docs
2026-01-31 21:13:13 +09:00
2026-01-17 03:52:37 +00:00
- [Configuration ](/gateway/configuration )
- [Image & Media Support ](/nodes/images )