2025-12-09 17:51:05 +00:00
---
summary: "How inbound audio/voice notes are downloaded, transcribed, and injected into replies"
read_when:
- Changing audio transcription or media handling
2026-01-31 16:04:03 -05:00
title: "Audio and Voice Notes"
2025-12-09 17:51:05 +00:00
---
2026-01-31 21:13:13 +09:00
2026-01-17 03:52:37 +00:00
# Audio / Voice Notes — 2026-01-17
2025-11-25 23:13:22 +01:00
## What works
2026-01-31 21:13:13 +09:00
2026-01-30 03:15:10 +01:00
- **Media understanding (audio)**: If audio understanding is enabled (or auto‑ detected), OpenClaw:
2026-01-31 21:13:13 +09:00
1. Locates the first audio attachment (local path or URL) and downloads it if needed.
2. Enforces `maxBytes` before sending to each model entry.
3. Runs the first eligible model entry in order (provider or CLI).
4. If it fails or skips (size/timeout), it tries the next entry.
5. On success, it replaces `Body` with an `[Audio]` block and sets `{{Transcript}}` .
2026-01-17 03:52:37 +00:00
- **Command parsing**: When transcription succeeds, `CommandBody` /`RawBody` are set to the transcript so slash commands still work.
- **Verbose logging**: In `--verbose` , we log when transcription runs and when it replaces the body.
2025-11-25 23:13:22 +01:00
2026-01-23 05:47:16 +00:00
## Auto-detection (default)
2026-01-31 21:13:13 +09:00
2026-01-23 05:47:16 +00:00
If you **don’ t configure models** and `tools.media.audio.enabled` is **not** set to `false` ,
2026-01-30 03:15:10 +01:00
OpenClaw auto-detects in this order and stops at the first working option:
2026-01-23 05:47:16 +00:00
2026-01-31 21:13:13 +09:00
1. **Local CLIs** (if installed)
2026-01-23 05:47:16 +00:00
- `sherpa-onnx-offline` (requires `SHERPA_ONNX_MODEL_DIR` with encoder/decoder/joiner/tokens)
- `whisper-cli` (from `whisper-cpp` ; uses `WHISPER_CPP_MODEL` or the bundled tiny model)
- `whisper` (Python CLI; downloads models automatically)
2026-01-31 21:13:13 +09:00
2. **Gemini CLI** (`gemini` ) using `read_many_files`
3. **Provider keys** (OpenAI → Groq → Deepgram → Google)
2026-01-23 05:47:16 +00:00
To disable auto-detection, set `tools.media.audio.enabled: false` .
To customize, set `tools.media.audio.models` .
Note: Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI is on `PATH` (we expand `~` ), or set an explicit CLI model with a full command path.
2026-01-17 03:52:37 +00:00
## Config examples
### Provider + CLI fallback (OpenAI + Whisper CLI)
2026-01-31 21:13:13 +09:00
2026-01-17 03:52:37 +00:00
```json5
{
tools: {
media: {
audio: {
enabled: true,
maxBytes: 20971520,
models: [
2026-01-23 05:47:16 +00:00
{ provider: "openai", model: "gpt-4o-mini-transcribe" },
2026-01-17 03:52:37 +00:00
{
type: "cli",
command: "whisper",
args: ["--model", "base", "{{MediaPath}}"],
2026-01-31 21:13:13 +09:00
timeoutSeconds: 45,
},
],
},
},
},
2026-01-17 03:52:37 +00:00
}
```
### Provider-only with scope gating
2026-01-31 21:13:13 +09:00
2025-11-25 23:13:22 +01:00
```json5
{
2026-01-11 01:51:07 +01:00
tools: {
2026-01-17 03:52:37 +00:00
media: {
audio: {
enabled: true,
scope: {
default: "allow",
2026-01-31 21:13:13 +09:00
rules: [{ action: "deny", match: { chatType: "group" } }],
2026-01-17 03:52:37 +00:00
},
2026-01-31 21:13:13 +09:00
models: [{ provider: "openai", model: "gpt-4o-mini-transcribe" }],
},
},
},
2025-11-25 23:13:22 +01:00
}
```
2026-01-17 08:46:40 +00:00
### Provider-only (Deepgram)
2026-01-31 21:13:13 +09:00
2026-01-17 08:46:40 +00:00
```json5
{
tools: {
media: {
audio: {
enabled: true,
2026-01-31 21:13:13 +09:00
models: [{ provider: "deepgram", model: "nova-3" }],
},
},
},
2026-01-17 08:46:40 +00:00
}
```
2025-11-25 23:13:22 +01:00
## Notes & limits
2026-01-31 21:13:13 +09:00
2026-01-17 03:52:37 +00:00
- Provider auth follows the standard model auth order (auth profiles, env vars, `models.providers.*.apiKey` ).
2026-01-17 08:46:40 +00:00
- Deepgram picks up `DEEPGRAM_API_KEY` when `provider: "deepgram"` is used.
- Deepgram setup details: [Deepgram (audio transcription) ](/providers/deepgram ).
2026-01-17 09:12:19 +00:00
- Audio providers can override `baseUrl` , `headers` , and `providerOptions` via `tools.media.audio` .
2026-01-17 03:52:37 +00:00
- Default size cap is 20MB (`tools.media.audio.maxBytes` ). Oversize audio is skipped for that model and the next entry is tried.
- Default `maxChars` for audio is **unset** (full transcript). Set `tools.media.audio.maxChars` or per-entry `maxChars` to trim output.
2026-01-23 05:47:16 +00:00
- OpenAI auto default is `gpt-4o-mini-transcribe` ; set `model: "gpt-4o-transcribe"` for higher accuracy.
2026-01-17 04:38:20 +00:00
- Use `tools.media.audio.attachments` to process multiple voice notes (`mode: "all"` + `maxAttachments` ).
2026-01-17 03:52:37 +00:00
- Transcript is available to templates as `{{Transcript}}` .
- CLI stdout is capped (5MB); keep CLI output concise.
2025-11-25 23:13:22 +01:00
2026-02-13 04:58:01 +13:00
## Mention Detection in Groups
When `requireMention: true` is set for a group chat, OpenClaw now transcribes audio **before** checking for mentions. This allows voice notes to be processed even when they contain mentions.
**How it works:**
1. If a voice message has no text body and the group requires mentions, OpenClaw performs a "preflight" transcription.
2. The transcript is checked for mention patterns (e.g., `@BotName` , emoji triggers).
3. If a mention is found, the message proceeds through the full reply pipeline.
4. The transcript is used for mention detection so voice notes can pass the mention gate.
**Fallback behavior:**
- If transcription fails during preflight (timeout, API error, etc.), the message is processed based on text-only mention detection.
- This ensures that mixed messages (text + audio) are never incorrectly dropped.
**Example:** A user sends a voice note saying "Hey @Claude , what's the weather?" in a Telegram group with `requireMention: true` . The voice note is transcribed, the mention is detected, and the agent replies.
2025-11-25 23:13:22 +01:00
## Gotchas
2026-01-31 21:13:13 +09:00
2026-01-17 03:52:37 +00:00
- Scope rules use first-match wins. `chatType` is normalized to `direct` , `group` , or `room` .
2025-11-25 23:13:22 +01:00
- Ensure your CLI exits 0 and prints plain text; JSON needs to be massaged via `jq -r .text` .
2026-01-17 03:52:37 +00:00
- Keep timeouts reasonable (`timeoutSeconds` , default 60s) to avoid blocking the reply queue.
2026-02-13 04:58:01 +13:00
- Preflight transcription only processes the **first** audio attachment for mention detection. Additional audio is processed during the main media understanding phase.