Files
openclaw/docs/help/testing.md

415 lines
21 KiB
Markdown
Raw Normal View History

2026-01-10 01:15:42 +00:00
---
summary: "Testing kit: unit/e2e/live suites, Docker runners, and what each test covers"
read_when:
- Running tests locally or in CI
- Adding regressions for model/provider bugs
- Debugging gateway + agent behavior
title: "Testing"
2026-01-10 01:15:42 +00:00
---
# Testing
2026-01-30 03:15:10 +01:00
OpenClaw has three Vitest suites (unit/integration, e2e, live) and a small set of Docker runners.
2026-01-10 19:53:24 +00:00
This doc is a “how we test” guide:
2026-01-31 21:13:13 +09:00
- What each suite covers (and what it deliberately does _not_ cover)
2026-01-10 19:53:24 +00:00
- Which commands to run for common workflows (local, pre-push, debugging)
- How live tests discover credentials and select models/providers
- How to add regressions for real-world model/provider issues
2026-01-10 01:15:42 +00:00
## Quick start
2026-01-10 19:53:24 +00:00
Most days:
2026-01-31 21:13:13 +09:00
- Full gate (expected before push): `pnpm build && pnpm check && pnpm test`
2026-01-10 19:53:24 +00:00
When you touch tests or want extra confidence:
2026-01-31 21:13:13 +09:00
2026-01-10 01:15:42 +00:00
- Coverage gate: `pnpm test:coverage`
- E2E suite: `pnpm test:e2e`
2026-01-10 19:53:24 +00:00
When debugging real providers/models (requires real creds):
2026-01-31 21:13:13 +09:00
- Live suite (models + gateway tool/image probes): `pnpm test:live`
2026-01-10 19:53:24 +00:00
Tip: when you only need one failing case, prefer narrowing live tests via the allowlist env vars described below.
2026-01-10 01:15:42 +00:00
## Test suites (what runs where)
2026-01-10 19:53:24 +00:00
Think of the suites as “increasing realism” (and increasing flakiness/cost):
2026-01-10 01:15:42 +00:00
### Unit / integration (default)
- Command: `pnpm test`
- Config: `scripts/test-parallel.mjs` (runs `vitest.unit.config.ts`, `vitest.extensions.config.ts`, `vitest.gateway.config.ts`)
- Files: `src/**/*.test.ts`, `extensions/**/*.test.ts`
2026-01-10 19:53:24 +00:00
- Scope:
- Pure unit tests
- In-process integration tests (gateway auth, routing, tooling, parsing, config)
- Deterministic regressions for known bugs
- Expectations:
- Runs in CI
- No real keys required
- Should be fast and stable
- Pool note:
- OpenClaw uses Vitest `vmForks` on Node 22/23 for faster unit shards.
- On Node 24+, OpenClaw automatically falls back to regular `forks` to avoid Node VM linking errors (`ERR_VM_MODULE_LINK_FAILURE` / `module is already linked`).
- Override manually with `OPENCLAW_TEST_VM_FORKS=0` (force `forks`) or `OPENCLAW_TEST_VM_FORKS=1` (force `vmForks`).
2026-01-10 01:15:42 +00:00
### E2E (gateway smoke)
- Command: `pnpm test:e2e`
- Config: `vitest.e2e.config.ts`
- Files: `src/**/*.e2e.test.ts`
2026-02-13 14:57:09 +00:00
- Runtime defaults:
- Uses Vitest `vmForks` for faster file startup.
- Uses adaptive workers (CI: 2-4, local: 4-8).
- Runs in silent mode by default to reduce console I/O overhead.
- Useful overrides:
- `OPENCLAW_E2E_WORKERS=<n>` to force worker count (capped at 16).
- `OPENCLAW_E2E_VERBOSE=1` to re-enable verbose console output.
2026-01-10 19:53:24 +00:00
- Scope:
- Multi-instance gateway end-to-end behavior
- WebSocket/HTTP surfaces, node pairing, and heavier networking
- Expectations:
- Runs in CI (when enabled in the pipeline)
- No real keys required
- More moving parts than unit tests (can be slower)
2026-01-10 01:15:42 +00:00
### Live (real providers + real models)
- Command: `pnpm test:live`
- Config: `vitest.live.config.ts`
- Files: `src/**/*.live.test.ts`
2026-01-30 03:15:10 +01:00
- Default: **enabled** by `pnpm test:live` (sets `OPENCLAW_LIVE_TEST=1`)
2026-01-10 19:53:24 +00:00
- Scope:
2026-01-31 21:13:13 +09:00
- “Does this provider/model actually work _today_ with real creds?”
2026-01-10 19:53:24 +00:00
- Catch provider format changes, tool-calling quirks, auth issues, and rate limit behavior
- Expectations:
- Not CI-stable by design (real networks, real provider policies, quotas, outages)
- Costs money / uses rate limits
- Prefer running narrowed subsets instead of “everything”
- Live runs will source `~/.profile` to pick up missing API keys
- API key rotation (provider-specific): set `*_API_KEYS` with comma/semicolon format or `*_API_KEY_1`, `*_API_KEY_2` (for example `OPENAI_API_KEYS`, `ANTHROPIC_API_KEYS`, `GEMINI_API_KEYS`) or per-live override via `OPENCLAW_LIVE_*_KEY`; tests retry on rate limit responses.
2026-01-10 19:53:24 +00:00
## Which suite should I run?
Use this decision table:
2026-01-31 21:13:13 +09:00
2026-01-10 19:53:24 +00:00
- Editing logic/tests: run `pnpm test` (and `pnpm test:coverage` if you changed a lot)
- Touching gateway networking / WS protocol / pairing: add `pnpm test:e2e`
- Debugging “my bot is down” / provider-specific failures / tool calling: run a narrowed `pnpm test:live`
2026-01-10 01:15:42 +00:00
## Live: Android node capability sweep
- Test: `src/gateway/android-node.capabilities.live.test.ts`
- Script: `pnpm android:test:integration`
- Goal: invoke **every command currently advertised** by a connected Android node and assert command contract behavior.
- Scope:
- Preconditioned/manual setup (the suite does not install/run/pair the app).
- Command-by-command gateway `node.invoke` validation for the selected Android node.
- Required pre-setup:
- Android app already connected + paired to the gateway.
- App kept in foreground.
- Permissions/capture consent granted for capabilities you expect to pass.
- Optional target overrides:
- `OPENCLAW_ANDROID_NODE_ID` or `OPENCLAW_ANDROID_NODE_NAME`.
- `OPENCLAW_ANDROID_GATEWAY_URL` / `OPENCLAW_ANDROID_GATEWAY_TOKEN` / `OPENCLAW_ANDROID_GATEWAY_PASSWORD`.
- Full Android setup details: [Android App](/platforms/android)
2026-01-10 01:15:42 +00:00
## Live: model smoke (profile keys)
2026-01-10 19:53:24 +00:00
Live tests are split into two layers so we can isolate failures:
2026-01-31 21:13:13 +09:00
2026-01-10 19:53:24 +00:00
- “Direct model” tells us the provider/model can answer at all with the given key.
- “Gateway smoke” tells us the full gateway+agent pipeline works for that model (sessions, history, tools, sandbox policy, etc.).
### Layer 1: Direct model completion (no gateway)
- Test: `src/agents/models.profiles.live.test.ts`
- Goal:
- Enumerate discovered models
- Use `getApiKeyForModel` to select models you have creds for
- Run a small completion per model (and targeted regressions where needed)
- How to enable:
2026-01-30 03:15:10 +01:00
- `pnpm test:live` (or `OPENCLAW_LIVE_TEST=1` if invoking Vitest directly)
- Set `OPENCLAW_LIVE_MODELS=modern` (or `all`, alias for modern) to actually run this suite; otherwise it skips to keep `pnpm test:live` focused on gateway smoke
2026-01-10 19:53:24 +00:00
- How to select models:
- `OPENCLAW_LIVE_MODELS=modern` to run the modern allowlist (Opus/Sonnet/Haiku 4.5, GPT-5.x + Codex, Gemini 3, GLM 4.7, MiniMax M2.5, Grok 4)
2026-01-30 03:15:10 +01:00
- `OPENCLAW_LIVE_MODELS=all` is an alias for the modern allowlist
- or `OPENCLAW_LIVE_MODELS="openai/gpt-5.2,anthropic/claude-opus-4-6,..."` (comma allowlist)
- How to select providers:
2026-01-30 03:15:10 +01:00
- `OPENCLAW_LIVE_PROVIDERS="google,google-antigravity,google-gemini-cli"` (comma allowlist)
2026-01-10 19:53:24 +00:00
- Where keys come from:
- By default: profile store and env fallbacks
2026-01-30 03:15:10 +01:00
- Set `OPENCLAW_LIVE_REQUIRE_PROFILE_KEYS=1` to enforce **profile store** only
2026-01-10 19:53:24 +00:00
- Why this exists:
- Separates “provider API is broken / key is invalid” from “gateway agent pipeline is broken”
- Contains small, isolated regressions (example: OpenAI Responses/Codex Responses reasoning replay + tool-call flows)
2026-01-30 03:15:10 +01:00
### Layer 2: Gateway + dev agent smoke (what “@openclaw” actually does)
2026-01-10 19:53:24 +00:00
- Test: `src/gateway/gateway-models.profiles.live.test.ts`
- Goal:
- Spin up an in-process gateway
- Create/patch a `agent:dev:*` session (model override per run)
- Iterate models-with-keys and assert:
- “meaningful” response (no tools)
- a real tool invocation works (read probe)
- optional extra tool probes (exec+read probe)
2026-01-10 19:53:24 +00:00
- OpenAI regression paths (tool-call-only → follow-up) keep working
2026-01-11 04:46:30 +00:00
- Probe details (so you can explain failures quickly):
- `read` probe: the test writes a nonce file in the workspace and asks the agent to `read` it and echo the nonce back.
- `exec+read` probe: the test asks the agent to `exec`-write a nonce into a temp file, then `read` it back.
2026-01-11 04:46:30 +00:00
- image probe: the test attaches a generated PNG (cat + randomized code) and expects the model to return `cat <CODE>`.
- Implementation reference: `src/gateway/gateway-models.profiles.live.test.ts` and `src/gateway/live-image-probe.ts`.
2026-01-10 19:53:24 +00:00
- How to enable:
2026-01-30 03:15:10 +01:00
- `pnpm test:live` (or `OPENCLAW_LIVE_TEST=1` if invoking Vitest directly)
2026-01-10 19:53:24 +00:00
- How to select models:
- Default: modern allowlist (Opus/Sonnet/Haiku 4.5, GPT-5.x + Codex, Gemini 3, GLM 4.7, MiniMax M2.5, Grok 4)
2026-01-30 03:15:10 +01:00
- `OPENCLAW_LIVE_GATEWAY_MODELS=all` is an alias for the modern allowlist
- Or set `OPENCLAW_LIVE_GATEWAY_MODELS="provider/model"` (or comma list) to narrow
- How to select providers (avoid “OpenRouter everything”):
2026-01-30 03:15:10 +01:00
- `OPENCLAW_LIVE_GATEWAY_PROVIDERS="google,google-antigravity,google-gemini-cli,openai,anthropic,zai,minimax"` (comma allowlist)
- Tool + image probes are always on in this live test:
- `read` probe + `exec+read` probe (tool stress)
- image probe runs when the model advertises image input support
- Flow (high level):
- Test generates a tiny PNG with “CAT” + random code (`src/gateway/live-image-probe.ts`)
- Sends it via `agent` `attachments: [{ mimeType: "image/png", content: "<base64>" }]`
- Gateway parses attachments into `images[]` (`src/gateway/server-methods/agent.ts` + `src/gateway/chat-attachments.ts`)
- Embedded agent forwards a multimodal user message to the model
- Assertion: reply contains `cat` + the code (OCR tolerance: minor mistakes allowed)
2026-01-10 19:53:24 +00:00
2026-01-11 02:24:35 +00:00
Tip: to see what you can test on your machine (and the exact `provider/model` ids), run:
```bash
2026-01-30 03:15:10 +01:00
openclaw models list
openclaw models list --json
2026-01-11 02:24:35 +00:00
```
2026-01-10 21:40:59 +00:00
## Live: Anthropic setup-token smoke
- Test: `src/agents/anthropic.setup-token.live.test.ts`
- Goal: verify Claude Code CLI setup-token (or a pasted setup-token profile) can complete an Anthropic prompt.
2026-01-10 21:40:59 +00:00
- Enable:
2026-01-30 03:15:10 +01:00
- `pnpm test:live` (or `OPENCLAW_LIVE_TEST=1` if invoking Vitest directly)
- `OPENCLAW_LIVE_SETUP_TOKEN=1`
2026-01-10 21:40:59 +00:00
- Token sources (pick one):
2026-01-30 03:15:10 +01:00
- Profile: `OPENCLAW_LIVE_SETUP_TOKEN_PROFILE=anthropic:setup-token-test`
- Raw token: `OPENCLAW_LIVE_SETUP_TOKEN_VALUE=sk-ant-oat01-...`
2026-01-10 21:40:59 +00:00
- Model override (optional):
- `OPENCLAW_LIVE_SETUP_TOKEN_MODEL=anthropic/claude-opus-4-6`
2026-01-10 21:40:59 +00:00
Setup example:
```bash
2026-01-30 03:15:10 +01:00
openclaw models auth paste-token --provider anthropic --profile-id anthropic:setup-token-test
OPENCLAW_LIVE_SETUP_TOKEN=1 OPENCLAW_LIVE_SETUP_TOKEN_PROFILE=anthropic:setup-token-test pnpm test:live src/agents/anthropic.setup-token.live.test.ts
2026-01-10 21:40:59 +00:00
```
## Live: CLI backend smoke (Claude Code CLI or other local CLIs)
2026-01-10 23:31:25 +00:00
- Test: `src/gateway/gateway-cli-backend.live.test.ts`
- Goal: validate the Gateway + agent pipeline using a local CLI backend, without touching your default config.
- Enable:
2026-01-30 03:15:10 +01:00
- `pnpm test:live` (or `OPENCLAW_LIVE_TEST=1` if invoking Vitest directly)
- `OPENCLAW_LIVE_CLI_BACKEND=1`
2026-01-10 23:31:25 +00:00
- Defaults:
- Model: `claude-cli/claude-sonnet-4-6`
2026-01-10 23:31:25 +00:00
- Command: `claude`
- Args: `["-p","--output-format","json","--permission-mode","bypassPermissions"]`
2026-01-10 23:31:25 +00:00
- Overrides (optional):
- `OPENCLAW_LIVE_CLI_BACKEND_MODEL="claude-cli/claude-opus-4-6"`
- `OPENCLAW_LIVE_CLI_BACKEND_MODEL="codex-cli/gpt-5.4"`
2026-01-30 03:15:10 +01:00
- `OPENCLAW_LIVE_CLI_BACKEND_COMMAND="/full/path/to/claude"`
- `OPENCLAW_LIVE_CLI_BACKEND_ARGS='["-p","--output-format","json","--permission-mode","bypassPermissions"]'`
- `OPENCLAW_LIVE_CLI_BACKEND_CLEAR_ENV='["ANTHROPIC_API_KEY","ANTHROPIC_API_KEY_OLD"]'`
- `OPENCLAW_LIVE_CLI_BACKEND_IMAGE_PROBE=1` to send a real image attachment (paths are injected into the prompt).
- `OPENCLAW_LIVE_CLI_BACKEND_IMAGE_ARG="--image"` to pass image file paths as CLI args instead of prompt injection.
- `OPENCLAW_LIVE_CLI_BACKEND_IMAGE_MODE="repeat"` (or `"list"`) to control how image args are passed when `IMAGE_ARG` is set.
- `OPENCLAW_LIVE_CLI_BACKEND_RESUME_PROBE=1` to send a second turn and validate resume flow.
- `OPENCLAW_LIVE_CLI_BACKEND_DISABLE_MCP_CONFIG=0` to keep Claude Code CLI MCP config enabled (default disables MCP config with a temporary empty file).
2026-01-10 23:31:25 +00:00
Example:
```bash
2026-01-30 03:15:10 +01:00
OPENCLAW_LIVE_CLI_BACKEND=1 \
OPENCLAW_LIVE_CLI_BACKEND_MODEL="claude-cli/claude-sonnet-4-6" \
2026-01-10 23:31:25 +00:00
pnpm test:live src/gateway/gateway-cli-backend.live.test.ts
```
2026-01-10 19:53:24 +00:00
### Recommended live recipes
Narrow, explicit allowlists are fastest and least flaky:
- Single model, direct (no gateway):
2026-01-30 03:15:10 +01:00
- `OPENCLAW_LIVE_MODELS="openai/gpt-5.2" pnpm test:live src/agents/models.profiles.live.test.ts`
2026-01-10 19:53:24 +00:00
- Single model, gateway smoke:
2026-01-30 03:15:10 +01:00
- `OPENCLAW_LIVE_GATEWAY_MODELS="openai/gpt-5.2" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts`
2026-01-10 19:53:24 +00:00
- Tool calling across several providers:
- `OPENCLAW_LIVE_GATEWAY_MODELS="openai/gpt-5.2,anthropic/claude-opus-4-6,google/gemini-3-flash-preview,zai/glm-4.7,minimax/minimax-m2.5" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts`
2026-01-10 01:15:42 +00:00
2026-01-10 20:55:47 +00:00
- Google focus (Gemini API key + Antigravity):
2026-01-30 03:15:10 +01:00
- Gemini (API key): `OPENCLAW_LIVE_GATEWAY_MODELS="google/gemini-3-flash-preview" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts`
- Antigravity (OAuth): `OPENCLAW_LIVE_GATEWAY_MODELS="google-antigravity/claude-opus-4-6-thinking,google-antigravity/gemini-3-pro-high" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts`
2026-01-10 20:55:47 +00:00
2026-01-10 21:40:59 +00:00
Notes:
2026-01-31 21:13:13 +09:00
2026-01-10 21:40:59 +00:00
- `google/...` uses the Gemini API (API key).
- `google-antigravity/...` uses the Antigravity OAuth bridge (Cloud Code Assist-style agent endpoint).
- `google-gemini-cli/...` uses the local Gemini CLI on your machine (separate auth + tooling quirks).
2026-01-11 04:46:30 +00:00
- Gemini API vs Gemini CLI:
2026-01-30 03:15:10 +01:00
- API: OpenClaw calls Googles hosted Gemini API over HTTP (API key / profile auth); this is what most users mean by “Gemini”.
- CLI: OpenClaw shells out to a local `gemini` binary; it has its own auth and can behave differently (streaming/tool support/version skew).
2026-01-10 21:40:59 +00:00
## Live: model matrix (what we cover)
There is no fixed “CI model list” (live is opt-in), but these are the **recommended** models to cover regularly on a dev machine with keys.
2026-01-10 21:40:59 +00:00
### Modern smoke set (tool calling + image)
This is the “common models” run we expect to keep working:
2026-01-31 21:13:13 +09:00
2026-01-10 21:40:59 +00:00
- OpenAI (non-Codex): `openai/gpt-5.2` (optional: `openai/gpt-5.1`)
- OpenAI Codex: `openai-codex/gpt-5.4`
- Anthropic: `anthropic/claude-opus-4-6` (or `anthropic/claude-sonnet-4-5`)
2026-03-08 02:32:49 +00:00
- Google (Gemini API): `google/gemini-3.1-pro-preview` and `google/gemini-3-flash-preview` (avoid older Gemini 2.x models)
- Google (Antigravity): `google-antigravity/claude-opus-4-6-thinking` and `google-antigravity/gemini-3-flash`
2026-01-10 21:40:59 +00:00
- Z.AI (GLM): `zai/glm-4.7`
- MiniMax: `minimax/minimax-m2.5`
2026-01-10 21:40:59 +00:00
Run gateway smoke with tools + image:
2026-03-08 02:32:49 +00:00
`OPENCLAW_LIVE_GATEWAY_MODELS="openai/gpt-5.2,openai-codex/gpt-5.4,anthropic/claude-opus-4-6,google/gemini-3.1-pro-preview,google/gemini-3-flash-preview,google-antigravity/claude-opus-4-6-thinking,google-antigravity/gemini-3-flash,zai/glm-4.7,minimax/minimax-m2.5" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts`
2026-01-10 21:40:59 +00:00
### Baseline: tool calling (Read + optional Exec)
Pick at least one per provider family:
2026-01-31 21:13:13 +09:00
- OpenAI: `openai/gpt-5.2` (or `openai/gpt-5-mini`)
- Anthropic: `anthropic/claude-opus-4-6` (or `anthropic/claude-sonnet-4-5`)
2026-03-08 02:32:49 +00:00
- Google: `google/gemini-3-flash-preview` (or `google/gemini-3.1-pro-preview`)
- Z.AI (GLM): `zai/glm-4.7`
- MiniMax: `minimax/minimax-m2.5`
Optional additional coverage (nice to have):
2026-01-31 21:13:13 +09:00
- xAI: `xai/grok-4` (or latest available)
- Mistral: `mistral/`… (pick one “tools” capable model you have enabled)
- Cerebras: `cerebras/`… (if you have access)
- LM Studio: `lmstudio/`… (local; tool calling depends on API mode)
### Vision: image send (attachment → multimodal message)
2026-01-30 03:15:10 +01:00
Include at least one image-capable model in `OPENCLAW_LIVE_GATEWAY_MODELS` (Claude/Gemini/OpenAI vision-capable variants, etc.) to exercise the image probe.
### Aggregators / alternate gateways
If you have keys enabled, we also support testing via:
2026-01-31 21:13:13 +09:00
2026-01-30 03:15:10 +01:00
- OpenRouter: `openrouter/...` (hundreds of models; use `openclaw models scan` to find tool+image capable candidates)
- OpenCode: `opencode/...` for Zen and `opencode-go/...` for Go (auth via `OPENCODE_API_KEY` / `OPENCODE_ZEN_API_KEY`)
2026-01-11 04:46:30 +00:00
More providers you can include in the live matrix (if you have creds/config):
2026-01-31 21:13:13 +09:00
- Built-in: `openai`, `openai-codex`, `anthropic`, `google`, `google-vertex`, `google-antigravity`, `google-gemini-cli`, `zai`, `openrouter`, `opencode`, `opencode-go`, `xai`, `groq`, `cerebras`, `mistral`, `github-copilot`
2026-01-11 04:46:30 +00:00
- Via `models.providers` (custom endpoints): `minimax` (cloud/API), plus any OpenAI/Anthropic-compatible proxy (LM Studio, vLLM, LiteLLM, etc.)
Tip: dont try to hardcode “all models” in docs. The authoritative list is whatever `discoverModels(...)` returns on your machine + whatever keys are available.
2026-01-10 01:15:42 +00:00
## Credentials (never commit)
2026-01-10 19:53:24 +00:00
Live tests discover credentials the same way the CLI does. Practical implications:
2026-01-31 21:13:13 +09:00
2026-01-10 19:53:24 +00:00
- If the CLI works, live tests should find the same keys.
2026-01-30 03:15:10 +01:00
- If a live test says “no creds”, debug the same way youd debug `openclaw models list` / model selection.
2026-01-10 01:15:42 +00:00
2026-01-30 03:15:10 +01:00
- Profile store: `~/.openclaw/credentials/` (preferred; what “profile keys” means in the tests)
- Config: `~/.openclaw/openclaw.json` (or `OPENCLAW_CONFIG_PATH`)
2026-01-10 01:15:42 +00:00
If you want to rely on env keys (e.g. exported in your `~/.profile`), run local tests after `source ~/.profile`, or use the Docker runners below (they can mount `~/.profile` into the container).
## Deepgram live (audio transcription)
- Test: `src/media-understanding/providers/deepgram/audio.live.test.ts`
- Enable: `DEEPGRAM_API_KEY=... DEEPGRAM_LIVE_TEST=1 pnpm test:live src/media-understanding/providers/deepgram/audio.live.test.ts`
## BytePlus coding plan live
- Test: `src/agents/byteplus.live.test.ts`
- Enable: `BYTEPLUS_API_KEY=... BYTEPLUS_LIVE_TEST=1 pnpm test:live src/agents/byteplus.live.test.ts`
- Optional model override: `BYTEPLUS_CODING_MODEL=ark-code-latest`
2026-01-10 01:15:42 +00:00
## Docker runners (optional “works in Linux” checks)
These run `pnpm test:live` inside the repo Docker image, mounting your local config dir and workspace (and sourcing `~/.profile` if mounted):
2026-01-10 01:15:42 +00:00
- Direct models: `pnpm test:docker:live-models` (script: `scripts/test-live-models-docker.sh`)
- Gateway + dev agent: `pnpm test:docker:live-gateway` (script: `scripts/test-live-gateway-models-docker.sh`)
- Onboarding wizard (TTY, full scaffolding): `pnpm test:docker:onboard` (script: `scripts/e2e/onboard-docker.sh`)
- Gateway networking (two containers, WS auth + health): `pnpm test:docker:gateway-network` (script: `scripts/e2e/gateway-network-docker.sh`)
2026-01-11 12:21:45 +00:00
- Plugins (custom extension load + registry smoke): `pnpm test:docker:plugins` (script: `scripts/e2e/plugins-docker.sh`)
2026-01-10 01:15:42 +00:00
The live-model Docker runners also bind-mount the current checkout read-only and
stage it into a temporary workdir inside the container. This keeps the runtime
image slim while still running Vitest against your exact local source/config.
feat: ACP thread-bound agents (#23580) * docs: add ACP thread-bound agents plan doc * docs: expand ACP implementation specification * feat(acp): route ACP sessions through core dispatch and lifecycle cleanup * feat(acp): add /acp commands and Discord spawn gate * ACP: add acpx runtime plugin backend * fix(subagents): defer transient lifecycle errors before announce * Agents: harden ACP sessions_spawn and tighten spawn guidance * Agents: require explicit ACP target for runtime spawns * docs: expand ACP control-plane implementation plan * ACP: harden metadata seeding and spawn guidance * ACP: centralize runtime control-plane manager and fail-closed dispatch * ACP: harden runtime manager and unify spawn helpers * Commands: route ACP sessions through ACP runtime in agent command * ACP: require persisted metadata for runtime spawns * Sessions: preserve ACP metadata when updating entries * Plugins: harden ACP backend registry across loaders * ACPX: make availability probe compatible with adapters * E2E: add manual Discord ACP plain-language smoke script * ACPX: preserve streamed spacing across Discord delivery * Docs: add ACP Discord streaming strategy * ACP: harden Discord stream buffering for thread replies * ACP: reuse shared block reply pipeline for projector * ACP: unify streaming config and adopt coalesceIdleMs * Docs: add temporary ACP production hardening plan * Docs: trim temporary ACP hardening plan goals * Docs: gate ACP thread controls by backend capabilities * ACP: add capability-gated runtime controls and /acp operator commands * Docs: remove temporary ACP hardening plan * ACP: fix spawn target validation and close cache cleanup * ACP: harden runtime dispatch and recovery paths * ACP: split ACP command/runtime internals and centralize policy * ACP: harden runtime lifecycle, validation, and observability * ACP: surface runtime and backend session IDs in thread bindings * docs: add temp plan for binding-service migration * ACP: migrate thread binding flows to SessionBindingService * ACP: address review feedback and preserve prompt wording * ACPX plugin: pin runtime dependency and prefer bundled CLI * Discord: complete binding-service migration cleanup and restore ACP plan * Docs: add standalone ACP agents guide * ACP: route harness intents to thread-bound ACP sessions * ACP: fix spawn thread routing and queue-owner stall * ACP: harden startup reconciliation and command bypass handling * ACP: fix dispatch bypass type narrowing * ACP: align runtime metadata to agentSessionId * ACP: normalize session identifier handling and labels * ACP: mark thread banner session ids provisional until first reply * ACP: stabilize session identity mapping and startup reconciliation * ACP: add resolved session-id notices and cwd in thread intros * Discord: prefix thread meta notices consistently * Discord: unify ACP/thread meta notices with gear prefix * Discord: split thread persona naming from meta formatting * Extensions: bump acpx plugin dependency to 0.1.9 * Agents: gate ACP prompt guidance behind acp.enabled * Docs: remove temp experiment plan docs * Docs: scope streaming plan to holy grail refactor * Docs: refactor ACP agents guide for human-first flow * Docs/Skill: add ACP feature-flag guidance and direct acpx telephone-game flow * Docs/Skill: add OpenCode and Pi to ACP harness lists * Docs/Skill: align ACP harness list with current acpx registry * Dev/Test: move ACP plain-language smoke script and mark as keep * Docs/Skill: reorder ACP harness lists with Pi first * ACP: split control-plane manager into core/types/utils modules * Docs: refresh ACP thread-bound agents plan * ACP: extract dispatch lane and split manager domains * ACP: centralize binding context and remove reverse deps * Infra: unify system message formatting * ACP: centralize error boundaries and session id rendering * ACP: enforce init concurrency cap and strict meta clear * Tests: fix ACP dispatch binding mock typing * Tests: fix Discord thread-binding mock drift and ACP request id * ACP: gate slash bypass and persist cleared overrides * ACPX: await pre-abort cancel before runTurn return * Extension: pin acpx runtime dependency to 0.1.11 * Docs: add pinned acpx install strategy for ACP extension * Extensions/acpx: enforce strict local pinned startup * Extensions/acpx: tighten acp-router install guidance * ACPX: retry runtime test temp-dir cleanup * Extensions/acpx: require proactive ACPX repair for thread spawns * Extensions/acpx: require restart offer after acpx reinstall * extensions/acpx: remove workspace protocol devDependency * extensions/acpx: bump pinned acpx to 0.1.13 * extensions/acpx: sync lockfile after dependency bump * ACPX: make runtime spawn Windows-safe * fix: align doctor-config-flow repair tests with default-account migration (#23580) (thanks @osolmaz)
2026-02-26 11:00:09 +01:00
Manual ACP plain-language thread smoke (not CI):
- `bun scripts/dev/discord-acp-plain-language-smoke.ts --channel <discord-channel-id> ...`
- Keep this script for regression/debug workflows. It may be needed again for ACP thread routing validation, so do not delete it.
2026-01-10 01:15:42 +00:00
Useful env vars:
2026-01-30 03:15:10 +01:00
- `OPENCLAW_CONFIG_DIR=...` (default: `~/.openclaw`) mounted to `/home/node/.openclaw`
- `OPENCLAW_WORKSPACE_DIR=...` (default: `~/.openclaw/workspace`) mounted to `/home/node/.openclaw/workspace`
- `OPENCLAW_PROFILE_FILE=...` (default: `~/.profile`) mounted to `/home/node/.profile` and sourced before running tests
- `OPENCLAW_LIVE_GATEWAY_MODELS=...` / `OPENCLAW_LIVE_MODELS=...` to narrow the run
- `OPENCLAW_LIVE_REQUIRE_PROFILE_KEYS=1` to ensure creds come from the profile store (not env)
2026-01-10 01:15:42 +00:00
## Docs sanity
Run docs checks after doc edits: `pnpm docs:list`.
## Offline regression (CI-safe)
2026-01-10 19:53:24 +00:00
These are “real pipeline” regressions without real providers:
2026-01-31 21:13:13 +09:00
- Gateway tool calling (mock OpenAI, real gateway + agent loop): `src/gateway/gateway.test.ts` (case: "runs a mock OpenAI tool call end-to-end via gateway agent loop")
- Gateway wizard (WS `wizard.start`/`wizard.next`, writes config + auth enforced): `src/gateway/gateway.test.ts` (case: "runs wizard over ws and writes auth token config")
2026-01-10 19:53:24 +00:00
2026-01-19 12:23:13 -08:00
## Agent reliability evals (skills)
We already have a few CI-safe tests that behave like “agent reliability evals”:
2026-01-31 21:13:13 +09:00
- Mock tool-calling through the real gateway + agent loop (`src/gateway/gateway.test.ts`).
- End-to-end wizard flows that validate session wiring and config effects (`src/gateway/gateway.test.ts`).
2026-01-19 12:23:13 -08:00
Whats still missing for skills (see [Skills](/tools/skills)):
2026-01-31 21:13:13 +09:00
2026-01-19 12:23:13 -08:00
- **Decisioning:** when skills are listed in the prompt, does the agent pick the right skill (or avoid irrelevant ones)?
- **Compliance:** does the agent read `SKILL.md` before use and follow required steps/args?
- **Workflow contracts:** multi-turn scenarios that assert tool order, session history carryover, and sandbox boundaries.
Future evals should stay deterministic first:
2026-01-31 21:13:13 +09:00
2026-01-19 12:23:13 -08:00
- A scenario runner using mock providers to assert tool calls + order, skill file reads, and session wiring.
- A small suite of skill-focused scenarios (use vs avoid, gating, prompt injection).
- Optional live evals (opt-in, env-gated) only after the CI-safe suite is in place.
2026-01-10 19:53:24 +00:00
## Adding regressions (guidance)
When you fix a provider/model issue discovered in live:
2026-01-31 21:13:13 +09:00
2026-01-10 19:53:24 +00:00
- Add a CI-safe regression if possible (mock/stub provider, or capture the exact request-shape transformation)
- If its inherently live-only (rate limits, auth policies), keep the live test narrow and opt-in via env vars
- Prefer targeting the smallest layer that catches the bug:
- provider request conversion/replay bug → direct models test
- gateway session/history/tool pipeline bug → gateway live smoke or CI-safe gateway mock test
- SecretRef traversal guardrail:
- `src/secrets/exec-secret-ref-id-parity.test.ts` derives one sampled target per SecretRef class from registry metadata (`listSecretTargetRegistryEntries()`), then asserts traversal-segment exec ids are rejected.
- If you add a new `includeInPlan` SecretRef target family in `src/secrets/target-registry-data.ts`, update `classifyTargetClass` in that test. The test intentionally fails on unclassified target ids so new classes cannot be skipped silently.