extensions/qa-channel: synthetic message channel with DM, channel, thread, reaction, edit, and delete surfaces.extensions/qa-lab: debugger UI and QA bus for observing the transcript, injecting inbound messages, and exporting a Markdown report.qa/: repo-backed seed assets for the kickoff task and baseline QA scenarios.
- Left: Gateway dashboard (Control UI) with the agent.
- Right: QA Lab, showing the Slack-ish transcript and scenario plan.
qa:lab:up:fast keeps the Docker services on a prebuilt image and bind-mounts
extensions/qa-lab/web/dist into the qa-lab container. qa:lab:watch
rebuilds that bundle on change, and the browser auto-reloads when the QA Lab
asset hash changes.
For a transport-real Matrix smoke lane, run:
qa-channel in the child config. It writes the structured report artifacts and
a combined stdout/stderr log into the selected Matrix QA output directory. To
capture the outer scripts/run-node.mjs build/launcher output too, set
OPENCLAW_RUN_NODE_OUTPUT_LOG=<path> to a repo-local log file.
For a transport-real Telegram smoke lane, run:
OPENCLAW_QA_TELEGRAM_GROUP_ID,
OPENCLAW_QA_TELEGRAM_DRIVER_BOT_TOKEN, and
OPENCLAW_QA_TELEGRAM_SUT_BOT_TOKEN, plus two distinct bots in the same
private group. The SUT bot must have a Telegram username, and bot-to-bot
observation works best when both bots have Bot-to-Bot Communication Mode
enabled in @BotFather.
The command exits non-zero when any scenario fails. Use --allow-failures when
you want artifacts without a failing exit code.
The Telegram report and summary include per-reply RTT from the driver message
send request to the observed SUT reply, starting with the canary.
For a transport-real Discord smoke lane, run:
OPENCLAW_QA_DISCORD_GUILD_ID, OPENCLAW_QA_DISCORD_CHANNEL_ID,
OPENCLAW_QA_DISCORD_DRIVER_BOT_TOKEN, OPENCLAW_QA_DISCORD_SUT_BOT_TOKEN,
and OPENCLAW_QA_DISCORD_SUT_APPLICATION_ID when using env credentials.
The lane verifies channel mention handling and checks that the SUT bot has
registered the native /help command with Discord.
The command exits non-zero when any scenario fails. Use --allow-failures when
you want artifacts without a failing exit code.
Live transport lanes now share one smaller contract instead of each inventing
their own scenario list shape:
qa-channel remains the broad synthetic product-behavior suite and is not part
of the live transport coverage matrix.
| Lane | Canary | Mention gating | Allowlist block | Top-level reply | Restart resume | Thread follow-up | Thread isolation | Reaction observation | Help command | Native command registration |
|---|---|---|---|---|---|---|---|---|---|---|
| Matrix | x | x | x | x | x | x | x | x | ||
| Telegram | x | x | x | |||||||
| Discord | x | x | x |
qa-channel as the broad product-behavior suite while Matrix,
Telegram, and future live transports share one explicit transport-contract
checklist.
For a disposable Linux VM lane without bringing Docker into the QA path, run:
qa suite, then copies the normal QA report and
summary back into .artifacts/qa-e2e/... on the host.
It reuses the same scenario-selection behavior as qa suite on the host.
Host and Multipass suite runs execute multiple selected scenarios in parallel
with isolated gateway workers by default. qa-channel defaults to concurrency
4, capped by the selected scenario count. Use --concurrency <count> to tune
the worker count, or --concurrency 1 for serial execution.
The command exits non-zero when any scenario fails. Use --allow-failures when
you want artifacts without a failing exit code.
Live runs forward the supported QA auth inputs that are practical for the
guest: env-based provider keys, the QA live provider config path, and
CODEX_HOME when present. Keep --output-dir under the repo root so the guest
can write back through the mounted workspace.
Repo-backed seeds
Seed assets live inqa/:
qa/scenarios/index.mdqa/scenarios/<theme>/*.md
qa-lab should stay a generic markdown runner. Each scenario markdown file is
the source of truth for one test run and should define:
- scenario metadata
- optional category, capability, lane, and risk metadata
- docs and code refs
- optional plugin requirements
- optional gateway config patch
- the executable
qa-flow
qa-flow is allowed to stay generic
and cross-cutting. For example, markdown scenarios can combine transport-side
helpers with browser-side helpers that drive the embedded Control UI through the
Gateway browser.request seam without adding a special-case runner.
Scenario files should be grouped by product capability rather than source tree
folder. Keep scenario IDs stable when files move; use docsRefs and codeRefs
for implementation traceability.
The baseline list should stay broad enough to cover:
- DM and channel chat
- thread behavior
- message action lifecycle
- cron callbacks
- memory recall
- model switching
- subagent handoff
- repo-reading and docs-reading
- one small build task such as Lobster Invaders
Provider mock lanes
qa suite has two local provider mock lanes:
mock-openaiis the scenario-aware OpenClaw mock. It remains the default deterministic mock lane for repo-backed QA and parity gates.aimockstarts an AIMock-backed provider server for experimental protocol, fixture, record/replay, and chaos coverage. It is additive and does not replace themock-openaiscenario dispatcher.
extensions/qa-lab/src/providers/.
Each provider owns its defaults, local server startup, gateway model config,
auth-profile staging needs, and live/mock capability flags. Shared suite and
gateway code should route through the provider registry instead of branching on
provider names.
Transport adapters
qa-lab owns a generic transport seam for markdown QA scenarios.
qa-channel is the first adapter on that seam, but the design target is wider:
future real or synthetic channels should plug into the same suite runner
instead of adding a transport-specific QA runner.
At the architecture level, the split is:
qa-labowns generic scenario execution, worker concurrency, artifact writing, and reporting.- the transport adapter owns gateway config, readiness, inbound and outbound observation, transport actions, and normalized transport state.
- markdown scenario files under
qa/scenarios/define the test run;qa-labprovides the reusable runtime surface that executes them.
Reporting
qa-lab exports a Markdown protocol report from the observed bus timeline.
The report should answer:
- What worked
- What failed
- What stayed blocked
- What follow-up scenarios are worth adding
SOUL.md, then run ordinary user turns
such as chat, workspace help, and small file tasks. The candidate model should
not be told that it is being evaluated. The command preserves each full
transcript, records basic run stats, then asks the judge models in fast mode with
xhigh reasoning where supported to rank the runs by naturalness, vibe, and humor.
Use --blind-judge-models when comparing providers: the judge prompt still gets
every transcript and run status, but candidate refs are replaced with neutral
labels such as candidate-01; the report maps rankings back to real refs after
parsing.
Candidate runs default to high thinking, with medium for GPT-5.4 and xhigh
for older OpenAI eval refs that support it. Override a specific candidate inline with
--model provider/model,thinking=<level>. --thinking <level> still sets a
global fallback, and the older --model-thinking <provider/model=level> form is
kept for compatibility.
OpenAI candidate refs default to fast mode so priority processing is used where
the provider supports it. Add ,fast, ,no-fast, or ,fast=false inline when a
single candidate or judge needs an override. Pass --fast only when you want to
force fast mode on for every candidate model. Candidate and judge durations are
recorded in the report for benchmark analysis, but judge prompts explicitly say
not to rank by speed.
Candidate and judge model runs both default to concurrency 16. Lower
--concurrency or --judge-concurrency when provider limits or local gateway
pressure make a run too noisy.
When no candidate --model is passed, the character eval defaults to
openai/gpt-5.4, openai/gpt-5.2, openai/gpt-5, anthropic/claude-opus-4-6,
anthropic/claude-sonnet-4-6, zai/glm-5.1,
moonshot/kimi-k2.5, and
google/gemini-3.1-pro-preview when no --model is passed.
When no --judge-model is passed, the judges default to
openai/gpt-5.4,thinking=xhigh,fast and
anthropic/claude-opus-4-6,thinking=high.