OpenClaw has three Vitest suites (unit/integration, e2e, live) and a small set of Docker runners. This doc is a “how we test” guide:Documentation Index
Fetch the complete documentation index at: https://docs.openclaw.ai/llms.txt
Use this file to discover all available pages before exploring further.
- What each suite covers (and what it deliberately does not cover).
- Which commands to run for common workflows (local, pre-push, debugging).
- How live tests discover credentials and select models/providers.
- How to add regressions for real-world model/provider issues.
QA stack (qa-lab, qa-channel, live transport lanes) is documented separately:
- QA overview — architecture, command surface, scenario authoring.
- Matrix QA — reference for
pnpm openclaw qa matrix. - QA channel — the synthetic transport plugin used by repo-backed scenarios.
qa invocations and points back at the references above.Quick start
Most days:- Full gate (expected before push):
pnpm build && pnpm check && pnpm check:test-types && pnpm test - Faster local full-suite run on a roomy machine:
pnpm test:max - Direct Vitest watch loop:
pnpm test:watch - Direct file targeting now routes extension/channel paths too:
pnpm test extensions/discord/src/monitor/message-handler.preflight.test.ts - Prefer targeted runs first when you are iterating on a single failure.
- Docker-backed QA site:
pnpm qa:lab:up - Linux VM-backed QA lane:
pnpm openclaw qa suite --runner multipass --scenario channel-chat-baseline
- Coverage gate:
pnpm test:coverage - E2E suite:
pnpm test:e2e
- Live suite (models + gateway tool/image probes):
pnpm test:live - Target one live file quietly:
pnpm test:live -- src/agents/models.profiles.live.test.ts - Docker live model sweep:
pnpm test:docker:live-models- Each selected model now runs a text turn plus a small file-read-style probe.
Models whose metadata advertises
imageinput also run a tiny image turn. Disable the extra probes withOPENCLAW_LIVE_MODEL_FILE_PROBE=0orOPENCLAW_LIVE_MODEL_IMAGE_PROBE=0when isolating provider failures. - CI coverage: daily
OpenClaw Scheduled Live And E2E Checksand manualOpenClaw Release Checksboth call the reusable live/E2E workflow withinclude_live_suites: true, which includes separate Docker live model matrix jobs sharded by provider. - For focused CI reruns, dispatch
OpenClaw Live And E2E Checks (Reusable)withinclude_live_suites: trueandlive_models_only: true. - Add new high-signal provider secrets to
scripts/ci-hydrate-live-auth.shplus.github/workflows/openclaw-live-and-e2e-checks-reusable.ymland its scheduled/release callers.
- Each selected model now runs a text turn plus a small file-read-style probe.
Models whose metadata advertises
- Native Codex bound-chat smoke:
pnpm test:docker:live-codex-bind- Runs a Docker live lane against the Codex app-server path, binds a synthetic
Slack DM with
/codex bind, exercises/codex fastand/codex permissions, then verifies a plain reply and an image attachment route through the native plugin binding instead of ACP.
- Runs a Docker live lane against the Codex app-server path, binds a synthetic
Slack DM with
- Codex app-server harness smoke:
pnpm test:docker:live-codex-harness- Runs gateway agent turns through the plugin-owned Codex app-server harness,
verifies
/codex statusand/codex models, and by default exercises image, cron MCP, sub-agent, and Guardian probes. Disable the sub-agent probe withOPENCLAW_LIVE_CODEX_HARNESS_SUBAGENT_PROBE=0when isolating other Codex app-server failures. For a focused sub-agent check, disable the other probes:OPENCLAW_LIVE_CODEX_HARNESS_IMAGE_PROBE=0 OPENCLAW_LIVE_CODEX_HARNESS_MCP_PROBE=0 OPENCLAW_LIVE_CODEX_HARNESS_GUARDIAN_PROBE=0 OPENCLAW_LIVE_CODEX_HARNESS_SUBAGENT_PROBE=1 pnpm test:docker:live-codex-harness. This exits after the sub-agent probe unlessOPENCLAW_LIVE_CODEX_HARNESS_SUBAGENT_ONLY=0is set.
- Runs gateway agent turns through the plugin-owned Codex app-server harness,
verifies
- Crestodian rescue command smoke:
pnpm test:live:crestodian-rescue-channel- Opt-in belt-and-suspenders check for the message-channel rescue command
surface. It exercises
/crestodian status, queues a persistent model change, replies/crestodian yes, and verifies the audit/config write path.
- Opt-in belt-and-suspenders check for the message-channel rescue command
surface. It exercises
- Crestodian planner Docker smoke:
pnpm test:docker:crestodian-planner- Runs Crestodian in a configless container with a fake Claude CLI on
PATHand verifies the fuzzy planner fallback translates into an audited typed config write.
- Runs Crestodian in a configless container with a fake Claude CLI on
- Crestodian first-run Docker smoke:
pnpm test:docker:crestodian-first-run- Starts from an empty OpenClaw state dir, routes bare
openclawto Crestodian, applies setup/model/agent/Discord plugin + SecretRef writes, validates config, and verifies audit entries. The same Ring 0 setup path is also covered in QA Lab bypnpm openclaw qa suite --scenario crestodian-ring-zero-setup.
- Starts from an empty OpenClaw state dir, routes bare
- Moonshot/Kimi cost smoke: with
MOONSHOT_API_KEYset, runopenclaw models list --provider moonshot --json, then run an isolatedopenclaw agent --local --session-id live-kimi-cost --message 'Reply exactly: KIMI_LIVE_OK' --thinking off --jsonagainstmoonshot/kimi-k2.6. Verify the JSON reports Moonshot/K2.6 and the assistant transcript stores normalizedusage.cost.
QA-specific runners
These commands sit beside the main test suites when you need QA-lab realism: CI runs QA Lab in dedicated workflows.Parity gate runs on matching PRs and
from manual dispatch with mock providers. QA-Lab - All Lanes runs nightly on
main and from manual dispatch with the mock parity gate, live Matrix lane,
Convex-managed live Telegram lane, and Convex-managed live Discord lane as
parallel jobs. Scheduled QA and release checks pass Matrix --profile fast
explicitly, while the Matrix CLI and manual workflow input default remain
all; manual dispatch can shard all into transport, media, e2ee-smoke,
e2ee-deep, and e2ee-cli jobs. OpenClaw Release Checks runs parity plus
the fast Matrix and Telegram lanes before release approval.
pnpm openclaw qa suite- Runs repo-backed QA scenarios directly on the host.
- Runs multiple selected scenarios in parallel by default with isolated
gateway workers.
qa-channeldefaults to concurrency 4 (bounded by the selected scenario count). Use--concurrency <count>to tune the worker count, or--concurrency 1for the older serial lane. - Exits non-zero when any scenario fails. Use
--allow-failureswhen you want artifacts without a failing exit code. - Supports provider modes
live-frontier,mock-openai, andaimock.aimockstarts a local AIMock-backed provider server for experimental fixture and protocol-mock coverage without replacing the scenario-awaremock-openailane.
pnpm test:gateway:cpu-scenarios- Runs the gateway startup bench plus a small mock QA Lab scenario pack
(
channel-chat-baseline,memory-failure-fallback,gateway-restart-inflight-run) and writes a combined CPU observation summary under.artifacts/gateway-cpu-scenarios/. - Flags only sustained hot CPU observations by default (
--cpu-core-warnplus--hot-wall-warn-ms), so short startup bursts are recorded as metrics without looking like the minutes-long gateway peg regression. - Uses built
distartifacts; run a build first when the checkout does not already have fresh runtime output.
- Runs the gateway startup bench plus a small mock QA Lab scenario pack
(
pnpm openclaw qa suite --runner multipass- Runs the same QA suite inside a disposable Multipass Linux VM.
- Keeps the same scenario-selection behavior as
qa suiteon the host. - Reuses the same provider/model selection flags as
qa suite. - Live runs forward the supported QA auth inputs that are practical for the guest:
env-based provider keys, the QA live provider config path, and
CODEX_HOMEwhen present. - Output dirs must stay under the repo root so the guest can write back through the mounted workspace.
- Writes the normal QA report + summary plus Multipass logs under
.artifacts/qa-e2e/....
pnpm qa:lab:up- Starts the Docker-backed QA site for operator-style QA work.
pnpm test:docker:npm-onboard-channel-agent- Builds an npm tarball from the current checkout, installs it globally in Docker, runs non-interactive OpenAI API-key onboarding, configures Telegram by default, verifies enabling the plugin installs runtime dependencies on demand, runs doctor, and runs one local agent turn against a mocked OpenAI endpoint.
- Use
OPENCLAW_NPM_ONBOARD_CHANNEL=discordto run the same packaged-install lane with Discord.
pnpm test:docker:session-runtime-context- Runs a deterministic built-app Docker smoke for embedded runtime context
transcripts. It verifies hidden OpenClaw runtime context is persisted as a
non-display custom message instead of leaking into the visible user turn,
then seeds an affected broken session JSONL and verifies
openclaw doctor --fixrewrites it to the active branch with a backup.
- Runs a deterministic built-app Docker smoke for embedded runtime context
transcripts. It verifies hidden OpenClaw runtime context is persisted as a
non-display custom message instead of leaking into the visible user turn,
then seeds an affected broken session JSONL and verifies
pnpm test:docker:npm-telegram-live- Installs an OpenClaw package candidate in Docker, runs installed-package onboarding, configures Telegram through the installed CLI, then reuses the live Telegram QA lane with that installed package as the SUT Gateway.
- Defaults to
OPENCLAW_NPM_TELEGRAM_PACKAGE_SPEC=openclaw@beta; setOPENCLAW_NPM_TELEGRAM_PACKAGE_TGZ=/path/to/openclaw-current.tgzorOPENCLAW_CURRENT_PACKAGE_TGZto test a resolved local tarball instead of installing from the registry. - Uses the same Telegram env credentials or Convex credential source as
pnpm openclaw qa telegram. For CI/release automation, setOPENCLAW_NPM_TELEGRAM_CREDENTIAL_SOURCE=convexplusOPENCLAW_QA_CONVEX_SITE_URLand the role secret. IfOPENCLAW_QA_CONVEX_SITE_URLand a Convex role secret are present in CI, the Docker wrapper selects Convex automatically. OPENCLAW_NPM_TELEGRAM_CREDENTIAL_ROLE=ci|maintaineroverrides the sharedOPENCLAW_QA_CREDENTIAL_ROLEfor this lane only.- GitHub Actions exposes this lane as the manual maintainer workflow
NPM Telegram Beta E2E. It does not run on merge. The workflow uses theqa-live-sharedenvironment and Convex CI credential leases.
- GitHub Actions also exposes
Package Acceptancefor side-run product proof against one candidate package. It accepts a trusted ref, published npm spec, HTTPS tarball URL plus SHA-256, or tarball artifact from another run, uploads the normalizedopenclaw-current.tgzaspackage-under-test, then runs the existing Docker E2E scheduler with smoke, package, product, full, or custom lane profiles. Settelegram_mode=mock-openaiorlive-frontierto run the Telegram QA workflow against the samepackage-under-testartifact.- Latest beta product proof:
- Exact tarball URL proof requires a digest:
- Artifact proof downloads a tarball artifact from another Actions run:
-
pnpm test:docker:bundled-channel-deps- Packs and installs the current OpenClaw build in Docker, starts the Gateway with OpenAI configured, then enables bundled channel/plugins via config edits.
- Verifies setup discovery leaves unconfigured plugin runtime dependencies absent, the first configured Gateway or doctor run installs each bundled plugin’s runtime dependencies on demand, and a second restart does not reinstall dependencies that were already activated.
- Also installs a known older npm baseline, enables Telegram before running
openclaw update --tag <candidate>, and verifies the candidate’s post-update doctor repairs bundled channel runtime dependencies without a harness-side postinstall repair.
-
pnpm test:parallels:npm-update-
Runs the native packaged-install update smoke across Parallels guests. Each
selected platform first installs the requested baseline package, then runs
the installed
openclaw updatecommand in the same guest and verifies the installed version, update status, gateway readiness, and one local agent turn. -
Use
--platform macos,--platform windows, or--platform linuxwhile iterating on one guest. Use--jsonfor the summary artifact path and per-lane status. -
The OpenAI lane uses
openai/gpt-5.5for the live agent-turn proof by default. Pass--model <provider/model>or setOPENCLAW_PARALLELS_OPENAI_MODELwhen deliberately validating another OpenAI model. -
Wrap long local runs in a host timeout so Parallels transport stalls cannot
consume the rest of the testing window:
-
The script writes nested lane logs under
/tmp/openclaw-parallels-npm-update.*. Inspectwindows-update.log,macos-update.log, orlinux-update.logbefore assuming the outer wrapper is hung. - Windows update can spend 10 to 15 minutes in post-update doctor/runtime dependency repair on a cold guest; that is still healthy when the nested npm debug log is advancing.
- Do not run this aggregate wrapper in parallel with individual Parallels macOS, Windows, or Linux smoke lanes. They share VM state and can collide on snapshot restore, package serving, or guest gateway state.
- The post-update proof runs the normal bundled plugin surface because capability facades such as speech, image generation, and media understanding are loaded through bundled runtime APIs even when the agent turn itself only checks a simple text response.
-
Runs the native packaged-install update smoke across Parallels guests. Each
selected platform first installs the requested baseline package, then runs
the installed
-
pnpm openclaw qa aimock- Starts only the local AIMock provider server for direct protocol smoke testing.
-
pnpm openclaw qa matrix- Runs the Matrix live QA lane against a disposable Docker-backed Tuwunel homeserver. Source-checkout only — packaged installs do not ship
qa-lab. - Full CLI, profile/scenario catalog, env vars, and artifact layout: Matrix QA.
- Runs the Matrix live QA lane against a disposable Docker-backed Tuwunel homeserver. Source-checkout only — packaged installs do not ship
-
pnpm openclaw qa telegram- Runs the Telegram live QA lane against a real private group using the driver and SUT bot tokens from env.
- Requires
OPENCLAW_QA_TELEGRAM_GROUP_ID,OPENCLAW_QA_TELEGRAM_DRIVER_BOT_TOKEN, andOPENCLAW_QA_TELEGRAM_SUT_BOT_TOKEN. The group id must be the numeric Telegram chat id. - Supports
--credential-source convexfor shared pooled credentials. Use env mode by default, or setOPENCLAW_QA_CREDENTIAL_SOURCE=convexto opt into pooled leases. - Exits non-zero when any scenario fails. Use
--allow-failureswhen you want artifacts without a failing exit code. - Requires two distinct bots in the same private group, with the SUT bot exposing a Telegram username.
- For stable bot-to-bot observation, enable Bot-to-Bot Communication Mode in
@BotFatherfor both bots and ensure the driver bot can observe group bot traffic. - Writes a Telegram QA report, summary, and observed-messages artifact under
.artifacts/qa-e2e/.... Replying scenarios include RTT from driver send request to observed SUT reply.
qa-channel is the broad synthetic suite and is not part of that matrix.
Shared Telegram credentials via Convex (v1)
When--credential-source convex (or OPENCLAW_QA_CREDENTIAL_SOURCE=convex) is enabled for
openclaw qa telegram, QA lab acquires an exclusive lease from a Convex-backed pool, heartbeats
that lease while the lane is running, and releases the lease on shutdown.
Reference Convex project scaffold:
qa/convex-credential-broker/
OPENCLAW_QA_CONVEX_SITE_URL(for examplehttps://your-deployment.convex.site)- One secret for the selected role:
OPENCLAW_QA_CONVEX_SECRET_MAINTAINERformaintainerOPENCLAW_QA_CONVEX_SECRET_CIforci
- Credential role selection:
- CLI:
--credential-role maintainer|ci - Env default:
OPENCLAW_QA_CREDENTIAL_ROLE(defaults tociin CI,maintainerotherwise)
- CLI:
OPENCLAW_QA_CREDENTIAL_LEASE_TTL_MS(default1200000)OPENCLAW_QA_CREDENTIAL_HEARTBEAT_INTERVAL_MS(default30000)OPENCLAW_QA_CREDENTIAL_ACQUIRE_TIMEOUT_MS(default90000)OPENCLAW_QA_CREDENTIAL_HTTP_TIMEOUT_MS(default15000)OPENCLAW_QA_CONVEX_ENDPOINT_PREFIX(default/qa-credentials/v1)OPENCLAW_QA_CREDENTIAL_OWNER_ID(optional trace id)OPENCLAW_QA_ALLOW_INSECURE_HTTP=1allows loopbackhttp://Convex URLs for local-only development.
OPENCLAW_QA_CONVEX_SITE_URL should use https:// in normal operation.
Maintainer admin commands (pool add/remove/list) require
OPENCLAW_QA_CONVEX_SECRET_MAINTAINER specifically.
CLI helpers for maintainers:
doctor before live runs to check the Convex site URL, broker secrets,
endpoint prefix, HTTP timeout, and admin/list reachability without printing
secret values. Use --json for machine-readable output in scripts and CI
utilities.
Default endpoint contract (OPENCLAW_QA_CONVEX_SITE_URL + /qa-credentials/v1):
POST /acquire- Request:
{ kind, ownerId, actorRole, leaseTtlMs, heartbeatIntervalMs } - Success:
{ status: "ok", credentialId, leaseToken, payload, leaseTtlMs?, heartbeatIntervalMs? } - Exhausted/retryable:
{ status: "error", code: "POOL_EXHAUSTED" | "NO_CREDENTIAL_AVAILABLE", ... }
- Request:
POST /heartbeat- Request:
{ kind, ownerId, actorRole, credentialId, leaseToken, leaseTtlMs } - Success:
{ status: "ok" }(or empty2xx)
- Request:
POST /release- Request:
{ kind, ownerId, actorRole, credentialId, leaseToken } - Success:
{ status: "ok" }(or empty2xx)
- Request:
POST /admin/add(maintainer secret only)- Request:
{ kind, actorId, payload, note?, status? } - Success:
{ status: "ok", credential }
- Request:
POST /admin/remove(maintainer secret only)- Request:
{ credentialId, actorId } - Success:
{ status: "ok", changed, credential } - Active lease guard:
{ status: "error", code: "LEASE_ACTIVE", ... }
- Request:
POST /admin/list(maintainer secret only)- Request:
{ kind?, status?, includePayload?, limit? } - Success:
{ status: "ok", credentials, count }
- Request:
{ groupId: string, driverToken: string, sutToken: string }groupIdmust be a numeric Telegram chat id string.admin/addvalidates this shape forkind: "telegram"and rejects malformed payloads.
Adding a channel to QA
The architecture and scenario-helper names for new channel adapters live in QA overview → Adding a channel. The minimum bar: implement the transport runner on the sharedqa-lab host seam, declare qaRunners in the plugin manifest, mount as openclaw qa <runner>, and author scenarios under qa/scenarios/.
Test suites (what runs where)
Think of the suites as “increasing realism” (and increasing flakiness/cost):Unit / integration (default)
- Command:
pnpm test - Config: untargeted runs use the
vitest.full-*.config.tsshard set and may expand multi-project shards into per-project configs for parallel scheduling - Files: core/unit inventories under
src/**/*.test.ts,packages/**/*.test.ts, andtest/**/*.test.ts; UI unit tests run in the dedicatedunit-uishard - Scope:
- Pure unit tests
- In-process integration tests (gateway auth, routing, tooling, parsing, config)
- Deterministic regressions for known bugs
- Expectations:
- Runs in CI
- No real keys required
- Should be fast and stable
Projects, shards, and scoped lanes
Projects, shards, and scoped lanes
- Untargeted
pnpm testruns twelve smaller shard configs (core-unit-fast,core-unit-src,core-unit-security,core-unit-ui,core-unit-support,core-support-boundary,core-contracts,core-bundled,core-runtime,agentic,auto-reply,extensions) instead of one giant native root-project process. This cuts peak RSS on loaded machines and avoids auto-reply/extension work starving unrelated suites. pnpm test --watchstill uses the native rootvitest.config.tsproject graph, because a multi-shard watch loop is not practical.pnpm test,pnpm test:watch, andpnpm test:perf:importsroute explicit file/directory targets through scoped lanes first, sopnpm test extensions/discord/src/monitor/message-handler.preflight.test.tsavoids paying the full root project startup tax.pnpm test:changedexpands changed git paths into cheap scoped lanes by default: direct test edits, sibling*.test.tsfiles, explicit source mappings, and local import-graph dependents. Config/setup/package edits do not broad-run tests unless you explicitly useOPENCLAW_TEST_CHANGED_BROAD=1 pnpm test:changed.pnpm check:changedis the normal smart local check gate for narrow work. It classifies the diff into core, core tests, extensions, extension tests, apps, docs, release metadata, live Docker tooling, and tooling, then runs the matching typecheck, lint, and guard commands. It does not run Vitest tests; callpnpm test:changedor explicitpnpm test <target>for test proof. Release metadata-only version bumps run targeted version/config/root-dependency checks, with a guard that rejects package changes outside the top-level version field.- Live Docker ACP harness edits run focused checks: shell syntax for the live Docker auth scripts and a live Docker scheduler dry-run.
package.jsonchanges are included only when the diff is limited toscripts["test:docker:live-*"]; dependency, export, version, and other package-surface edits still use the broader guards. - Import-light unit tests from agents, commands, plugins, auto-reply helpers,
plugin-sdk, and similar pure utility areas route through theunit-fastlane, which skipstest/setup-openclaw-runtime.ts; stateful/runtime-heavy files stay on the existing lanes. - Selected
plugin-sdkandcommandshelper source files also map changed-mode runs to explicit sibling tests in those light lanes, so helper edits avoid rerunning the full heavy suite for that directory. auto-replyhas dedicated buckets for top-level core helpers, top-levelreply.*integration tests, and thesrc/auto-reply/reply/**subtree. CI further splits the reply subtree into agent-runner, dispatch, and commands/state-routing shards so one import-heavy bucket does not own the full Node tail.
Embedded runner coverage
Embedded runner coverage
- When you change message-tool discovery inputs or compaction runtime context, keep both levels of coverage.
- Add focused helper regressions for pure routing and normalization boundaries.
- Keep the embedded runner integration suites healthy:
src/agents/pi-embedded-runner/compact.hooks.test.ts,src/agents/pi-embedded-runner/run.overflow-compaction.test.ts, andsrc/agents/pi-embedded-runner/run.overflow-compaction.loop.test.ts. - Those suites verify that scoped ids and compaction behavior still flow
through the real
run.ts/compact.tspaths; helper-only tests are not a sufficient substitute for those integration paths.
Vitest pool and isolation defaults
Vitest pool and isolation defaults
- Base Vitest config defaults to
threads. - The shared Vitest config fixes
isolate: falseand uses the non-isolated runner across the root projects, e2e, and live configs. - The root UI lane keeps its
jsdomsetup and optimizer, but runs on the shared non-isolated runner too. - Each
pnpm testshard inherits the samethreads+isolate: falsedefaults from the shared Vitest config. scripts/run-vitest.mjsadds--no-maglevfor Vitest child Node processes by default to reduce V8 compile churn during big local runs. SetOPENCLAW_VITEST_ENABLE_MAGLEV=1to compare against stock V8 behavior.
Fast local iteration
Fast local iteration
pnpm changed:lanesshows which architectural lanes a diff triggers.- The pre-commit hook is formatting-only. It restages formatted files and does not run lint, typecheck, or tests.
- Run
pnpm check:changedexplicitly before handoff or push when you need the smart local check gate. pnpm test:changedroutes through cheap scoped lanes by default. UseOPENCLAW_TEST_CHANGED_BROAD=1 pnpm test:changedonly when the agent decides a harness, config, package, or contract edit really needs broader Vitest coverage.pnpm test:maxandpnpm test:changed:maxkeep the same routing behavior, just with a higher worker cap.- Local worker auto-scaling is intentionally conservative and backs off when the host load average is already high, so multiple concurrent Vitest runs do less damage by default.
- The base Vitest config marks the projects/config files as
forceRerunTriggersso changed-mode reruns stay correct when test wiring changes. - The config keeps
OPENCLAW_VITEST_FS_MODULE_CACHEenabled on supported hosts; setOPENCLAW_VITEST_FS_MODULE_CACHE_PATH=/abs/pathif you want one explicit cache location for direct profiling.
Perf debugging
Perf debugging
pnpm test:perf:importsenables Vitest import-duration reporting plus import-breakdown output.pnpm test:perf:imports:changedscopes the same profiling view to files changed sinceorigin/main.- Shard timing data is written to
.artifacts/vitest-shard-timings.json. Whole-config runs use the config path as the key; include-pattern CI shards append the shard name so filtered shards can be tracked separately. - When one hot test still spends most of its time in startup imports,
keep heavy dependencies behind a narrow local
*.runtime.tsseam and mock that seam directly instead of deep-importing runtime helpers just to pass them throughvi.mock(...). pnpm test:perf:changed:bench -- --ref <git-ref>compares routedtest:changedagainst the native root-project path for that committed diff and prints wall time plus macOS max RSS.pnpm test:perf:changed:bench -- --worktreebenchmarks the current dirty tree by routing the changed file list throughscripts/test-projects.mjsand the root Vitest config.pnpm test:perf:profile:mainwrites a main-thread CPU profile for Vitest/Vite startup and transform overhead.pnpm test:perf:profile:runnerwrites runner CPU+heap profiles for the unit suite with file parallelism disabled.
Stability (gateway)
- Command:
pnpm test:stability:gateway - Config:
vitest.gateway.config.ts, forced to one worker - Scope:
- Starts a real loopback Gateway with diagnostics enabled by default
- Drives synthetic gateway message, memory, and large-payload churn through the diagnostic event path
- Queries
diagnostics.stabilityover the Gateway WS RPC - Covers diagnostic stability bundle persistence helpers
- Asserts the recorder remains bounded, synthetic RSS samples stay under the pressure budget, and per-session queue depths drain back to zero
- Expectations:
- CI-safe and keyless
- Narrow lane for stability-regression follow-up, not a substitute for the full Gateway suite
E2E (gateway smoke)
- Command:
pnpm test:e2e - Config:
vitest.e2e.config.ts - Files:
src/**/*.e2e.test.ts,test/**/*.e2e.test.ts, and bundled-plugin E2E tests underextensions/ - Runtime defaults:
- Uses Vitest
threadswithisolate: false, matching the rest of the repo. - Uses adaptive workers (CI: up to 2, local: 1 by default).
- Runs in silent mode by default to reduce console I/O overhead.
- Uses Vitest
- Useful overrides:
OPENCLAW_E2E_WORKERS=<n>to force worker count (capped at 16).OPENCLAW_E2E_VERBOSE=1to re-enable verbose console output.
- Scope:
- Multi-instance gateway end-to-end behavior
- WebSocket/HTTP surfaces, node pairing, and heavier networking
- Expectations:
- Runs in CI (when enabled in the pipeline)
- No real keys required
- More moving parts than unit tests (can be slower)
E2E: OpenShell backend smoke
- Command:
pnpm test:e2e:openshell - File:
extensions/openshell/src/backend.e2e.test.ts - Scope:
- Starts an isolated OpenShell gateway on the host via Docker
- Creates a sandbox from a temporary local Dockerfile
- Exercises OpenClaw’s OpenShell backend over real
sandbox ssh-config+ SSH exec - Verifies remote-canonical filesystem behavior through the sandbox fs bridge
- Expectations:
- Opt-in only; not part of the default
pnpm test:e2erun - Requires a local
openshellCLI plus a working Docker daemon - Uses isolated
HOME/XDG_CONFIG_HOME, then destroys the test gateway and sandbox
- Opt-in only; not part of the default
- Useful overrides:
OPENCLAW_E2E_OPENSHELL=1to enable the test when running the broader e2e suite manuallyOPENCLAW_E2E_OPENSHELL_COMMAND=/path/to/openshellto point at a non-default CLI binary or wrapper script
Live (real providers + real models)
- Command:
pnpm test:live - Config:
vitest.live.config.ts - Files:
src/**/*.live.test.ts,test/**/*.live.test.ts, and bundled-plugin live tests underextensions/ - Default: enabled by
pnpm test:live(setsOPENCLAW_LIVE_TEST=1) - Scope:
- “Does this provider/model actually work today with real creds?”
- Catch provider format changes, tool-calling quirks, auth issues, and rate limit behavior
- Expectations:
- Not CI-stable by design (real networks, real provider policies, quotas, outages)
- Costs money / uses rate limits
- Prefer running narrowed subsets instead of “everything”
- Live runs source
~/.profileto pick up missing API keys. - By default, live runs still isolate
HOMEand copy config/auth material into a temp test home so unit fixtures cannot mutate your real~/.openclaw. - Set
OPENCLAW_LIVE_USE_REAL_HOME=1only when you intentionally need live tests to use your real home directory. pnpm test:livenow defaults to a quieter mode: it keeps[live] ...progress output, but suppresses the extra~/.profilenotice and mutes gateway bootstrap logs/Bonjour chatter. SetOPENCLAW_LIVE_TEST_QUIET=0if you want the full startup logs back.- API key rotation (provider-specific): set
*_API_KEYSwith comma/semicolon format or*_API_KEY_1,*_API_KEY_2(for exampleOPENAI_API_KEYS,ANTHROPIC_API_KEYS,GEMINI_API_KEYS) or per-live override viaOPENCLAW_LIVE_*_KEY; tests retry on rate limit responses. - Progress/heartbeat output:
- Live suites now emit progress lines to stderr so long provider calls are visibly active even when Vitest console capture is quiet.
vitest.live.config.tsdisables Vitest console interception so provider/gateway progress lines stream immediately during live runs.- Tune direct-model heartbeats with
OPENCLAW_LIVE_HEARTBEAT_MS. - Tune gateway/probe heartbeats with
OPENCLAW_LIVE_GATEWAY_HEARTBEAT_MS.
Which suite should I run?
Use this decision table:- Editing logic/tests: run
pnpm test(andpnpm test:coverageif you changed a lot) - Touching gateway networking / WS protocol / pairing: add
pnpm test:e2e - Debugging “my bot is down” / provider-specific failures / tool calling: run a narrowed
pnpm test:live
Live (network-touching) tests
For the live model matrix, CLI backend smokes, ACP smokes, Codex app-server harness, and all media-provider live tests (Deepgram, BytePlus, ComfyUI, image, music, video, media harness) — plus credential handling for live runs — see Testing — live suites.Docker runners (optional “works in Linux” checks)
These Docker runners split into two buckets:- Live-model runners:
test:docker:live-modelsandtest:docker:live-gatewayrun only their matching profile-key live file inside the repo Docker image (src/agents/models.profiles.live.test.tsandsrc/gateway/gateway-models.profiles.live.test.ts), mounting your local config dir and workspace (and sourcing~/.profileif mounted). The matching local entrypoints aretest:live:models-profilesandtest:live:gateway-profiles. - Docker live runners default to a smaller smoke cap so a full Docker sweep stays practical:
test:docker:live-modelsdefaults toOPENCLAW_LIVE_MAX_MODELS=12, andtest:docker:live-gatewaydefaults toOPENCLAW_LIVE_GATEWAY_SMOKE=1,OPENCLAW_LIVE_GATEWAY_MAX_MODELS=8,OPENCLAW_LIVE_GATEWAY_STEP_TIMEOUT_MS=45000, andOPENCLAW_LIVE_GATEWAY_MODEL_TIMEOUT_MS=90000. Override those env vars when you explicitly want the larger exhaustive scan. test:docker:allbuilds the live Docker image once viatest:docker:live-build, packs OpenClaw once as an npm tarball throughscripts/package-openclaw-for-docker.mjs, then builds/reuses twoscripts/e2e/Dockerfileimages. The bare image is only the Node/Git runner for install/update/plugin-dependency lanes; those lanes mount the prebuilt tarball. The functional image installs the same tarball into/appfor built-app functionality lanes. Docker lane definitions live inscripts/lib/docker-e2e-scenarios.mjs; planner logic lives inscripts/lib/docker-e2e-plan.mjs;scripts/test-docker-all.mjsexecutes the selected plan. The aggregate uses a weighted local scheduler:OPENCLAW_DOCKER_ALL_PARALLELISMcontrols process slots, while resource caps keep heavy live, npm-install, and multi-service lanes from all starting at once. If a single lane is heavier than the active caps, the scheduler can still start it when the pool is empty and then keeps it running alone until capacity is available again. Defaults are 10 slots,OPENCLAW_DOCKER_ALL_LIVE_LIMIT=9,OPENCLAW_DOCKER_ALL_NPM_LIMIT=10, andOPENCLAW_DOCKER_ALL_SERVICE_LIMIT=7; tuneOPENCLAW_DOCKER_ALL_WEIGHT_LIMITorOPENCLAW_DOCKER_ALL_DOCKER_LIMITonly when the Docker host has more headroom. The runner performs a Docker preflight by default, removes stale OpenClaw E2E containers, prints status every 30 seconds, stores successful lane timings in.artifacts/docker-tests/lane-timings.json, and uses those timings to start longer lanes first on later runs. UseOPENCLAW_DOCKER_ALL_DRY_RUN=1to print the weighted lane manifest without building or running Docker, ornode scripts/test-docker-all.mjs --plan-jsonto print the CI plan for selected lanes, package/image needs, and credentials.Package Acceptanceis the GitHub-native package gate for “does this installable tarball work as a product?” It resolves one candidate package fromsource=npm,source=ref,source=url, orsource=artifact, uploads it aspackage-under-test, then runs the reusable Docker E2E lanes against that exact tarball instead of repacking the selected ref.workflow_refselects the trusted workflow/harness scripts, whilepackage_refselects the source commit/branch/tag to pack whensource=ref; this lets current acceptance logic validate older trusted commits. Profiles are ordered by breadth:smokeis quick install/channel/agent plus gateway/config,packageis the package/update/plugin contract and the default native replacement for most Parallels package/update coverage,productadds MCP channels, cron/subagent cleanup, OpenAI web search, and OpenWebUI, andfullruns the release-path Docker chunks with OpenWebUI. Release validation runs a custom package delta (bundled-channel-deps-compat plugins-offline) plus Telegram package QA because the release-path Docker chunks already cover the overlapping package/update/plugin lanes. Targeted GitHub Docker rerun commands generated from artifacts include prior package artifact and prepared image inputs when available, so failed lanes can avoid rebuilding the package and images.- Build and release checks run
scripts/check-cli-bootstrap-imports.mjsafter tsdown. The guard walks the static built graph fromdist/entry.jsanddist/cli/run-main.jsand fails if pre-dispatch startup imports package dependencies such as Commander, prompt UI, undici, or logging before command dispatch; it also keeps the bundled gateway run chunk under budget and rejects static imports of known cold gateway paths. Packaged CLI smoke also covers root help, onboard help, doctor help, status, config schema, and a model-list command. - Package Acceptance legacy compatibility is capped at
2026.4.25(2026.4.25-beta.*included). Through that cutoff, the harness tolerates only shipped-package metadata gaps: omitted private QA inventory entries, missinggateway install --wrapper, missing patch files in the tarball-derived git fixture, missing persistedupdate.channel, legacy plugin install-record locations, missing marketplace install-record persistence, and config metadata migration duringplugins update. For packages after2026.4.25, those paths are strict failures. - Container smoke runners:
test:docker:openwebui,test:docker:onboard,test:docker:npm-onboard-channel-agent,test:docker:update-channel-switch,test:docker:session-runtime-context,test:docker:agents-delete-shared-workspace,test:docker:gateway-network,test:docker:browser-cdp-snapshot,test:docker:mcp-channels,test:docker:pi-bundle-mcp-tools,test:docker:cron-mcp-cleanup,test:docker:plugins,test:docker:plugin-update, andtest:docker:config-reloadboot one or more real containers and verify higher-level integration paths.
- Direct models:
pnpm test:docker:live-models(script:scripts/test-live-models-docker.sh) - ACP bind smoke:
pnpm test:docker:live-acp-bind(script:scripts/test-live-acp-bind-docker.sh; covers Claude, Codex, and Gemini by default, with strict Droid/OpenCode coverage viapnpm test:docker:live-acp-bind:droidandpnpm test:docker:live-acp-bind:opencode) - CLI backend smoke:
pnpm test:docker:live-cli-backend(script:scripts/test-live-cli-backend-docker.sh) - Codex app-server harness smoke:
pnpm test:docker:live-codex-harness(script:scripts/test-live-codex-harness-docker.sh) - Gateway + dev agent:
pnpm test:docker:live-gateway(script:scripts/test-live-gateway-models-docker.sh) - Observability smoke:
pnpm qa:otel:smokeis a private QA source-checkout lane. It is intentionally not part of package Docker release lanes because the npm tarball omits QA Lab. - Open WebUI live smoke:
pnpm test:docker:openwebui(script:scripts/e2e/openwebui-docker.sh) - Onboarding wizard (TTY, full scaffolding):
pnpm test:docker:onboard(script:scripts/e2e/onboard-docker.sh) - Npm tarball onboarding/channel/agent smoke:
pnpm test:docker:npm-onboard-channel-agentinstalls the packed OpenClaw tarball globally in Docker, configures OpenAI via env-ref onboarding plus Telegram by default, verifies doctor repairs activated plugin runtime deps, and runs one mocked OpenAI agent turn. Reuse a prebuilt tarball withOPENCLAW_CURRENT_PACKAGE_TGZ=/path/to/openclaw-*.tgz, skip the host rebuild withOPENCLAW_NPM_ONBOARD_HOST_BUILD=0, or switch channel withOPENCLAW_NPM_ONBOARD_CHANNEL=discord. - Update channel switch smoke:
pnpm test:docker:update-channel-switchinstalls the packed OpenClaw tarball globally in Docker, switches from packagestableto gitdev, verifies the persisted channel and plugin post-update work, then switches back to packagestableand checks update status. - Session runtime context smoke:
pnpm test:docker:session-runtime-contextverifies hidden runtime context transcript persistence plus doctor repair of affected duplicated prompt-rewrite branches. - Bun global install smoke:
bash scripts/e2e/bun-global-install-smoke.shpacks the current tree, installs it withbun install -gin an isolated home, and verifiesopenclaw infer image providers --jsonreturns bundled image providers instead of hanging. Reuse a prebuilt tarball withOPENCLAW_BUN_GLOBAL_SMOKE_PACKAGE_TGZ=/path/to/openclaw-*.tgz, skip the host build withOPENCLAW_BUN_GLOBAL_SMOKE_HOST_BUILD=0, or copydist/from a built Docker image withOPENCLAW_BUN_GLOBAL_SMOKE_DIST_IMAGE=openclaw-dockerfile-smoke:local. - Installer Docker smoke:
bash scripts/test-install-sh-docker.shshares one npm cache across its root, update, and direct-npm containers. Update smoke defaults to npmlatestas the stable baseline before upgrading to the candidate tarball. Override withOPENCLAW_INSTALL_SMOKE_UPDATE_BASELINE=2026.4.22locally, or with the Install Smoke workflow’supdate_baseline_versioninput on GitHub. Non-root installer checks keep an isolated npm cache so root-owned cache entries do not mask user-local install behavior. SetOPENCLAW_INSTALL_SMOKE_NPM_CACHE_DIR=/path/to/cacheto reuse the root/update/direct-npm cache across local reruns. - Install Smoke CI skips the duplicate direct-npm global update with
OPENCLAW_INSTALL_SMOKE_SKIP_NPM_GLOBAL=1; run the script locally without that env when directnpm install -gcoverage is needed. - Agents delete shared workspace CLI smoke:
pnpm test:docker:agents-delete-shared-workspace(script:scripts/e2e/agents-delete-shared-workspace-docker.sh) builds the root Dockerfile image by default, seeds two agents with one workspace in an isolated container home, runsagents delete --json, and verifies valid JSON plus retained workspace behavior. Reuse the install-smoke image withOPENCLAW_AGENTS_DELETE_SHARED_WORKSPACE_E2E_IMAGE=openclaw-dockerfile-smoke:local OPENCLAW_AGENTS_DELETE_SHARED_WORKSPACE_E2E_SKIP_BUILD=1. - Gateway networking (two containers, WS auth + health):
pnpm test:docker:gateway-network(script:scripts/e2e/gateway-network-docker.sh) - Browser CDP snapshot smoke:
pnpm test:docker:browser-cdp-snapshot(script:scripts/e2e/browser-cdp-snapshot-docker.sh) builds the source E2E image plus a Chromium layer, starts Chromium with raw CDP, runsbrowser doctor --deep, and verifies CDP role snapshots cover link URLs, cursor-promoted clickables, iframe refs, and frame metadata. - OpenAI Responses web_search minimal reasoning regression:
pnpm test:docker:openai-web-search-minimal(script:scripts/e2e/openai-web-search-minimal-docker.sh) runs a mocked OpenAI server through Gateway, verifiesweb_searchraisesreasoning.effortfromminimaltolow, then forces the provider schema reject and checks the raw detail appears in Gateway logs. - MCP channel bridge (seeded Gateway + stdio bridge + raw Claude notification-frame smoke):
pnpm test:docker:mcp-channels(script:scripts/e2e/mcp-channels-docker.sh) - Pi bundle MCP tools (real stdio MCP server + embedded Pi profile allow/deny smoke):
pnpm test:docker:pi-bundle-mcp-tools(script:scripts/e2e/pi-bundle-mcp-tools-docker.sh) - Cron/subagent MCP cleanup (real Gateway + stdio MCP child teardown after isolated cron and one-shot subagent runs):
pnpm test:docker:cron-mcp-cleanup(script:scripts/e2e/cron-mcp-cleanup-docker.sh) - Plugins (install smoke, ClawHub kitchen-sink install/uninstall, marketplace updates, and Claude-bundle enable/inspect):
pnpm test:docker:plugins(script:scripts/e2e/plugins-docker.sh) SetOPENCLAW_PLUGINS_E2E_CLAWHUB=0to skip the ClawHub block, or override the default kitchen-sink package/runtime pair withOPENCLAW_PLUGINS_E2E_CLAWHUB_SPECandOPENCLAW_PLUGINS_E2E_CLAWHUB_ID. WithoutOPENCLAW_CLAWHUB_URL/CLAWHUB_URL, the test uses a hermetic local ClawHub fixture server. - Plugin update unchanged smoke:
pnpm test:docker:plugin-update(script:scripts/e2e/plugin-update-unchanged-docker.sh) - Config reload metadata smoke:
pnpm test:docker:config-reload(script:scripts/e2e/config-reload-source-docker.sh) - Bundled plugin runtime deps:
pnpm test:docker:bundled-channel-depsbuilds a small Docker runner image by default, builds and packs OpenClaw once on the host, then mounts that tarball into each Linux install scenario. Reuse the image withOPENCLAW_SKIP_DOCKER_BUILD=1, skip the host rebuild after a fresh local build withOPENCLAW_BUNDLED_CHANNEL_HOST_BUILD=0, or point at an existing tarball withOPENCLAW_CURRENT_PACKAGE_TGZ=/path/to/openclaw-*.tgz. The full Docker aggregate and release-path bundled-channel chunks pre-pack this tarball once, then shard bundled channel checks into independent lanes, including separate update lanes for Telegram, Discord, Slack, Feishu, memory-lancedb, and ACPX. Release chunks split channel smokes, update targets, and setup/runtime contracts intobundled-channels-core,bundled-channels-update-a,bundled-channels-update-b, andbundled-channels-contracts; the aggregatebundled-channelschunk remains available for manual reruns. The release workflow also splits provider installer chunks and bundled plugin install/uninstall chunks; legacypackage-update,plugins-runtime, andplugins-integrationschunks remain aggregate aliases for manual reruns. UseOPENCLAW_BUNDLED_CHANNELS=telegram,slackto narrow the channel matrix when running the bundled lane directly, orOPENCLAW_BUNDLED_CHANNEL_UPDATE_TARGETS=telegram,acpxto narrow the update scenario. The lane also verifies thatchannels.<id>.enabled=falseandplugins.entries.<id>.enabled=falsesuppress doctor/runtime-dependency repair. - Narrow bundled plugin runtime deps while iterating by disabling unrelated scenarios, for example:
OPENCLAW_BUNDLED_CHANNEL_SCENARIOS=0 OPENCLAW_BUNDLED_CHANNEL_UPDATE_SCENARIO=0 OPENCLAW_BUNDLED_CHANNEL_ROOT_OWNED_SCENARIO=0 OPENCLAW_BUNDLED_CHANNEL_SETUP_ENTRY_SCENARIO=0 pnpm test:docker:bundled-channel-deps.
OPENCLAW_GATEWAY_NETWORK_E2E_IMAGE still win when set. When OPENCLAW_SKIP_DOCKER_BUILD=1 points at a remote shared image, the scripts pull it if it is not already local. The QR and installer Docker tests keep their own Dockerfiles because they validate package/install behavior rather than the shared built-app runtime.
The live-model Docker runners also bind-mount the current checkout read-only and
stage it into a temporary workdir inside the container. This keeps the runtime
image slim while still running Vitest against your exact local source/config.
The staging step skips large local-only caches and app build outputs such as
.pnpm-store, .worktrees, __openclaw_vitest__, and app-local .build or
Gradle output directories so Docker live runs do not spend minutes copying
machine-specific artifacts.
They also set OPENCLAW_SKIP_CHANNELS=1 so gateway live probes do not start
real Telegram/Discord/etc. channel workers inside the container.
test:docker:live-models still runs pnpm test:live, so pass through
OPENCLAW_LIVE_GATEWAY_* as well when you need to narrow or exclude gateway
live coverage from that Docker lane.
test:docker:openwebui is a higher-level compatibility smoke: it starts an
OpenClaw gateway container with the OpenAI-compatible HTTP endpoints enabled,
starts a pinned Open WebUI container against that gateway, signs in through
Open WebUI, verifies /api/models exposes openclaw/default, then sends a
real chat request through Open WebUI’s /api/chat/completions proxy.
The first run can be noticeably slower because Docker may need to pull the
Open WebUI image and Open WebUI may need to finish its own cold-start setup.
This lane expects a usable live model key, and OPENCLAW_PROFILE_FILE
(~/.profile by default) is the primary way to provide it in Dockerized runs.
Successful runs print a small JSON payload like { "ok": true, "model": "openclaw/default", ... }.
test:docker:mcp-channels is intentionally deterministic and does not need a
real Telegram, Discord, or iMessage account. It boots a seeded Gateway
container, starts a second container that spawns openclaw mcp serve, then
verifies routed conversation discovery, transcript reads, attachment metadata,
live event queue behavior, outbound send routing, and Claude-style channel +
permission notifications over the real stdio MCP bridge. The notification check
inspects the raw stdio MCP frames directly so the smoke validates what the
bridge actually emits, not just what a specific client SDK happens to surface.
test:docker:pi-bundle-mcp-tools is deterministic and does not need a live
model key. It builds the repo Docker image, starts a real stdio MCP probe server
inside the container, materializes that server through the embedded Pi bundle
MCP runtime, executes the tool, then verifies coding and messaging keep
bundle-mcp tools while minimal and tools.deny: ["bundle-mcp"] filter them.
test:docker:cron-mcp-cleanup is deterministic and does not need a live model
key. It starts a seeded Gateway with a real stdio MCP probe server, runs an
isolated cron turn and a /subagents spawn one-shot child turn, then verifies
the MCP child process exits after each run.
Manual ACP plain-language thread smoke (not CI):
bun scripts/dev/discord-acp-plain-language-smoke.ts --channel <discord-channel-id> ...- Keep this script for regression/debug workflows. It may be needed again for ACP thread routing validation, so do not delete it.
OPENCLAW_CONFIG_DIR=...(default:~/.openclaw) mounted to/home/node/.openclawOPENCLAW_WORKSPACE_DIR=...(default:~/.openclaw/workspace) mounted to/home/node/.openclaw/workspaceOPENCLAW_PROFILE_FILE=...(default:~/.profile) mounted to/home/node/.profileand sourced before running testsOPENCLAW_DOCKER_PROFILE_ENV_ONLY=1to verify only env vars sourced fromOPENCLAW_PROFILE_FILE, using temporary config/workspace dirs and no external CLI auth mountsOPENCLAW_DOCKER_CLI_TOOLS_DIR=...(default:~/.cache/openclaw/docker-cli-tools) mounted to/home/node/.npm-globalfor cached CLI installs inside Docker- External CLI auth dirs/files under
$HOMEare mounted read-only under/host-auth..., then copied into/home/node/...before tests start- Default dirs:
.minimax - Default files:
~/.codex/auth.json,~/.codex/config.toml,.claude.json,~/.claude/.credentials.json,~/.claude/settings.json,~/.claude/settings.local.json - Narrowed provider runs mount only the needed dirs/files inferred from
OPENCLAW_LIVE_PROVIDERS/OPENCLAW_LIVE_GATEWAY_PROVIDERS - Override manually with
OPENCLAW_DOCKER_AUTH_DIRS=all,OPENCLAW_DOCKER_AUTH_DIRS=none, or a comma list likeOPENCLAW_DOCKER_AUTH_DIRS=.claude,.codex
- Default dirs:
OPENCLAW_LIVE_GATEWAY_MODELS=.../OPENCLAW_LIVE_MODELS=...to narrow the runOPENCLAW_LIVE_GATEWAY_PROVIDERS=.../OPENCLAW_LIVE_PROVIDERS=...to filter providers in-containerOPENCLAW_SKIP_DOCKER_BUILD=1to reuse an existingopenclaw:local-liveimage for reruns that do not need a rebuildOPENCLAW_LIVE_REQUIRE_PROFILE_KEYS=1to ensure creds come from the profile store (not env)OPENCLAW_OPENWEBUI_MODEL=...to choose the model exposed by the gateway for the Open WebUI smokeOPENCLAW_OPENWEBUI_PROMPT=...to override the nonce-check prompt used by the Open WebUI smokeOPENWEBUI_IMAGE=...to override the pinned Open WebUI image tag
Docs sanity
Run docs checks after doc edits:pnpm check:docs.
Run full Mintlify anchor validation when you need in-page heading checks too: pnpm docs:check-links:anchors.
Offline regression (CI-safe)
These are “real pipeline” regressions without real providers:- Gateway tool calling (mock OpenAI, real gateway + agent loop):
src/gateway/gateway.test.ts(case: “runs a mock OpenAI tool call end-to-end via gateway agent loop”) - Gateway wizard (WS
wizard.start/wizard.next, writes config + auth enforced):src/gateway/gateway.test.ts(case: “runs wizard over ws and writes auth token config”)
Agent reliability evals (skills)
We already have a few CI-safe tests that behave like “agent reliability evals”:- Mock tool-calling through the real gateway + agent loop (
src/gateway/gateway.test.ts). - End-to-end wizard flows that validate session wiring and config effects (
src/gateway/gateway.test.ts).
- Decisioning: when skills are listed in the prompt, does the agent pick the right skill (or avoid irrelevant ones)?
- Compliance: does the agent read
SKILL.mdbefore use and follow required steps/args? - Workflow contracts: multi-turn scenarios that assert tool order, session history carryover, and sandbox boundaries.
- A scenario runner using mock providers to assert tool calls + order, skill file reads, and session wiring.
- A small suite of skill-focused scenarios (use vs avoid, gating, prompt injection).
- Optional live evals (opt-in, env-gated) only after the CI-safe suite is in place.
Contract tests (plugin and channel shape)
Contract tests verify that every registered plugin and channel conforms to its interface contract. They iterate over all discovered plugins and run a suite of shape and behavior assertions. The defaultpnpm test unit lane intentionally
skips these shared seam and smoke files; run the contract commands explicitly
when you touch shared channel or provider surfaces.
Commands
- All contracts:
pnpm test:contracts - Channel contracts only:
pnpm test:contracts:channels - Provider contracts only:
pnpm test:contracts:plugins
Channel contracts
Located insrc/channels/plugins/contracts/*.contract.test.ts:
- plugin - Basic plugin shape (id, name, capabilities)
- setup - Setup wizard contract
- session-binding - Session binding behavior
- outbound-payload - Message payload structure
- inbound - Inbound message handling
- actions - Channel action handlers
- threading - Thread ID handling
- directory - Directory/roster API
- group-policy - Group policy enforcement
Provider status contracts
Located insrc/plugins/contracts/*.contract.test.ts.
- status - Channel status probes
- registry - Plugin registry shape
Provider contracts
Located insrc/plugins/contracts/*.contract.test.ts:
- auth - Auth flow contract
- auth-choice - Auth choice/selection
- catalog - Model catalog API
- discovery - Plugin discovery
- loader - Plugin loading
- runtime - Provider runtime
- shape - Plugin shape/interface
- wizard - Setup wizard
When to run
- After changing plugin-sdk exports or subpaths
- After adding or modifying a channel or provider plugin
- After refactoring plugin registration or discovery
Adding regressions (guidance)
When you fix a provider/model issue discovered in live:- Add a CI-safe regression if possible (mock/stub provider, or capture the exact request-shape transformation)
- If it’s inherently live-only (rate limits, auth policies), keep the live test narrow and opt-in via env vars
- Prefer targeting the smallest layer that catches the bug:
- provider request conversion/replay bug → direct models test
- gateway session/history/tool pipeline bug → gateway live smoke or CI-safe gateway mock test
- SecretRef traversal guardrail:
src/secrets/exec-secret-ref-id-parity.test.tsderives one sampled target per SecretRef class from registry metadata (listSecretTargetRegistryEntries()), then asserts traversal-segment exec ids are rejected.- If you add a new
includeInPlanSecretRef target family insrc/secrets/target-registry-data.ts, updateclassifyTargetClassin that test. The test intentionally fails on unclassified target ids so new classes cannot be skipped silently.