Mantis is the OpenClaw end-to-end verification system for bugs that need a real runtime, a real transport, and visible proof. It runs a scenario against a known bad ref, captures evidence, runs the same scenario against a candidate ref, and publishes the comparison as artifacts that a maintainer can inspect from a PR or from a local command. Mantis starts with Discord because Discord gives us a high-value first lane: real bot auth, real guild channels, reactions, threads, native commands, and a browser UI where humans can visually confirm what the transport showed.Documentation Index
Fetch the complete documentation index at: https://docs.openclaw.ai/llms.txt
Use this file to discover all available pages before exploring further.
Goals
- Reproduce a bug from a GitHub issue or PR with the same transport shape users see.
- Capture a before artifact on the baseline ref before applying the fix.
- Capture an after artifact on the candidate ref after applying the fix.
- Use a deterministic oracle whenever possible, such as a Discord REST reaction read or channel transcript check.
- Capture screenshots when the bug has a visible UI surface.
- Run locally from an agent-controlled CLI and remotely from GitHub.
- Preserve enough machine state for VNC rescue when login, browser automation, or provider auth gets stuck.
- Post concise status to an operator Discord channel when the run is blocked, needs manual VNC help, or finishes.
Non goals
- Mantis is not a replacement for unit tests. A Mantis run should usually become a smaller regression test after the fix is understood.
- Mantis is not the normal fast CI gate. It is slower, uses live credentials, and is reserved for bugs where the live environment matters.
- Mantis should not require a human for normal operation. Manual VNC is a rescue path, not the happy path.
- Mantis does not store raw secrets in artifacts, logs, screenshots, Markdown reports, or PR comments.
Ownership
Mantis lives in the OpenClaw QA stack.- OpenClaw owns the scenario runtime, transport adapters, evidence schema, and
local CLI under
pnpm openclaw qa mantis. - QA Lab owns the live transport harness pieces, browser capture helpers, and artifact writers.
- Crabbox owns warmed Linux machines when a remote VM is needed.
- GitHub Actions owns the remote workflow entrypoint and artifact retention.
- ClawSweeper owns GitHub comment routing: parsing maintainer commands, dispatching the workflow, and posting the final PR comment.
- OpenClaw agents drive Mantis through Codex when a scenario needs agentic setup, debugging, or stuck-state reporting.
Command shape
The first local command verifies the Discord bot, guild, channel, message send, reaction send, and artifact path:--allow-failures, then writes baseline/, candidate/, comparison.json,
and mantis-report.md. For the first Discord scenario, a successful verification
means baseline status is fail and candidate status is pass.
The second Discord before/after probe targets thread attachments:
message.thread-reply action with a repo-local
filePath, then polls the thread for the SUT reply and attachment filename. The
baseline screenshot shows the reply with no attachment; the candidate screenshot
shows the expected mantis-thread-report.md attachment.
The first VM/browser primitive is the desktop smoke:
--provider, --crabbox-bin, or
OPENCLAW_MANTIS_CRABBOX_PROVIDER when running against another Crabbox fleet.
Useful desktop smoke flags:
--lease-id <cbx_...>orOPENCLAW_MANTIS_CRABBOX_LEASE_IDreuses a warmed desktop.--browser-url <url>changes the page opened in the visible browser.--html-file <path>renders a repo-local HTML artifact in the visible browser. Mantis uses this to capture the generated Discord status-reaction timeline through a real Crabbox desktop.--browser-profile-dir <remote-path>reuses a remote Chrome user-data-dir so a persistent Mantis desktop can stay logged in between runs. Use this for the long-lived Discord Web viewer profile.--browser-profile-archive-env <name>restores a base64.tgzChrome user-data-dir archive from the named environment variable before launching the browser. Use this for logged-in witnesses such as Discord Web. The default env var isOPENCLAW_MANTIS_BROWSER_PROFILE_TGZ_B64.--video-duration <seconds>controls the MP4 capture length. Use a longer duration for slow logged-in web apps that need time to settle.--keep-leaseorOPENCLAW_MANTIS_KEEP_VM=1keeps a newly created passing lease open for VNC inspection. Failed runs keep the lease by default when one was created so an operator can reconnect.--class,--idle-timeout, and--ttltune machine size and lease lifetime.
thread-reply, and checks the attachment through Discord
REST. When OPENCLAW_QA_DISCORD_CAPTURE_UI_METADATA=1 is set, the scenario also
writes a Discord Web URL artifact. When OPENCLAW_QA_DISCORD_KEEP_THREADS=1 is
set, it leaves that thread available long enough for a logged-in browser to open
and record it.
The GitHub workflow opens the candidate thread URL in Discord Web, captures a
screenshot, records an MP4, and generates a trimmed GIF preview when Crabbox
media tooling is available. Prefer a persistent viewer profile path configured
through MANTIS_DISCORD_VIEWER_CHROME_PROFILE_DIR, because full Chrome profile
archives can outgrow GitHub’s secret-size limit. For small/bootstrap profiles,
the workflow can also restore a base64 .tgz archive from
MANTIS_DISCORD_VIEWER_CHROME_PROFILE_TGZ_B64. If neither profile source is
configured, the workflow still publishes the deterministic baseline/candidate
attachment screenshots and logs a notice that the logged-in Discord Web witness
was skipped.
The first full desktop transport primitive is the Slack desktop smoke:
pnpm openclaw qa slack inside that VM, opens Slack Web in the VNC
browser, captures the visible desktop, and copies both the Slack QA artifacts and
the VNC screenshot back to the local output directory. This is the first Mantis
shape where the SUT OpenClaw gateway and the browser both live inside the same
Linux desktop VM.
With --gateway-setup, the command prepares a persistent disposable OpenClaw
home at $HOME/.openclaw-mantis/slack-openclaw, patches Slack Socket Mode
configuration for the selected channel, starts openclaw gateway run on port
38973, and keeps Chrome running in the VNC session. This is the “leave me a
Linux desktop with Slack and a claw running” mode; the bot-to-bot Slack QA lane
remains the default when --gateway-setup is omitted.
Required inputs for --credential-source env:
OPENCLAW_QA_SLACK_CHANNEL_IDOPENCLAW_QA_SLACK_DRIVER_BOT_TOKENOPENCLAW_QA_SLACK_SUT_BOT_TOKENOPENCLAW_QA_SLACK_SUT_APP_TOKENOPENCLAW_LIVE_OPENAI_KEYfor the remote model lane. If onlyOPENAI_API_KEYis set locally, Mantis maps it toOPENCLAW_LIVE_OPENAI_KEYbefore invoking Crabbox so Crabbox’sOPENCLAW_*env forwarding can carry it into the VM.
--gateway-setup --credential-source convex, Mantis leases the Slack SUT
credential from the shared pool before creating the VM and forwards the leased
channel id, Socket Mode app token, and bot token as the OPENCLAW_MANTIS_SLACK_*
runtime env inside the desktop. That keeps GitHub workflows thin: they only need
the Convex broker secret, not raw Slack bot or app tokens.
Useful Slack desktop flags:
--lease-id <cbx_...>reruns against a machine where an operator already logged in to Slack Web through VNC.--gateway-setupstarts a persistent OpenClaw Slack gateway in the VM instead of only running the bot-to-bot QA lane.--keep-leasekeeps the gateway VM open for VNC inspection after success;--no-keep-leasestops it after collecting artifacts.--slack-url <url>opens a specific Slack Web URL. Without it, Mantis deriveshttps://app.slack.com/client/<team>/<channel>from Slackauth.testwhen the SUT bot token is available.--slack-channel-id <id>controls the Slack channel allowlist used by gateway setup.OPENCLAW_MANTIS_SLACK_BROWSER_PROFILE_DIRcontrols the persistent Chrome profile inside the VM. The default is$HOME/.config/openclaw-mantis/slack-chrome-profile, so a manual Slack Web login survives reruns on the same lease.--credential-source convex --credential-role ciuses the shared credential pool instead of direct Slack env tokens.--provider-mode,--model,--alt-model, and--fastpass through to the Slack live lane.
Mantis Discord Smoke. The before and after GitHub
workflow for the first real scenario is Mantis Discord Status Reactions. It
accepts:
baseline_ref: the ref expected to reproduce queued-only behavior.candidate_ref: the ref expected to showqueued -> thinking -> done.
discord-status-reactions-tool-only against each worktree, and
uploads baseline/, candidate/, comparison.json, and mantis-report.md as
Actions artifacts. It also renders each lane’s timeline HTML in a Crabbox
desktop browser and publishes those VNC screenshots beside the deterministic
timeline PNGs in the PR comment. The same PR comment embeds lightweight
motion-trimmed GIF previews generated by crabbox media preview, links to the
matching motion-trimmed MP4 clips, and keeps the full desktop MP4 files for deep
inspection. Screenshots stay inline for quick review. The workflow builds the
Crabbox CLI from
openclaw/crabbox main so it can use the current desktop/browser lease flags
before the next Crabbox binary release is cut.
Mantis Scenario is the generic manual entrypoint. It takes a scenario_id,
candidate_ref, optional baseline_ref, and optional pr_number, then
dispatches the scenario-owned workflow. The wrapper is intentionally thin:
scenario workflows still own their transport setup, credentials, VM class,
expected oracle, and artifact manifest.
Mantis Slack Desktop Smoke is the first Slack VM workflow. It checks out the
trusted candidate ref in a separate worktree, leases a Crabbox Linux desktop,
runs pnpm openclaw qa mantis slack-desktop-smoke --gateway-setup against that
candidate, opens Slack Web in the VNC browser, records the desktop, generates a
motion-trimmed preview with crabbox media preview, uploads the full artifact
directory, and optionally posts the inline evidence comment on the target PR.
It defaults to AWS for the desktop lease and exposes a manual provider input so
operators can switch to Hetzner when AWS capacity is slow or unavailable. Use
this lane when you want “a Linux desktop with Slack and a claw running” instead
of only a bot-to-bot Slack transcript.
Mantis Telegram Live wraps the existing Telegram live QA lane in the same PR
evidence pipeline. It checks out the trusted candidate ref in a separate
worktree, runs pnpm openclaw qa telegram --credential-source convex --credential-role ci, writes a mantis-evidence.json manifest from the
Telegram QA summary and observed-message artifact, renders the redacted
transcript HTML through a Crabbox desktop browser, generates a motion-trimmed GIF
with crabbox media preview, and posts the inline PR evidence comment when a PR
number is available. This lane is transcript-visual rather than logged-in
Telegram Web proof: the Telegram Bot API gives stable live message evidence, but
Telegram Web login state is not required for normal Mantis automation.
Mantis Telegram Desktop Proof is the agentic native Telegram Desktop
before/after wrapper. A maintainer can trigger it from a PR comment with
@Mantis telegram desktop proof, from the Actions UI with freeform
instructions, or through the generic Mantis Scenario dispatcher. The workflow
hands the PR, baseline ref, candidate ref, and maintainer instructions to Codex.
The agent reads the PR, decides what Telegram-visible behavior proves the
change, runs the real-user Crabbox Telegram Desktop proof lane for baseline and
candidate, iterates until the native GIFs are useful, writes paired
motionPreview artifacts into mantis-evidence.json, uploads the bundle, and
posts a 2-column PR evidence table when a PR number is available.
For human-in-the-loop Telegram desktop setup, use the scenario builder:
openclaw gateway run
on port 38974, posts a driver-bot readiness message to the leased private
group, then captures a screenshot and MP4 from the visible VNC desktop. A bot
token never logs Telegram Desktop in; it only configures OpenClaw. The desktop
viewer is a separate Telegram user session restored from
--telegram-profile-archive-env <name> or created manually through VNC and kept
alive with --keep-lease.
Useful Telegram desktop builder flags:
--lease-id <cbx_...>reruns against a VM where an operator already logged in to Telegram Desktop.--telegram-profile-archive-env <name>reads a base64.tgzTelegram Desktop profile archive from that env var and restores it before launch.--telegram-profile-dir <remote-path>controls the remote Telegram Desktop profile directory. The default is$HOME/.local/share/TelegramDesktop.--no-gateway-setupinstalls and opens Telegram Desktop without configuring OpenClaw.--credential-source convex --credential-role ciuses the shared credential broker instead of direct Telegram env tokens.
mantis-evidence.json next to its report.
This schema is the handoff between scenario code and GitHub comments:
path values are relative to the manifest directory. targetPath
values are relative paths under the qa-artifacts branch publish directory.
The publisher rejects path traversal and skips entries marked
"required": false when optional previews or videos are unavailable.
Supported artifact kinds:
timeline: deterministic scenario screenshot, usually before/after.desktopScreenshot: VNC/browser desktop screenshot.motionPreview: inline animated GIF generated from the desktop recording.motionClip: motion-trimmed MP4 that removes static lead-in and tail.fullVideo: full MP4 recording for deep inspection.metadata: JSON/log sidecar.report: Markdown report.
scripts/mantis/publish-pr-evidence.mjs. Workflows
call it with the manifest, target PR, qa-artifacts target root, comment marker,
Actions artifact URL, run URL, and request source. It copies declared artifacts
to the qa-artifacts branch, builds a summary-first PR comment with inline
images/previews and linked videos, then updates the existing marker comment or
creates one.
You can also trigger the status-reactions run directly from a PR comment:
telegram-status-command. Maintainers can override candidate=...,
provider=aws|hetzner, and lease=<cbx_...> when they need a specific ref or a
pre-warmed Crabbox desktop.
ClawSweeper command examples:
Run lifecycle
- Acquire credentials.
- Allocate or reuse a VM.
- Prepare the desktop/browser profile when the scenario needs UI evidence.
- Prepare a clean checkout for the baseline ref.
- Install dependencies and build only what the scenario needs.
- Start a child OpenClaw Gateway with an isolated state directory.
- Configure the live transport, provider, model, and browser profile.
- Run the scenario and capture baseline evidence.
- Stop the gateway and preserve logs.
- Prepare the candidate ref in the same VM.
- Run the same scenario and capture candidate evidence.
- Compare the oracle results and visual evidence.
- Write Markdown, JSON, logs, screenshots, and optional trace artifacts.
- Upload GitHub Actions artifacts.
- Post a concise PR or Discord status message.
- Bug reproduced: baseline failed in the expected way.
- Harness failure: environment setup, credentials, Discord API, browser, or provider failed before the bug oracle was meaningful.
Discord MVP
The first scenario should target Discord status reactions in guild channels where the source reply delivery mode ismessage_tool_only.
Why it is a good Mantis seed:
- It is visible in Discord as reactions on the triggering message.
- It has a strong REST oracle through Discord message reaction state.
- It exercises a real OpenClaw Gateway, Discord bot auth, message dispatch, source reply delivery mode, status reaction state, and model turn lifecycle.
- It is narrow enough to keep the first implementation honest.
messages.statusReactions.enabled is explicitly
true.
The executable first slice is the opt-in Discord live QA scenario:
visibleReplies: "message_tool", ackReaction: "👀", and explicit status reactions. The oracle
polls the real Discord triggering message and expects the observed sequence
👀 -> 🤔 -> 👍. Artifacts include discord-qa-reaction-timelines.json,
discord-status-reactions-tool-only-timeline.html, and
discord-status-reactions-tool-only-timeline.png.
Existing QA pieces
Mantis should build on the existing private QA stack instead of starting from zero:pnpm openclaw qa discordalready runs a live Discord lane with driver and SUT bots.- The live transport runner already writes reports and observed-message
artifacts under
.artifacts/qa-e2e/. - Convex credential leases already provide exclusive access to shared live transport credentials.
- The browser control service already supports screenshots, snapshots, headless managed profiles, and remote CDP profiles.
- QA Lab already has a debugger UI and bus for transport-shaped testing.
Evidence model
Every run writes a stable artifact directory:mantis-summary.json should be the machine-readable source of truth. The
Markdown report is for PR comments and human review.
The summary must include:
- refs and SHAs tested
- transport and scenario id
- machine provider and machine id or lease id
- credential source without secret values
- baseline result
- candidate result
- whether the bug reproduced on baseline
- whether the candidate fixed it
- artifact paths
- sanitized setup or cleanup issues
Browser and VNC
The browser lane has two modes:- Headless automation: default for CI. Chrome runs with CDP enabled, and Playwright or OpenClaw browser control captures screenshots.
- VNC rescue: enabled on the same VM when login, MFA, Discord anti-automation, or visual debugging needs a human.
- run id
- scenario id
- machine provider
- artifact directory
- VNC or noVNC connection instructions if available
- short blocker text
Machines
Mantis should prefer AWS through Crabbox for the first remote implementation. Crabbox gives us warmed machines, lease tracking, hydration, logs, results, and cleanup. If AWS capacity is too slow or unavailable, add a Hetzner provider behind the same machine interface. Minimum VM requirements:- Linux with a desktop-capable Chrome or Chromium install
- CDP access for browser automation
- VNC or noVNC for rescue
- Node 22 and pnpm
- OpenClaw checkout and dependency cache
- Playwright Chromium browser cache when Playwright is used
- enough CPU and memory for one OpenClaw Gateway, one browser, and one model run
- outbound access to Discord, GitHub, model providers, and the credential broker
Secrets
Secrets live in GitHub organization or repository secrets for remote runs, and in a local operator-controlled secret file for local runs. Recommended secret names:OPENCLAW_QA_DISCORD_MANTIS_BOT_TOKENOPENCLAW_QA_DISCORD_DRIVER_BOT_TOKENOPENCLAW_QA_DISCORD_SUT_BOT_TOKENOPENCLAW_QA_DISCORD_GUILD_IDOPENCLAW_QA_DISCORD_CHANNEL_IDOPENCLAW_QA_DISCORD_NOTIFY_CHANNEL_IDOPENCLAW_QA_REDACT_PUBLIC_METADATA=1for public GitHub artifact uploadsOPENCLAW_QA_CONVEX_SITE_URLOPENCLAW_QA_CONVEX_SECRET_CIOPENCLAW_QA_MANTIS_CRABBOX_COORDINATOROPENCLAW_QA_MANTIS_CRABBOX_COORDINATOR_TOKEN
CRABBOX_COORDINATOR and CRABBOX_COORDINATOR_TOKEN environment variables
that the Crabbox CLI expects. The plain CRABBOX_* GitHub secret names remain
accepted as a compatibility fallback.
The Mantis runner must never print:
- Discord bot tokens
- provider API keys
- browser cookies
- auth profile contents
- VNC passwords
- raw credential payloads
OPENCLAW_QA_REDACT_PUBLIC_METADATA=1 for this reason.
If a token is accidentally pasted into an issue, PR, chat, or log, rotate it
after the new secret has been stored.
GitHub artifacts and PR comments
Mantis workflows should upload the full evidence bundle as a short-lived Actions artifact. When the workflow is run for a bug report or fix PR, it should also publish the redacted PNG screenshots to theqa-artifacts branch and upsert a
comment on that bug or fix PR with inline before/after screenshots. Do not post
the primary proof only on a generic QA automation PR. Raw logs, observed
messages, and other bulky evidence stay in the Actions artifact.
Production workflows should post those comments with the Mantis GitHub App, not
with github-actions[bot]. Store the app id and private key as
MANTIS_GITHUB_APP_ID and MANTIS_GITHUB_APP_PRIVATE_KEY GitHub Actions
secrets. The workflow uses a hidden marker as the upsert key, updates that
comment when the token can edit it, and creates a new Mantis-owned comment when
an older bot-owned marker cannot be edited.
The PR comment should be short and visual:
Private deployment notes
A private deployment may already have a Mantis Discord application. Reuse that application instead of creating another app when it has the right bot permissions and can be safely rotated. Set the initial operator notification channel through secrets or deployment configuration. It can point at an existing maintainer or operations channel first, then move to a dedicated Mantis channel once one exists. Do not put guild ids, channel ids, bot tokens, browser cookies, or VNC passwords in this document. Store them in GitHub secrets, the credential broker, or the operator’s local secret store.Adding a scenario
A Mantis scenario should declare:- id and title
- transport
- required credentials
- baseline ref policy
- candidate ref policy
- OpenClaw config patch
- setup steps
- stimulus
- expected baseline oracle
- expected candidate oracle
- visual capture targets
- timeout budget
- cleanup steps
- Discord reaction state for reaction bugs
- Discord message references for threading bugs
- Slack thread ts and reaction API state for Slack bugs
- email message ids and headers for email bugs
- browser screenshots when UI is the only reliable observable
Provider expansion
After Discord, the same runner can add:- Slack: reactions, threads, app mentions, modals, file uploads.
- Email: Gmail auth and message threading using
gogwhere connectors are not enough. - WhatsApp: QR login, re-identification, message delivery, media, reactions.
- Telegram: group mention gating, commands, reactions where available.
- Matrix: encrypted rooms, thread or reply relations, restart resume.
Open questions
- Which Discord bot should be the driver, and which should be the SUT, when the existing Mantis bot is reused?
- Should the observer browser login use a human Discord account, a test account, or only bot-readable REST evidence for the first phase?
- How long should GitHub retain Mantis artifacts for PRs?
- When should ClawSweeper automatically recommend Mantis instead of waiting for a maintainer command?
- Should screenshots be redacted or cropped before upload for public PRs?