Skip to content

feat(whatsapp): expand live QA coverage#90480

Merged
mcaxtr merged 4 commits into
mainfrom
feat/whatsapp-qa-scenarios
Jun 8, 2026
Merged

feat(whatsapp): expand live QA coverage#90480
mcaxtr merged 4 commits into
mainfrom
feat/whatsapp-qa-scenarios

Conversation

@mcaxtr

@mcaxtr mcaxtr commented Jun 4, 2026

Copy link
Copy Markdown
Member

Summary

Expands the WhatsApp live QA lane from a small smoke set into a broader regression lane for the extension's transport, Gateway, and WhatsApp Web integration boundaries.

This PR also expands the WhatsApp QA driver/send surface and mock-provider support so the lane can exercise structured WhatsApp capabilities deterministically instead of relying on private test-only hooks or frontier-model wording.

What Changed

  • Expands the WhatsApp QA driver and active send API support for:
    • media captions and audio observations
    • quoted reply metadata
    • reactions
    • polls
    • contact cards
    • locations
    • stickers
    • stale-observation filtering with observedAfter
  • Adds QA Gateway/live-lane helpers for:
    • send
    • poll
    • message.action
    • workspace media fixtures
    • Gateway DM sends routed to the driver peer
    • observed-message recording and post-send assertions
    • redacted timeout diagnostics for unmatched WhatsApp observations
  • Adds deterministic mock-openai support for WhatsApp-specific scripted responses, including audio-preflight markers and long final/chunked response checks.
  • Updates WhatsApp inbound extraction and listener test harness coverage for structured live events used by the expanded lane.
  • Aligns shared live-transport baseline coverage so whatsapp-group-allowlist-block is the WhatsApp allowlist-block standard scenario.
  • Documents the expanded WhatsApp QA lane, scenario catalog, environment variables, and output artifacts in docs/concepts/qa-e2e-automation.md.

WhatsApp QA Coverage

The WhatsApp QA catalog now covers 35 scenarios total.

Scenario families covered in this branch include:

  • Basic DM/group smoke and gating:
    • canary
    • pairing gate
    • mention gating
    • top-level reply shape
    • restart/resume
  • Native command UX:
    • /help
    • /status
    • /commands
    • /tools compact
    • /whoami
    • /context list
    • /new
  • Usage/final-output behavior:
    • tool-only usage footer
    • long streamed final message accounting
    • long quoted reply chunking
  • Reply/quote behavior:
    • reply-to mode quotes the triggering message
    • fresh Gateway sends do not reuse prior quote context
  • Inbound media/structured messages:
    • image caption
    • audio preflight transcript
    • document caption
    • poll
    • location
    • group audio gating
  • Outbound send/action behavior:
    • image
    • document
    • audio/voice
    • multi-media
    • document filename preservation
    • native poll
    • message.action react
    • message.action upload-file
  • Access control:
    • DM open
    • DM disabled
    • group open
    • group disabled
    • group allowlist block
  • Native approvals:
    • exec allow-once
    • exec deny
    • exec reaction approval
    • plugin allow-once
  • Status reactions:
    • observable WhatsApp status reactions

Contact-card and sticker send support are part of the expanded QA driver/send API surface, but they are covered by focused driver/send API tests rather than current live scenario IDs.

Default Lane Behavior

  • live-frontier remains intentionally small at 8 default scenarios for fast live smoke coverage.
  • mock-openai runs 29 deterministic WhatsApp scenarios by default.
  • Approval scenarios and a few heavier/blocking checks remain explicit-only unless selected by scenario ID.

The mocked-provider scenarios still run through the real WhatsApp transport; only the model/provider response is mocked so the lane can assert exact markers, chunk counts, and structured behavior without frontier-model variance.

Verification

WhatsApp QA proof on local rewritten HEAD:

  • Branch/head tested: feat/whatsapp-qa-scenarios @ 8b17055
  • Base: origin/main @ 6da3b1f
  • Command: OPENCLAW_QA_ALLOW_INSECURE_HTTP=1 OPENCLAW_QA_CONVEX_SITE_URL=http://127.0.0.1:<local-broker> OPENCLAW_QA_CONVEX_SECRET_MAINTAINER=<redacted> OPENCLAW_QA_WHATSAPP_GROUP_JID=<configured/redacted> pnpm openclaw qa whatsapp --credential-source convex --credential-role maintainer --provider-mode mock-openai --sut-account work --output-dir .artifacts/qa-e2e/whatsapp-full-current-8b17055ef7-20260605-213904 --scenario whatsapp-canary --scenario whatsapp-pairing-block --scenario whatsapp-mention-gating --scenario whatsapp-top-level-reply-shape --scenario whatsapp-restart-resume --scenario whatsapp-help-command --scenario whatsapp-status-command --scenario whatsapp-commands-command --scenario whatsapp-tools-compact-command --scenario whatsapp-whoami-command --scenario whatsapp-context-command --scenario whatsapp-tool-only-usage-footer --scenario whatsapp-reply-to-message --scenario whatsapp-reply-context-isolation --scenario whatsapp-inbound-image-caption --scenario whatsapp-audio-preflight --scenario whatsapp-outbound-media-matrix --scenario whatsapp-outbound-document-preserves-filename --scenario whatsapp-outbound-poll --scenario whatsapp-message-actions --scenario whatsapp-inbound-structured-messages --scenario whatsapp-group-audio-gating --scenario whatsapp-access-control-dm-open --scenario whatsapp-access-control-dm-disabled --scenario whatsapp-access-control-group-open --scenario whatsapp-access-control-group-disabled --scenario whatsapp-reply-delivery-shape --scenario whatsapp-stream-final-message-accounting --scenario whatsapp-native-new-command --scenario whatsapp-approval-exec-deny-native --scenario whatsapp-status-reactions --scenario whatsapp-group-allowlist-block --scenario whatsapp-approval-exec-native --scenario whatsapp-approval-exec-reaction-native --scenario whatsapp-approval-plugin-native
  • Provider mode: mock-openai
  • Driver account: default
  • SUT account: work
  • QA group: <configured/redacted>
  • Result: 35 passed / 0 failed / 0 skipped, discovered 35 scenarios
  • Artifacts:
    • .artifacts/qa-e2e/whatsapp-full-current-8b17055ef7-20260605-213904/whatsapp-qa-report.md
    • .artifacts/qa-e2e/whatsapp-full-current-8b17055ef7-20260605-213904/whatsapp-qa-summary.json
    • .artifacts/qa-e2e/whatsapp-full-current-8b17055ef7-20260605-213904/whatsapp-qa-observed-messages.json
    • .artifacts/qa-e2e/whatsapp-full-current-8b17055ef7-20260605-213904/gateway-debug/ (not created; preserved only on scenario failure)

Focused local checks:

  • pnpm test extensions/qa-lab/src/providers/mock-openai/server.test.ts extensions/qa-lab/src/live-transports/whatsapp/whatsapp-live.runtime.test.ts extensions/whatsapp/src/qa-driver.runtime.test.ts extensions/whatsapp/src/inbound/send-api.test.ts
  • pnpm test extensions/qa-lab/src/live-transports/shared/live-gateway.runtime.test.ts
  • pnpm tsgo:extensions
  • pnpm lint:extensions
  • git diff --check origin/main..HEAD

@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation channel: whatsapp-web Channel integration: whatsapp-web extensions: qa-lab size: XL maintainer Maintainer-authored PR labels Jun 4, 2026
@clawsweeper

clawsweeper Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed June 7, 2026, 10:32 PM ET / 02:32 UTC.

Summary
This PR expands the WhatsApp live QA lane, QA driver/send surface, mock-openai scripted responses, tests, and QA docs to cover 35 WhatsApp scenarios.

PR surface: Source +2443, Tests +1814, Docs +32. Total +4289 across 18 files.

Reproducibility: not applicable. as a feature PR. Current main lacks the new scenario IDs, while the PR body provides a real WhatsApp mock-openai lane result of 35 passed and focused local checks.

Review metrics: 1 noteworthy metric.

  • WhatsApp QA defaults: live-frontier 8 default scenarios; mock-openai 29 default scenarios; 35 total catalog. This is the maintainer-visible behavior change that can affect routine WhatsApp QA runtime and operator expectations before merge.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🐚 platinum hermit
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • Let the relevant head checks finish before landing.
  • Confirm maintainers accept the mock-openai WhatsApp default lane expanding to 29 scenarios.

Risk before merge

  • [P1] The mock-openai WhatsApp default lane grows to 29 scenarios and still uses the real WhatsApp transport, so existing maintainer/operator QA runs may become longer or more sensitive to account and group-state stability.
  • [P1] The PR touches WhatsApp send and observed-message classification paths used by live channel behavior; no concrete bug surfaced, but missed Baileys shape drift could affect message-delivery proof quality.
  • [P1] The branch is maintainer-labeled and size XL, so landing should wait for owner review of the broad QA/default-lane tradeoff rather than treating green unit checks as the whole decision.

Maintainer options:

  1. Land after QA owner signoff (recommended)
    Maintain the expanded defaults and live-driver surface if WhatsApp/QA owners accept the longer deterministic lane and current head validation is satisfactory.
  2. Trim defaults before merge
    If the default mock-openai runtime is too broad, keep the new scenarios selectable but move heavier cases to explicit scenario IDs before landing.
  3. Pause for live-account stability
    If the WhatsApp credential pool or group state is not stable enough for this broader lane, hold the PR until the operator runbook and account lease behavior are ready.

Next step before merge

  • [P2] Protected maintainer PR with broad live QA/default-lane changes; the remaining action is owner review and validation, not a narrow automated repair.

Security
Cleared: No concrete security or supply-chain regression was found; the diff does not change dependencies, lockfiles, workflows, permissions, or secret handling beyond redacted QA artifact behavior.

Review details

Best possible solution:

Keep this PR as the canonical WhatsApp QA expansion, then land it only after maintainers accept the broader default lane and the relevant head checks/proof are satisfactory.

Do we have a high-confidence way to reproduce the issue?

Not applicable as a feature PR. Current main lacks the new scenario IDs, while the PR body provides a real WhatsApp mock-openai lane result of 35 passed and focused local checks.

Is this the best way to solve the issue?

Yes, the direction fits the owner boundary: QA Lab imports WhatsApp through @openclaw/whatsapp/api.js and the production listener contract stays narrow. The remaining decision is whether the broader default lane cost is acceptable.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 4780546c124d.

Label changes

Label changes:

  • add rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🐚 platinum hermit and patch quality is 🐚 platinum hermit.
  • add status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Not applicable: The external contributor proof gate is not applicable because this is a maintainer/MEMBER PR, though the body includes live WhatsApp lane output and focused checks.
  • remove rating: 🌊 off-meta tidepool: Current PR rating is rating: 🐚 platinum hermit, so this older rating label is no longer current.

Label justifications:

  • P2: This is a normal-priority WhatsApp QA and extension improvement with limited product blast radius but meaningful review scope.
  • merge-risk: 🚨 automation: The PR changes the live WhatsApp QA lane, mock-provider scripting, scenario defaults, and artifacts that maintainer automation may rely on.
  • merge-risk: 🚨 compatibility: The PR changes default WhatsApp QA lane behavior for operators, especially the broader mock-openai default scenario set.
  • merge-risk: 🚨 message-delivery: The PR touches WhatsApp send API helpers and observed-message classification used to prove transport delivery.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🐚 platinum hermit and patch quality is 🐚 platinum hermit.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Not applicable: The external contributor proof gate is not applicable because this is a maintainer/MEMBER PR, though the body includes live WhatsApp lane output and focused checks.
Evidence reviewed

PR surface:

Source +2443, Tests +1814, Docs +32. Total +4289 across 18 files.

View PR surface stats
Area Files Added Removed Net
Source 11 2645 202 +2443
Tests 6 1828 14 +1814
Docs 1 62 30 +32
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 18 4535 246 +4289

What I checked:

  • Root policy read: Read the full root AGENTS.md and applied its maintainer/protected-label, extension-boundary, docs, and merge-risk guidance. (AGENTS.md:1, 4780546c124d)
  • Scoped policy read: Read the scoped extensions and docs AGENTS.md files; the PR stays within the extension public-boundary model and uses root-relative Mintlify docs links. (extensions/AGENTS.md:1, 4780546c124d)
  • Current main lacks the central scenario expansion: Current main has no matches for the new scenario IDs such as whatsapp-outbound-media-matrix, whatsapp-inbound-structured-messages, whatsapp-stream-final-message-accounting, or whatsapp-group-allowlist-block. (4780546c124d)
  • PR branch adds the expanded catalog: The PR head defines the expanded WhatsApp scenario union and WHATSAPP_QA_SCENARIOS catalog, including the new structured inbound, outbound media, stream-final, and allowlist scenarios. (extensions/qa-lab/src/live-transports/whatsapp/whatsapp-live.runtime.ts:55, 25e5c0bef7a8)
  • Default lane behavior changed: The PR selects default scenarios from standardId plus defaultProviderModes, making mock-openai the broader deterministic default while keeping live-frontier smaller. (extensions/qa-lab/src/live-transports/whatsapp/whatsapp-live.runtime.ts:1488, 25e5c0bef7a8)
  • WhatsApp send surface inspected: The PR adds contact, location, and sticker helpers through createWebSendApi while keeping ActiveWebListener narrowed to the existing sendMessage/sendPoll/sendReaction/sendComposingTo runtime surface. (extensions/whatsapp/src/inbound/send-api.ts:202, 25e5c0bef7a8)

Likely related people:

  • Vincent Koc: Current main blame and -S history for the WhatsApp QA lane, QA driver, send API, and mock provider all route through the recent baseline/import commits that currently own these files. (role: recent area contributor; confidence: medium; commits: 0f855ea71acc, 2e08f0f4221f; files: extensions/qa-lab/src/live-transports/whatsapp/whatsapp-live.runtime.ts, extensions/whatsapp/src/qa-driver.runtime.ts, extensions/whatsapp/src/inbound/send-api.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@mcaxtr mcaxtr force-pushed the feat/whatsapp-qa-scenarios branch from 2acaf5d to d86971d Compare June 4, 2026 23:41
@clawsweeper clawsweeper Bot added rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. P2 Normal backlog priority with limited blast radius. merge-risk: 🚨 automation 🚨 May affect CI, automerge, proof capture, label sync, or maintainer automation. labels Jun 4, 2026
@mcaxtr mcaxtr force-pushed the feat/whatsapp-qa-scenarios branch from d86971d to 65f008d Compare June 4, 2026 23:55
@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels Jun 5, 2026
@mcaxtr mcaxtr force-pushed the feat/whatsapp-qa-scenarios branch 2 times, most recently from 0c3e0d4 to 373b051 Compare June 5, 2026 01:00
@clawsweeper clawsweeper Bot added rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels Jun 5, 2026
@mcaxtr mcaxtr force-pushed the feat/whatsapp-qa-scenarios branch from cff84f5 to acc68bb Compare June 5, 2026 05:10
@clawsweeper clawsweeper Bot added rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. proof: sufficient ClawSweeper judged the real behavior proof convincing. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. labels Jun 5, 2026
@clawsweeper clawsweeper Bot added status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. mantis: telegram-visible-proof Mantis should capture Telegram visible proof. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. and removed status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels Jun 6, 2026
@clawsweeper clawsweeper Bot temporarily deployed to qa-live-shared June 6, 2026 01:54 Inactive
@mcaxtr mcaxtr force-pushed the feat/whatsapp-qa-scenarios branch from aedbe8a to 0157cd7 Compare June 6, 2026 02:01
@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. mantis: telegram-visible-proof Mantis should capture Telegram visible proof. labels Jun 6, 2026
@mcaxtr mcaxtr force-pushed the feat/whatsapp-qa-scenarios branch from 0157cd7 to 6ab719b Compare June 6, 2026 03:58
@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels Jun 6, 2026
@mcaxtr mcaxtr force-pushed the feat/whatsapp-qa-scenarios branch 5 times, most recently from cc3945d to 4e1ec22 Compare June 8, 2026 00:23
@clawsweeper clawsweeper Bot added rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. and removed proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. labels Jun 8, 2026
@mcaxtr

mcaxtr commented Jun 8, 2026

Copy link
Copy Markdown
Member Author

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: whatsapp-web Channel integration: whatsapp-web docs Improvements or additions to documentation extensions: qa-lab maintainer Maintainer-authored PR merge-risk: 🚨 automation 🚨 May affect CI, automerge, proof capture, label sync, or maintainer automation. merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. P2 Normal backlog priority with limited blast radius. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. size: XL status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant