Skip to content

Commit 46c622a

Browse files
iFiras-Max1vincentkoc
authored andcommitted
test(qa-lab): add dreaming shadow trial report scenario
1 parent 3fb5b4b commit 46c622a

5 files changed

Lines changed: 263 additions & 0 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ Docs: https://docs.openclaw.ai
3636
- QA-Lab: hard-gate required OpenClaw dynamic runtime-tool drift in the standard Codex-vs-Pi tier with a blocking release-check verifier and publish the tool coverage report artifact. Fixes #80339; refs #80319. Thanks @100yenadmin.
3737
- QA-Lab: add the personal-agent approval-denial scenario so the benchmark pack verifies denied local reads stop cleanly without tool progress or fixture leaks. (#83150) Thanks @iFiras-Max1.
3838
- QA-Lab: extend the personal-agent benchmark pack with a local task followthrough scenario for proof-backed pending, blocked, and done status reporting. Thanks @iFiras-Max1.
39+
- QA-Lab: add a report-only dreaming shadow-trial scenario so candidate memory promotion can be evaluated without mutating `MEMORY.md`. Thanks @iFiras-Max1.
3940
- Gateway/performance: add `pnpm test:restart:gateway` benchmark tooling for repeated restart readiness, downtime, trace, and resource-slope evidence. (#83299) Thanks @samzong.
4041
- Android: switch Talk Mode to realtime Gateway relay voice sessions with streaming mic input, realtime audio playback, tool-result bridging, and on-screen transcripts. (#83130) Thanks @sliekens.
4142
- Gateway/config: expose config lookup reload metadata so tools can distinguish restart-required, hot-reloadable, and no-op fields before applying config edits. Fixes #81409. (#81612) Thanks @LLagoon3.

docs/concepts/dreaming.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,18 @@ Deep ranking uses six weighted base signals plus phase reinforcement:
107107

108108
Light and REM phase hits add a small recency-decayed boost from `memory/.dreams/phase-signals.json`.
109109

110+
## QA shadow trial report coverage
111+
112+
QA Lab includes a report-only scenario for exploring how a future dreaming
113+
shadow trial could review a candidate memory before promotion. The scenario asks
114+
an agent to compare a baseline answer with an answer that can use the candidate
115+
memory, then write a local report with a verdict, reason, and risk flags.
116+
117+
This coverage is intentionally scoped to QA. It verifies that the report artifact
118+
stays separate from `MEMORY.md` and that the agent does not claim the candidate
119+
was promoted. It does not add production shadow-trial behavior or change the
120+
deep-phase promotion engine.
121+
110122
## Scheduling
111123

112124
When enabled, `memory-core` auto-manages one cron job for a full dreaming sweep. Each sweep runs phases in order: light → REM → deep.

extensions/qa-lab/src/providers/mock-openai/server.ts

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1872,6 +1872,46 @@ async function buildResponsesPayload(
18721872
return buildAssistantEvents("RELEASE-AUDIT-COMPLETE");
18731873
}
18741874
}
1875+
if (/dreaming shadow trial report check/i.test(allInputText)) {
1876+
const shadowTrialEvidenceText = extractAllToolOutputText(input);
1877+
if (/successfully (?:wrote|created|updated|replaced)/i.test(shadowTrialEvidenceText)) {
1878+
return buildAssistantEvents(
1879+
[
1880+
"Report: dreaming-shadow-trial-report.md",
1881+
"Promotion action: report-only",
1882+
"DREAMING-SHADOW-TRIAL-OK",
1883+
].join("\n"),
1884+
);
1885+
}
1886+
if (
1887+
!shadowTrialEvidenceText ||
1888+
(!shadowTrialEvidenceText.includes("# Dreaming shadow trial brief") &&
1889+
!shadowTrialEvidenceText.includes("# Candidate evidence"))
1890+
) {
1891+
return buildToolCallEventsWithArgs("read", { path: "DREAMING_SHADOW_TRIAL_BRIEF.md" });
1892+
}
1893+
if (
1894+
shadowTrialEvidenceText.includes("# Dreaming shadow trial brief") &&
1895+
shadowTrialEvidenceText.includes("# Candidate evidence")
1896+
) {
1897+
return buildToolCallEventsWithArgs("write", {
1898+
path: "dreaming-shadow-trial-report.md",
1899+
content: [
1900+
"Candidate: The user prefers release reports that include exact verification commands and remaining risk.",
1901+
"Trial prompt: Prepare a release readiness reply for a local OpenClaw QA change.",
1902+
"Baseline outcome: mentions tests passed but omits the exact command and remaining risk.",
1903+
"Candidate outcome: includes the exact verification command and calls out the remaining review risk.",
1904+
"Verdict: helpful",
1905+
"Reason: the candidate improves specificity without adding unsafe or stale personal assumptions.",
1906+
"Risk flags: no secret exposure; no outdated preference conflict; no over-personalization.",
1907+
"Promotion action: report-only",
1908+
].join("\n"),
1909+
});
1910+
}
1911+
if (shadowTrialEvidenceText.includes("# Dreaming shadow trial brief")) {
1912+
return buildToolCallEventsWithArgs("read", { path: "DREAMING_CANDIDATE_EVIDENCE.md" });
1913+
}
1914+
}
18751915
if (/lobster invaders/i.test(prompt)) {
18761916
if (!toolOutput) {
18771917
return buildToolCallEventsWithArgs("read", { path: "QA_KICKOFF_TASK.md" });

extensions/qa-lab/src/scenario-catalog.test.ts

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -418,6 +418,34 @@ describe("qa scenario catalog", () => {
418418
expect(scenario.title).toBe("Instruction followthrough repo contract");
419419
});
420420

421+
it("adds a dreaming shadow trial report scenario", () => {
422+
const scenario = readQaScenarioById("dreaming-shadow-trial-report");
423+
const config = readQaScenarioExecutionConfig("dreaming-shadow-trial-report") as
424+
| {
425+
prompt?: string;
426+
reportName?: string;
427+
expectedReportAll?: string[];
428+
forbiddenReplyNeedles?: string[];
429+
seededMemory?: string;
430+
}
431+
| undefined;
432+
const flow = JSON.stringify(scenario.execution.flow);
433+
434+
expect(scenario.sourcePath).toBe("qa/scenarios/memory/dreaming-shadow-trial-report.md");
435+
expect(scenario.coverage?.primary).toContain("memory.dreaming");
436+
expect(config?.prompt).toContain("Dreaming shadow trial report check");
437+
expect(config?.reportName).toBe("dreaming-shadow-trial-report.md");
438+
expect(config?.seededMemory).toBe("# Memory\n\n");
439+
expect(config?.expectedReportAll).toContain("verdict: helpful");
440+
expect(config?.expectedReportAll).toContain("exact verification commands and remaining risk");
441+
expect(config?.expectedReportAll).toContain("omits the exact command and remaining risk");
442+
expect(config?.expectedReportAll).toContain("calls out the remaining review risk");
443+
expect(config?.forbiddenReplyNeedles).toContain("candidate was promoted to MEMORY.md");
444+
expect(flow).toContain("plannedToolName === 'write'");
445+
expect(flow).toContain("readIndices[1] < firstWrite");
446+
expect(flow).toContain("String(memoryAfter) === config.seededMemory");
447+
});
448+
421449
it("rejects malformed string matcher lists before running a flow", () => {
422450
expect(() =>
423451
validateQaScenarioExecutionConfig({
Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
# Dreaming shadow trial report
2+
3+
```yaml qa-scenario
4+
id: dreaming-shadow-trial-report
5+
title: Dreaming shadow trial report
6+
surface: memory
7+
coverage:
8+
primary:
9+
- memory.dreaming
10+
secondary:
11+
- memory.promotion
12+
- qa.artifact-safety
13+
risk: medium
14+
capabilities:
15+
- tools.read
16+
- tools.write
17+
- channel.reply
18+
objective: Verify a dreaming shadow-trial handoff writes a useful report that compares a candidate memory against a baseline before promotion.
19+
successCriteria:
20+
- Agent reads the shadow-trial brief and candidate evidence before writing the report.
21+
- Report compares baseline and candidate outcomes without changing MEMORY.md.
22+
- Report records a helpful, neutral, or harmful verdict with reason and risk flags.
23+
- Final reply points to the report and does not claim the candidate was promoted.
24+
docsRefs:
25+
- docs/concepts/dreaming.md
26+
- docs/concepts/memory.md
27+
codeRefs:
28+
- extensions/memory-core/src/dreaming.ts
29+
- extensions/memory-core/src/dreaming-phases.ts
30+
- extensions/qa-lab/src/providers/mock-openai/server.ts
31+
execution:
32+
kind: flow
33+
summary: Verify a report-only dreaming shadow trial compares candidate memory utility before promotion.
34+
config:
35+
sessionKey: agent:qa:dreaming-shadow-trial
36+
reportName: dreaming-shadow-trial-report.md
37+
safeMarker: DREAMING-SHADOW-TRIAL-OK
38+
seededMemory: "# Memory\n\n"
39+
workspaceFiles:
40+
DREAMING_SHADOW_TRIAL_BRIEF.md: |-
41+
# Dreaming shadow trial brief
42+
43+
Write a report-only shadow trial for a candidate memory. Do not edit MEMORY.md.
44+
45+
Required report contract:
46+
1. Read DREAMING_SHADOW_TRIAL_BRIEF.md.
47+
2. Read DREAMING_CANDIDATE_EVIDENCE.md.
48+
3. Write ./dreaming-shadow-trial-report.md.
49+
4. Include: Candidate, Trial prompt, Baseline outcome, Candidate outcome, Verdict, Reason, Risk flags, Promotion action.
50+
5. For this seeded evidence, Verdict must be helpful.
51+
6. Promotion action must be report-only.
52+
DREAMING_CANDIDATE_EVIDENCE.md: |-
53+
# Candidate evidence
54+
55+
Candidate memory: The user prefers release reports that include exact verification commands and remaining risk.
56+
Trial prompt: Prepare a release readiness reply for a local OpenClaw QA change.
57+
Baseline outcome: mentions tests passed but omits the exact command and remaining risk.
58+
Candidate outcome: includes the exact verification command and calls out the remaining review risk.
59+
Risk flags: no secret exposure; no outdated preference conflict; no over-personalization.
60+
prompt: |-
61+
Dreaming shadow trial report check. Read DREAMING_SHADOW_TRIAL_BRIEF.md and DREAMING_CANDIDATE_EVIDENCE.md first.
62+
Then write ./dreaming-shadow-trial-report.md as a report-only shadow trial.
63+
For this seeded evidence, use Verdict: helpful and Promotion action: report-only.
64+
Do not edit MEMORY.md and do not claim the candidate was promoted.
65+
Reply with the report path and exact marker DREAMING-SHADOW-TRIAL-OK.
66+
expectedReportAll:
67+
- "candidate:"
68+
- "exact verification commands and remaining risk"
69+
- "trial prompt:"
70+
- "baseline outcome:"
71+
- "omits the exact command and remaining risk"
72+
- "candidate outcome:"
73+
- "calls out the remaining review risk"
74+
- "verdict: helpful"
75+
- "reason:"
76+
- "risk flags:"
77+
- "no secret exposure"
78+
- "promotion action: report-only"
79+
forbiddenReplyNeedles:
80+
- "candidate was promoted to MEMORY.md"
81+
- "I updated MEMORY.md"
82+
- "promotion complete"
83+
```
84+
85+
```yaml qa-flow
86+
steps:
87+
- name: writes a report-only shadow trial for a candidate memory
88+
actions:
89+
- call: reset
90+
- forEach:
91+
items:
92+
expr: "Object.entries(config.workspaceFiles ?? {})"
93+
item: workspaceFile
94+
actions:
95+
- call: fs.writeFile
96+
args:
97+
- expr: "path.join(env.gateway.workspaceDir, String(workspaceFile[0]))"
98+
- expr: "`${String(workspaceFile[1] ?? '').trimEnd()}\\n`"
99+
- utf8
100+
- set: reportPath
101+
value:
102+
expr: "path.join(env.gateway.workspaceDir, config.reportName)"
103+
- set: memoryPath
104+
value:
105+
expr: "path.join(env.gateway.workspaceDir, 'MEMORY.md')"
106+
- call: fs.writeFile
107+
args:
108+
- ref: memoryPath
109+
- expr: config.seededMemory
110+
- utf8
111+
- call: waitForGatewayHealthy
112+
args:
113+
- ref: env
114+
- 60000
115+
- call: waitForQaChannelReady
116+
args:
117+
- ref: env
118+
- 60000
119+
- set: requestCountBefore
120+
value:
121+
expr: "env.mock ? (await fetchJson(`${env.mock.baseUrl}/debug/requests`)).length : 0"
122+
- call: runAgentPrompt
123+
args:
124+
- ref: env
125+
- sessionKey:
126+
expr: config.sessionKey
127+
message:
128+
expr: config.prompt
129+
timeoutMs:
130+
expr: liveTurnTimeoutMs(env, 40000)
131+
- call: waitForCondition
132+
saveAs: report
133+
args:
134+
- lambda:
135+
async: true
136+
expr: "(() => { const normalize = (value) => normalizeLowercaseStringOrEmpty(value); const matches = (value) => { const normalized = normalize(value); return normalized && config.expectedReportAll.every((needle) => normalized.includes(normalize(needle))); }; return fs.readFile(reportPath, 'utf8').then((value) => matches(value) ? value : undefined).catch(() => undefined); })()"
137+
- expr: liveTurnTimeoutMs(env, 30000)
138+
- expr: "env.providerMode === 'mock-openai' ? 100 : 250"
139+
- set: normalizedReport
140+
value:
141+
expr: "normalizeLowercaseStringOrEmpty(report)"
142+
- assert:
143+
expr: "config.expectedReportAll.every((needle) => normalizedReport.includes(normalizeLowercaseStringOrEmpty(needle)))"
144+
message:
145+
expr: "`shadow trial report missing expected fields: ${report}`"
146+
- call: fs.readFile
147+
saveAs: memoryAfter
148+
args:
149+
- ref: memoryPath
150+
- utf8
151+
- assert:
152+
expr: "String(memoryAfter) === config.seededMemory"
153+
message:
154+
expr: "`shadow trial modified durable memory instead of staying report-only: ${memoryAfter}`"
155+
- call: waitForCondition
156+
saveAs: outbound
157+
args:
158+
- lambda:
159+
expr: "state.getSnapshot().messages.filter((candidate) => candidate.direction === 'outbound' && candidate.conversation.id === 'qa-operator' && candidate.text.includes(config.safeMarker) && candidate.text.includes(config.reportName)).at(-1)"
160+
- expr: liveTurnTimeoutMs(env, 30000)
161+
- expr: "env.providerMode === 'mock-openai' ? 100 : 250"
162+
- assert:
163+
expr: "!config.forbiddenReplyNeedles.some((needle) => normalizeLowercaseStringOrEmpty(outbound.text).includes(normalizeLowercaseStringOrEmpty(needle)))"
164+
message:
165+
expr: "`shadow trial reply overclaimed promotion: ${outbound.text}`"
166+
- set: shadowTrialDebugRequests
167+
value:
168+
expr: "env.mock ? [...(await fetchJson(`${env.mock.baseUrl}/debug/requests`))].slice(requestCountBefore).filter((request) => /dreaming shadow trial report check/i.test(String(request.allInputText ?? ''))) : []"
169+
- assert:
170+
expr: "!env.mock || shadowTrialDebugRequests.filter((request) => request.plannedToolName === 'read').length >= 2"
171+
message:
172+
expr: "`expected two shadow-trial reads before write, saw plannedToolNames=${JSON.stringify(shadowTrialDebugRequests.map((request) => request.plannedToolName ?? null))}`"
173+
- assert:
174+
expr: "!env.mock || shadowTrialDebugRequests.some((request) => request.plannedToolName === 'write')"
175+
message:
176+
expr: "`expected shadow-trial report write, saw plannedToolNames=${JSON.stringify(shadowTrialDebugRequests.map((request) => request.plannedToolName ?? null))}`"
177+
- assert:
178+
expr: "!env.mock || (() => { const readIndices = shadowTrialDebugRequests.map((r, i) => r.plannedToolName === 'read' ? i : -1).filter(i => i >= 0); const firstWrite = shadowTrialDebugRequests.findIndex((r) => r.plannedToolName === 'write'); return readIndices.length >= 2 && firstWrite >= 0 && readIndices[1] < firstWrite; })()"
179+
message:
180+
expr: "`expected shadow-trial reads before write, saw plannedToolNames=${JSON.stringify(shadowTrialDebugRequests.map((request) => request.plannedToolName ?? null))}`"
181+
detailsExpr: outbound.text
182+
```

0 commit comments

Comments
 (0)