Problem
The eval suite runs memory formation and recall cases sequentially without waiting for the async memory curation pipeline to process. Since #410 moved memory formation to a background session observer, memories written in one eval case may not be available for recall in the next — the checkpoint queue hasn't drained yet.
This was always latent (curation was always async via checkpoints), but the session observer architecture makes it more pronounced because formation happens entirely in the background rather than inline.
Observed
After a fresh database (#412 testing), memory recall evals fail because the curation pipeline hasn't processed any checkpoints yet. Even with an existing database, timing-dependent eval failures are possible if formation and recall cases run back-to-back.
Proposed fix
- Split eval phases: run all memory-write evals first, then all recall evals
- Add a drain wait: between phases, poll pending checkpoint count (
netclaw status or equivalent) and wait for the queue to reach zero before proceeding
- Configurable timeout:
NETCLAW_EVAL_DRAIN_TIMEOUT (default 60s) — fail loudly if checkpoints don't drain in time rather than silently running recall against stale data
Related
Problem
The eval suite runs memory formation and recall cases sequentially without waiting for the async memory curation pipeline to process. Since #410 moved memory formation to a background session observer, memories written in one eval case may not be available for recall in the next — the checkpoint queue hasn't drained yet.
This was always latent (curation was always async via checkpoints), but the session observer architecture makes it more pronounced because formation happens entirely in the background rather than inline.
Observed
After a fresh database (#412 testing), memory recall evals fail because the curation pipeline hasn't processed any checkpoints yet. Even with an existing database, timing-dependent eval failures are possible if formation and recall cases run back-to-back.
Proposed fix
netclaw statusor equivalent) and wait for the queue to reach zero before proceedingNETCLAW_EVAL_DRAIN_TIMEOUT(default 60s) — fail loudly if checkpoints don't drain in time rather than silently running recall against stale dataRelated