Skip to content

Serialize TaskStore writes#7

Merged
tjb-tech merged 1 commit intoHKUDS:mainfrom
fancyboi999:fancy/taskstore-races
Mar 18, 2026
Merged

Serialize TaskStore writes#7
tjb-tech merged 1 commit intoHKUDS:mainfrom
fancyboi999:fancy/taskstore-races

Conversation

@fancyboi999
Copy link
Copy Markdown
Contributor

@fancyboi999 fancyboi999 commented Mar 18, 2026

Summary

  • serialize TaskStore mutations with a team-scoped file lock
  • re-read task state inside the lock before applying updates, so concurrent claim attempts cannot both succeed on stale snapshots
  • use unique temp files for task saves to avoid shared .tmp filename collisions during contention
  • add a regression test that races two worker processes against the same task and asserts only one can claim it

Root Cause

TaskStore.update() previously performed an unlocked read-modify-write cycle:

  • process A and process B could both read the same unlocked task
  • both would pass _acquire_lock() against their stale in-memory copy
  • both would then race while saving, so the later writer could overwrite the earlier lock state
  • under contention, both writers also targeted the same task-<id>.tmp path, which could produce FileNotFoundError while still leaving the loser's state on disk

In short, the implementation had atomic file replacement but not atomic state transitions.

What Changed

  • added tasks/<team>/.tasks.lock and wrapped create, update, and release_stale_locks in an exclusive flock
  • changed update() to reload the task only after the write lock is held
  • kept dependent resolution inside the same mutation boundary
  • replaced the shared task-<id>.tmp save path with unique temp files followed by atomic replace

Validation

  • python -m pytest -q
  • python -m compileall clawteam tests
  • ruff check clawteam/team/tasks.py tests/test_task_store_locking.py
  • manual repro before fix: two forked worker processes racing on the same task could produce one success plus one crash on the shared .tmp path, while the final locked_by still reflected the losing writer
  • manual repro after fix: one worker succeeds, the other gets TaskLockError, and the final task owner remains stable

@tjb-tech tjb-tech merged commit d51e632 into HKUDS:main Mar 18, 2026
7 checks passed
a24ibrah pushed a commit to a24ibrah/ClawTeam that referenced this pull request Mar 30, 2026
juntaochi added a commit to novix-science/ClawTeam-gstack that referenced this pull request Apr 20, 2026
…e validation

RED phase for Plan 02-08 cycle-detector + envelope-enforcement invariants:

tests/test_cycle_detector.py (7 tests):
- no-cycle on single message (negative case)
- 3 A->B + 3 B->A round-trips in last-20 window trip cycle (D-20/D-21)
- window bounded to 20 — older round-trips don't count
- topic-hash priority chain: dedupe_key -> request_id -> sha1(content[:128]) (D-19)
- suppressedTopics persist on disk; re-entry deduped as cycle_suppressed_existing
- Pitfall HKUDS#3: progressSignal entry breaks streak (skipif Plan 02-09 not merged)
- Pitfall HKUDS#7: sync emit — handler observes suppressedTopics on disk when fired

tests/test_transport_envelope.py (6 tests):
- _pre_deliver_hooks parses JSON bytes -> TeamMessage
- BC pass-through: no envelope fields -> no validation (Pitfall HKUDS#8)
- Partial envelope -> MalformedEnvelopeError
- Full valid envelope -> accepted
- 8 consecutive malformed -> DriftRegression (D-09)
- Valid envelope resets per-agent counter

RED confirmed: cycle_detector 4 failed + 2 passed + 1 skipped;
transport_envelope ImportError on _pre_deliver_hooks / _reset_drift_counters.
juntaochi added a commit to novix-science/ClawTeam-gstack that referenced this pull request Apr 20, 2026
…n Transport.deliver

GREEN phase — implements the two Phase 2 hook sites that break A<->B deadlock
loops and enforce the envelope protocol scope-limited for BC.

DefaultRoutingPolicy (clawteam/team/routing_policy.py):
  - decide() prefixed with cycle detection BEFORE existing throttle (D-18);
    reuses the existing recentEvents 50-entry bounded window (specifics
    lesson HKUDS#1, no parallel tracker).
  - _topic_hash(envelope): priority chain dedupe_key -> payload.request_id ->
    sha1(content[:128])[:16] (D-19).
  - _detect_cycle(state, envelope, topic_hash): scans last 20 recentEvents for
    >=3 A->B AND >=3 B->A with matching topic_hash (D-20); returns None when any
    entry carries progressSignal: True (Pitfall HKUDS#3 mitigation).
  - Cycle hit persists topic_hash onto routes[route_key].suppressedTopics, saves
    state BEFORE sync emit (Pitfall HKUDS#7), then emits CycleDetected with pair,
    topic_hash, route_keys, window_size=20.
  - Subsequent hits for same route+topic short-circuit with
    reason='cycle_suppressed_existing' (no re-emit, D-21).
  - _append_event gains topic_hash kwarg; pending/aggregated/cycle_suppressed
    entries now carry topicHash so future decide() calls can correlate.

Transport (clawteam/transport/base.py):
  - _pre_deliver_hooks(data) -> TeamMessage: parses JSON bytes, short-circuits
    when msg has none-of-three envelope fields (Pitfall HKUDS#8 BC pass-through),
    otherwise requires all three via TurnEnvelope.model_validate.
  - Per-agent module-level consecutive-malformed counter (D-09); threshold = 8
    consecutive strikes emits DriftRegression (sync). Any valid envelope resets
    the counter for that agent. MalformedEnvelope emitted on every strike.
  - _reset_drift_counters(): test + SprintConductor-teardown helper (Plan 02-11
    contract).
  - Transport ABC unchanged (public interface preserved).

FileTransport.deliver (clawteam/transport/file.py):
  - Invokes _pre_deliver_hooks at the top; MalformedEnvelopeError re-raised to
    caller so MailboxManager can route to dead-letter / surface to the agent.
  - p2p.py falls back to FileTransport.deliver so its delivery path inherits
    the hook automatically; no direct change needed there for v1.

Tests: 698 passed, 1 skipped (Plan 02-09 gate); ruff clean.
Regression matrix (12 templates) and test_runtime_routing.py both still green.
juntaochi added a commit to novix-science/ClawTeam-gstack that referenced this pull request Apr 20, 2026
Adds plan-level SUMMARY covering Task 1 (RED) + Task 2 (GREEN) for the
cycle detector in DefaultRoutingPolicy.decide() and the envelope-validation
pre-deliver hook in Transport.deliver.

Key outcomes recorded:
- D-18..D-21 cycle detection (tail-20 window, 3+ round-trip threshold)
- D-19 topic-hash priority chain (dedupe_key -> request_id -> sha1)
- Pitfall HKUDS#3 progressSignal short-circuit (populator lands in Plan 02-09)
- Pitfall HKUDS#7 sync emit after _save_state
- D-08/D-09 envelope enforcement at Transport.deliver() with Pitfall HKUDS#8 BC
- 8-strike DriftRegression + per-agent counter reset on valid envelope

Forward contracts for Plan 02-09 (progressSignal populator) and Plan 02-11
(SprintConductor _reset_drift_counters teardown) documented.
juntaochi added a commit to novix-science/ClawTeam-gstack that referenced this pull request Apr 21, 2026
…n tests

- EXISTING_TEMPLATES_PHASE3 canonical list + len==6 guard (T-09-02)
- gstack_plugin_loaded fixture: fresh GstackSprintPlugin per test, minimal
  fake ctx, no EvidenceSchemaRegistry touch (avoids test-ordering collision
  with tests/test_gstack_plugin.py)
- test_existing_template_spawn_unaffected_by_gstack_plugin[6 templates] —
  asserts TeamConfig.template != 'gstack', member roster != 11 gstack roles,
  leader_role != 'ceo' for each of software-dev/hedge-fund/code-review/
  harness-default/research-paper/strategy-room (Pitfall 14 + QUALITY-14 +
  success-criterion HKUDS#2)
- test_reflect_phase_does_not_trigger_phase6_pending_write_for_non_gstack
  [6 templates] — T-07-01 HIGH severity 2nd-layer mitigation anchor;
  runtime assertion that emitting a Reflect event for non-gstack teams
  never writes _phase6_pending/<sprint>-retro.json
- test_all_six_existing_templates_parse_with_gstack_loaded — aggregate
  assertion that each existing template's phases != gstack phases AND
  leader_role != 'ceo' (success-criterion HKUDS#7)

Full-suite delta: +13 new passing tests (958 -> 971 passed). 3 pre-existing
failures unchanged (documented in deferred-items.md).
juntaochi added a commit to novix-science/ClawTeam-gstack that referenced this pull request Apr 27, 2026
User request: "用tmux hook 把prompt注入进每一轮对话 包括身份和需要用的
skill 或者用tmux hook去做一些别的事情 总之tmux hook会是一个很好的技术".

Concept: agents drift over long sessions. CEO forgets to delegate (user
just hit this bug), engineer forgets which `clawteam` subcommands are
available, shipper forgets that `/ship` lives in the framework. A
periodic re-injection of "you are X; here's what you should use"
keeps them anchored.

Implementation:

1. `clawteam nudge <pane_id>` CLI command (new clawteam/nudge.py):
   - Looks up pane_id → session → team → agent via the persisted
     tmux_pane_map.json
   - Debounces (60s cooldown per pane) so repeated silence-hook firings
     don't spam the claude TUI
   - Reads the latest sprint state (current_phase + goal) for context
   - Builds a role-specific reminder text from _ROLE_NUDGES table
     (each role gets 2-3 lines: what you ARE + what commands you CAN
     run now). Appends "Goal: X · Phase: Y" when available.
   - Injects via `tmux send-keys -l <text>` + `send-keys Enter` so
     claude processes it as a user turn.
   - Silent-fail on any error (a noisy hook drowns the user's tmux).

2. tmux hook wiring (tmux_backend.py::_configure_nudge_hook):
   - `monitor-silence = 45s` on each window
   - `alert-silence` hook fires `clawteam nudge #{pane_id}` in the
     background (`run-shell -b`)
   - Opt-out via `CLAWTEAM_NUDGE_SECONDS=0` env var
   - Called from both `TmuxBackend.tile_panes()` (happy path) and
     `TmuxBackend.enable_mouse()` (the --windows non-tiled path) so
     every team gets nudge wiring regardless of layout choice.

3. pane_map normalization (tmux_backend.py::tile_panes):
   - Post-layout readback of pane_title returned claude's dynamic
     status strings like "⠐ CEO" / "✢ ENG-MGR" (spinner glyphs prefix).
   - Normalize by matching uppercase title content against the
     pre-merge window roster (roster_upper dict). Any title containing
     a known role (e.g. "⠐ CEO" contains "CEO") maps to the canonical
     lowercase role name.
   - pane_map.json now reliably stores {"0": "ceo", "1": "pm", ...}
     regardless of what claude's currently showing.

Real-env verified:
  clawteam go --no-attach --no-window "test"
  cat ~/.clawteam/teams/<team>/tmux_pane_map.json  → clean lowercase roster
  clawteam status                                    → Panes: 0=ceo 1=pm ...
  # wait for claude to boot
  clawteam nudge --force <pane_id>                   → injects role reminder
  tmux capture-pane                                  → sees "[nudge] Reminder:
                                                      you DELEGATE, never
                                                      implement. Decompose..."

Future uses of this hook infrastructure (brainstormed but not impl):
  - HKUDS#3 pane-died → log death reason, optional respawn
  - HKUDS#5 alert-activity → append pane output to activity.jsonl timeline
  - HKUDS#6 session-created → broadcast phase+goal to all panes
  - HKUDS#7 send-keys filter → intercept dangerous commands (rm -rf, force push)
  - HKUDS#10 pipe-pane → validate agent envelope JSON against schema

Tests: 30 solo/spawn_cli still green (nudge unit tests not added yet
— next commit candidate).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants