Skip to content

feat(agent): guard todo completion transitions#2576

Merged
esengine merged 1 commit into
esengine:main-v2from
GTC2080:GTC/todo-completion-transitions
Jun 1, 2026
Merged

feat(agent): guard todo completion transitions#2576
esengine merged 1 commit into
esengine:main-v2from
GTC2080:GTC/todo-completion-transitions

Conversation

@GTC2080

@GTC2080 GTC2080 commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Related to #2572

Summary

  • Add a runtime-only receipt field for successful complete_step calls so the host can remember which todo step was signed off in the current turn.
  • Validate todo_write transitions against the latest successful todo_write receipt: a todo newly moved to completed must have a prior successful matching complete_step receipt.
  • Keep todo_write serial in batches so ledger-backed transition checks observe provider order.

Conflict check

  • Runtime-only harness change; avoids UI, auto-plan, multi-agent, goal, cache, MCP, and hook areas.
  • No prompt or tool schema changes.

Review evidence

  • No UI changes; screenshots not applicable.
  • No performance claims; cache hit rate, token consumption, and runtime metrics are not claimed for this correctness change.
  • No prompt or tool schema changes.

Verification

  • PASS: D:\Go\go\bin\go.exe test ./internal/evidence ./internal/tool/builtin ./internal/agent
  • PASS: D:\Go\go\bin\go.exe test ./internal/...
  • PASS: git diff --check

Notes

  • Local push used --no-verify after the Go validation above because the legacy hook still invokes npm and fails without package.json on main-v2.

@github-actions github-actions Bot added the v2 Go rewrite (1.x) — main-v2 branch, active development label Jun 1, 2026
@esengine

esengine commented Jun 1, 2026

Copy link
Copy Markdown
Owner

C:/Program Files/Git/e2e diff

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown

🤖 e2e bot failed — see the run log.

@esengine

esengine commented Jun 1, 2026

Copy link
Copy Markdown
Owner

C:/Program Files/Git/e2e diff

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown

🤖 Reasonix e2e — diff test-gen

Result: ❌ fail · 3 changed source file(s) across 3 package(s)

Metric Value
New test functions added 14
Test lines added +337
go test on affected pkgs fail
Differential (fails on pre-PR code) no
Non-test source touched by agent 0 file(s)
Cache hit 94%
Tokens (prompt / completion) 1,431,194 / 14,565
Model calls 40
Cost ¥ 0.1374

Packages: ./internal/agent, ./internal/evidence, ./internal/tool/builtin

go test output (tail)
--- FAIL: TestExecuteOneCompleteStepSuccessRecordsReceipt (0.00s)
    guards_test.go:338: successful complete_step should record a receipt
FAIL
FAIL	reasonix/internal/agent	0.347s
ok  	reasonix/internal/evidence	0.002s
ok  	reasonix/internal/tool/builtin	0.955s
FAIL

agent run note: exit status 1

Pass = the agent added ≥1 new test function, the affected packages' tests are green, AND those tests fail when the PR's source is reverted (so they genuinely cover the change).

agent built from main-v2 · PR head b73a9ce · triggered by @esengine

@esengine

esengine commented Jun 1, 2026

Copy link
Copy Markdown
Owner

C:/Program Files/Git/e2e diff

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown

🤖 Reasonix e2e — diff test-gen

Result: ✅ pass · 3 changed source file(s) across 3 package(s)

Metric Value
New test functions added 5
Test lines added +155
go test on affected pkgs pass
Differential (fails on pre-PR code) yes
Non-test source touched by agent 0 file(s)
Cache hit 95%
Tokens (prompt / completion) 2,064,068 / 26,752
Model calls 56
Cost ¥ 0.1916

Packages: ./internal/agent, ./internal/evidence, ./internal/tool/builtin

Pass = the agent added ≥1 new test function, the affected packages' tests are green, AND those tests fail when the PR's source is reverted (so they genuinely cover the change).

agent built from main-v2 · PR head b73a9ce · triggered by @esengine

@esengine esengine merged commit 3ddc5ee into esengine:main-v2 Jun 1, 2026
3 checks passed
@esengine

esengine commented Jun 1, 2026

Copy link
Copy Markdown
Owner

C:/Program Files/Git/e2e diff x2

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown

🤖 Reasonix e2e — diff test-gen

Result: ✅ pass · 3 changed source file(s) across 3 package(s)

Metric Value
New test functions added 9
Test lines added +181
go test on affected pkgs pass
Differential (fail on pre-PR code) 9/9 new tests
↳ pin by assertion / by compile only 0 / 9
Changed-line coverage 63% (67/106 changed lines)
go build ./... (regression) pass
Non-test source touched by agent 0 file(s)
Cache hit 95%
Tokens (prompt / completion) 500,573 / 8,736
Model calls 16
Cost ¥ 0.0507
Attempts 1 of up to 2 (passed)

Packages: ./internal/agent, ./internal/evidence, ./internal/tool/builtin

Per-test differential
Test Package Pins the change?
TestUnverifiedCompletedTodosNilLedger ./internal/evidence ⚠️ by compile only
TestUnverifiedCompletedTodosEmptyCurrent ./internal/evidence ⚠️ by compile only
TestUnverifiedCompletedTodosPartialAuthorization ./internal/evidence ⚠️ by compile only
TestUnverifiedCompletedTodosAlreadyCompletedByPosition ./internal/evidence ⚠️ by compile only
TestUnverifiedCompletedTodosAlreadyCompletedByIdentity ./internal/evidence ⚠️ by compile only
TestTodoWriteAcceptsPreviouslyCompleted ./internal/tool/builtin ⚠️ by compile only
TestTodoWriteRejectsMultipleNewCompletionsWithoutStep ./internal/tool/builtin ⚠️ by compile only
TestTodoWriteEmptyTodosSucceeds ./internal/tool/builtin ⚠️ by compile only
TestTodoWriteAllowsCompletionWithoutLedger ./internal/tool/builtin ⚠️ by compile only
Generated tests (review the assertions)
diff --git a/internal/evidence/evidence_test.go b/internal/evidence/evidence_test.go
index b346b2e2..c876037f 100644
--- a/internal/evidence/evidence_test.go
+++ b/internal/evidence/evidence_test.go
@@ -251,3 +251,117 @@ func TestLedgerNoBaselineDoesNotConstrainCompletedTodos(t *testing.T) {
 		t.Fatalf("no baseline should not report missing completions, got %+v", missing)
 	}
 }
+
+func TestUnverifiedCompletedTodosNilLedger(t *testing.T) {
+	var ledger *Ledger
+	missing, hasBaseline := ledger.UnverifiedCompletedTodos([]TodoItem{
+		{Content: "Add parser", Status: "completed"},
+	})
+
+	if hasBaseline {
+		t.Fatal("nil ledger should not report a baseline")
+	}
+	if len(missing) != 0 {
+		t.Fatalf("nil ledger should not report missing completions, got %+v", missing)
+	}
+}
+
+func TestUnverifiedCompletedTodosEmptyCurrent(t *testing.T) {
+	ledger := NewLedger()
+	ledger.Record(Receipt{
+		ToolName: "todo_write",
+		Success:  true,
+		Todos:    []TodoItem{{Content: "Add parser", Status: "in_progress"}},
+	})
+
+	missing, hasBaseline := ledger.UnverifiedCompletedTodos(nil)
+	if !hasBaseline {
+		t.Fatal("non-empty ledger should report a baseline even with empty current")
+	}
+	if len(missing) != 0 {
+		t.Fatalf("empty current todos should not report missing completions, got %+v", missing)
+	}
+
+	missing, hasBaseline = ledger.UnverifiedCompletedTodos([]TodoItem{})
+	if !hasBaseline {
+		t.Fatal("non-empty ledger should report a baseline even with empty slice")
+	}
+	if len(missing) != 0 {
+		t.Fatalf("empty slice should not report missing completions, got %+v", missing)
+	}
+}
+
+func TestUnverifiedCompletedTodosPartialAuthorization(t *testing.T) {
+	ledger := NewLedger()
+	ledger.Record(Receipt{
+		ToolName: "todo_write",
+		Success:  true,
+		Todos: []TodoItem{
+			{Content: "Add parser", Status: "in_progress"},
+			{Content: "Wire parser", Status: "pending"},
+		},
+	})
+	// Only "Add parser" has a matching complete_step.
+	ledger.Record(Receipt{ToolName: "complete_step", Success: true, Step: "Add parser"})
+
+	current := []TodoItem{
+		{Content: "Add parser", Status: "completed"},
+		{Content: "Wire parser", Status: "completed"},
+	}
+	missing, hasBaseline := ledger.UnverifiedCompletedTodos(current)
+	if !hasBaseline {
+		t.Fatal("expected prior todo_write baseline")
+	}
+	if len(missing) != 1 || missing[0].Content != "Wire parser" {
+		t.Fatalf("missing = %+v, want only Wire parser", missing)
+	}
+}
+
+func TestUnverifiedCompletedTodosAlreadyCompletedByPosition(t *testing.T) {
+	ledger := NewLedger()
+	ledger.Record(Receipt{
+		ToolName: "todo_write",
+		Success:  true,
+		Todos: []TodoItem{
+			{Content: "Add parser", Status: "completed"},
+			{Content: "Wire parser", Status: "in_progress"},
+		},
+	})
+
+	current := []TodoItem{
+		{Content: "Add parser", Status: "completed"},
+		{Content: "Wire parser", Status: "completed"},
+	}
+	// "Add parser" was already completed in baseline (same position, same content).
+	// "Wire parser" newly completed but no complete_step.
+	missing, hasBaseline := ledger.UnverifiedCompletedTodos(current)
+	if !hasBaseline {
+		t.Fatal("expected prior todo_write baseline")
+	}
+	if len(missing) != 1 || missing[0].Content != "Wire parser" {
+		t.Fatalf("missing = %+v, want only Wire parser", missing)
+	}
+}
+
+func TestUnverifiedCompletedTodosAlreadyCompletedByIdentity(t *testing.T) {
+	ledger := NewLedger()
+	ledger.Record(Receipt{
+		ToolName: "todo_write",
+		Success:  true,
+		Todos: []TodoItem{
+			{Content: "Add parser", Status: "completed"},
+		},
+	})
+	// New todo_write moves "Add parser" to position 2, but its content matches.
+	current := []TodoItem{
+		{Content: "Setup project", Status: "in_progress"},
+		{Content: "Add parser", Status: "completed"},
+	}
+	missing, hasBaseline := ledger.UnverifiedCompletedTodos(current)
+	if !hasBaseline {
+		t.Fatal("expected prior todo_write baseline")
+	}
+	if len(missing) != 0 {
+		t.Fatalf("already-completed item should match by identity despite position change, missing = %+v", missing)
+	}
+}
diff --git a/internal/tool/builtin/todo_test.go b/internal/tool/builtin/todo_test.go
index f475c4d0..bba36a40 100644
--- a/internal/tool/builtin/todo_test.go
+++ b/internal/tool/builtin/todo_test.go
@@ -83,3 +83,70 @@ func TestTodoWriteIgnoresFailedCompleteStepReceipt(t *testing.T) {
 		t.Fatalf("failed complete_step should not authorize new completion, got %v", err)
 	}
 }
+
+func TestTodoWriteAcceptsPreviouslyCompleted(t *testing.T) {
+	ledger := evidence.NewLedger()
+	ledger.Record(evidence.Receipt{
+		ToolName: "todo_write",
+		Success:  true,
+		Todos:    []evidence.TodoItem{{Content: "Add parser", Status: "completed"}},
+	})
+	ledger.Record(evidence.Receipt{
+		ToolName: "todo_write",
+		Success:  true,
+		Todos:    []evidence.TodoItem{{Content: "Add parser", Status: "completed"}, {Content: "Wire parser", Status: "in_progress"}},
+	})
+	ctx := evidence.WithLedger(context.Background(), ledger)
+	args := json.RawMessage(`{"todos":[{"content":"Add parser","status":"completed"},{"content":"Wire parser","status":"completed"}]}`)
+
+	_, err := (todoWrite{}).Execute(ctx, args)
+	// "Add parser" was already completed before — allowed.
+	// "Wire parser" newly completed without a complete_step — rejected.
+	if err == nil || !strings.Contains(err.Error(), "Wire parser") {
+		t.Fatalf("expected rejection for newly completed 'Wire parser', got %v", err)
+	}
+}
+
+func TestTodoWriteRejectsMultipleNewCompletionsWithoutStep(t *testing.T) {
+	ledger := evidence.NewLedger()
+	ledger.Record(evidence.Receipt{
+		ToolName: "todo_write",
+		Success:  true,
+		Todos: []evidence.TodoItem{
+			{Content: "Add parser", Status: "in_progress"},
+			{Content: "Wire parser", Status: "pending"},
+		},
+	})
+	ctx := evidence.WithLedger(context.Background(), ledger)
+	args := json.RawMessage(`{"todos":[
+		{"content":"Add parser","status":"completed"},
+		{"content":"Wire parser","status":"completed"}
+	]}`)
+
+	_, err := (todoWrite{}).Execute(ctx, args)
+	if err == nil {
+		t.Fatal("expected rejection when multiple todos newly completed without complete_step")
+	}
+	// Error should mention the count (plural form) or each item.
+	if !strings.Contains(err.Error(), "complete_step") {
+		t.Fatalf("error should mention complete_step, got %v", err)
+	}
+}
+
+func TestTodoWriteEmptyTodosSucceeds(t *testing.T) {
+	ctx := evidence.WithLedger(context.Background(), evidence.NewLedger())
+	args := json.RawMessage(`{"todos":[]}`)
+
+	_, err := (todoWrite{}).Execute(ctx, args)
+	if err != nil {
+		t.Fatalf("empty todos list should be accepted, got %v", err)
+	}
+}
+
+func TestTodoWriteAllowsCompletionWithoutLedger(t *testing.T) {
+	// No ledger in context — preserve loose behavior.
+	args := json.RawMessage(`{"todos":[{"content":"Add parser","status":"completed"}]}`)
+	if _, err := (todoWrite{}).Execute(context.Background(), args); err != nil {
+		t.Fatalf("completion without a ledger in context should be accepted, got %v", err)
+	}
+}

Pass = the agent added ≥1 test, the affected packages are green, AND ≥1 new test fails when the PR's source is reverted. "By assertion" pins are strong (they check changed behavior); "by compile only" pins just need a PR-added symbol — and since Go compiles per package, one compile-coupled test marks every test in its package that way, so read the diff above to judge the rest.

agent built from main-v2 · PR head b73a9ce · triggered by @esengine

@esengine

esengine commented Jun 1, 2026

Copy link
Copy Markdown
Owner

C:/Program Files/Git/e2e diff

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown

🤖 Reasonix e2e — diff test-gen

Result: ✅ pass · 3 changed source file(s) across 3 package(s)

Metric Value
New test functions added 10
Test lines added +267
go test on affected pkgs pass
Differential (fail on pre-PR code) 8/10 new tests
↳ pin by assertion / by compile only 1 / 7
Changed-line coverage 59% (63/106 changed lines)
Mutation (changed funcs caught) 10/12 (83%) · survived: executeOne, ReceiptFromToolCall
go build ./... (regression) pass
Non-test source touched by agent 0 file(s)
Cache hit 95%
Tokens (prompt / completion) 981,729 / 14,811
Model calls 27
Cost ¥ 0.1014

Packages: ./internal/agent, ./internal/evidence, ./internal/tool/builtin

Single stochastic run — a green result is one sample, not a guarantee. Comment /e2e diff x3 to retry up to 3×.

Per-test differential
Test Package Pins the change?
TestPartitionToolCallsTodoWriteAndCompleteStepSerial ./internal/agent ✅ by assertion
TestExecuteOneCompleteStepRecordsReceiptOnSuccess ./internal/agent ❌ no (passes on old code)
TestExecuteOneCompleteStepDoesNotRecordReceiptOnError ./internal/agent ❌ no (passes on old code)
TestUnverifiedCompletedTodosPartialAuthorization ./internal/evidence ⚠️ by compile only
TestUnverifiedCompletedTodosAlreadyCompletedTransition ./internal/evidence ⚠️ by compile only
TestUnverifiedCompletedTodosIdentityByActiveForm ./internal/evidence ⚠️ by compile only
TestUnverifiedCompletedTodosMultipleMissing ./internal/evidence ⚠️ by compile only
TestTodoWriteRejectsMultipleNewCompletions ./internal/tool/builtin ⚠️ by compile only
TestTodoWriteAcceptsCompleteStepByNumber ./internal/tool/builtin ⚠️ by compile only
TestTodoWriteAcceptsCompleteStepByActiveForm ./internal/tool/builtin ⚠️ by compile only
Generated tests (review the assertions)
diff --git a/internal/agent/guards_test.go b/internal/agent/guards_test.go
index 26e5ce8d..3547261f 100644
--- a/internal/agent/guards_test.go
+++ b/internal/agent/guards_test.go
@@ -198,6 +198,33 @@ func TestPartitionToolCallsTodoWriteSerial(t *testing.T) {
 	}
 }
 
+func TestPartitionToolCallsTodoWriteAndCompleteStepSerial(t *testing.T) {
+	// Both todo_write and complete_step are special-cased as serial, even when
+	// read-only, so neither joins a parallel run with other read-only tools.
+	reg := tool.NewRegistry()
+	reg.Add(fakeTool{name: "grep", readOnly: true})
+	reg.Add(fakeTool{name: "todo_write", readOnly: true})
+	reg.Add(fakeTool{name: "complete_step", readOnly: true})
+	reg.Add(fakeTool{name: "read_file", readOnly: true})
+
+	calls := []provider.ToolCall{
+		{Name: "grep"},
+		{Name: "todo_write"},
+		{Name: "complete_step"},
+		{Name: "read_file"},
+	}
+	got := partitionToolCalls(reg, calls)
+	want := []toolCallBatch{
+		{start: 0, end: 1, parallel: true},
+		{start: 1, end: 2},
+		{start: 2, end: 3},
+		{start: 3, end: 4, parallel: true},
+	}
+	if !reflect.DeepEqual(got, want) {
+		t.Fatalf("partitionToolCalls = %+v, want %+v", got, want)
+	}
+}
+
 // TestExecuteBatchParallelReadOnly checks that three 80ms read-only calls
 // complete in well under 3×80ms — the wall-clock proof of true parallelism.
 func TestExecuteBatchParallelReadOnly(t *testing.T) {
@@ -310,6 +337,68 @@ func TestExecuteOneFailedReceiptDoesNotVerify(t *testing.T) {
 	}
 }
 
+func TestExecuteOneCompleteStepRecordsReceiptOnSuccess(t *testing.T) {
+	completeStep, ok := tool.LookupBuiltin("complete_step")
+	if !ok {
+		t.Fatal("complete_step builtin not registered")
+	}
+	todoWrite, ok := tool.LookupBuiltin("todo_write")
+	if !ok {
+		t.Fatal("todo_write builtin not registered")
+	}
+	reg := tool.NewRegistry()
+	reg.Add(completeStep)
+	reg.Add(todoWrite)
+	a := New(nil, reg, NewSession(""), Options{}, event.Discard)
+
+	// First establish a baseline: todo_write with "Add parser" in_progress.
+	a.executeOne(context.Background(), provider.ToolCall{
+		Name:      "todo_write",
+		Arguments: `{"todos":[{"content":"Add parser","status":"in_progress"}]}`,
+	})
+
+	// Then a successful complete_step for "Add parser".
+	_ = a.executeOne(context.Background(), provider.ToolCall{
+		Name: "complete_step",
+		Arguments: `{
+			"step":"Add parser",
+			"result":"parser added",
+			"evidence":[{"kind":"manual","summary":"checked"}]
+		}`,
+	})
+
+	// Now a todo_write that tries to mark "Add parser" as completed. Because the
+	// complete_step receipt was recorded on a.evidence, the completion should be
+	// authorized and the call must succeed.
+	out := a.executeOne(context.Background(), provider.ToolCall{
+		Name:      "todo_write",
+		Arguments: `{"todos":[{"content":"Add parser","status":"completed"}]}`,
+	})
+	if out.errMsg != "" {
+		t.Fatalf("todo_write rejected after successful complete_step: errMsg=%q, output=%q", out.errMsg, out.output)
+	}
+}
+
+func TestExecuteOneCompleteStepDoesNotRecordReceiptOnError(t *testing.T) {
+	// A complete_step call that returns an error (e.g. missing step field)
+	// must NOT record any receipt — failed complete_step receipts are silently
+	// dropped so they never accidentally authorize a todo completion.
+	reg := tool.NewRegistry()
+	reg.Add(fakeTool{name: "complete_step", readOnly: true, err: errors.New("validation failed")})
+	a := New(nil, reg, NewSession(""), Options{}, event.Discard)
+
+	a.executeOne(context.Background(), provider.ToolCall{
+		Name:      "complete_step",
+		Arguments: `{"step":"","result":"","evidence":[]}`,
+	})
+
+	// The ledger should be empty — no receipt recorded for the failed call.
+	_, ok := a.evidence.MatchLatestTodoStep("anything")
+	if ok {
+		t.Fatal("errored complete_step must not record a receipt")
+	}
+}
+
 func TestRunResetsEvidenceLedger(t *testing.T) {
 	reg := tool.NewRegistry()
 	reg.Add(fakeTool{name: "bash", readOnly: false})
diff --git a/internal/evidence/evidence_test.go b/internal/evidence/evidence_test.go
index b346b2e2..8e2d6359 100644
--- a/internal/evidence/evidence_test.go
+++ b/internal/evidence/evidence_test.go
@@ -238,6 +238,119 @@ func TestLedgerMatchesCompletionByActiveFormAndNumber(t *testing.T) {
 	}
 }
 
+func TestUnverifiedCompletedTodosPartialAuthorization(t *testing.T) {
+	ledger := NewLedger()
+	ledger.Record(Receipt{
+		ToolName: "todo_write",
+		Success:  true,
+		Todos: []TodoItem{
+			{Content: "Add parser", Status: "in_progress"},
+			{Content: "Write tests", Status: "in_progress"},
+			{Content: "Document", Status: "pending"},
+		},
+	})
+
+	// Only "Add parser" has a matching complete_step receipt.
+	ledger.Record(Receipt{ToolName: "complete_step", Success: true, Step: "Add parser"})
+
+	current := []TodoItem{
+		{Content: "Add parser", Status: "completed"},
+		{Content: "Write tests", Status: "completed"},
+		{Content: "Document", Status: "pending"},
+	}
+	missing, hasBaseline := ledger.UnverifiedCompletedTodos(current)
+	if !hasBaseline {
+		t.Fatal("expected prior todo_write baseline")
+	}
+	if len(missing) != 1 || missing[0].Content != "Write tests" {
+		t.Fatalf("missing = %+v, want only Write tests (Add parser should be authorized)", missing)
+	}
+}
+
+func TestUnverifiedCompletedTodosAlreadyCompletedTransition(t *testing.T) {
+	// An item that was already completed in the baseline and remains completed
+	// in the current list should not be flagged as missing.
+	ledger := NewLedger()
+	ledger.Record(Receipt{
+		ToolName: "todo_write",
+		Success:  true,
+		Todos: []TodoItem{
+			{Content: "Bootstrap", Status: "completed"},
+			{Content: "Add parser", Status: "in_progress"},
+		},
+	})
+
+	current := []TodoItem{
+		{Content: "Bootstrap", Status: "completed"},
+		{Content: "Add parser", Status: "completed"},
+	}
+	ledger.Record(Receipt{ToolName: "complete_step", Success: true, Step: "Add parser"})
+	missing, hasBaseline := ledger.UnverifiedCompletedTodos(current)
+	if !hasBaseline {
+		t.Fatal("expected prior todo_write baseline")
+	}
+	if len(missing) != 0 {
+		t.Fatalf("already-completed baseline item and authorized new completion should yield no missing, got %+v", missing)
+	}
+}
+
+func TestUnverifiedCompletedTodosIdentityByActiveForm(t *testing.T) {
+	// previousTodoCompleted must match by activeForm when position differs.
+	ledger := NewLedger()
+	ledger.Record(Receipt{
+		ToolName: "todo_write",
+		Success:  true,
+		Todos: []TodoItem{
+			{Content: "Add parser", Status: "in_progress", ActiveForm: "Adding parser"},
+		},
+	})
+
+	current := []TodoItem{
+		{Content: "Add parser", Status: "completed", ActiveForm: "Adding parser"},
+	}
+	// complete_step matches by activeForm.
+	ledger.Record(Receipt{ToolName: "complete_step", Success: true, Step: "Adding parser"})
+	missing, hasBaseline := ledger.UnverifiedCompletedTodos(current)
+	if !hasBaseline {
+		t.Fatal("expected prior todo_write baseline")
+	}
+	if len(missing) != 0 {
+		t.Fatalf("matching by activeForm should authorize completion, missing = %+v", missing)
+	}
+}
+
+func TestUnverifiedCompletedTodosMultipleMissing(t *testing.T) {
+	ledger := NewLedger()
+	ledger.Record(Receipt{
+		ToolName: "todo_write",
+		Success:  true,
+		Todos: []TodoItem{
+			{Content: "Task A", Status: "in_progress"},
+			{Content: "Task B", Status: "in_progress"},
+			{Content: "Task C", Status: "in_progress"},
+		},
+	})
+
+	current := []TodoItem{
+		{Content: "Task A", Status: "completed"},
+		{Content: "Task B", Status: "completed"},
+		{Content: "Task C", Status: "completed"},
+	}
+	// No complete_step at all — all three are missing.
+	missing, hasBaseline := ledger.UnverifiedCompletedTodos(current)
+	if !hasBaseline {
+		t.Fatal("expected prior todo_write baseline")
+	}
+	if len(missing) != 3 {
+		t.Fatalf("expected 3 missing completions, got %+v", missing)
+	}
+	for _, m := range missing {
+		if m.Status != "completed" {
+			t.Fatalf("missing item %d has status %q, want completed", m.Index, m.Status)
+		}
+	}
+}
+
 func TestLedgerNoBaselineDoesNotConstrainCompletedTodos(t *testing.T) {
 	ledger := NewLedger()
 	missing, hasBaseline := ledger.UnverifiedCompletedTodos([]TodoItem{
diff --git a/internal/tool/builtin/todo_test.go b/internal/tool/builtin/todo_test.go
index f475c4d0..b441543f 100644
--- a/internal/tool/builtin/todo_test.go
+++ b/internal/tool/builtin/todo_test.go
@@ -83,3 +83,68 @@ func TestTodoWriteIgnoresFailedCompleteStepReceipt(t *testing.T) {
 		t.Fatalf("failed complete_step should not authorize new completion, got %v", err)
 	}
 }
+
+func TestTodoWriteRejectsMultipleNewCompletions(t *testing.T) {
+	ledger := evidence.NewLedger()
+	ledger.Record(evidence.Receipt{
+		ToolName: "todo_write",
+		Success:  true,
+		Todos: []evidence.TodoItem{
+			{Content: "Task A", Status: "in_progress"},
+			{Content: "Task B", Status: "in_progress"},
+		},
+	})
+	ctx := evidence.WithLedger(context.Background(), ledger)
+	args := json.RawMessage(`{"todos":[
+		{"content":"Task A","status":"completed"},
+		{"content":"Task B","status":"completed"}
+	]}`)
+
+	_, err := (todoWrite{}).Execute(ctx, args)
+	if err == nil {
+		t.Fatal("multiple new completions without complete_step should be rejected")
+	}
+	if !strings.Contains(err.Error(), "2 todos") {
+		t.Fatalf("error should mention count for multiple items, got %v", err)
+	}
+}
+
+func TestTodoWriteAcceptsCompleteStepByNumber(t *testing.T) {
+	ledger := evidence.NewLedger()
+	ledger.Record(evidence.Receipt{
+		ToolName: "todo_write",
+		Success:  true,
+		Todos: []evidence.TodoItem{
+			{Content: "Add parser", Status: "in_progress"},
+			{Content: "Write tests", Status: "in_progress"},
+		},
+	})
+	ledger.Record(evidence.Receipt{ToolName: "complete_step", Success: true, Step: "1"})
+	ctx := evidence.WithLedger(context.Background(), ledger)
+	args := json.RawMessage(`{"todos":[
+		{"content":"Add parser","status":"completed"},
+		{"content":"Write tests","status":"in_progress"}
+	]}`)
+
+	if _, err := (todoWrite{}).Execute(ctx, args); err != nil {
+		t.Fatalf("complete_step by number should authorize completion: %v", err)
+	}
+}
+
+func TestTodoWriteAcceptsCompleteStepByActiveForm(t *testing.T) {
+	ledger := evidence.NewLedger()
+	ledger.Record(evidence.Receipt{
+		ToolName: "todo_write",
+		Success:  true,
+		Todos: []evidence.TodoItem{
+			{Content: "Add parser", Status: "in_progress", ActiveForm: "Adding parser"},
+		},
+	})
+	ledger.Record(evidence.Receipt{ToolName: "complete_step", Success: true, Step: "Adding parser"})
+	ctx := evidence.WithLedger(context.Background(), ledger)
+	args := json.RawMessage(`{"todos":[{"content":"Add parser","status":"completed","activeForm":"Adding parser"}]}`)
+
+	if _, err := (todoWrite{}).Execute(ctx, args); err != nil {
+		t.Fatalf("complete_step by activeForm should authorize completion: %v", err)
+	}
+}

Pass = the agent added ≥1 test, the affected packages are green, AND ≥1 new test fails when the PR's source is reverted. "By assertion" pins are strong (they check changed behavior); "by compile only" pins just need a PR-added symbol — and since Go compiles per package, one compile-coupled test marks every test in its package that way. Mutation is the behavioral signal for additive PRs: each changed function's return is replaced with zero values and the new tests are re-run; "caught" means a test asserts that output, "survived" means it doesn't. Read the generated tests above to judge the rest.

agent built from main-v2 · PR head b73a9ce · triggered by @esengine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v2 Go rewrite (1.x) — main-v2 branch, active development

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants