Skip to content

Execution plan#1

Merged
nayname merged 23 commits intomasterfrom
execution_governance
Mar 3, 2026
Merged

Execution plan#1
nayname merged 23 commits intomasterfrom
execution_governance

Conversation

@nayname
Copy link
Owner

@nayname nayname commented Feb 9, 2026

Summary

Introduces an explicit Execution Plan–driven execution engine in the governance layer.
The planner now produces a schema-defined execution-plan.json artifact, and execution is routed through a new engine.py, supporting both deterministic execution and LLM-guided (agent) execution. Middleware responsibilities are split between evaluation and execution.

This makes execution explicit, inspectable, and mode-selectable, without changing existing validation logic.


Motivation

Previously, execution semantics were implicit:

  • The planner produced internal structures rather than a stable artifact
  • Execution was tightly coupled to LLM tool calling

This change introduces a plan-first execution boundary:

Nothing executes unless it exists in an explicit execution plan.


What Changed

Execution Plan Schema

Introduced a schema-defined execution-plan.json that represents the complete execution setup

The schema captures:

  • execution intent
  • ordered operations
  • execution constraints
  • execution mode (engine vs agent)

This artifact is now the canonical contract between planning and execution

Planner → Execution Plan Artifact

  • Planner now produces a schema-conformant JSON instance
  • The plan is the single source of truth for execution intent
  • Validation logic is left unchanged and continues to operate as-is

New Governance Engine

New file:

  • src/governance/engine.py

Responsibilities:

  • Execute a plan according to its declared execution mode
  • Act as the single execution authority
  • Support both:
    • Engine mode — tools executed outside the LLM (execution layer)
    • Agent mode — plan injected into LLM context (no tool execution)

Middleware Refactor (Evaluate vs Execute)

Middleware responsibilities are now explicitly split:

  • evaluate(...)
    • Planning
    • Validation
    • Execution mode selection
    • No side effects
  • execute(...)
    • Plan execution via engine.py
    • Tool invocation or agent-context generation
    • All side effects live here

This removes execution logic from evaluation and prevents accidental execution during planning or validation.


Execution Modes

Execution mode is selected from plan constraints:

"constraints": {
"executionMode": "engine" | "agent"
}

Engine Mode (default)

  • Tools executed deterministically outside the LLM
  • Governed by the execution engine
  • Suitable for auditable, replayable execution

Agent Mode

  • Plan is presented to the LLM as structured context
  • No tools are executed by the engine
  • Suitable for guided reasoning or interactive flows

What Did NOT Change

  • Existing policy enforcement semantics remain intact
  • No breaking changes to existing governance rules
  • No changes to tool implementations

This is a purely additive and structural change.


Architecture

New Flow

evaluate()
→ Planner (execution-plan.json)
→ Validator (unchanged)
→ Execution decision

execute()
→ Engine
→ Tools (engine mode)
→ LLM context injection (agent mode)


Why This Matters

  • Makes execution explicit and inspectable
  • Enables plan persistence, replay, and audit
  • Supports multiple execution strategies without branching logic in middleware
  • Aligns governance with a compiler/runtime model rather than agent heuristics

Breaking Changes

None.

This change is additive and preserves existing behavior by default (engine mode).

@yi-john-huang
Copy link

The following issues are detected both Claude Code and OpenAI Codex and reviewed by me as well.

Critical Issues (Blocks Merge)

1. Missing file: engine_adapter.py — ImportError on startup

middleware.py:21 imports:

from src.governance.engine_adapter import adapt_execution_plan

This file does not exist in the PR or the fork. The proxy will crash at import time.

2. await inside non-async function — SyntaxError

middleware.py:247 (the new execute() method):

def execute(self, evaluation, session, tool_executor):
    ...
    state = await engine.execute(...)  # SyntaxError: 'await' outside async function

The method is defined with def, not async def.

3. Breaking constructor: PlanGenerator.__init__ now requires llm

planner.py:77:

def __init__(self, patterns_path: str, llm) -> None:

But GovernanceMiddleware.__init__ (line 87) still calls:

self._planner = PlanGenerator(patterns_path)  # TypeError: missing 'llm'

This breaks the existing code path. Every test and production instantiation of PlanGenerator will crash.

4. GovernanceEnforcer() called without required args

middleware.py:243:

engine = ExecutionEngine(
    enforcer=GovernanceEnforcer(),  # Missing db_path and secret!
    ...
)

GovernanceEnforcer.__init__ requires (db_path: str, secret: str). This is a TypeError.

5. self.plan doesn't exist on GovernanceMiddleware

middleware.py:226:

exec_plan = self.plan

The middleware has no plan attribute. The evaluate/enforce cycle works through PlanStore, not instance state. This is an AttributeError.

6. Two generate() methods with incompatible signatures

The PR adds a new generate() to PlanGenerator:

def generate(self, *, user_message, context, session_id, ttl_minutes):

This shadows the existing generate():

def generate(self, intent, request_body, session_id):

The middleware calls the OLD signature. The new method would take over and crash with TypeError.

7. state=None passed to AgentContextInjector.generate_context()

middleware.py:231:

agent_context = injector.generate_context(plan=exec_plan, state=None)

But generate_context() accesses state.current_sequence and state.step_results[...]. Passing NoneAttributeError.


Major Issues

8. Placeholder prompt: EXECUTION_PLAN_PROMPT = "\n...\n"

planner.py:31-33:

EXECUTION_PLAN_PROMPT = """
...
"""

The new generate() method sends this literal "..." to the LLM. It can never produce a valid plan. This is dead code.

9. Deprecated datetime.utcnow()

planner.py:118:

now = datetime.utcnow()  # DeprecationWarning in Python 3.12+

The project already uses datetime.now(UTC) in engine.py. Inconsistent and deprecated.

10. Silent exception swallowing in engine

engine.py:520-522:

except Exception as e:
    state.status = StepStatus.FAILED
    # Log error but don't crash

The comment says "log error" but no logging happens. The exception is silently dropped. This makes debugging execution failures impossible and violates the project's error handling rules (.claude/rules/error-handling.md).

11. Zero tests — 1,094 lines of untested code

The project has 520+ Python tests with coverage fail_under=90. This PR adds zero tests for:

  • ExecutionEngine (the core new component)
  • AgentContextInjector
  • New Pydantic models (8 new model classes)
  • New generate() method on PlanGenerator
  • New execute() method on middleware

12. Attribute mismatch: planId vs plan_id

middleware.py:237,254: References exec_plan.planId (camelCase), but all Pydantic models use plan_id (snake_case). The JSON schema uses camelCase, but the Python models don't define aliases. This is an AttributeError.


Minor / Style Issues

13. Uses Dict, List from typing instead of builtins

planner.py:16:

from typing import Any, List, Dict

Python 3.12+ project — should use dict[str, Any] and list[...] directly (as done everywhere else in the codebase).

14. Formatting inconsistency in imports

middleware.py:25:

    ToolCall, ExecutionContext,

Multiple symbols on one line with trailing comma — doesn't match existing import style.

15. Trailing newline removed from middleware.py

The diff shows \ No newline at end of file — removed the trailing newline from the existing file.

16. Schema file never loaded or referenced by code

config/execution-plan.json is a JSON Schema but no code loads, validates against, or references it. It's a documentation artifact, not enforced infrastructure.

17. EnhancedExecutionPlan.conditionals and recovery_paths are commented out

models.py (new):

# conditionals: list[ConditionalBranch] = Field(default_factory=list)
# recovery_paths: list[RecoveryPath] = Field(default_factory=list)

But engine.py accesses plan.conditionals and plan.recovery_paths — these will be AttributeError.


Architectural Feedback

The design idea — separating evaluate (side-effect-free) from execute (side-effects) with an explicit plan artifact — is sound and aligns well with the existing governance architecture. However:

  1. The existing codebase already separates evaluation from enforcement. evaluate() produces plans, enforce() validates tool calls at runtime. The PR's execute() introduces a third layer that overlaps with enforce() without clear boundaries.

  2. The EnhancedExecutionPlan wraps ExecutionPlan but doesn't integrate. There's no code path that creates an EnhancedExecutionPlan from the existing planner output. The bridge (engine_adapter.py) is missing.

  3. Two execution paradigms conflict. The existing system is a "validate-then-proxy" model — the proxy evaluates, gets a token, then forwards individual tool calls through enforcement. The new engine wants to drive execution itself. These aren't reconciled.

  4. The JSON schema and Pydantic models are divergent. The schema defines one contract (camelCase, operations, allow/deny rules), the models define another (snake_case, PlannedAction, risk scores). There's no serialization bridge.


Recommendation

Before this can be considered for merge:

  1. Fix all 7 critical bugs (items 1-7) — the code doesn't run
  2. Add comprehensive tests — at minimum for ExecutionEngine, AgentContextInjector, and the new models
  3. Reconcile the two generate() methods — either extend the existing one or create a separate entry point that doesn't shadow it
  4. Write engine_adapter.py — or remove the import
  5. Fill in EXECUTION_PLAN_PROMPT — or remove the LLM-based generate path
  6. Decide how execute() relates to enforce() — document the boundary
  7. Add logging in the exception handler (engine.py:520)
  8. Fix the async/sync boundary — make execute() properly async

@nayname
Copy link
Owner Author

nayname commented Feb 16, 2026

@yi-john-huang thanks for the thoughtful detailed review and happy Lunar New Year! PR was intentionally a sketch to validate direction before diving into implementation. I’ll go through your comments carefully and think about a concrete plan for addressing them.

@nayname
Copy link
Owner Author

nayname commented Feb 18, 2026

Hi @yi-john-huang ,

I've addressed your feedback and wanted to summarize the changes and our architectural direction.

Summary

Focused on making the plan a complete artifact. In this iteration:

  1. Enhanced plan creation — adapted from basic plan using LLM + JSON schema
  2. Separated execution — moved out of middleware.py into src/execution/ (stub for now)

The enhanced plan is embedded in the current workflow — it's created but not consumed. Tests added.

Next steps: Focus on plan creation — embedding user info (paths, configs, authorization, etc.). Plan structure will evolve. May drop JSON schema eventually, but it's good enough for now.

Critical Issues (1-7) — All Fixed

# Issue Fix
1 Missing engine_adapter.py implemented as planner.enhance()
2 await in non-async Executor.execute() is async def
3 PlanGenerator.__init__ requires llm llm passed to enhance(), not __init__
4 GovernanceEnforcer() missing args Dependency injection in Executor
5 self.plan doesn't exist Removed
6 Two generate() methods Separate generate() and enhance()
7 state=None crash initialize_state() called before use

Major/Minor Issues — Fixed

Architectural Feedback — Addressed

"execute() overlaps with enforce()" → Separated into layers:

  • src/governance/ — evaluate, validate, enforce (existing)
  • src/execution/ — engine, executor, context injection (new, stub)

"EnhancedExecutionPlan doesn't integrate"planner.enhance() creates it from basic plan + LLM output

"Two execution paradigms conflict" → not dealing with that, execution layer is a stub for now. Focus is on plan creation.

"JSON schema and Pydantic divergent" → Intentionally keeping _parse_* bridge methods for now. Schema is still evolving; adding Pydantic aliases would create maintenance burden. Will revisit when schema stabilizes.

@nayname
Copy link
Owner Author

nayname commented Feb 19, 2026

I mirrored this branch upstream here: yi-john-huang#14 (same code), just in case.
Discussion and review still live in this PR - I’ll align any upstream follow-ups based on feedback here.

@yi-john-huang
Copy link

What's Good

  • Clean evaluate/execute split — evaluation is side-effect-free, execution owns all side effects. Strong boundary.
  • Schema-first design — defining the execution plan as a JSON Schema artifact is the right approach for audit/replay.
  • ToolExecutorAdapter ABC — clean abstraction for tool execution, easy to test and extend.
  • Recovery path modeling — retry/skip/fail-fast strategy is well-structured.
  • Test volume — good coverage across happy paths and error paths.

Critical (Must Fix)

1. EnhancedExecutionPlan references RecoveryPath / ConditionalBranch before they're defined
src/governance/models.py

EnhancedExecutionPlan has fields typed as list[RecoveryPath] and list[ConditionalBranch], but both classes are defined after EnhancedExecutionPlan in the file. Even with from __future__ import annotations, Pydantic v2 evaluates Field(default_factory=...) at class-definition time — it will raise NameError or PydanticUserError on import.

Fix: Move RecoveryPath and ConditionalBranch above EnhancedExecutionPlan, or call EnhancedExecutionPlan.model_rebuild() at module end after all classes are defined.


2. create_enhanced_plan() return value is discarded and its exceptions crash evaluate()
src/governance/middleware.py

Two problems in one call site:

if self._enhancement_enabled:
    self.create_enhanced_plan(plan, effective_session_id, user_id, token)
  • The EnhancedExecutionPlan is computed and immediately garbage-collected. EvaluationResult still returns the base plan's plan_id and token. Enhancement is a no-op at runtime.
  • create_enhanced_plan() has no error handling. Any LLM failure (network error, missing API key, bad JSON) propagates up and crashes the entire evaluate() call — blocking all governance evaluation for what's supposed to be an optional enhancement.

Fix: Wrap the call in try/except and log on failure. Separately, decide how the enhanced plan reaches the caller (persist to store, add an enhanced_plan_id field to EvaluationResult, etc.).


3. PlanGenerator instantiated with CWD-relative schema path
src/governance/middleware.py

self._planner = PlanGenerator(patterns_path)

PlanGenerator.__init__ now has a schema_path parameter defaulting to "config/execution-plan.json" — a relative path. In any environment where CWD isn't the project root (pytest tmp_path, production deployment), self._schema will be None, and enhance() will raise RuntimeError: Schema not found whenever enhancement is enabled.

Fix: Pass schema path explicitly through middleware settings, or resolve it to an absolute path at construction time.


4. Missing src/llm/__init__.py

src/llm/client.py exists, but there's no __init__.py. The import from src.llm.client import LLMClient in middleware.py will fail with ModuleNotFoundError at startup.


High (Should Fix)

5. Off-by-one in retry loop
src/execution/engine.py

With max_retries=3, while retry_count < max_retries runs 3 total attempts — not 3 retries after the first. A user setting max_retries=1 (the minimum, per ge=1) gets exactly one attempt with no retry at all.

Fix: Rename to max_attempts, or change semantics so max_retries=3 means 3 retries (4 total attempts).


6. _should_skip() conditional logic is broken
src/execution/engine.py

def _should_skip(self, sequence, plan):
    for cond in plan.conditionals:
        if sequence in cond.if_true or sequence in cond.if_false:
            for result in plan.state.step_results:
                if result.status == StepStatus.FAILED:
                    if sequence in cond.if_true:
                        return True
    return False
  • if_false branch is never evaluated — steps in if_false are never skipped
  • Any prior step failure triggers skip, not the specific step the conditional references
  • cond.condition (e.g., "step_0_success") is parsed but never evaluated

This makes the conditional system logically incorrect for any plan with more than 2 steps.


7. Full ToolCall.arguments sent unsanitized to external LLM
src/governance/planner.py

enhance() serializes the full ExecutionPlan — including all ToolCall.arguments — and sends it to the Anthropic API. Those arguments may contain passwords, tokens, file paths, or PII extracted from the user's request. This violates the project's security rules (OWASP A03/A09).

Fix: Scrub sensitive fields from ToolCall.arguments before including them in the prompt. Log a security audit event when calling the external API.


8. Unused imports in executor.py will fail ruff
src/execution/executor.py

Session and GovernanceDecision are imported but never used. The project runs ruff — this will fail CI.


9. LLMClient needs hardening
src/llm/client.py

  • Anthropic() raises AuthenticationError at construction if ANTHROPIC_API_KEY is unset — no helpful error message
  • No timeout — LLM calls can hang indefinitely
  • response.content[0] can IndexError on empty responses
  • Model is hardcoded (claude-sonnet-4-20250514) — should come from config

Medium

  • Duration log always shows "incomplete"engine.py: _calc_total_duration() is called before plan.state.completed_at is set. Move the assignment up.
  • session_id: str | None passed to initialize_state()middleware.py: ExecutionContext.session_id requires str, not None. Will raise ValidationError when no session exists.
  • isort violationmodels.py: from datetime import UTC, datetime appears after third-party imports. Ruff will flag this.
  • Two divergent schema filesexecution-plan-big.json (draft-2020-12) and execution-plan.json (draft-07) define different schemas for the same concept. Only the latter is used. Remove execution-plan-big.json or document its purpose.
  • test_execute_raises_when_state_not_initialized — asserts AttributeError from an uninitialized state, which is testing a bug rather than a contract. Should assert a meaningful ExecutionError instead.

@nayname
Copy link
Owner Author

nayname commented Feb 21, 2026

Hi @yi-john-huang,

Thanks for the thorough review. I've addressed all the issues.

General Notes:

_should_skip() stubbed: The conditional logic was broken and half-implemented. Rather than ship buggy branching, stubbed it to return False. Will implement properly when we define conditional semantics.

execution-plan-big.json kept: Serves as a landmark for the full schema vision. The smaller execution-plan.json is used in practice.

Enhanced plan not persisted: Currently created but not saved. The goal is to eventually transfer meaningful context to the user and store a useful token — this requires refinement of what the plan actually contains first.

Next steps: Plan refinement — using full schema and embedding user info (paths, configs, authorization, etc.). That's when the stubbed conditional logic comes into play.

Here's the summary:

Critical (1-4) — All Fixed

# Issue Fix
1 RecoveryPath/ConditionalBranch defined after use Moved classes before EnhancedExecutionPlan
2 create_enhanced_plan() return discarded, no error handling Added try/except, logging, returns `EnhancedExecutionPlan`, still not saved
3 Schema path CWD-relative Resolved to absolute path from patterns_path
4 Missing src/llm/__init__.py Created

High (5-9) — All Fixed

# Issue Fix
5 Off-by-one in retry loop max_attempts = 1 + max_retries (1 initial + N retries)
6 _should_skip() broken Stubbed with return False
7 Sensitive args sent to LLM Added _sanitize_for_llm() + security audit logging
8 Unused imports in executor.py Removed GovernanceDecision, Session
9 LLMClient needs hardening Added timeout, env var config, error handling

Medium — Fixed

Issue Fix
Duration log "incomplete" completed_at set before _calc_total_duration()
session_id: None crashes ExecutionContext initialize_state() raises ValueError if None
isort violation Fixed import order
Two schema files execution-plan-big.json stays as reference/landmark for full schema vision
Test asserts AttributeError Now asserts ExecutionError

Tests updated to reflect all changes.

General notes / clarifications

Conditional logic (_should_skip)
The previous implementation was broken and half-implemented. Rather than ship incorrect branching semantics, I’ve stubbed it to return False for now. Proper conditional execution will be implemented once we finalize the conditional expression semantics and the richer plan schema.

execution-plan-big.json
This is intentionally kept as a landmark for the full schema vision. The smaller execution-plan.json is the one used in practice today.

Enhanced plan persistence
Enhanced plans are now created safely (non-blocking) but are not yet persisted. The intention is to eventually store a refined enhanced plan and issue a meaningful token once the schema and user-context embedding (paths, configs, auth, etc.) are finalized.

Next steps
Refining the execution plan schema and enriching it with real user/runtime context — this is where proper conditional logic will come back into play.

Summary of fixes
Critical (1–4) — All fixed
1	RecoveryPath / ConditionalBranch defined after use	Classes moved before EnhancedExecutionPlan
2	create_enhanced_plan() discarded + no error handling	Wrapped in try/except, logged failures, returns EnhancedExecutionPlan (non-blocking)
3	Schema path CWD-relative	Resolved to absolute path derived from patterns_path
4	Missing src/llm/__init__.py	Added
High (5–9) — All fixed
5	Retry off-by-one	max_attempts = 1 + max_retries
6	_should_skip() broken	Stubbed to return False until semantics are defined
7	Sensitive args sent to LLM	Added _sanitize_for_llm() + security/audit logging
8	Unused imports in executor.py	Removed
9	LLMClient hardening	Timeout, env-configurable model, error handling, dependency declared
Medium — Fixed
Issue	Fix
Duration log always “incomplete”	completed_at set before _calc_total_duration()
session_id=None crashes ExecutionContext	initialize_state() validates and errors early
isort violation	Import order fixed
Two schema files	execution-plan-big.json kept intentionally as reference
Test asserting AttributeError	Updated to assert ExecutionError
Tests	Updated to reflect all changes
Moves the Anthropic import behind LLMClient initialization so optional
LLM enhancement no longer affects import-time behavior or core governance
paths.

This change ensures:
- Governance evaluation and execution remain functional when LLM
  enhancement is disabled or misconfigured
- Missing dependencies or API keys fail locally and explicitly, not at
  module import time
- Clear error messages guide correct installation and configuration

Adds targeted tests covering:
- Missing anthropic dependency
- Missing ANTHROPIC_API_KEY
- Successful lazy initialization when enhancement is enabled

This reduces blast radius for optional integrations and tightens the
boundary between core governance logic and external LLM providers.
Tighten failure handling around LLM-based plan enhancement.

Enhancement remains best-effort and non-authoritative: failures are
caught locally and never affect core governance evaluation or execution.

Improvements include:
- Richer warning logs with plan_id, error type, and full traceback to
  aid debugging and post-mortem analysis
- Removal of a redundant ExecutionError definition to avoid duplicate
  error types
- Integration test coverage asserting that enhancement failures are
  logged with sufficient diagnostic detail

This reduces silent failure modes while preserving strict isolation
between optional LLM enhancement and core execution paths.
…lidation

Tighten security and determinism around LLM-based plan enhancement.

This change introduces an allowlist-based sanitization strategy for all
data sent to external LLMs. Only explicitly permitted structural fields
and safe argument keys are allowed through; everything else is redacted.
This replaces the previous sensitive-key denylist with a stricter,
governance-appropriate posture.

Additional improvements:
- Validate LLM output against the execution-plan JSON schema before
  constructing an EnhancedExecutionPlan
- Make schema path resolution deterministic and independent of CWD
- Move LLM dependencies behind an optional dependency group
- Expand tests to cover allowlist sanitization, schema validation, and
  schema path resolution behavior

The goal is to reduce blast radius, prevent credential or PII leakage,
and ensure LLM enhancement remains auditable, deterministic, and
non-authoritative.
Moved jsonschema to core dependencies (it's small, reasonable)
Remove try/except ImportError in complete() - anthropic is guaranteed
available since __init__ succeeded. Delete corresponding test that
simulated module disappearing mid-process (impossible scenario).
- Rename agentContext to agent_context (snake_case consistency)
- Add EXPERIMENTAL docstring to engine.py, injected_context.py
- Clarify ValidationError handling comment in middleware.py
- Update tests to use snake_case key
Move execution-plan-big.json to docs/reference/ and rename to
execution-plan-reference.json. This schema is not used in production -
it documents the full vision for plan structure to guide future work.
Add an explicit note explaining why EnhancedExecutionPlan remains mutable
in this iteration.

The plan primitive is still being refined, and execution state is embedded
temporarily to avoid prematurely locking lifecycle and ownership boundaries.
Once the plan schema stabilizes, execution state will be extracted into a
separate mutable container owned by the execution engine and the plan model
will be frozen.
Add debug logging when skipping malformed recovery_paths and
conditionals. Update tests to verify logging occurs.
Add @requires_llm marker to tests that need real Anthropic client.
Tests auto-skip with clear reason when key is unavailable.
@nayname nayname force-pushed the execution_governance branch from c454d22 to ad4ae41 Compare February 27, 2026 12:12
Guard asyncio.create_task for sync callers like evaluate_governance().
@nayname nayname merged commit 1f35f57 into master Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants