Skip to content

Agent Run Implementation#504

Merged
onuratakan merged 15 commits intomasterfrom
agent-run-clean
Jan 17, 2026
Merged

Agent Run Implementation#504
onuratakan merged 15 commits intomasterfrom
agent-run-clean

Conversation

@DoganK01
Copy link
Copy Markdown
Member

@DoganK01 DoganK01 commented Jan 5, 2026

  • Agent Run Implementation
  • Memory modes added for UEL. This solves conflicts between placeholder history and model history
  • Make context managers a regular class as we use step-based architecture in Agent flow
  • HITL (Human-in-the-loop) logics now managed by RunRequirement class and their usage got changed
  • Refactored streaming logic usage and removed unnecessary, confusing stream methods in Agent class
  • debug_level 2
  • AgentSession class for storing agent related informations
  • Parallel tool call support for external tool call logic
  • For one tool at a time calls when external tool call enabled, we provide callable as arg so that we can handle it in the background
  • Thinking part step added
  • metadata attribute in Agent class. This enables user to put any metadata into the prompt
  • Dynamic search tool when KnowledgeBase used as tool (RAG as tool)
  • tool name prefix for MCP tools. MCPs might have same tool name. In this way, you can use different function tools with same name for different purposes.

@claude
Copy link
Copy Markdown

claude bot commented Jan 10, 2026

Pull Request Review: Agent Run Implementation (#504)

This is a substantial architectural refactoring with 38,936 additions and 13,895 deletions across 100 files. The changes introduce a unified pipeline-based execution model with improved HITL support. Below is my detailed review.


🎯 Overall Assessment

Strengths:

  • Well-architected pipeline system with clear separation of concerns
  • Comprehensive HITL (Human-in-the-Loop) support through external tool execution
  • Strong serialization/deserialization infrastructure
  • Clear state management with AgentRunOutput as single source of truth

Concerns:

  • Critical: Insufficient test coverage for such a massive refactoring
  • High Risk: Many breaking changes with limited migration documentation
  • Security: Cloudpickle usage needs security review
  • Documentation: Missing comprehensive documentation for new architecture

📋 Detailed Findings

1. Code Quality & Best Practices

Strengths:

  • Clean Architecture: Pipeline pattern with clear step separation (src/upsonic/agent/pipeline/)
  • Type Safety: Extensive use of TYPE_CHECKING guards to avoid circular imports (e.g., agent.py:20-64)
  • Consistent Error Handling: Try-except-finally pattern in all pipeline steps (e.g., steps.py:51-111)
  • Proper Async Support: Dual async/sync methods throughout (e.g., context_manager.py)
  • Good Naming: Clear, descriptive names for classes and methods

⚠️ Issues:

1. Lazy Import Anti-Pattern

# src/upsonic/agent/agent.py:20-43
if TYPE_CHECKING:
    from upsonic.models import Model
else:
    Model = "Model"  # String fallback

While this avoids circular imports, it breaks IDE autocomplete and type checking at runtime. Consider restructuring imports or using from __future__ import annotations.

2. Complex Context Management
The AgentRunOutput class has grown very large (636 lines) with many responsibilities:

  • Message tracking
  • Tool execution
  • HITL requirements
  • Status management
  • Serialization
  • Step results
  • Event streaming

Recommendation: Consider breaking this into smaller, focused classes following Single Responsibility Principle.

3. Inconsistent Error Handling

# src/upsonic/agent/pipeline/steps.py:100-108
except Exception as e:
    step_result = StepResult(
        name=self.name,
        step_number=step_number,
        status=StepStatus.ERROR,
        message=str(e)[:500],  # Truncates error messages
        execution_time=time.time() - start_time,
    )
    raise

Truncating error messages to 500 chars could hide critical debugging information. Consider logging full errors separately.


2. Potential Bugs & Issues

🐛 Critical Issues:

1. Cloudpickle Security Risk

# src/upsonic/run/agent/output.py:625-628
def serialize(self) -> bytes:
    """Serialize to bytes for storage."""
    import cloudpickle
    return cloudpickle.dumps(self.to_dict())

Security Concern: Cloudpickle can execute arbitrary code during deserialization. If AgentRunOutput objects are stored in databases accessible by untrusted users, this is a remote code execution vulnerability.

Recommendation:

  • Use JSON serialization for untrusted data
  • If pickle is necessary, implement signature verification
  • Document security implications clearly
  • Consider using msgpack or Protocol Buffers

2. Message Confusion

# src/upsonic/run/agent/output.py:106-109
# chat_history: Full conversation history (historical + current) for LLM execution FOR THE SESSION, ALL RUNS
chat_history: List["ModelMessage"] = field(default_factory=list)
# messages: Only NEW messages from THIS run
messages: Optional[List["ModelMessage"]] = None

Having both chat_history and messages is confusing and error-prone. The comment helps, but the dual structure invites bugs where developers use the wrong one.

3. Property Logic Error

# src/upsonic/run/requirements.py:46-54
@property
def needs_confirmation(self) -> bool:
    if self.confirmation is not None:
        return False
    if not self.tool_execution:
        return False
    if self.tool_execution.confirmed is True:  # ← Bug here
        return True
    return self.tool_execution.requires_confirmation or False

Bug: Line 52 returns True when already confirmed, which contradicts the property name. Should be False.

4. Race Condition in RunRequirement

# src/upsonic/run/requirements.py:67-74
@property
def needs_external_execution(self) -> bool:
    if not self.tool_execution:
        return False
    if self.tool_execution.result is not None:
        return False
    return self.tool_execution.external_execution_required or False

No locking mechanism for concurrent access to tool_execution.result. Could cause race conditions in parallel tool execution.

⚠️ Medium Issues:

5. Memory Leak Potential

# src/upsonic/session/agent.py:92-98
def flatten_messages_from_runs_all(self) -> List[ModelMessage]:
    all_msgs = []
    if self.runs:
        for run_data in self.runs.values():
            if run_data.output and run_data.output.messages:
                all_msgs.extend(run_data.output.messages)
    return all_msgs

No limit on the number of runs or messages. Long-running sessions could accumulate unbounded data. Consider:

  • Max runs limit with LRU eviction
  • Message pagination
  • Automatic archiving

6. Silent Failures

# src/upsonic/session/agent.py:136-143
for run_id, run_data in runs_data.items():
    if isinstance(run_data, RunData):
        runs[run_id] = run_data
    elif isinstance(run_data, dict):
        try:
            runs[run_id] = RunData.from_dict(run_data)
        except Exception as e:
            log_warning(f"Failed to deserialize run {run_id}: {e}")  # Silent failure

Failed deserializations are only logged, not raised. Could lead to data loss without user awareness.


3. Performance Considerations

🚀 Optimizations Needed:

1. Inefficient Message Extraction

# src/upsonic/run/agent/output.py:308-312
def new_messages(self) -> List["ModelMessage"]:
    if not self._run_boundaries or not self.messages:
        return (self.messages or []).copy()  # ← Copies entire list
    last_start = self._run_boundaries[-1]
    return self.messages[last_start:].copy()  # ← Another copy

Creates copies of potentially large message lists on every call. Consider:

  • Returning views/slices without copying
  • Caching results
  • Using generators for iteration

2. Repeated Serialization

# Multiple locations serialize the same objects repeatedly
def to_dict(self) -> Dict[str, Any]:
    return {
        "chat_history": [self._serialize_message(m) for m in self.chat_history],
        "messages": self._serialize_list(self.messages),
        # ... many more
    }

No caching of serialized objects. For large sessions with frequent saves, this could be expensive.

3. Pipeline Step Overhead
The new pipeline has 19 steps (direct) vs 14 steps (streaming). Each step has:

  • Try-except-finally overhead
  • Event emission
  • StepResult creation
  • Statistics updates

For simple tasks, this could add significant latency. Consider:

  • Step fusion for common paths
  • Conditional step execution
  • Lazy evaluation

4. Security Concerns

🔒 Critical Security Issues:

1. Arbitrary Code Execution via Cloudpickle (mentioned above)
Severity: HIGH
CVSS: ~8.1 (High)

2. No Input Validation on External Tool Results

# src/upsonic/run/requirements.py:114-118
def set_external_execution_result(self, result: str) -> None:
    """Set the result from external execution."""
    if not self.tool_execution:
        raise ValueError("No tool execution to set result for")
    self.tool_execution.result = result  # ← No validation

External tool results are accepted without validation. Malicious results could contain:

  • Code injection payloads
  • XSS vectors
  • SQL injection attempts (if results are used in queries)

Recommendation: Add content validation and sanitization.

3. Missing Authentication/Authorization

# src/upsonic/session/agent.py
@dataclass
class AgentSession:
    session_id: str
    user_id: Optional[str] = None  # ← Optional user_id

User ID is optional, and there's no evidence of authorization checks. Any code with access to session_id can manipulate sessions.

4. Metadata Injection

# src/upsonic/agent/agent.py:332-333
# Agent metadata (injected into prompts)
self.metadata = metadata or {}

User-controlled metadata is injected into prompts without sanitization. Could enable prompt injection attacks.


5. Test Coverage

Critical Gap:

Only 1 test file modified in a PR changing 100 files:

tests/doc_examples/agent/adding_tools/adding_tools_1.py

This is insufficient for such a massive refactoring. Required test coverage:

Unit Tests Needed:

  • ✗ Pipeline step execution (all 19 steps)
  • ✗ AgentRunOutput serialization/deserialization
  • ✗ RunRequirement state transitions
  • ✗ AgentSession HITL flows
  • ✗ Message tracking and boundaries
  • ✗ Error recovery (durable execution)
  • ✗ Cancellation handling
  • ✗ Streaming vs non-streaming paths

Integration Tests Needed:

  • ✗ Full pipeline execution (happy path)
  • ✗ External tool pause/resume
  • ✗ Multi-run sessions
  • ✗ Memory modes (update vs replace)
  • ✗ Policy enforcement in pipeline
  • ✗ Culture manager integration

Edge Cases to Test:

  • ✗ Run boundaries with empty messages
  • ✗ Concurrent tool execution
  • ✗ Session serialization with large histories
  • ✗ Pipeline interruption/resumption
  • ✗ Malformed external tool results

Recommendation: Add comprehensive test suite before merging. Consider test-driven development for future features.


6. Breaking Changes

This PR introduces MAJOR BREAKING CHANGES that will affect all existing users:

🔴 Breaking Changes:

  1. AgentRunOutput is now mandatory

    • Old: Functions returned simple strings or objects
    • New: All execution returns AgentRunOutput
    • Migration: Wrap existing code with .output access
  2. RunRequirement scope change

    • Old: Handled confirmations, input, and external execution
    • New: ONLY handles external tool execution
    • Migration: Update HITL code to use AgentRunOutput.status directly
  3. Streaming no longer supports HITL

    • astream() explicitly disallows external tool pauses (line 2740-2741)
    • Migration: Use do_async() for HITL workflows
  4. Context manager lifecycle changed

    • Old: manage_context() primary path
    • New: aprepare() → use → afinalize()
    • Migration: Update custom context managers
  5. Memory mode changes

    • New user_memory_mode: Literal['update', 'replace'] parameter
    • Migration: Explicitly set memory mode in configurations
  6. Message API changes

    • Separation of chat_history vs messages
    • Migration: Update code accessing agent messages

Missing: No migration guide or deprecation warnings provided.


7. Documentation Gaps

📝 Documentation Needed:

Architecture Documentation:

  • Pipeline architecture diagram
  • Step execution flow
  • State machine diagrams for RunStatus
  • Message flow and boundaries
  • HITL workflow examples

API Documentation:

  • AgentRunOutput API reference
  • Pipeline step creation guide
  • RunRequirement usage guide
  • Session management patterns

Migration Guides:

  • v1 to v2 migration guide
  • Breaking changes checklist
  • Code examples for all breaking changes
  • Deprecation timeline

Developer Guides:

  • Creating custom pipeline steps
  • Implementing new context managers
  • Error recovery patterns
  • Testing pipeline components

🎯 Recommendations

High Priority (Must Address Before Merge):

  1. Add Comprehensive Test Suite

    • Target: 80%+ coverage for new code
    • Include unit, integration, and e2e tests
    • Add performance benchmarks
  2. Fix Security Issues

    • Replace cloudpickle with safe serialization for untrusted data
    • Add input validation on external tool results
    • Implement authorization checks
    • Sanitize metadata before prompt injection
  3. Fix Logic Bugs

    • Correct needs_confirmation property logic
    • Add thread safety for concurrent tool execution
    • Handle edge cases in message boundary tracking
  4. Add Migration Documentation

    • Write detailed migration guide
    • Provide code examples for all breaking changes
    • Add deprecation warnings to old APIs
    • Create compatibility shims if possible

Medium Priority (Should Address Soon):

  1. Refactor Large Classes

    • Break down AgentRunOutput into focused components
    • Extract message management to separate class
    • Move serialization logic to dedicated module
  2. Add Resource Limits

    • Implement max runs per session
    • Add message count limits
    • Create auto-archiving for old runs
  3. Improve Performance

    • Cache serialization results
    • Use message views instead of copies
    • Optimize pipeline for simple tasks
  4. Documentation

    • Architecture overview
    • API reference for new components
    • Developer guide for extensions

Low Priority (Nice to Have):

  1. Error Messages

    • Don't truncate errors to 500 chars
    • Add context to error messages
    • Include debugging hints
  2. Code Cleanup

    • Reduce TYPE_CHECKING string fallbacks
    • Consolidate message handling
    • Add docstring examples

📊 Risk Assessment

Risk Category Level Justification
Test Coverage 🔴 CRITICAL Only 1 test file for 100 files changed
Security 🔴 CRITICAL Cloudpickle RCE, no input validation
Breaking Changes 🟠 HIGH Major API changes without migration guide
Code Quality 🟡 MEDIUM Good architecture, but some anti-patterns
Performance 🟡 MEDIUM Some inefficiencies, but not critical
Documentation 🟠 HIGH Insufficient for scope of changes

Overall Risk: 🔴 HIGH


Approval Criteria

Before this PR can be safely merged:

  • Add comprehensive test suite (80%+ coverage)
  • Fix security vulnerabilities (cloudpickle, input validation)
  • Fix logic bugs (needs_confirmation, race conditions)
  • Write migration guide with examples
  • Add architecture documentation
  • Performance benchmark showing no regression
  • Security audit completed
  • Breaking changes documented in CHANGELOG

💡 Positive Notes

Despite the concerns, this is a well-architected refactoring:

  • ✅ Clean pipeline pattern is a significant improvement
  • ✅ Strong separation of concerns
  • ✅ Comprehensive HITL support is well-designed
  • ✅ Consistent error handling patterns
  • ✅ Good use of dataclasses and type hints
  • ✅ Thoughtful serialization infrastructure

With proper testing, security fixes, and documentation, this will be a solid foundation for the framework.


📌 Summary

This is an ambitious and well-architected refactoring that introduces important features like unified pipeline execution and robust HITL support. However, the lack of test coverage, security vulnerabilities, and insufficient documentation make it not ready for merge in its current state.

Recommendation: REQUEST CHANGES

The core architecture is sound, but the risk is too high without adequate testing and security measures. Please address the high-priority items before merging.

Happy to discuss any of these points in detail! 🚀

@claude
Copy link
Copy Markdown

claude bot commented Jan 17, 2026

Pull Request Review: Agent Run Implementation (#504)

This is a substantial architectural refactoring with 80,397 additions and 18,923 deletions across the codebase. Here's my comprehensive review:

🎯 Summary

This PR introduces a major architectural overhaul of the agent execution system, focusing on:

  • New AgentRun architecture with AgentRunInput and AgentRunOutput
  • Improved HITL (Human-in-the-Loop) management via RunRequirement
  • Refactored storage and memory system
  • Enhanced streaming and event handling
  • Parallel tool call support

✅ Strengths

1. Clear Separation of Concerns

  • The new run/ module provides excellent separation between input, output, events, and requirements
  • AgentRunInput and AgentRunOutput as dataclasses provide type safety and clear contracts
  • Single source of truth pattern for run state in AgentRunOutput

2. Improved HITL Architecture

  • RunRequirement class (src/upsonic/run/requirements.py) provides a clean abstraction for external tool execution
  • Clear distinction between confirmation, user input, and external execution needs
  • The is_resolved property elegantly handles requirement completion logic

3. Robust Cancellation Management

  • Thread-safe RunCancellationManager (src/upsonic/run/cancel.py:13-61)
  • Proper cleanup with register_run, cancel_run, and cleanup_run lifecycle
  • Cancellation checks integrated throughout the pipeline steps

4. Step-Based Pipeline Architecture

  • Well-structured pipeline with discrete steps (InitializationStep, CacheCheckStep, etc.)
  • Each step handles cancellation, errors, and result tracking
  • StepResult provides excellent execution visibility

5. Comprehensive Storage Refactoring

  • Abstract Storage base class provides clean interface (src/upsonic/storage/base.py)
  • Support for multiple backends (In-Memory, JSON, SQLite, Redis, PostgreSQL, MongoDB, Mem0)
  • Consistent async/sync patterns across implementations

⚠️ Issues & Concerns

Critical Security Issues

1. Unsafe URL Fetching (src/upsonic/run/agent/input.py:54)

response = httpx.get(url_content.url)
response.raise_for_status()

Problem: No timeout, no validation, vulnerable to SSRF attacks.

Recommendation:

response = httpx.get(url_content.url, timeout=30.0, follow_redirects=False)
# Add URL validation to prevent access to internal resources

2. Eval Usage in Test Examples

Multiple test files use eval() for calculator tools (tests/smoke_tests/agent/stream_events_example.py:43):

result = eval(expression)

Problem: While this is in test code, it sets a bad example. Users may copy this pattern.

Recommendation: Use ast.literal_eval() or a safe expression parser like numexpr.

Code Quality Issues

3. Massive File Sizes

  • src/upsonic/agent/agent.py: 1,656 additions (likely 2000+ lines total)
  • src/upsonic/agent/pipeline/steps.py: 2,509 additions
  • src/upsonic/run/events/events.py: 1,410 new lines
  • src/upsonic/run/agent/output.py: 1,073 new lines

Problem: These files are becoming monolithic and hard to maintain.

Recommendation: Consider breaking down into smaller modules:

  • Split steps.py into individual step files
  • Extract event definitions into separate files by category
  • Consider separating agent initialization, execution, and streaming logic

4. Inconsistent Error Handling

In steps.py, error messages are truncated to 500 chars:

message=str(e)[:500]

Problem: This arbitrary limit may hide critical debugging information.

Recommendation:

  • Use structured logging for full error details
  • Store full error in StepResult.error field
  • Keep truncated version for user-facing messages

5. Memory Accumulation Risk

AgentRunOutput.events accumulates all events during streaming (src/upsonic/run/agent/output.py:136):

events: Optional[List["AgentEvent"]] = None

Problem: For long-running agents, this list could grow unbounded.

Recommendation:

  • Add configurable event limit with circular buffer
  • Consider event persistence/streaming instead of in-memory accumulation
  • Add memory usage monitoring

6. Type Safety Concerns

Multiple uses of Optional[Any] reduce type safety:

  • metadata: Optional[Dict[str, Any]] (widespread)
  • session_state: Optional[Dict[str, Any]]

Recommendation: Define Pydantic models or TypedDicts for structured metadata.

Performance Concerns

7. Synchronous URL Downloads in Init

AgentRunInput.__post_init__() does synchronous HTTP calls:

response = httpx.get(url_content.url)

Problem: Blocks initialization for remote resources.

Recommendation: Make this lazy or async-only.

8. Global State in Cancellation Manager

_cancellation_manager = RunCancellationManager()

Problem: Global mutable state can cause issues in multi-threaded/multi-process environments.

Recommendation: Consider context-based or dependency injection pattern.

Testing Concerns

9. Test Organization

Tests moved from unit_tests/ to smoke_tests/ (commit: "fix: moved smoke tests from unit_tests into smoke_tests")

Problem: This reduces fast unit test coverage.

Recommendation:

  • Maintain separate unit tests for business logic
  • Keep smoke tests for integration scenarios
  • Ensure CI runs both test suites

10. Missing Test Coverage

For 80K lines of changes, I'd expect to see more new test files. The PR includes fixes to existing tests but limited new coverage for:

  • RunRequirement class
  • Cancellation manager
  • New event types
  • Storage migration paths

📊 Architecture Review

Good Patterns

✅ Dataclass usage for DTOs (AgentRunInput, AgentRunOutput)
✅ ABC usage for storage providers
✅ Event-driven architecture for streaming
✅ Step-based pipeline with clear lifecycle
✅ Separation of sync/async implementations

Anti-Patterns to Address

❌ Global mutable state (cancellation manager)
❌ Monolithic files (2000+ lines)
❌ Optional[Any] reducing type safety
❌ Unbounded memory accumulation (events list)
❌ Synchronous I/O in initialization

🔍 Additional Observations

Documentation

  • Good inline docstrings in new classes
  • Consider adding migration guide for users of old architecture
  • Architecture decision records would be valuable for this scale of change

Breaking Changes

This appears to be a breaking change for:

  • Agent initialization
  • Memory management
  • Context manager usage
  • HITL flow handling

Recommendation: Ensure changelog clearly documents migration path.

Dependencies

The storage refactoring introduces tight coupling to multiple database drivers. Consider:

  • Optional dependency groups (already partially done)
  • Plugin architecture for storage backends
  • Clear documentation of which extras to install

🎯 Recommendations Summary

Must Fix Before Merge

  1. ✋ Add timeout and validation to httpx.get() calls
  2. ✋ Add documentation for breaking changes and migration
  3. ✋ Address unbounded event accumulation

Should Fix Before Merge

  1. 📋 Split monolithic files into smaller modules
  2. 📋 Add unit tests for new core classes (RunRequirement, RunCancellationManager)
  3. 📋 Replace eval() in examples with safe alternatives
  4. 📋 Add memory usage limits/monitoring

Consider for Follow-up

  1. 💡 Refactor global state to dependency injection
  2. 💡 Define stricter types for metadata fields
  3. 💡 Add performance benchmarks for storage backends
  4. 💡 Create architecture decision records

🎖️ Overall Assessment

This is ambitious and well-structured work that significantly improves the agent execution architecture. The separation of concerns, event streaming, and HITL improvements are excellent.

However, the scale of changes (80K additions) introduces risk. The security issues must be addressed, and the monolithic file sizes will become maintenance burdens.

Recommendation:

  • ✅ Approve with required changes (security fixes, documentation)
  • 🔄 Plan follow-up refactoring to break down large files
  • 📊 Add performance benchmarks to prevent regressions
  • 🧪 Expand test coverage incrementally

Great work on tackling this complex refactoring! The new architecture provides a solid foundation for future features.


Review generated with assistance from Claude Code

@claude
Copy link
Copy Markdown

claude bot commented Jan 17, 2026

🔍 Comprehensive Code Review: PR #504 - Agent Run Implementation

Thank you for this substantial contribution! This is an impressive architectural refactor that brings significant improvements to the agent execution model. I've conducted a thorough review and have feedback across multiple dimensions.


📊 Overall Assessment

Code Quality: ⭐⭐⭐⭐ (4/5)

The implementation demonstrates strong architectural design with excellent separation of concerns, comprehensive type hints, and solid error handling patterns. The pipeline architecture is elegant and maintainable. However, there are several critical gaps around error handling, security, and test coverage that should be addressed.

Recommendation: Request changes - Please address the high-priority issues before merging, particularly around session save reliability and security concerns.


✅ Major Strengths

  1. Pipeline Architecture - Clean step-based execution with clear responsibilities across 24 distinct step classes
  2. AgentRunOutput as Single Source of Truth - Excellent design consolidating all run state in one place
  3. Context Manager Migration - Successfully converted to regular classes with proper aprepare()/afinalize() lifecycle
  4. Comprehensive Type Hints - Extensive use of TYPE_CHECKING guards and proper typing throughout
  5. Event Streaming - Proper event emission at each step for observability
  6. HITL (Human-in-the-Loop) - Intelligent resumption with proper state management

🔴 Critical Issues (High Priority)

1. Session Save Failures Are Silent

File: src/upsonic/agent/pipeline/manager.py:98-123

Issue: Session save exceptions are swallowed with only debug-level warnings:

except Exception as save_error:
    if self.debug:
        warning_log(f"Failed to save session: {save_error}", "PipelineManager")

Impact: Critical data loss risk - HITL state could be lost without user awareness

Fix:

  • Always log session save failures (not just in debug mode)
  • Consider retrying or raising if save is critical
  • Add telemetry/metrics for save failures

2. Memory Manager Finalization Silent Failure

File: src/upsonic/agent/context_managers/memory_manager.py:98-107

Issue: Session won't save if set_run_output() isn't called:

if self.memory and self._agent_run_output:
    await self.memory.save_session_async(output=self._agent_run_output)

Impact: Silent data loss if the run output is never set

Fix: Add warning log if _agent_run_output is None during finalization


3. Message Boundary Tracking Fragility

File: src/upsonic/run/agent/output.py:329-379

Issue: start_new_run() and finalize_run_messages() must be called in correct order manually. If pipeline flow changes, new_messages() could return incorrect data.

Impact: Core feature fragility - incorrect message tracking breaks memory management

Fix:

  • Add assertions to detect lifecycle misuse
  • Consider making this automatic via pipeline lifecycle hooks

4. Security: Unsafe Session Deserialization

File: src/upsonic/session/agent.py:129-192 (around line 870s based on pattern)

Issue: Complex deserialization with importlib.import_module for dynamic class loading

Impact: Potential arbitrary code execution if storage is compromised

Fix: Implement a whitelist of allowed modules for deserialization


5. Parallel Tool Call Implementation Unverified

File: src/upsonic/tools/processor.py:295-300

Issue: Code tracks sequential flag but actual parallel execution logic verification is needed

Impact: Cannot confirm parallel tool execution actually works as designed

Fix:

  • Review actual tool execution implementation
  • Add comprehensive tests with multiple parallel tools
  • Verify error handling when parallel tools fail

🟡 Medium Priority Issues

6. No Maximum Retry Cap

File: src/upsonic/agent/agent.py:308-311

Issue: Validation only checks retry < 1 but doesn't cap maximum:

if retry < 1:
    raise ValueError("The 'retry' count must be at least 1.")

Impact: Could allow extremely high retry counts

Fix: Add maximum retry cap (e.g., 100)


7. External Tool Result Validation Missing

File: src/upsonic/agent/pipeline/manager.py:690-729

Issue: No validation that tool results are actually provided before resumption

Impact: Could resume with incomplete/invalid tool results

Fix: Add validation in continue_run_async to ensure all requirements have valid results


8. External Tool Execution Accepts Unsanitized Input

File: src/upsonic/requirements.py:114-118

Issue: set_external_execution_result accepts any string without validation:

def set_external_execution_result(self, result: str) -> None:
    if not self.tool_execution:
        raise ValueError("No tool execution to set result for")
    self.tool_execution.result = result

Impact: Malicious results could be injected depending on downstream processing

Fix: Add result validation/sanitization


9. Tool Call Limit Not Enforced for Parallel Tools

File: src/upsonic/agent/agent.py:142, 793-795

Issue: tool_call_limit exists but unclear enforcement in parallel execution

Impact: Parallel tools could exceed limit

Fix: Enforce limit before dispatching parallel tools


10. Session Deserialization Robustness

File: src/upsonic/session/agent.py:129-192

Issue: Complex dict/object type checking with many potential failure points

Impact: Deserialization could fail on edge cases

Fix: Add comprehensive error handling with sensible fallback values


🟢 Low Priority / Suggestions

11. Large File Size

File: src/upsonic/agent/agent.py - 3,651 lines

Suggestion: Consider splitting into:

  • agent_core.py - Main execution logic
  • agent_tools.py - Tool management
  • agent_policies.py - Policy management
  • agent_resumption.py - HITL/continuation logic

12. Event Deduplication Logic

File: src/upsonic/agent/pipeline/manager.py:535-536

Issue: Uses object equality which may not work for all event types:

if event not in context.events:
    context.events.append(event)

Suggestion: Use event IDs instead of object equality


13. nest_asyncio Usage

File: src/upsonic/agent/agent.py:894-912

Note: Uses nest_asyncio for tool policy validation in potentially sync context

Suggestion: Document this behavior clearly; could hide async context issues in edge cases


📝 Test Coverage Gaps

Missing Tests:

  1. Parallel Tool Execution - No visible tests for multiple parallel tools
  2. Edge Cases:
    • Resume from mid-pipeline with incomplete manager state
    • Multiple external tool rounds with errors
    • Session save failures during HITL
  3. Streaming + HITL Interaction - Verify that HITL correctly rejects streaming mode

Recommendation:

Add negative test cases and parallel execution tests before merging.


⚡ Performance Considerations

1. Large Object Serialization

File: src/upsonic/run/agent/output.py:592-799

Concern: to_dict() is very comprehensive (200+ lines) - could be slow for large runs

Suggestion:

  • Add lazy serialization option
  • Consider storing large fields (messages, events) separately

2. Unlimited Event List Growth

File: src/upsonic/run/agent/output.py:136

Concern: events: Optional[List["AgentEvent"]] = None - no size limit

Impact: Memory bloat for long-running streaming tasks

Suggestion: Add event list size limit or circular buffer


3. Step Result Storage

File: src/upsonic/run/agent/output.py:130-133

Concern: All step results stored in memory and serialized

Suggestion: Consider optional step result pruning for long pipelines


🎯 Action Items

Immediate (Before Merge):

  • Add session save failure logging (always, not just debug)
  • Add memory manager finalization warnings
  • Implement security whitelist for deserialization
  • Add validation for external tool results before resumption
  • Add comprehensive parallel tool execution tests

Short-term:

  • Implement maximum retry cap
  • Create assertions for message boundary tracking
  • Add error aggregation for parallel tool failures
  • Document streaming limitation for HITL

Long-term Improvements:

  • Refactor agent.py into multiple focused files
  • Implement automatic step finalization
  • Add HITL streaming support
  • Optimize large object serialization

💬 Questions

  1. Can you verify the parallel tool execution implementation? The actual concurrent execution logic wasn't visible in the review.
  2. What's the expected behavior when parallel tools partially fail?
  3. Is there a plan to support streaming for continued runs (HITL)?
  4. Should session save failures block execution or just log warnings?

Overall, this is solid work that significantly improves the agent architecture. The main concerns are around error handling reliability and security hardening. Please address the high-priority issues, and I'm happy to re-review!

Reviewed with: Claude Code + Repository CLAUDE.md guidelines

@onuratakan onuratakan merged commit aac4a17 into master Jan 17, 2026
5 checks passed
@onuratakan onuratakan deleted the agent-run-clean branch January 17, 2026 10:08
@DoganK01 DoganK01 restored the agent-run-clean branch January 17, 2026 10:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants