Agent Run Implementation by DoganK01 · Pull Request #504 · Upsonic/Upsonic

DoganK01 · 2026-01-05T17:50:39Z

Agent Run Implementation
Memory modes added for UEL. This solves conflicts between placeholder history and model history
Make context managers a regular class as we use step-based architecture in Agent flow
HITL (Human-in-the-loop) logics now managed by RunRequirement class and their usage got changed
Refactored streaming logic usage and removed unnecessary, confusing stream methods in Agent class
debug_level 2
AgentSession class for storing agent related informations
Parallel tool call support for external tool call logic
For one tool at a time calls when external tool call enabled, we provide callable as arg so that we can handle it in the background
Thinking part step added
metadata attribute in Agent class. This enables user to put any metadata into the prompt
Dynamic search tool when KnowledgeBase used as tool (RAG as tool)
tool name prefix for MCP tools. MCPs might have same tool name. In this way, you can use different function tools with same name for different purposes.

…Output

claude · 2026-01-10T15:51:27Z

Pull Request Review: Agent Run Implementation (#504)

This is a substantial architectural refactoring with 38,936 additions and 13,895 deletions across 100 files. The changes introduce a unified pipeline-based execution model with improved HITL support. Below is my detailed review.

🎯 Overall Assessment

Strengths:

Well-architected pipeline system with clear separation of concerns
Comprehensive HITL (Human-in-the-Loop) support through external tool execution
Strong serialization/deserialization infrastructure
Clear state management with AgentRunOutput as single source of truth

Concerns:

Critical: Insufficient test coverage for such a massive refactoring
High Risk: Many breaking changes with limited migration documentation
Security: Cloudpickle usage needs security review
Documentation: Missing comprehensive documentation for new architecture

📋 Detailed Findings

1. Code Quality & Best Practices

✅ Strengths:

Clean Architecture: Pipeline pattern with clear step separation (src/upsonic/agent/pipeline/)
Type Safety: Extensive use of TYPE_CHECKING guards to avoid circular imports (e.g., agent.py:20-64)
Consistent Error Handling: Try-except-finally pattern in all pipeline steps (e.g., steps.py:51-111)
Proper Async Support: Dual async/sync methods throughout (e.g., context_manager.py)
Good Naming: Clear, descriptive names for classes and methods

⚠️ Issues:

1. Lazy Import Anti-Pattern

# src/upsonic/agent/agent.py:20-43
if TYPE_CHECKING:
    from upsonic.models import Model
else:
    Model = "Model"  # String fallback

While this avoids circular imports, it breaks IDE autocomplete and type checking at runtime. Consider restructuring imports or using from __future__ import annotations.

2. Complex Context Management
The AgentRunOutput class has grown very large (636 lines) with many responsibilities:

Message tracking
Tool execution
HITL requirements
Status management
Serialization
Step results
Event streaming

Recommendation: Consider breaking this into smaller, focused classes following Single Responsibility Principle.

3. Inconsistent Error Handling

# src/upsonic/agent/pipeline/steps.py:100-108
except Exception as e:
    step_result = StepResult(
        name=self.name,
        step_number=step_number,
        status=StepStatus.ERROR,
        message=str(e)[:500],  # Truncates error messages
        execution_time=time.time() - start_time,
    )
    raise

Truncating error messages to 500 chars could hide critical debugging information. Consider logging full errors separately.

2. Potential Bugs & Issues

🐛 Critical Issues:

1. Cloudpickle Security Risk

# src/upsonic/run/agent/output.py:625-628
def serialize(self) -> bytes:
    """Serialize to bytes for storage."""
    import cloudpickle
    return cloudpickle.dumps(self.to_dict())

Security Concern: Cloudpickle can execute arbitrary code during deserialization. If AgentRunOutput objects are stored in databases accessible by untrusted users, this is a remote code execution vulnerability.

Recommendation:

Use JSON serialization for untrusted data
If pickle is necessary, implement signature verification
Document security implications clearly
Consider using msgpack or Protocol Buffers

2. Message Confusion

# src/upsonic/run/agent/output.py:106-109
# chat_history: Full conversation history (historical + current) for LLM execution FOR THE SESSION, ALL RUNS
chat_history: List["ModelMessage"] = field(default_factory=list)
# messages: Only NEW messages from THIS run
messages: Optional[List["ModelMessage"]] = None

Having both chat_history and messages is confusing and error-prone. The comment helps, but the dual structure invites bugs where developers use the wrong one.

3. Property Logic Error

# src/upsonic/run/requirements.py:46-54
@property
def needs_confirmation(self) -> bool:
    if self.confirmation is not None:
        return False
    if not self.tool_execution:
        return False
    if self.tool_execution.confirmed is True:  # ← Bug here
        return True
    return self.tool_execution.requires_confirmation or False

Bug: Line 52 returns True when already confirmed, which contradicts the property name. Should be False.

4. Race Condition in RunRequirement

# src/upsonic/run/requirements.py:67-74
@property
def needs_external_execution(self) -> bool:
    if not self.tool_execution:
        return False
    if self.tool_execution.result is not None:
        return False
    return self.tool_execution.external_execution_required or False

No locking mechanism for concurrent access to tool_execution.result. Could cause race conditions in parallel tool execution.

⚠️ Medium Issues:

5. Memory Leak Potential

# src/upsonic/session/agent.py:92-98
def flatten_messages_from_runs_all(self) -> List[ModelMessage]:
    all_msgs = []
    if self.runs:
        for run_data in self.runs.values():
            if run_data.output and run_data.output.messages:
                all_msgs.extend(run_data.output.messages)
    return all_msgs

No limit on the number of runs or messages. Long-running sessions could accumulate unbounded data. Consider:

Max runs limit with LRU eviction
Message pagination
Automatic archiving

6. Silent Failures

# src/upsonic/session/agent.py:136-143
for run_id, run_data in runs_data.items():
    if isinstance(run_data, RunData):
        runs[run_id] = run_data
    elif isinstance(run_data, dict):
        try:
            runs[run_id] = RunData.from_dict(run_data)
        except Exception as e:
            log_warning(f"Failed to deserialize run {run_id}: {e}")  # Silent failure

Failed deserializations are only logged, not raised. Could lead to data loss without user awareness.

3. Performance Considerations

🚀 Optimizations Needed:

1. Inefficient Message Extraction

# src/upsonic/run/agent/output.py:308-312
def new_messages(self) -> List["ModelMessage"]:
    if not self._run_boundaries or not self.messages:
        return (self.messages or []).copy()  # ← Copies entire list
    last_start = self._run_boundaries[-1]
    return self.messages[last_start:].copy()  # ← Another copy

Creates copies of potentially large message lists on every call. Consider:

Returning views/slices without copying
Caching results
Using generators for iteration

2. Repeated Serialization

# Multiple locations serialize the same objects repeatedly
def to_dict(self) -> Dict[str, Any]:
    return {
        "chat_history": [self._serialize_message(m) for m in self.chat_history],
        "messages": self._serialize_list(self.messages),
        # ... many more
    }

No caching of serialized objects. For large sessions with frequent saves, this could be expensive.

3. Pipeline Step Overhead
The new pipeline has 19 steps (direct) vs 14 steps (streaming). Each step has:

Try-except-finally overhead
Event emission
StepResult creation
Statistics updates

For simple tasks, this could add significant latency. Consider:

Step fusion for common paths
Conditional step execution
Lazy evaluation

4. Security Concerns

🔒 Critical Security Issues:

1. Arbitrary Code Execution via Cloudpickle (mentioned above)
Severity: HIGH
CVSS: ~8.1 (High)

2. No Input Validation on External Tool Results

# src/upsonic/run/requirements.py:114-118
def set_external_execution_result(self, result: str) -> None:
    """Set the result from external execution."""
    if not self.tool_execution:
        raise ValueError("No tool execution to set result for")
    self.tool_execution.result = result  # ← No validation

External tool results are accepted without validation. Malicious results could contain:

Code injection payloads
XSS vectors
SQL injection attempts (if results are used in queries)

Recommendation: Add content validation and sanitization.

3. Missing Authentication/Authorization

# src/upsonic/session/agent.py
@dataclass
class AgentSession:
    session_id: str
    user_id: Optional[str] = None  # ← Optional user_id

User ID is optional, and there's no evidence of authorization checks. Any code with access to session_id can manipulate sessions.

4. Metadata Injection

# src/upsonic/agent/agent.py:332-333
# Agent metadata (injected into prompts)
self.metadata = metadata or {}

User-controlled metadata is injected into prompts without sanitization. Could enable prompt injection attacks.

5. Test Coverage

❌ Critical Gap:

Only 1 test file modified in a PR changing 100 files:

tests/doc_examples/agent/adding_tools/adding_tools_1.py

This is insufficient for such a massive refactoring. Required test coverage:

Unit Tests Needed:

✗ Pipeline step execution (all 19 steps)
✗ AgentRunOutput serialization/deserialization
✗ RunRequirement state transitions
✗ AgentSession HITL flows
✗ Message tracking and boundaries
✗ Error recovery (durable execution)
✗ Cancellation handling
✗ Streaming vs non-streaming paths

Integration Tests Needed:

✗ Full pipeline execution (happy path)
✗ External tool pause/resume
✗ Multi-run sessions
✗ Memory modes (update vs replace)
✗ Policy enforcement in pipeline
✗ Culture manager integration

Edge Cases to Test:

✗ Run boundaries with empty messages
✗ Concurrent tool execution
✗ Session serialization with large histories
✗ Pipeline interruption/resumption
✗ Malformed external tool results

Recommendation: Add comprehensive test suite before merging. Consider test-driven development for future features.

6. Breaking Changes

This PR introduces MAJOR BREAKING CHANGES that will affect all existing users:

🔴 Breaking Changes:

AgentRunOutput is now mandatory
- Old: Functions returned simple strings or objects
- New: All execution returns AgentRunOutput
- Migration: Wrap existing code with .output access
RunRequirement scope change
- Old: Handled confirmations, input, and external execution
- New: ONLY handles external tool execution
- Migration: Update HITL code to use AgentRunOutput.status directly
Streaming no longer supports HITL
- astream() explicitly disallows external tool pauses (line 2740-2741)
- Migration: Use do_async() for HITL workflows
Context manager lifecycle changed
- Old: manage_context() primary path
- New: aprepare() → use → afinalize()
- Migration: Update custom context managers
Memory mode changes
- New user_memory_mode: Literal['update', 'replace'] parameter
- Migration: Explicitly set memory mode in configurations
Message API changes
- Separation of chat_history vs messages
- Migration: Update code accessing agent messages

Missing: No migration guide or deprecation warnings provided.

7. Documentation Gaps

📝 Documentation Needed:

Architecture Documentation:

API Documentation:

AgentRunOutput API reference
Pipeline step creation guide
RunRequirement usage guide
Session management patterns

Migration Guides:

v1 to v2 migration guide
Breaking changes checklist
Code examples for all breaking changes
Deprecation timeline

Developer Guides:

Creating custom pipeline steps
Implementing new context managers
Error recovery patterns
Testing pipeline components

🎯 Recommendations

High Priority (Must Address Before Merge):

Add Comprehensive Test Suite
- Target: 80%+ coverage for new code
- Include unit, integration, and e2e tests
- Add performance benchmarks
Fix Security Issues
- Replace cloudpickle with safe serialization for untrusted data
- Add input validation on external tool results
- Implement authorization checks
- Sanitize metadata before prompt injection
Fix Logic Bugs
- Correct needs_confirmation property logic
- Add thread safety for concurrent tool execution
- Handle edge cases in message boundary tracking
Add Migration Documentation
- Write detailed migration guide
- Provide code examples for all breaking changes
- Add deprecation warnings to old APIs
- Create compatibility shims if possible

Medium Priority (Should Address Soon):

Refactor Large Classes
- Break down AgentRunOutput into focused components
- Extract message management to separate class
- Move serialization logic to dedicated module
Add Resource Limits
- Implement max runs per session
- Add message count limits
- Create auto-archiving for old runs
Improve Performance
- Cache serialization results
- Use message views instead of copies
- Optimize pipeline for simple tasks
Documentation
- Architecture overview
- API reference for new components
- Developer guide for extensions

Low Priority (Nice to Have):

Error Messages
- Don't truncate errors to 500 chars
- Add context to error messages
- Include debugging hints
Code Cleanup
- Reduce TYPE_CHECKING string fallbacks
- Consolidate message handling
- Add docstring examples

📊 Risk Assessment

Risk Category	Level	Justification
Test Coverage	🔴 CRITICAL	Only 1 test file for 100 files changed
Security	🔴 CRITICAL	Cloudpickle RCE, no input validation
Breaking Changes	🟠 HIGH	Major API changes without migration guide
Code Quality	🟡 MEDIUM	Good architecture, but some anti-patterns
Performance	🟡 MEDIUM	Some inefficiencies, but not critical
Documentation	🟠 HIGH	Insufficient for scope of changes

Overall Risk: 🔴 HIGH

✅ Approval Criteria

Before this PR can be safely merged:

Add comprehensive test suite (80%+ coverage)
Fix security vulnerabilities (cloudpickle, input validation)
Fix logic bugs (needs_confirmation, race conditions)
Write migration guide with examples
Add architecture documentation
Performance benchmark showing no regression
Security audit completed
Breaking changes documented in CHANGELOG

💡 Positive Notes

Despite the concerns, this is a well-architected refactoring:

✅ Clean pipeline pattern is a significant improvement
✅ Strong separation of concerns
✅ Comprehensive HITL support is well-designed
✅ Consistent error handling patterns
✅ Good use of dataclasses and type hints
✅ Thoughtful serialization infrastructure

With proper testing, security fixes, and documentation, this will be a solid foundation for the framework.

📌 Summary

This is an ambitious and well-architected refactoring that introduces important features like unified pipeline execution and robust HITL support. However, the lack of test coverage, security vulnerabilities, and insufficient documentation make it not ready for merge in its current state.

Recommendation: REQUEST CHANGES

The core architecture is sound, but the risk is too high without adequate testing and security measures. Please address the high-priority items before merging.

Happy to discuss any of these points in detail! 🚀

claude · 2026-01-17T09:12:55Z

Pull Request Review: Agent Run Implementation (#504)

This is a substantial architectural refactoring with 80,397 additions and 18,923 deletions across the codebase. Here's my comprehensive review:

🎯 Summary

This PR introduces a major architectural overhaul of the agent execution system, focusing on:

New AgentRun architecture with AgentRunInput and AgentRunOutput
Improved HITL (Human-in-the-Loop) management via RunRequirement
Refactored storage and memory system
Enhanced streaming and event handling
Parallel tool call support

✅ Strengths

1. Clear Separation of Concerns

The new run/ module provides excellent separation between input, output, events, and requirements
AgentRunInput and AgentRunOutput as dataclasses provide type safety and clear contracts
Single source of truth pattern for run state in AgentRunOutput

2. Improved HITL Architecture

RunRequirement class (src/upsonic/run/requirements.py) provides a clean abstraction for external tool execution
Clear distinction between confirmation, user input, and external execution needs
The is_resolved property elegantly handles requirement completion logic

3. Robust Cancellation Management

Thread-safe RunCancellationManager (src/upsonic/run/cancel.py:13-61)
Proper cleanup with register_run, cancel_run, and cleanup_run lifecycle
Cancellation checks integrated throughout the pipeline steps

4. Step-Based Pipeline Architecture

Well-structured pipeline with discrete steps (InitializationStep, CacheCheckStep, etc.)
Each step handles cancellation, errors, and result tracking
StepResult provides excellent execution visibility

5. Comprehensive Storage Refactoring

Abstract Storage base class provides clean interface (src/upsonic/storage/base.py)
Support for multiple backends (In-Memory, JSON, SQLite, Redis, PostgreSQL, MongoDB, Mem0)
Consistent async/sync patterns across implementations

⚠️ Issues & Concerns

Critical Security Issues

1. Unsafe URL Fetching (src/upsonic/run/agent/input.py:54)

response = httpx.get(url_content.url)
response.raise_for_status()

Problem: No timeout, no validation, vulnerable to SSRF attacks.

Recommendation:

response = httpx.get(url_content.url, timeout=30.0, follow_redirects=False)
# Add URL validation to prevent access to internal resources

2. Eval Usage in Test Examples

Multiple test files use eval() for calculator tools (tests/smoke_tests/agent/stream_events_example.py:43):

result = eval(expression)

Problem: While this is in test code, it sets a bad example. Users may copy this pattern.

Recommendation: Use ast.literal_eval() or a safe expression parser like numexpr.

Code Quality Issues

3. Massive File Sizes

src/upsonic/agent/agent.py: 1,656 additions (likely 2000+ lines total)
src/upsonic/agent/pipeline/steps.py: 2,509 additions
src/upsonic/run/events/events.py: 1,410 new lines
src/upsonic/run/agent/output.py: 1,073 new lines

Problem: These files are becoming monolithic and hard to maintain.

Recommendation: Consider breaking down into smaller modules:

Split steps.py into individual step files
Extract event definitions into separate files by category
Consider separating agent initialization, execution, and streaming logic

4. Inconsistent Error Handling

In steps.py, error messages are truncated to 500 chars:

message=str(e)[:500]

Problem: This arbitrary limit may hide critical debugging information.

Recommendation:

Use structured logging for full error details
Store full error in StepResult.error field
Keep truncated version for user-facing messages

5. Memory Accumulation Risk

AgentRunOutput.events accumulates all events during streaming (src/upsonic/run/agent/output.py:136):

events: Optional[List["AgentEvent"]] = None

Problem: For long-running agents, this list could grow unbounded.

Recommendation:

Add configurable event limit with circular buffer
Consider event persistence/streaming instead of in-memory accumulation
Add memory usage monitoring

6. Type Safety Concerns

Multiple uses of Optional[Any] reduce type safety:

metadata: Optional[Dict[str, Any]] (widespread)
session_state: Optional[Dict[str, Any]]

Recommendation: Define Pydantic models or TypedDicts for structured metadata.

Performance Concerns

7. Synchronous URL Downloads in Init

AgentRunInput.__post_init__() does synchronous HTTP calls:

response = httpx.get(url_content.url)

Problem: Blocks initialization for remote resources.

Recommendation: Make this lazy or async-only.

8. Global State in Cancellation Manager

_cancellation_manager = RunCancellationManager()

Problem: Global mutable state can cause issues in multi-threaded/multi-process environments.

Recommendation: Consider context-based or dependency injection pattern.

Testing Concerns

9. Test Organization

Tests moved from unit_tests/ to smoke_tests/ (commit: "fix: moved smoke tests from unit_tests into smoke_tests")

Problem: This reduces fast unit test coverage.

Recommendation:

Maintain separate unit tests for business logic
Keep smoke tests for integration scenarios
Ensure CI runs both test suites

10. Missing Test Coverage

For 80K lines of changes, I'd expect to see more new test files. The PR includes fixes to existing tests but limited new coverage for:

RunRequirement class
Cancellation manager
New event types
Storage migration paths

📊 Architecture Review

Good Patterns

✅ Dataclass usage for DTOs (AgentRunInput, AgentRunOutput)
✅ ABC usage for storage providers
✅ Event-driven architecture for streaming
✅ Step-based pipeline with clear lifecycle
✅ Separation of sync/async implementations

Anti-Patterns to Address

❌ Global mutable state (cancellation manager)
❌ Monolithic files (2000+ lines)
❌ Optional[Any] reducing type safety
❌ Unbounded memory accumulation (events list)
❌ Synchronous I/O in initialization

🔍 Additional Observations

Documentation

Good inline docstrings in new classes
Consider adding migration guide for users of old architecture
Architecture decision records would be valuable for this scale of change

Breaking Changes

This appears to be a breaking change for:

Agent initialization
Memory management
Context manager usage
HITL flow handling

Recommendation: Ensure changelog clearly documents migration path.

Dependencies

The storage refactoring introduces tight coupling to multiple database drivers. Consider:

Optional dependency groups (already partially done)
Plugin architecture for storage backends
Clear documentation of which extras to install

🎯 Recommendations Summary

Must Fix Before Merge

✋ Add timeout and validation to httpx.get() calls
✋ Add documentation for breaking changes and migration
✋ Address unbounded event accumulation

Should Fix Before Merge

📋 Split monolithic files into smaller modules
📋 Add unit tests for new core classes (RunRequirement, RunCancellationManager)
📋 Replace eval() in examples with safe alternatives
📋 Add memory usage limits/monitoring

Consider for Follow-up

💡 Refactor global state to dependency injection
💡 Define stricter types for metadata fields
💡 Add performance benchmarks for storage backends
💡 Create architecture decision records

🎖️ Overall Assessment

This is ambitious and well-structured work that significantly improves the agent execution architecture. The separation of concerns, event streaming, and HITL improvements are excellent.

However, the scale of changes (80K additions) introduces risk. The security issues must be addressed, and the monolithic file sizes will become maintenance burdens.

Recommendation:

✅ Approve with required changes (security fixes, documentation)
🔄 Plan follow-up refactoring to break down large files
📊 Add performance benchmarks to prevent regressions
🧪 Expand test coverage incrementally

Great work on tackling this complex refactoring! The new architecture provides a solid foundation for future features.

Review generated with assistance from Claude Code

claude · 2026-01-17T10:06:50Z

🔍 Comprehensive Code Review: PR #504 - Agent Run Implementation

Thank you for this substantial contribution! This is an impressive architectural refactor that brings significant improvements to the agent execution model. I've conducted a thorough review and have feedback across multiple dimensions.

📊 Overall Assessment

Code Quality: ⭐⭐⭐⭐ (4/5)

The implementation demonstrates strong architectural design with excellent separation of concerns, comprehensive type hints, and solid error handling patterns. The pipeline architecture is elegant and maintainable. However, there are several critical gaps around error handling, security, and test coverage that should be addressed.

Recommendation: Request changes - Please address the high-priority issues before merging, particularly around session save reliability and security concerns.

✅ Major Strengths

Pipeline Architecture - Clean step-based execution with clear responsibilities across 24 distinct step classes
AgentRunOutput as Single Source of Truth - Excellent design consolidating all run state in one place
Context Manager Migration - Successfully converted to regular classes with proper aprepare()/afinalize() lifecycle
Comprehensive Type Hints - Extensive use of TYPE_CHECKING guards and proper typing throughout
Event Streaming - Proper event emission at each step for observability
HITL (Human-in-the-Loop) - Intelligent resumption with proper state management

🔴 Critical Issues (High Priority)

1. Session Save Failures Are Silent

File: src/upsonic/agent/pipeline/manager.py:98-123

Issue: Session save exceptions are swallowed with only debug-level warnings:

except Exception as save_error:
    if self.debug:
        warning_log(f"Failed to save session: {save_error}", "PipelineManager")

Impact: Critical data loss risk - HITL state could be lost without user awareness

Fix:

Always log session save failures (not just in debug mode)
Consider retrying or raising if save is critical
Add telemetry/metrics for save failures

2. Memory Manager Finalization Silent Failure

File: src/upsonic/agent/context_managers/memory_manager.py:98-107

Issue: Session won't save if set_run_output() isn't called:

if self.memory and self._agent_run_output:
    await self.memory.save_session_async(output=self._agent_run_output)

Impact: Silent data loss if the run output is never set

Fix: Add warning log if _agent_run_output is None during finalization

3. Message Boundary Tracking Fragility

File: src/upsonic/run/agent/output.py:329-379

Issue: start_new_run() and finalize_run_messages() must be called in correct order manually. If pipeline flow changes, new_messages() could return incorrect data.

Impact: Core feature fragility - incorrect message tracking breaks memory management

Fix:

Add assertions to detect lifecycle misuse
Consider making this automatic via pipeline lifecycle hooks

4. Security: Unsafe Session Deserialization

File: src/upsonic/session/agent.py:129-192 (around line 870s based on pattern)

Issue: Complex deserialization with importlib.import_module for dynamic class loading

Impact: Potential arbitrary code execution if storage is compromised

Fix: Implement a whitelist of allowed modules for deserialization

5. Parallel Tool Call Implementation Unverified

File: src/upsonic/tools/processor.py:295-300

Issue: Code tracks sequential flag but actual parallel execution logic verification is needed

Impact: Cannot confirm parallel tool execution actually works as designed

Fix:

Review actual tool execution implementation
Add comprehensive tests with multiple parallel tools
Verify error handling when parallel tools fail

🟡 Medium Priority Issues

6. No Maximum Retry Cap

File: src/upsonic/agent/agent.py:308-311

Issue: Validation only checks retry < 1 but doesn't cap maximum:

if retry < 1:
    raise ValueError("The 'retry' count must be at least 1.")

Impact: Could allow extremely high retry counts

Fix: Add maximum retry cap (e.g., 100)

7. External Tool Result Validation Missing

File: src/upsonic/agent/pipeline/manager.py:690-729

Issue: No validation that tool results are actually provided before resumption

Impact: Could resume with incomplete/invalid tool results

Fix: Add validation in continue_run_async to ensure all requirements have valid results

8. External Tool Execution Accepts Unsanitized Input

File: src/upsonic/requirements.py:114-118

Issue: set_external_execution_result accepts any string without validation:

def set_external_execution_result(self, result: str) -> None:
    if not self.tool_execution:
        raise ValueError("No tool execution to set result for")
    self.tool_execution.result = result

Impact: Malicious results could be injected depending on downstream processing

Fix: Add result validation/sanitization

9. Tool Call Limit Not Enforced for Parallel Tools

File: src/upsonic/agent/agent.py:142, 793-795

Issue: tool_call_limit exists but unclear enforcement in parallel execution

Impact: Parallel tools could exceed limit

Fix: Enforce limit before dispatching parallel tools

10. Session Deserialization Robustness

File: src/upsonic/session/agent.py:129-192

Issue: Complex dict/object type checking with many potential failure points

Impact: Deserialization could fail on edge cases

Fix: Add comprehensive error handling with sensible fallback values

🟢 Low Priority / Suggestions

11. Large File Size

File: src/upsonic/agent/agent.py - 3,651 lines

Suggestion: Consider splitting into:

agent_core.py - Main execution logic
agent_tools.py - Tool management
agent_policies.py - Policy management
agent_resumption.py - HITL/continuation logic

12. Event Deduplication Logic

File: src/upsonic/agent/pipeline/manager.py:535-536

Issue: Uses object equality which may not work for all event types:

if event not in context.events:
    context.events.append(event)

Suggestion: Use event IDs instead of object equality

13. nest_asyncio Usage

File: src/upsonic/agent/agent.py:894-912

Note: Uses nest_asyncio for tool policy validation in potentially sync context

Suggestion: Document this behavior clearly; could hide async context issues in edge cases

📝 Test Coverage Gaps

Missing Tests:

Parallel Tool Execution - No visible tests for multiple parallel tools
Edge Cases:
- Resume from mid-pipeline with incomplete manager state
- Multiple external tool rounds with errors
- Session save failures during HITL
Streaming + HITL Interaction - Verify that HITL correctly rejects streaming mode

Recommendation:

Add negative test cases and parallel execution tests before merging.

⚡ Performance Considerations

1. Large Object Serialization

File: src/upsonic/run/agent/output.py:592-799

Concern: to_dict() is very comprehensive (200+ lines) - could be slow for large runs

Suggestion:

Add lazy serialization option
Consider storing large fields (messages, events) separately

2. Unlimited Event List Growth

File: src/upsonic/run/agent/output.py:136

Concern: events: Optional[List["AgentEvent"]] = None - no size limit

Impact: Memory bloat for long-running streaming tasks

Suggestion: Add event list size limit or circular buffer

3. Step Result Storage

File: src/upsonic/run/agent/output.py:130-133

Concern: All step results stored in memory and serialized

Suggestion: Consider optional step result pruning for long pipelines

🎯 Action Items

Immediate (Before Merge):

Add session save failure logging (always, not just debug)
Add memory manager finalization warnings
Implement security whitelist for deserialization
Add validation for external tool results before resumption
Add comprehensive parallel tool execution tests

Short-term:

Implement maximum retry cap
Create assertions for message boundary tracking
Add error aggregation for parallel tool failures
Document streaming limitation for HITL

Long-term Improvements:

Refactor agent.py into multiple focused files
Implement automatic step finalization
Add HITL streaming support
Optimize large object serialization

💬 Questions

Can you verify the parallel tool execution implementation? The actual concurrent execution logic wasn't visible in the review.
What's the expected behavior when parallel tools partially fail?
Is there a plan to support streaming for continued runs (HITL)?
Should session save failures block execution or just log warnings?

Overall, this is solid work that significantly improves the agent architecture. The main concerns are around error handling reliability and security hardening. Please address the high-priority issues, and I'm happy to re-review!

Reviewed with: Claude Code + Repository CLAUDE.md guidelines

DoganK01 added 10 commits January 5, 2026 20:18

Agent Run Implementation

a3d0e62

fix: system prompt setting up fix

dc9a9d0

fix: test fix, duplicate file remove

363a8be

fix: remove old version files

afd9071

fix: fix tests getting stuck

b1977cd

fix: moved smoke tests from unit_tests into smoke_tests

1f94bfa

fix: fix unit tests for 3.12

f8c8cdb

fix: fix milvus test stucking

70c6b2d

fix: test fix for python versions

aee828d

refactor: remove AgentRunContext and make it all dependet on AgentRun…

6135b41

…Output

DoganK01 added 4 commits January 17, 2026 12:05

fix: notebook fixed

aff5197

refactor: storage providers/memory class refactor

ad636b3

fix: fix test for new architectures

5db686e

refactor: pytest.ini update

aebb910

chor: version 0.70.0

d30743c

onuratakan merged commit aac4a17 into master Jan 17, 2026
5 checks passed

onuratakan deleted the agent-run-clean branch January 17, 2026 10:08

DoganK01 restored the agent-run-clean branch January 17, 2026 10:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent Run Implementation#504

Agent Run Implementation#504
onuratakan merged 15 commits intomasterfrom
agent-run-clean

DoganK01 commented Jan 5, 2026 •

edited

Loading

Uh oh!

claude bot commented Jan 10, 2026 •

edited by DoganK01

Loading

Uh oh!

claude bot commented Jan 17, 2026

Uh oh!

claude bot commented Jan 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DoganK01 commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Jan 10, 2026 • edited by DoganK01 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Review: Agent Run Implementation (#504)

🎯 Overall Assessment

📋 Detailed Findings

1. Code Quality & Best Practices

✅ Strengths:

⚠️ Issues:

2. Potential Bugs & Issues

🐛 Critical Issues:

⚠️ Medium Issues:

3. Performance Considerations

🚀 Optimizations Needed:

4. Security Concerns

🔒 Critical Security Issues:

5. Test Coverage

❌ Critical Gap:

6. Breaking Changes

🔴 Breaking Changes:

7. Documentation Gaps

📝 Documentation Needed:

🎯 Recommendations

High Priority (Must Address Before Merge):

Medium Priority (Should Address Soon):

Low Priority (Nice to Have):

📊 Risk Assessment

✅ Approval Criteria

💡 Positive Notes

📌 Summary

Uh oh!

claude bot commented Jan 17, 2026

Pull Request Review: Agent Run Implementation (#504)

🎯 Summary

✅ Strengths

1. Clear Separation of Concerns

2. Improved HITL Architecture

3. Robust Cancellation Management

4. Step-Based Pipeline Architecture

5. Comprehensive Storage Refactoring

⚠️ Issues & Concerns

Critical Security Issues

1. Unsafe URL Fetching (src/upsonic/run/agent/input.py:54)

2. Eval Usage in Test Examples

Code Quality Issues

3. Massive File Sizes

4. Inconsistent Error Handling

5. Memory Accumulation Risk

6. Type Safety Concerns

Performance Concerns

7. Synchronous URL Downloads in Init

8. Global State in Cancellation Manager

Testing Concerns

9. Test Organization

10. Missing Test Coverage

📊 Architecture Review

Good Patterns

Anti-Patterns to Address

🔍 Additional Observations

Documentation

Breaking Changes

Dependencies

🎯 Recommendations Summary

Must Fix Before Merge

Should Fix Before Merge

Consider for Follow-up

🎖️ Overall Assessment

Uh oh!

claude bot commented Jan 17, 2026

🔍 Comprehensive Code Review: PR #504 - Agent Run Implementation

📊 Overall Assessment

✅ Major Strengths

🔴 Critical Issues (High Priority)

1. Session Save Failures Are Silent

2. Memory Manager Finalization Silent Failure

3. Message Boundary Tracking Fragility

4. Security: Unsafe Session Deserialization

DoganK01 commented Jan 5, 2026 •

edited

Loading

claude bot commented Jan 10, 2026 •

edited by DoganK01

Loading