Skip to content

External tool fix#457

Merged
onuratakan merged 8 commits intomasterfrom
external-tool-fix
Nov 1, 2025
Merged

External tool fix#457
onuratakan merged 8 commits intomasterfrom
external-tool-fix

Conversation

@DoganK01
Copy link
Copy Markdown
Member

  • some bug and external tool call fix

@claude
Copy link
Copy Markdown

claude bot commented Oct 31, 2025

Pull Request Review - External Tool Fix (#457)

Overview

This is a large PR (16,024 additions, 4,320 deletions) that introduces several major features and bug fixes:

  • Durable execution system with checkpointing
  • OCR engine with multiple provider support
  • Direct class for simplified LLM interactions
  • State graph implementation (graphv2)
  • External tool execution fixes
  • Policy manager improvements

Code Quality & Architecture

Strengths ✅

  1. Well-structured durable execution system

    • Clean separation of concerns with DurableExecution, DurableExecutionStorage, and serializer
    • Multiple storage backend support (In-Memory, File, SQLite, Redis)
    • Good error handling and recovery mechanisms in agent.py:1862-2109
  2. OCR integration is modular

    • Abstract base class pattern in ocr/base.py
    • Multiple provider support (EasyOCR, RapidOCR, Tesseract, PaddleOCR, DeepSeek)
    • Consistent error handling with custom exception hierarchy
  3. Direct class provides clean API

    • Simplified interface for quick LLM interactions in direct.py
    • Proper async/sync handling with builder pattern (with_model, with_settings)
  4. Policy manager improvements

    • Better exception handling for DisallowedOperation in agent.py:911-913, 1251-1253
    • Re-raises disallowed exceptions properly instead of swallowing them

Concerns & Issues ⚠️

1. Missing Test Coverage 🔴 CRITICAL

  • No tests for durable execution: grep -r "durable" tests/ returns 0 results
  • No tests for OCR functionality: No test files in tests/ for OCR
  • No tests for Direct class: Not covered in test suite
  • No tests for graphv2: State graph implementation is untested

Impact: These are major features that could break in production. Durable execution handles failure recovery - this MUST have tests.

Recommendation: Add comprehensive tests for:

# Required test files:
tests/unit_tests/test_durable_execution.py
tests/unit_tests/ocr/test_ocr_providers.py  
tests/unit_tests/test_direct.py
tests/unit_tests/test_graphv2.py

2. Agent Continuation Logic Complexity 🟡

File: agent.py:1740-1860

The continue_async method recreates the pipeline with a subset of steps. This is fragile:

  • Risk: If pipeline step order changes in do_async, continuation breaks
  • Issue: Steps are hardcoded twice (lines 1325-1345 and 1829-1841)
  • Problem: Comment says "SKIP ALL STEPS UNTIL MessageBuildStep" but relies on manual step selection

Recommendation:

  • Extract pipeline step configuration to a shared method
  • Add integration tests for continuation scenarios
  • Consider marking which steps are "continuation-safe" via metadata

3. Pipeline Manager State Tracking 🟡

File: agent/pipeline/manager.py:162-267

# Lines 135-140: Using bare attributes for tracking
self._current_step_index = step_index
self._current_step = step

Issues:

  • State stored in manager instance, not in context
  • Race conditions possible if manager is reused across tasks
  • Error handling at line 267 uses getattr with defaults, masking bugs

Recommendation:

  • Move tracking state into StepContext
  • Make PipelineManager stateless and reusable

4. Security: Policy Manager Debug Flag 🟡

File: agent.py:1328-1330

if debug:
    self.user_policy_manager.debug = True
    self.agent_policy_manager.debug = True

Issue: Mutates shared policy manager state based on per-task debug flag. If policy managers are reused across tasks, this could leak debug output.

Recommendation: Pass debug flag to execute_policies_async instead of mutating manager state.

5. Type Annotations 🟡

Multiple files use Any extensively:

  • agent.py:1865: storage: Optional[Any] = None
  • direct.py:43: model: Union[str, Any, None] = None

Impact: Reduces type safety and IDE autocomplete effectiveness

Recommendation: Use proper types from typing or create TypeAliases

6. Error Handling in Direct Class 🟢 Minor

File: direct.py:171-182

except Exception as e:
    print(f"Warning: Could not load attachment {attachment_path}: {e}")

Issue: Uses print instead of logging framework, swallows all exceptions

Recommendation: Use proper logging and consider raising for critical errors

7. Durable Checkpoint Status Logic 🟡

File: agent/pipeline/manager.py:210-222

The checkpoint logic saves on both SUCCESS and PENDING but determines resume point differently for "failed" vs "success" status in continue_durable_async. This could cause issues:

# Line 210: Saves checkpoint on SUCCESS
if result.status == StepStatus.SUCCESS:
    await self._save_durable_checkpoint(context, step, ...)

# Line 215: Also saves on PENDING  
if result.status == StepStatus.PENDING:
    await self._save_durable_checkpoint(context, step, ..., status="paused")

Concern: What happens if a step succeeds but the next step immediately fails? The logic in continue_durable_async at lines 1934-1945 may skip the wrong step.

Performance Considerations

Positive

  • Durable execution uses efficient serialization
  • OCR supports multiple backends for different use cases
  • Direct class avoids memory/knowledge base overhead

Concerns

  1. Large dependency additions: OCR dependencies add ~6 packages (EasyOCR, PaddleOCR, pdf2image, etc.) - should be in separate extras group ✅ (already done)
  2. Checkpoint I/O: Every pipeline step now triggers a checkpoint save. For long pipelines, this could be slow with file-based storage.

Recommendation: Add configurable checkpoint frequency (e.g., every N steps)

Security Review

Good

✅ Policy manager properly re-raises DisallowedOperation
✅ OCR has proper input validation in base classes
✅ File paths validated in OCR utils

Concerns

🟡 Direct class reads files without size limits (direct.py:172)
🟡 Durable execution deserializes pickled data - ensure storage is trusted

Documentation

Missing

  • No docstrings for several new graphv2 classes
  • Limited examples for durable execution recovery
  • OCR provider comparison/selection guide missing

Good

  • Direct class has clear docstrings with examples
  • Durable execution has good class-level documentation

Breaking Changes

⚠️ Potential Breaking Change: Direct is no longer an alias for Agent (agent.py removed line: Direct = Agent)

This was added to __init__.py but may break existing code that expects from upsonic.agent import Direct.

Recommendations Summary

Must Fix (Before Merge) 🔴

  1. Add comprehensive tests for durable execution, OCR, Direct, and graphv2
  2. Fix pipeline continuation fragility - extract step configuration
  3. Document breaking changes in changelog/migration guide

Should Fix (High Priority) 🟡

  1. Fix policy manager debug state mutation
  2. Improve type annotations (reduce Any usage)
  3. Move pipeline tracking state to context (thread safety)
  4. Review durable checkpoint resume logic for edge cases

Nice to Have 🟢

  1. Add logging instead of print statements
  2. Add file size limits in Direct class
  3. Add checkpoint frequency configuration
  4. Add OCR provider comparison documentation

Conclusion

This PR adds valuable features but should not be merged without tests. The durable execution system is particularly critical - it handles failure recovery and must be thoroughly tested.

Recommendation:

  • Create follow-up PR with test coverage for new features
  • OR add tests to this PR before merging
  • Consider breaking into smaller PRs (durable, OCR, direct, graphv2)

Overall Assessment: 6/10 - Good architecture and features, but missing critical test coverage and has some implementation concerns.


Review generated by Claude Code - Let me know if you'd like me to elaborate on any specific concerns!

@claude
Copy link
Copy Markdown

claude bot commented Oct 31, 2025

Pull Request Review - External Tool Fix (#457)

Overview

This is a substantial PR with 16,101 additions and 4,320 deletions that introduces several major features:

  1. Durable Execution System - checkpoint/recovery for long-running tasks
  2. OCR Integration - multiple OCR provider support
  3. Direct Interface - simplified high-speed LLM interaction
  4. External Tool Continuation - improved handling of paused tasks
  5. Bug fixes - agent policy enforcement and pipeline improvements

🟢 Code Quality & Strengths

1. Well-Structured Architecture

  • Clean separation of concerns with dedicated modules (durable, ocr, direct)
  • Proper lazy imports in __init__.py to avoid loading heavy dependencies
  • Good use of abstract base classes (DurableExecutionStorage, OCRProvider)

2. Comprehensive Error Handling

  • Re-raises DisallowedOperation exceptions from policy managers (agent.py:911, 1251)
  • Proper async/sync handling with fallback patterns in Direct class
  • Good error context preservation in durable execution recovery

3. Good Documentation

  • Docstrings with examples for Direct class and durable methods
  • Clear usage examples in class documentation
  • Type hints throughout new code

4. Production-Ready Features

  • Checkpoint/recovery system with multiple storage backends
  • Sentry transaction tracking and spans in pipeline execution
  • Proper cleanup and resource management

🟡 Issues & Concerns

1. CRITICAL: Invalid Python Syntax ⚠️

Location: src/upsonic/agent/pipeline/manager.py:165

for step_index, step in enumerate[Step](self.steps):

Issue: This is invalid syntax. The correct syntax should be:

for step_index, step in enumerate(self.steps):

Python's enumerate() doesn't take type parameters with brackets. This will cause a runtime error.

Impact: This will break the entire pipeline execution.


2. Agent Policy Step Reordering ⚠️

Location: src/upsonic/agent/agent.py:1349-1354

The AgentPolicyStep was moved before CallManagementStep:

MemoryMessageTrackingStep(),
AgentPolicyStep(),  # Move before CallManagementStep
CallManagementStep(),
TaskManagementStep(),

Concerns:

  • The comment suggests intentional reordering, but no explanation is provided
  • This changes the execution flow significantly - policies now apply before call management
  • Could break existing behavior if code depends on the original order
  • No tests appear to validate this change

Recommendation: Add tests to verify the new ordering is correct and document why this change was necessary.


3. Incomplete Continuation State Management

Location: src/upsonic/agent/agent.py:1786-1803

if hasattr(task, '_continuation_state') and task._continuation_state:
    continuation_messages = task._continuation_state.get('messages', [])
    response_with_tool_calls = task._continuation_state.get('response_with_tool_calls')

Issues:

  • Uses private attribute _continuation_state without property accessor
  • No validation that continuation_state has the expected structure
  • Missing documentation on when/how _continuation_state is set on tasks
  • No clear initialization or cleanup of this state

Recommendation: Create a proper property or method to manage continuation state with validation.


4. Missing Tests for Major Features ⚠️

Based on file analysis, there are no dedicated tests for:

  • Durable execution checkpoint/recovery
  • Direct interface functionality
  • External tool continuation logic
  • OCR provider implementations

The only test files touched are playground/LCEL tests which appear unrelated to the core changes.

Impact: High risk of regressions and unexpected behavior in production.

Recommendation: Add comprehensive test coverage:

tests/durable/test_checkpoint_recovery.py
tests/durable/test_storage_backends.py
tests/test_direct_interface.py
tests/test_external_tool_continuation.py
tests/ocr/test_ocr_providers.py

5. Heavy OCR Dependencies

Location: pyproject.toml:123-130

ocr = [
    "easyocr>=1.7.2",
    "paddleocr>=2.10.0",
    "pdf2image>=1.17.0",
    "pillow>=11.3.0",
    "pytesseract>=0.3.13",
    "rapidocr-onnxruntime>=1.4.4",
]

Concerns:

  • 6 OCR libraries with significant installation size
  • Some require system dependencies (tesseract, poppler for pdf2image)
  • No documentation about when to use which provider
  • ocr-gpu extra is defined but empty

Recommendations:

  • Document installation requirements and system dependencies
  • Provide guidance on provider selection
  • Consider making some providers optional sub-extras
  • Populate ocr-gpu or remove it

6. Durable Execution Retry Logic

Location: src/upsonic/agent/pipeline/manager.py:217-232

if checkpoint_status == "failed":
    resume_from_index = step_index  # Retry the failed step
else:
    resume_from_index = step_index + 1  # Continue from next step

Concerns:

  • No limit on retry attempts for failed steps
  • Could create infinite retry loops if the step consistently fails
  • No exponential backoff or delay between retries
  • Missing configuration for retry behavior

Recommendation: Add retry limits and backoff configuration to durable execution.


7. Type Safety Issues

Several locations have type safety concerns:

  • manager.py:165 - Invalid enumerate syntax (critical)
  • direct.py:70-73 - Duck typing check with hasattr(model, 'request') instead of proper type checking
  • Missing type hints in several callback/handler functions

8. External Tool Result Conversion

Location: src/upsonic/agent/agent.py:1754-1771

for tool_call in task.tools_awaiting_external_execution:
    tool_call_id = None
    if hasattr(tool_call, 'tool_call_id'):
        tool_call_id = tool_call.tool_call_id
    result = getattr(tool_call, 'result', None)

Issues:

  • Inconsistent interface handling with hasattr and getattr
  • Comment mentions both ExternalToolCall and ToolCall but no type checking
  • Could fail silently if objects don't have expected attributes

Recommendation: Use proper type checking or Protocol definitions.


🔒 Security Considerations

1. File Handling in Direct Interface

with open(attachment_path, "rb") as attachment_file:
    attachment_data = attachment_file.read()
  • No path validation or sanitization
  • Could read sensitive files if paths are user-controlled
  • No size limits on file reads

Recommendation: Add path validation, size limits, and allowlist checking.

2. Policy Manager Re-raises

The change to re-raise DisallowedOperation exceptions is good for security - ensures policy violations aren't silently caught.


⚡ Performance Considerations

1. Durable Checkpoint Overhead

  • Checkpoint is saved after every successful step (manager.py:211)
  • For pipelines with many steps, this could add significant I/O overhead
  • Consider making checkpoint frequency configurable

2. External Tool Continuation

  • Creates entirely new pipeline execution instead of true resumption
  • Rebuilds context and state from scratch
  • Could be optimized for better performance

📋 Recommendations

High Priority (Fix Before Merge)

  1. Fix the enumerate[Step] syntax error - This will break execution
  2. Add tests for durable execution, Direct interface, and continuation logic
  3. Document the AgentPolicyStep reordering - Why was this changed?

Medium Priority

  1. Add retry limits and configuration to durable execution
  2. Improve type safety with proper Protocol/ABC usage
  3. Add file security validation in Direct interface
  4. Document OCR provider selection and system requirements

Nice to Have

  1. Make checkpoint frequency configurable
  2. Add benchmarks for performance-critical paths
  3. Consider optimization of continuation flow

📊 Summary

Category Rating Notes
Code Quality 🟡 Good Well-structured but has critical syntax error
Test Coverage 🔴 Poor Major features lack tests
Security 🟡 Adequate Some concerns with file handling
Performance 🟡 Good Some optimization opportunities
Documentation 🟢 Good Good docstrings and examples

Final Verdict

⚠️ Changes Requested

This PR introduces valuable features but has a critical syntax error that must be fixed before merge. The lack of test coverage for major new features is also concerning for production readiness.

Required Changes:

  1. Fix enumerate[Step] syntax error
  2. Add test coverage for new features
  3. Document AgentPolicyStep reordering rationale

Once these issues are addressed, this will be a solid addition to the framework.


Review based on CLAUDE.md guidelines and Python best practices

@claude
Copy link
Copy Markdown

claude bot commented Oct 31, 2025

Pull Request Review: External Tool Fix & Feature Additions

I've reviewed PR #457 which includes 16,136 additions and 4,320 deletions. This is a substantial PR that introduces several major features alongside bug fixes. Here's my comprehensive feedback:


🎯 Overview

This PR introduces three major feature sets:

  1. Durable Execution: Checkpoint-based execution recovery system
  2. External Tool Continuation: Improved handling of paused executions for external tools
  3. OCR Integration: Multi-provider OCR support
  4. GraphV2: New state graph execution system

Plus various bug fixes and improvements to the pipeline, policy system, and safety engine.


✅ Strengths

1. Durable Execution System

  • Well-structured with clear separation of concerns (execution, storage, serialization)
  • Multiple storage backends (Memory, File, SQLite, Redis) following a common interface
  • Good error handling with checkpoint status tracking (success, failed, paused)
  • Clean API design: continue_durable() and continue_durable_async() methods

2. External Tool Continuation Logic

  • Properly handles continuation state with _continuation_state on tasks
  • Converts ExternalToolCall to ToolReturnPart messages correctly
  • Restores pipeline state intelligently by skipping completed initialization steps

3. Code Organization

  • OCR module follows clean provider pattern with base classes
  • Proper lazy loading in __init__.py
  • Good use of dataclasses for structured data (ExternalToolCall, etc.)

⚠️ Critical Issues

1. Missing Test Coverage ❗❗❗

Severity: HIGH

The PR adds ~16K lines of code but includes NO tests for the major new features:

  • ❌ No tests for DurableExecution checkpoint/recovery logic
  • ❌ No tests for continue_async() with external tool results
  • ❌ No tests for continue_durable_async()
  • ❌ No tests for the new GraphV2 system
  • ❌ Only one existing OCR test (test_pdfplumber_ocr.py)

Recommendation: Add comprehensive tests before merging:

# tests/unit_tests/test_durable_execution.py
# tests/unit_tests/test_external_continuation.py
# tests/unit_tests/graphv2/test_state_graph.py

2. Pipeline Step Order Change 🔴

Location: src/upsonic/agent/agent.py:1349

# Before (line order changed):
AgentPolicyStep(),  # Moved BEFORE CallManagementStep
CallManagementStep(),

Issue: This changes execution order without clear justification. The comment says "Move before CallManagementStep" but doesn't explain why. This could:

  • Break existing agent behaviors
  • Affect tool call interception
  • Change when policies are applied vs when calls are managed

Recommendation:

  • Document WHY this change is necessary
  • Add tests verifying the new behavior
  • Consider if this is a breaking change

3. Error Handling in DisallowedOperation Re-raising

Location: src/upsonic/agent/agent.py:911-913, 1251-1253

if result.should_block():
    if result.disallowed_exception:
        raise result.disallowed_exception  # Re-raises caught exception

Issue: This re-raises exceptions that were already caught by PolicyManager. Questions:

  • Will this bypass error handling in the pipeline?
  • Should these exceptions be wrapped or logged first?
  • Is there proper cleanup before re-raising?

Recommendation: Add proper exception context and ensure cleanup happens.

4. Potential Race Condition in Pipeline Manager

Location: src/upsonic/agent/pipeline/manager.py:165-167

for step_index, step in enumerate[Step](self.steps):
    self._current_step_index = step_index
    self._current_step = step

Issue: Instance variables _current_step_index and _current_step could cause issues if:

  • Multiple tasks run concurrently
  • Async operations interleave
  • Error occurs before these are set

Recommendation: Store current step info in the context instead of on the manager instance.


🐛 Bug Fixes & Code Quality

1. Type Annotation Issues

Location: src/upsonic/agent/pipeline/manager.py:165

for step_index, step in enumerate[Step](self.steps):  # ❌ Wrong syntax

Issue: This should be:

for step_index, step in enumerate(self.steps):  # ✅ Correct

enumerate doesn't take type parameters in Python. The type hint should be on the variable or return type.

2. Memory Management in Durable Storage

Location: src/upsonic/durable/storages/file.py

The file storage loads entire checkpoint data into memory. For large tasks with many messages/images:

  • Could cause OOM errors
  • No size limits or warnings
  • Recommend adding size checks or streaming for large checkpoints

3. Timeout Addition to DeepAgent Tools

Location: src/upsonic/agent/deep_agent/tools.py

@tool(
    sequential=False,
    show_result=False,
    docstring_format='google',
    timeout=60.0  # Added timeout
)

Good: Adding timeouts prevents hanging operations
Question: Why 60 seconds specifically? Should this be configurable? What happens on timeout?


🔒 Security Considerations

1. OCR File Processing

The OCR module processes user-provided files (PDFs, images). Ensure:

  • ✅ File type validation (appears to be present)
  • ⚠️ File size limits (not clearly visible)
  • ⚠️ Path traversal protection
  • ⚠️ Malicious file handling (PDF exploits, image bombs)

Recommendation: Add explicit file size limits and validate file paths.

2. Durable Execution State Serialization

Checkpoint data includes task state, messages, and potentially sensitive data:

  • Is checkpoint data encrypted at rest?
  • Are there retention policies?
  • Who can access checkpoint files?

Recommendation: Document security considerations for production use.


📝 Documentation & API Design

1. Missing Migration Guide

This PR introduces breaking changes:

  • Pipeline step order changed
  • New continuation API
  • Agent policy behavior changed

Recommendation: Add a migration guide for users upgrading from previous versions.

2. API Consistency

The continuation APIs are well-designed:

agent.continue_async(task, ...)  # Continue from paused external execution
agent.continue_durable(execution_id, ...)  # Continue from checkpoint

However, consider:

  • Should there be a unified continue() method that detects the scenario?
  • Is continue_durable_async vs continue_async naming clear enough?

3. GraphV2 Documentation

The GraphV2 system is completely new but lacks:

  • README or docs explaining the architecture
  • Examples of usage
  • Migration path from existing graph system

🚀 Performance Considerations

1. Checkpoint Frequency

Location: src/upsonic/agent/pipeline/manager.py:210

if result.status == StepStatus.SUCCESS:
    await self._save_durable_checkpoint(context, step, ...)

Issue: Checkpoints saved after EVERY successful step. For pipelines with many steps:

  • Could cause significant I/O overhead
  • Slow down execution
  • Fill up storage quickly

Recommendation:

  • Add configurable checkpoint frequency
  • Only checkpoint at critical steps
  • Implement checkpoint compression

2. OCR Provider Initialization

Multiple OCR providers are loaded but may not all be used. Consider lazy-loading providers only when requested.


📋 Recommendations Summary

Before Merging:

  1. Add comprehensive test coverage (CRITICAL)
  2. Fix the enumerate[Step] type syntax error (CRITICAL)
  3. Document pipeline step order change rationale (HIGH)
  4. Add security documentation for durable execution (HIGH)
  5. Consider checkpoint performance implications (MEDIUM)

For Future PRs:

  1. Break large features into smaller, focused PRs
  2. Include tests with each feature
  3. Add migration guides for breaking changes
  4. Document security considerations
  5. Consider performance implications of frequent checkpointing

🎬 Conclusion

This PR adds valuable functionality (durable execution, external tool handling, OCR) but needs work before merging:

Blockers:

  • ❌ Missing test coverage for major features
  • ❌ Type syntax error in pipeline manager
  • ⚠️ Undocumented breaking change (pipeline step order)

Overall Assessment: The architecture and design are solid, but the PR is too large, lacks tests, and introduces potential breaking changes without documentation. I recommend:

  1. Split into smaller PRs if possible
  2. Add comprehensive tests
  3. Document breaking changes
  4. Address security considerations

Recommendation: Request Changes - mainly for test coverage and fixing the syntax error.


Let me know if you'd like me to dive deeper into any specific area or help write tests for the new features!

@onuratakan onuratakan merged commit 05fc31f into master Nov 1, 2025
4 checks passed
@onuratakan onuratakan deleted the external-tool-fix branch November 1, 2025 09:45
@sentry
Copy link
Copy Markdown

sentry bot commented Nov 2, 2025

Issues attributed to commits in this pull request

This pull request was merged and Sentry observed the following issues:

DoganK01 pushed a commit that referenced this pull request Jan 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants