Skip to content

refactor: new smoke tests and bug fix#492

Merged
DoganK01 merged 1 commit intomasterfrom
smoke-tests
Dec 7, 2025
Merged

refactor: new smoke tests and bug fix#492
DoganK01 merged 1 commit intomasterfrom
smoke-tests

Conversation

@DoganK01
Copy link
Copy Markdown
Member

@DoganK01 DoganK01 commented Dec 7, 2025

  • Fixed MCP tool handling, builtin tool handling and remove/add tool logics properly
  • image output handling fixed
  • fix Graph class runtime
  • fix evaluation class deepcopy problem
  • add printing for reflection logic

@claude
Copy link
Copy Markdown

claude bot commented Dec 7, 2025

Pull Request Review: refactor: new smoke tests and bug fix

Summary

This PR includes significant improvements to tool management, image handling, reflection logging, and adds comprehensive smoke tests. The changes demonstrate good attention to detail in fixing several critical bugs and improving code quality.


Code Quality & Best Practices

✅ Strengths

  1. Improved Tool Management Architecture (src/upsonic/agent/agent.py)

    • Excellent separation of builtin tools from regular tools throughout the codebase
    • Proper handling of tool registration/removal for both agent-level and task-level tools
    • Consistent pattern applied across _register_agent_tools(), add_tools(), and remove_tools()
  2. Enhanced Safety Policy Integration

    • Good distinction between RAISE and BLOCK action types in tool validation (agent.py:556-583)
    • Proper exception propagation for RAISE actions while gracefully handling BLOCK actions
    • Debug logging for blocked tools improves observability
  3. Image Output Handling (src/upsonic/agent/agent.py:1797-1811, src/upsonic/direct.py:216-232)

    • Smart handling of single vs. multiple images (returns bytes vs. list of bytes)
    • Consistent implementation across both Agent and Direct classes
    • Good separation of concerns (check images first, then text)
  4. Reflection Logging (src/upsonic/reflection/processor.py)

    • Well-structured logging for reflection workflow
    • Good use of the new printing utilities for better UX
    • Clear separation of concerns (evaluation, improvement, completion)
  5. Comprehensive Smoke Tests

    • Excellent coverage of tool management scenarios in test_tool_management.py
    • Good test organization with clear success criteria documented at the top
    • Proper use of fixtures and async/await patterns

Potential Issues & Concerns

🔴 Critical Issues

  1. Agent Deepcopy Removal May Break Stateful Tests (src/upsonic/eval/performance.py:68-86)

    # Don't deepcopy agent_under_test - it contains unpicklable objects like httpx.AsyncClient
    # Each task execution should be stateless anyway

    Issue: While the comment claims "each task execution should be stateless," this assumption may not hold true:

    • Agents can have stateful attributes (memory, context, etc.)
    • Multiple iterations might contaminate state between runs
    • Performance measurements could be affected by state accumulation

    Recommendation: Consider one of these alternatives:

    • Implement a proper reset() or clear_state() method on the Agent class
    • Create a lightweight copy mechanism that preserves necessary state but resets execution-specific attributes
    • Document this limitation clearly if intentional

    Location: src/upsonic/eval/performance.py:68, 76

  2. Potential Tool Removal Side Effects (src/upsonic/agent/agent.py:573-583)

    for registered_tools_dict in registered_tools_dicts:
        self.tool_manager.remove_tools(
            tools=[tool_def.name],
            registered_tools=registered_tools_dict
        )
        
        # Also remove from the tracking dict to keep it clean
        if tool_def.name in registered_tools_dict:
            del registered_tools_dict[tool_def.name]

    Issue:

    • Removing from all dicts in the list without checking which one actually contains the tool
    • Could attempt removal from empty dicts unnecessarily
    • The remove_tools() call might fail silently or log unnecessary warnings

    Recommendation: Add existence check before removal:

    for registered_tools_dict in registered_tools_dicts:
        if tool_def.name in registered_tools_dict:
            self.tool_manager.remove_tools(
                tools=[tool_def.name],
                registered_tools=registered_tools_dict
            )
            del registered_tools_dict[tool_def.name]
            break  # Tool found and removed, no need to check other dicts

    Location: src/upsonic/agent/agent.py:575-583

⚠️ Medium Priority Issues

  1. Graph Node Type Check After Pop (src/upsonic/graph/graph.py:668-672)

    while execution_queue:
        node = execution_queue.pop(0)
        if isinstance(node, TaskNode):
            node.task.context = []

    Issue: node.task.context = [] is now only executed for TaskNode instances, but it was previously executed for all nodes. This could be intentional, but:

    • The context attribute might need to be reset for all node types
    • If other node types have a task attribute, they won't be reset

    Recommendation: Verify this is intentional and document why context clearing is TaskNode-specific.

    Location: src/upsonic/graph/graph.py:671

  2. Missing Error Handling in Guardrail Structured Output Parsing (src/upsonic/agent/agent.py:1383-1393)

    try:
        import json
        parsed = json.loads(final_text_output)
        if hasattr(task.response_format, 'model_validate'):
            guardrail_input = task.response_format.model_validate(parsed)
    except:
        # If parsing fails, use the text output
        guardrail_input = final_text_output

    Issue: Bare except: clause catches all exceptions, including KeyboardInterrupt and SystemExit

    Recommendation: Use specific exception handling:

    except (json.JSONDecodeError, ValidationError, AttributeError) as e:
        # Log the parsing failure for debugging
        if self.debug:
            debug_log(f"Failed to parse structured output for guardrail: {e}")
        guardrail_input = final_text_output

    Location: src/upsonic/agent/agent.py:1392

  3. Reflection Config Auto-Creation (src/upsonic/agent/agent.py:319-323)

    # Handle reflection configuration
    if reflection and not reflection_config:
        # Create default reflection config if reflection=True but no config provided
        from upsonic.reflection import ReflectionConfig
        reflection_config = ReflectionConfig()

    Issue: While convenient, auto-creating config with defaults might lead to unexpected behavior if users don't realize what settings are being used.

    Recommendation: Consider logging a debug message or warning when auto-creating the config to improve visibility.

    Location: src/upsonic/agent/agent.py:320-323

ℹ️ Minor Issues / Suggestions

  1. Code Duplication in Tool Separation Logic

    • The pattern for separating builtin from regular tools is repeated multiple times
    • Consider extracting to a helper method:
    def _separate_builtin_and_regular_tools(self, tools: List[Any]) -> Tuple[List[Any], List[Any]]:
        """Separate builtin tools from regular tools."""
        from upsonic.tools.builtin_tools import AbstractBuiltinTool
        builtin_tools = []
        regular_tools = []
        for tool in tools:
            if tool is not None and isinstance(tool, AbstractBuiltinTool):
                builtin_tools.append(tool)
            else:
                regular_tools.append(tool)
        return builtin_tools, regular_tools

    Locations: agent.py:614-622, 674-682, 730-738, 832-840

  2. Turkish Comment in printing.py

    setup_logging(enable_console=False)  # Console kapalı, Rich kullanıyoruz

    Recommendation: Use English for all code comments for international collaboration.

    Location: src/upsonic/utils/printing.py:16


Performance Considerations

  1. Tool Removal in Validation Loop - The policy validation now removes blocked tools in a loop. For large tool sets, this could be slow. Consider batching removals.

  2. Multiple Dict Iterations - The tool removal logic iterates through registered_tools_dicts for each blocked tool. For better performance, build a removal list first, then process once.

  3. Reflection Overhead - The new reflection logging adds function call overhead. Ensure it's only active when needed and doesn't impact performance significantly.


Security Concerns

✅ Security Improvements

  1. Enhanced Safety Policy Integration - The distinction between RAISE and BLOCK actions provides better security control granularity.

  2. Tool Validation Before Execution - Pre-execution validation with tool removal prevents blocked tools from being used.

⚠️ Security Considerations

  1. Tool Removal Race Conditions - If tools are being added/removed concurrently, the validation-then-remove pattern could have race conditions. Consider adding locks if concurrent access is possible.

  2. Bare Exception Handling - The bare except: in guardrail parsing (mentioned above) could mask security-relevant exceptions.


Test Coverage

✅ Excellent Coverage

  1. Comprehensive Smoke Tests - 42 new test files covering:

    • Tool management (all variants)
    • Image I/O
    • Storage systems
    • Safety policies
    • Team/multi-agent scenarios
    • Knowledge base integration
    • Reflection
    • Structured outputs
    • And many more scenarios
  2. Well-Structured Tests

    • Clear success criteria documented
    • Good use of fixtures
    • Proper async/await patterns
    • Timeout decorators to prevent hanging

📝 Test Quality Notes

  1. Timeout Configuration - Most tests use 120-180 second timeouts. This is reasonable for integration tests but:

    • Consider if any can be optimized to run faster
    • Document why specific timeouts are chosen for different test categories
  2. Docker-based Testing - The DOCKER_SETUP.md provides good documentation for running tests with external dependencies.


Recommendations Summary

High Priority

  1. ✅ Address the agent deepcopy removal issue in performance.py - implement proper state management
  2. ✅ Fix the tool removal logic to check existence before attempting removal
  3. ✅ Replace bare except: with specific exception types in guardrail parsing

Medium Priority

  1. ✅ Verify the Graph node context clearing behavior is intentional
  2. ✅ Add logging when auto-creating ReflectionConfig
  3. ✅ Extract tool separation logic to reduce duplication

Low Priority

  1. ✅ Translate Turkish comment to English
  2. ✅ Consider performance optimizations for tool validation loops

Overall Assessment

Score: 8.5/10

This is a strong PR with significant improvements to the codebase:

  • ✅ Excellent test coverage with comprehensive smoke tests
  • ✅ Well-architected tool management refactoring
  • ✅ Good safety policy integration
  • ✅ Improved image handling
  • ✅ Better observability through reflection logging

The critical issues identified are fixable and don't fundamentally compromise the improvements. The agent deepcopy removal is the most concerning change and should be carefully reviewed or reverted if state isolation is important for performance testing.

Recommendation: Approve with requested changes - Address the critical issues, particularly the performance evaluation state management, before merging.

Great work on the comprehensive testing and thoughtful refactoring! 🚀

@DoganK01 DoganK01 merged commit f586dcf into master Dec 7, 2025
4 checks passed
@DoganK01 DoganK01 deleted the smoke-tests branch December 7, 2025 23:12
DoganK01 added a commit that referenced this pull request Jan 5, 2026
refactor: new smoke tests and bug fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant