refactor: new smoke tests and bug fix by DoganK01 · Pull Request #492 · Upsonic/Upsonic

DoganK01 · 2025-12-07T23:04:59Z

Fixed MCP tool handling, builtin tool handling and remove/add tool logics properly
image output handling fixed
fix Graph class runtime
fix evaluation class deepcopy problem
add printing for reflection logic

claude · 2025-12-07T23:08:26Z

Pull Request Review: refactor: new smoke tests and bug fix

Summary

This PR includes significant improvements to tool management, image handling, reflection logging, and adds comprehensive smoke tests. The changes demonstrate good attention to detail in fixing several critical bugs and improving code quality.

Code Quality & Best Practices

✅ Strengths

Improved Tool Management Architecture (src/upsonic/agent/agent.py)
- Excellent separation of builtin tools from regular tools throughout the codebase
- Proper handling of tool registration/removal for both agent-level and task-level tools
- Consistent pattern applied across _register_agent_tools(), add_tools(), and remove_tools()
Enhanced Safety Policy Integration
- Good distinction between RAISE and BLOCK action types in tool validation (agent.py:556-583)
- Proper exception propagation for RAISE actions while gracefully handling BLOCK actions
- Debug logging for blocked tools improves observability
Image Output Handling (src/upsonic/agent/agent.py:1797-1811, src/upsonic/direct.py:216-232)
- Smart handling of single vs. multiple images (returns bytes vs. list of bytes)
- Consistent implementation across both Agent and Direct classes
- Good separation of concerns (check images first, then text)
Reflection Logging (src/upsonic/reflection/processor.py)
- Well-structured logging for reflection workflow
- Good use of the new printing utilities for better UX
- Clear separation of concerns (evaluation, improvement, completion)
Comprehensive Smoke Tests
- Excellent coverage of tool management scenarios in test_tool_management.py
- Good test organization with clear success criteria documented at the top
- Proper use of fixtures and async/await patterns

Potential Issues & Concerns

🔴 Critical Issues

Agent Deepcopy Removal May Break Stateful Tests (src/upsonic/eval/performance.py:68-86)
```
# Don't deepcopy agent_under_test - it contains unpicklable objects like httpx.AsyncClient
# Each task execution should be stateless anyway
```
Issue: While the comment claims "each task execution should be stateless," this assumption may not hold true:
- Agents can have stateful attributes (memory, context, etc.)
- Multiple iterations might contaminate state between runs
- Performance measurements could be affected by state accumulation
Recommendation: Consider one of these alternatives:
- Implement a proper reset() or clear_state() method on the Agent class
- Create a lightweight copy mechanism that preserves necessary state but resets execution-specific attributes
- Document this limitation clearly if intentional
Location: src/upsonic/eval/performance.py:68, 76

Potential Tool Removal Side Effects (src/upsonic/agent/agent.py:573-583)

for registered_tools_dict in registered_tools_dicts:
    self.tool_manager.remove_tools(
        tools=[tool_def.name],
        registered_tools=registered_tools_dict
    )
    
    # Also remove from the tracking dict to keep it clean
    if tool_def.name in registered_tools_dict:
        del registered_tools_dict[tool_def.name]

Issue:

Removing from all dicts in the list without checking which one actually contains the tool
Could attempt removal from empty dicts unnecessarily
The remove_tools() call might fail silently or log unnecessary warnings

Recommendation: Add existence check before removal:

for registered_tools_dict in registered_tools_dicts:
    if tool_def.name in registered_tools_dict:
        self.tool_manager.remove_tools(
            tools=[tool_def.name],
            registered_tools=registered_tools_dict
        )
        del registered_tools_dict[tool_def.name]
        break  # Tool found and removed, no need to check other dicts

Location: src/upsonic/agent/agent.py:575-583

⚠️ Medium Priority Issues

Graph Node Type Check After Pop (src/upsonic/graph/graph.py:668-672)
```
while execution_queue:
    node = execution_queue.pop(0)
    if isinstance(node, TaskNode):
        node.task.context = []
```
Issue: node.task.context = [] is now only executed for TaskNode instances, but it was previously executed for all nodes. This could be intentional, but:
- The context attribute might need to be reset for all node types
- If other node types have a task attribute, they won't be reset
Recommendation: Verify this is intentional and document why context clearing is TaskNode-specific.

Location: src/upsonic/graph/graph.py:671

Missing Error Handling in Guardrail Structured Output Parsing (src/upsonic/agent/agent.py:1383-1393)

try:
    import json
    parsed = json.loads(final_text_output)
    if hasattr(task.response_format, 'model_validate'):
        guardrail_input = task.response_format.model_validate(parsed)
except:
    # If parsing fails, use the text output
    guardrail_input = final_text_output

Issue: Bare except: clause catches all exceptions, including KeyboardInterrupt and SystemExit

Recommendation: Use specific exception handling:

except (json.JSONDecodeError, ValidationError, AttributeError) as e:
    # Log the parsing failure for debugging
    if self.debug:
        debug_log(f"Failed to parse structured output for guardrail: {e}")
    guardrail_input = final_text_output

Location: src/upsonic/agent/agent.py:1392

Reflection Config Auto-Creation (src/upsonic/agent/agent.py:319-323)
```
# Handle reflection configuration
if reflection and not reflection_config:
    # Create default reflection config if reflection=True but no config provided
    from upsonic.reflection import ReflectionConfig
    reflection_config = ReflectionConfig()
```
Issue: While convenient, auto-creating config with defaults might lead to unexpected behavior if users don't realize what settings are being used.

Recommendation: Consider logging a debug message or warning when auto-creating the config to improve visibility.

Location: src/upsonic/agent/agent.py:320-323

ℹ️ Minor Issues / Suggestions

Code Duplication in Tool Separation Logic

The pattern for separating builtin from regular tools is repeated multiple times
Consider extracting to a helper method:

def _separate_builtin_and_regular_tools(self, tools: List[Any]) -> Tuple[List[Any], List[Any]]:
    """Separate builtin tools from regular tools."""
    from upsonic.tools.builtin_tools import AbstractBuiltinTool
    builtin_tools = []
    regular_tools = []
    for tool in tools:
        if tool is not None and isinstance(tool, AbstractBuiltinTool):
            builtin_tools.append(tool)
        else:
            regular_tools.append(tool)
    return builtin_tools, regular_tools

Locations: agent.py:614-622, 674-682, 730-738, 832-840

Turkish Comment in printing.py
```
setup_logging(enable_console=False)  # Console kapalı, Rich kullanıyoruz
```
Recommendation: Use English for all code comments for international collaboration.

Location: src/upsonic/utils/printing.py:16

Performance Considerations

Tool Removal in Validation Loop - The policy validation now removes blocked tools in a loop. For large tool sets, this could be slow. Consider batching removals.
Multiple Dict Iterations - The tool removal logic iterates through registered_tools_dicts for each blocked tool. For better performance, build a removal list first, then process once.
Reflection Overhead - The new reflection logging adds function call overhead. Ensure it's only active when needed and doesn't impact performance significantly.

Security Concerns

✅ Security Improvements

Enhanced Safety Policy Integration - The distinction between RAISE and BLOCK actions provides better security control granularity.
Tool Validation Before Execution - Pre-execution validation with tool removal prevents blocked tools from being used.

⚠️ Security Considerations

Tool Removal Race Conditions - If tools are being added/removed concurrently, the validation-then-remove pattern could have race conditions. Consider adding locks if concurrent access is possible.
Bare Exception Handling - The bare except: in guardrail parsing (mentioned above) could mask security-relevant exceptions.

Test Coverage

✅ Excellent Coverage

Comprehensive Smoke Tests - 42 new test files covering:
- Tool management (all variants)
- Image I/O
- Storage systems
- Safety policies
- Team/multi-agent scenarios
- Knowledge base integration
- Reflection
- Structured outputs
- And many more scenarios
Well-Structured Tests
- Clear success criteria documented
- Good use of fixtures
- Proper async/await patterns
- Timeout decorators to prevent hanging

📝 Test Quality Notes

Timeout Configuration - Most tests use 120-180 second timeouts. This is reasonable for integration tests but:
- Consider if any can be optimized to run faster
- Document why specific timeouts are chosen for different test categories
Docker-based Testing - The DOCKER_SETUP.md provides good documentation for running tests with external dependencies.

Recommendations Summary

High Priority

✅ Address the agent deepcopy removal issue in performance.py - implement proper state management
✅ Fix the tool removal logic to check existence before attempting removal
✅ Replace bare except: with specific exception types in guardrail parsing

Medium Priority

✅ Verify the Graph node context clearing behavior is intentional
✅ Add logging when auto-creating ReflectionConfig
✅ Extract tool separation logic to reduce duplication

Low Priority

✅ Translate Turkish comment to English
✅ Consider performance optimizations for tool validation loops

Overall Assessment

Score: 8.5/10

This is a strong PR with significant improvements to the codebase:

✅ Excellent test coverage with comprehensive smoke tests
✅ Well-architected tool management refactoring
✅ Good safety policy integration
✅ Improved image handling
✅ Better observability through reflection logging

The critical issues identified are fixable and don't fundamentally compromise the improvements. The agent deepcopy removal is the most concerning change and should be carefully reviewed or reverted if state isolation is important for performance testing.

Recommendation: Approve with requested changes - Address the critical issues, particularly the performance evaluation state management, before merging.

Great work on the comprehensive testing and thoughtful refactoring! 🚀

refactor: new smoke tests and bug fix

refactor: new smoke tests and bug fix

755b18b

DoganK01 merged commit f586dcf into master Dec 7, 2025
4 checks passed

DoganK01 deleted the smoke-tests branch December 7, 2025 23:12

DoganK01 added a commit that referenced this pull request Jan 5, 2026

Merge pull request #492 from Upsonic/smoke-tests

add3f85

refactor: new smoke tests and bug fix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: new smoke tests and bug fix#492

refactor: new smoke tests and bug fix#492
DoganK01 merged 1 commit intomasterfrom
smoke-tests

DoganK01 commented Dec 7, 2025

Uh oh!

claude bot commented Dec 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DoganK01 commented Dec 7, 2025

Uh oh!

claude bot commented Dec 7, 2025

Pull Request Review: refactor: new smoke tests and bug fix

Summary

Code Quality & Best Practices

✅ Strengths

Potential Issues & Concerns

🔴 Critical Issues

⚠️ Medium Priority Issues

ℹ️ Minor Issues / Suggestions

Performance Considerations

Security Concerns

✅ Security Improvements

⚠️ Security Considerations

Test Coverage

✅ Excellent Coverage

📝 Test Quality Notes

Recommendations Summary

High Priority

Medium Priority

Low Priority

Overall Assessment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant