Skip to content

feat: add comprehensive benchmark system for performance analysis#511

Merged
onuratakan merged 4 commits intomasterfrom
feat/benchmark-system
Feb 1, 2026
Merged

feat: add comprehensive benchmark system for performance analysis#511
onuratakan merged 4 commits intomasterfrom
feat/benchmark-system

Conversation

@IremOztimur
Copy link
Copy Markdown
Collaborator

Improvements:

  • Improved Performance Measurement Infrastructure: Added a comprehensive benchmark system with shared utilities (MemoryProfiler, PerformanceProfiler, BenchmarkReporter) for consistent memory, performance, and cost profiling across all future benchmark projects, enabling data-driven optimization decisions.

  • Improved Framework Comparison Capabilities: Implemented three-way overhead analysis benchmark comparing Direct LLM Call (minimal overhead) vs Agent (no prompt) vs Agent (with prompt), providing detailed metrics including mean, median, stdev, min, max time measurements, memory usage, cost tracking, and sample outputs for quality comparison.

  • Improved Developer Experience with Automation: Added Makefile with common commands (setup, run, clean) and comprehensive documentation (README.md, SETUP.md, QUICKSTART.md) for quick onboarding, plus support for multiple models with automatic validation and organized results in subdirectories by test case name.

  • Improved Testing Coverage: Added 5 diverse test cases (Simple Text Query, Structured Output, Math Problem, Text Analysis, Context-based Query) to measure performance across different use cases, with support for custom iterations and multi-model comparison for comprehensive analysis.


Full Changelog address: master...IremOztimur:Upsonic:docs/benchmark-system

Pull Requests:

- Add .gitignore for Python, venv, and result files
- Add __init__.py with project exports
- Add Makefile for benchmark automation and common tasks
- Add MemoryProfiler for object size and memory tracking
- Add PerformanceProfiler for timing and statistical analysis
- Add BenchmarkReporter for result generation (JSON, Markdown, console)
- Add data models: MemoryMetrics, PerformanceMetrics, CostMetrics, BenchmarkResult
- Support for three-way comparison with detailed tables and sample outputs
- Implement three-way comparison: Direct vs Agent (no prompt) vs Agent (with prompt)
- Measure memory usage (object size in bytes)
- Measure performance (mean, median, stdev, min, max)
- Track cost metrics (per iteration and total, including tokens)
- Capture sample outputs for quality comparison
- Add 5 test cases: Simple Text Query, Structured Output, Math Problem, Text Analysis, Context-based Query
- Support multiple models with automatic validation
- Generate detailed comparison tables and sample output reports
- Organize results into subdirectories by test case name
- Add README.md with project overview and available benchmarks
- Add SETUP.md with detailed installation and troubleshooting guide
- Add QUICKSTART.md with fast-track setup instructions (Turkish)
- Document new result format with comparison tables and sample outputs
- Include Makefile command reference and usage examples
- Add environment setup and API key configuration guides
@claude
Copy link
Copy Markdown

claude bot commented Jan 28, 2026

Pull Request Review: Comprehensive Benchmark System

Overall Assessment

This PR adds a well-structured benchmark system to the Upsonic framework. The implementation is comprehensive and provides valuable performance insights. However, there are several areas that need attention before merging.

✅ Strengths

  1. Excellent Architecture: The separation of concerns is well done with utils.py providing reusable profiling tools and benchmark.py handling the orchestration.

  2. Comprehensive Metrics: The benchmark captures memory, performance (with statistics), cost, and sample outputs - providing a complete picture of overhead.

  3. Three-way Comparison: Comparing Direct, Agent (no prompt), and Agent (with prompt) is insightful and helps users make informed decisions.

  4. Developer Experience: The Makefile automation and detailed documentation (README, SETUP, QUICKSTART) are excellent additions.

  5. Multi-model Support: Ability to compare across different models with automatic validation is well implemented.

  6. Organized Results: Subdirectory organization by test case name keeps results manageable.

🔴 Critical Issues

1. Security: Hardcoded Model Name (benchmark.py:109, 229, 363)

The default model gpt-5-mini-2025-08-07 is hardcoded throughout. GPT-5 doesn't exist yet (latest is GPT-4). This will cause all benchmarks to fail by default.

Fix:

# Change default to an actual model
model: str = "gpt-4o-mini"  # or "openai/gpt-4o-mini"

2. Missing Error Handling in Cost Calculation (utils.py:201)

Division by zero vulnerability when total_tokens == 0:

cost_per_1k = (total_cost / (total_tokens / 1000)) if total_tokens > 0 else 0.0

While there's a check, the logic would fail if total_cost > 0 but total_tokens == 0 (which shouldn't happen but isn't validated).

3. Type Safety Issues (utils.py:91)

Using modern Python syntax tuple[float, float] without from __future__ import annotations will fail on Python 3.8-3.9. The project should either:

  • Add from __future__ import annotations at the top
  • Use Tuple[float, float] from typing (already imported)

Current: def stop_tracking(self) -> tuple[float, float]:
Should be: def stop_tracking(self) -> Tuple[float, float]:

⚠️ Major Issues

4. Missing Async Support

The benchmark script imports asyncio (benchmark.py:17) but never uses it. If the framework supports async operations, the benchmarks should measure them too.

5. No Test Coverage

A benchmark system should have its own tests to ensure:

  • Profilers work correctly
  • Report generation produces valid JSON/Markdown
  • Error handling works as expected
  • Statistical calculations are accurate

Recommendation: Add tests/benchmarks/ with unit tests for the utilities.

6. Pympler as Optional Dependency (utils.py:116-120)

The fallback to shallow size when pympler isn't available significantly affects measurement accuracy. This should be a required dependency or at least warn users loudly.

Fix:

try:
    from pympler import asizeof
    deep_size = asizeof.asizeof(obj)
except ImportError:
    import warnings
    warnings.warn(
        "pympler not installed. Deep size calculations will be inaccurate. "
        "Install with: pip install pympler"
    )
    deep_size = shallow_size

7. Warmup Strategy Issues (benchmark.py:166-172)

Warmup failures immediately abort the benchmark with a generic error. Consider:

  • Retrying once for transient network issues
  • Providing more specific error messages
  • Gracefully degrading to skip that specific test case

🟡 Code Quality Issues

8. Duplicate Code (benchmark.py:108-226, 228-359)

The benchmark_direct and benchmark_agent functions share 80%+ identical code. Consider extracting common logic:

def _benchmark_executor(
    executor: Union[Direct, Agent],
    test_case: Dict[str, Any],
    iterations: int,
    name: str,
    metadata: Dict[str, Any]
) -> BenchmarkResult:
    # Common benchmarking logic
    pass

def benchmark_direct(...):
    direct = Direct(model=model)
    return _benchmark_executor(direct, test_case, iterations, "Direct", {...})

9. Magic Numbers

  • utils.py:152: Warmup default is 1 - should be configurable or at least documented why 1 is chosen
  • benchmark.py:109: Default iterations 5 - document why 5 is sufficient for statistical validity

10. Inconsistent Error Messages (benchmark.py:479-493)

The validation help message shows gpt-5-mini-2025-08-07 as an example, which doesn't exist.

11. Long Functions

  • benchmark.py:run_benchmark (96 lines) - too complex, should be split
  • utils.py:BenchmarkReporter.save_markdown_report (188 lines) - needs refactoring

🟢 Minor Issues

12. Documentation Issues

  • Makefile:68: Installing with uv pip install -e .. assumes the parent directory is the Upsonic package, which is correct but should be documented
  • README.md:230: Dead links to docs.upsonic.ai (returns 404)
  • QUICKSTART.md: Written in Turkish, which is inconsistent with the rest of the English documentation

13. Makefile Portability

  • Uses bash-specific syntax ([[, source) that won't work on all systems
  • Color codes might not render correctly on Windows
  • Consider adding a note about requiring bash or providing Windows alternatives

14. Result File Organization

Results are organized by test case name but not by model. With multi-model support, this could get confusing. Consider:

results/
  Simple_Text_Query/
    gpt-4o-mini/
      2024_...json
    claude-3-5-haiku/
      2024_...json

15. Missing Gitignore Entry

The .gitignore excludes */results/* but the path is actually benchmarks/overhead_analysis/results/. Verify this works correctly.

📊 Performance Considerations

16. Memory Profiling Overhead

Using tracemalloc adds overhead to the benchmarks themselves. Consider:

  • Documenting this limitation
  • Optionally disabling memory profiling for pure speed tests
  • Running separate passes for memory vs speed

17. Statistical Validity

Default 5 iterations may not be enough for reliable statistics, especially with network-dependent operations. Consider:

  • Documenting recommended minimum (10-20 iterations)
  • Warning users if iterations < 10
  • Calculating confidence intervals

18. Cost Tracking Accuracy

Cost metrics depend on task.total_cost being set correctly. If the Task class doesn't populate these fields, the benchmarks will show 0 cost. Add validation:

if total_cost == 0 and iterations > 0:
    warnings.warn("Cost tracking returned $0. Verify API is returning token counts.")

🔒 Security Concerns

19. Environment Variable Handling

The benchmark disables telemetry with os.environ["UPSONIC_TELEMETRY"] = "False" (benchmark.py:26). This is fine, but:

  • Document why telemetry is disabled in benchmarks
  • Consider making it configurable
  • Ensure this doesn't affect production behavior

20. API Key Exposure Risk

While the .env file is gitignored, the error messages could potentially leak key prefixes. Ensure error messages never include actual keys.

🧪 Test Coverage Recommendations

Priority areas for adding tests:

  1. utils.py:

    • Test MemoryProfiler with known object sizes
    • Test PerformanceProfiler statistical calculations
    • Test BenchmarkReporter JSON/Markdown generation
    • Test error cases (empty results, missing fields)
  2. test_cases.py:

    • Validate all test cases return correct structure
    • Test get_test_case_by_name error handling
  3. benchmark.py:

    • Test model validation logic
    • Test result aggregation
    • Test error handling during benchmark runs

📝 Documentation Improvements Needed

  1. Add docstrings to Makefile targets explaining what each does
  2. Document the statistical methodology (why mean/median/stdev are all included)
  3. Add examples of interpreting benchmark results
  4. Document system requirements (minimum Python version, dependencies)
  5. Add troubleshooting section for common errors

🎯 Recommendations for Merging

Must Fix Before Merge:

  1. ✅ Change default model from gpt-5-mini-2025-08-07 to valid model
  2. ✅ Fix type hints for Python 3.8+ compatibility
  3. ✅ Add pympler as required dependency or improve fallback handling
  4. ✅ Fix documentation links
  5. ✅ Remove or translate Turkish documentation to English

Should Fix Before Merge:

  1. Add basic test coverage for utils.py
  2. Extract duplicate code in benchmark functions
  3. Improve error messages and retry logic
  4. Document statistical methodology and limitations

Nice to Have:

  1. Add async benchmark support
  2. Improve result organization for multi-model comparisons
  3. Add confidence interval calculations
  4. Create Windows-compatible scripts

📊 Final Score

Code Quality: 7/10
Architecture: 9/10
Documentation: 8/10
Test Coverage: 2/10
Security: 8/10

Overall: 7/10 - Good foundation but needs critical fixes and test coverage before merge.


Recommendation: Request changes - the hardcoded invalid model name and type compatibility issues are blockers. Once fixed, this will be a valuable addition to the framework.

@claude
Copy link
Copy Markdown

claude bot commented Jan 30, 2026

Pull Request Review: PR #511 - Benchmark System

Thank you for this comprehensive benchmark infrastructure! I've conducted a thorough review covering code quality, potential bugs, performance, security, and test coverage.

📊 Summary Scorecard

Code Quality: 6/10 | Type Safety: 5/10 | Error Handling: 4/10 | Performance: 7/10 | Security: 8/10 | Test Coverage: 0/10 | Overall: 5.75/10


🔴 CRITICAL ISSUES - Must Fix Before Merge

1. Python 3.8 Type Compatibility Bug (utils.py:97, 162)

  • Uses Python 3.9+ syntax tuple[float, float] instead of typing.Tuple[float, float]
  • Fix: Import Tuple from typing and use Tuple[float, float]

2. Invalid Default Model Name (benchmark.py:109, 229, 363, 618)

  • References non-existent model gpt-5-mini-2025-08-07
  • Fix: Change to gpt-4o-mini or openai/gpt-4o-mini

3. Race Condition in Memory Profiling (utils.py:93-108)

  • No check if tracemalloc is already running
  • Add tracemalloc.is_tracing() check before start/stop

4. Unsafe Division (utils.py:376-378)

  • No validation that iterations > 0 before division
  • Add validation to prevent ZeroDivisionError

⚠️ MAJOR ISSUES - Should Fix

5. Zero Test Coverage

  • No tests for 1,633 lines of new code
  • Need tests for MemoryProfiler, PerformanceProfiler, BenchmarkReporter

6. Massive Code Duplication (benchmark.py:108-359)

  • ~200 lines of nearly identical code between benchmark_direct() and benchmark_agent()
  • Extract common logic into shared helper function

7. Silent Fallback (utils.py:121-126)

  • Silently falls back to shallow size when pympler missing
  • Add warning to inform users of inaccurate measurements

8. Insufficient Statistical Rigor

  • Default iterations=5 is too low for network operations
  • Increase to 10 minimum, add warning for low iteration counts

9. No Retry Logic (benchmark.py:166-172)

  • Immediately aborts on any error including transient failures
  • Add exponential backoff retry logic (3 attempts recommended)

🟡 CODE QUALITY ISSUES

10. Long Functions - save_markdown_report() is 188 lines
11. Magic Numbers - Use named constants for warmup=1, iterations=5, bytes conversion
12. Weak Type Hints - Many Dict[str, Any] should use TypedDict/dataclass
13. Unused Import - asyncio imported but never used in benchmark.py


🔒 SECURITY CONCERNS

14. Path Traversal Risk (utils.py:501-509)

  • test_name could contain ../../ for directory traversal
  • Add filename sanitization

15. No Input Validation (benchmark.py:363-386)

  • No validation of iterations, model, test_case_name

⚡ PERFORMANCE NOTES

16. Memory Profiling Overhead - tracemalloc adds 10-30% overhead, document this
17. No Async Support - Consider adding async benchmark variants


✅ RECOMMENDATIONS FOR MERGE

🔴 MUST FIX (Blocking):

  1. Fix Python 3.8 type hints (tuple → Tuple)
  2. Fix default model name (not GPT-5)
  3. Add tracemalloc state checking
  4. Add basic error handling for file I/O

🟡 SHOULD FIX (Strongly Recommended):
5. Add pympler fallback warning
6. Extract duplicate code
7. Add minimum test coverage
8. Increase default iterations to 10
9. Add retry logic for errors

🟢 NICE TO HAVE:
10. Break up long functions
11. Add async support
12. Extract magic numbers
13. Add path sanitization


🎯 FINAL VERDICT

RECOMMENDATION: REQUEST CHANGES ⚠️

Critical issues that block merging:

  • Python 3.8 compatibility bug
  • Invalid default model breaks all runs
  • Zero test coverage for 1,633 lines
  • Significant code duplication

Estimated effort to fix: 10-18 hours (2-4 for critical, 4-8 for tests, 4-6 for quality)

Great work on the documentation and Makefile! The design is solid, just needs fixes before production-ready.

@claude
Copy link
Copy Markdown

claude bot commented Jan 31, 2026

Code Review: PR #511 - Comprehensive Benchmark System

Summary

This PR introduces a well-architected benchmark system for performance analysis. The code demonstrates good separation of concerns and comprehensive metrics collection. However, there are critical issues that must be fixed before merging.

Overall Assessment: APPROVE with changes requested


Critical Issues (Must Fix)

1. Division by Zero Vulnerabilities in benchmarks/utils.py

Location: Lines 772-773, 786-787

Problem: If the denominator is 0, this will raise ZeroDivisionError and crash the benchmark.

Fix: Add guards before division for both memory ratio and cost ratio calculations.

2. Unused Import in benchmarks/overhead_analysis/benchmark.py

Location: Line 1482

Fix: Remove the unused asyncio import statement.

3. Overly Broad Exception Handling in benchmark.py

Location: Lines 1554-1564

Problem: Catching all Exception can mask critical errors like KeyboardInterrupt and SystemExit.

Fix: Allow system exceptions to propagate by catching them first and re-raising.


High Priority Recommendations

4. Code Duplication

Lines 1587-1690 vs 1693-1823 in benchmark.py have ~150 lines of nearly identical code. Extract common logic into a shared helper function.

5. Missing Unit Tests

No test files for the benchmark utilities. Add benchmarks/tests/ directory with tests for division by zero scenarios, memory profiler edge cases, performance calculations, and report generation.


Medium Priority Suggestions

  • Add warning when pympler is unavailable (utils.py:430-435)
  • Replace print statements with logging module throughout
  • Fix text formatting in test_cases.py using textwrap.dedent()

Security Notes

  • Environment variable modification (benchmark.py:1491) could affect other parts of the system
  • API key handling looks good with truncated error messages
  • File system operations are safe

What's Done Well

  1. Excellent architecture with separation of concerns
  2. Comprehensive type annotations
  3. User-friendly error messages
  4. Well-designed CLI interface
  5. Comprehensive metrics collection
  6. Both JSON and Markdown report formats

Action Items

Before Merge:

  1. Fix division by zero in utils.py (lines 772-773, 786-787)
  2. Remove unused asyncio import in benchmark.py (line 1482)
  3. Fix overly broad exception handling in benchmark.py (lines 1554-1564)

Strongly Recommended:
4. Refactor duplicated benchmark code
5. Add unit tests

Great work on this benchmark system! Once the critical issues are addressed, this will be a valuable addition to Upsonic.

@onuratakan onuratakan merged commit 78eada8 into master Feb 1, 2026
5 checks passed
@onuratakan onuratakan deleted the feat/benchmark-system branch February 1, 2026 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants