feat: add comprehensive benchmark system for performance analysis by IremOztimur · Pull Request #511 · Upsonic/Upsonic

IremOztimur · 2026-01-28T13:28:38Z

Improvements:

Improved Performance Measurement Infrastructure: Added a comprehensive benchmark system with shared utilities (MemoryProfiler, PerformanceProfiler, BenchmarkReporter) for consistent memory, performance, and cost profiling across all future benchmark projects, enabling data-driven optimization decisions.
Improved Framework Comparison Capabilities: Implemented three-way overhead analysis benchmark comparing Direct LLM Call (minimal overhead) vs Agent (no prompt) vs Agent (with prompt), providing detailed metrics including mean, median, stdev, min, max time measurements, memory usage, cost tracking, and sample outputs for quality comparison.
Improved Developer Experience with Automation: Added Makefile with common commands (setup, run, clean) and comprehensive documentation (README.md, SETUP.md, QUICKSTART.md) for quick onboarding, plus support for multiple models with automatic validation and organized results in subdirectories by test case name.
Improved Testing Coverage: Added 5 diverse test cases (Simple Text Query, Structured Output, Math Problem, Text Analysis, Context-based Query) to measure performance across different use cases, with support for custom iterations and multi-model comparison for comprehensive analysis.

Full Changelog address: master...IremOztimur:Upsonic:docs/benchmark-system

Pull Requests:

PR name: made-by Irem Oztimur in #511

- Add .gitignore for Python, venv, and result files - Add __init__.py with project exports - Add Makefile for benchmark automation and common tasks

- Add MemoryProfiler for object size and memory tracking - Add PerformanceProfiler for timing and statistical analysis - Add BenchmarkReporter for result generation (JSON, Markdown, console) - Add data models: MemoryMetrics, PerformanceMetrics, CostMetrics, BenchmarkResult - Support for three-way comparison with detailed tables and sample outputs

- Implement three-way comparison: Direct vs Agent (no prompt) vs Agent (with prompt) - Measure memory usage (object size in bytes) - Measure performance (mean, median, stdev, min, max) - Track cost metrics (per iteration and total, including tokens) - Capture sample outputs for quality comparison - Add 5 test cases: Simple Text Query, Structured Output, Math Problem, Text Analysis, Context-based Query - Support multiple models with automatic validation - Generate detailed comparison tables and sample output reports - Organize results into subdirectories by test case name

- Add README.md with project overview and available benchmarks - Add SETUP.md with detailed installation and troubleshooting guide - Add QUICKSTART.md with fast-track setup instructions (Turkish) - Document new result format with comparison tables and sample outputs - Include Makefile command reference and usage examples - Add environment setup and API key configuration guides

claude · 2026-01-28T13:30:30Z

Pull Request Review: Comprehensive Benchmark System

Overall Assessment

This PR adds a well-structured benchmark system to the Upsonic framework. The implementation is comprehensive and provides valuable performance insights. However, there are several areas that need attention before merging.

✅ Strengths

Excellent Architecture: The separation of concerns is well done with utils.py providing reusable profiling tools and benchmark.py handling the orchestration.
Comprehensive Metrics: The benchmark captures memory, performance (with statistics), cost, and sample outputs - providing a complete picture of overhead.
Three-way Comparison: Comparing Direct, Agent (no prompt), and Agent (with prompt) is insightful and helps users make informed decisions.
Developer Experience: The Makefile automation and detailed documentation (README, SETUP, QUICKSTART) are excellent additions.
Multi-model Support: Ability to compare across different models with automatic validation is well implemented.
Organized Results: Subdirectory organization by test case name keeps results manageable.

🔴 Critical Issues

1. Security: Hardcoded Model Name (benchmark.py:109, 229, 363)

The default model gpt-5-mini-2025-08-07 is hardcoded throughout. GPT-5 doesn't exist yet (latest is GPT-4). This will cause all benchmarks to fail by default.

Fix:

# Change default to an actual model
model: str = "gpt-4o-mini"  # or "openai/gpt-4o-mini"

2. Missing Error Handling in Cost Calculation (utils.py:201)

Division by zero vulnerability when total_tokens == 0:

cost_per_1k = (total_cost / (total_tokens / 1000)) if total_tokens > 0 else 0.0

While there's a check, the logic would fail if total_cost > 0 but total_tokens == 0 (which shouldn't happen but isn't validated).

3. Type Safety Issues (utils.py:91)

Using modern Python syntax tuple[float, float] without from __future__ import annotations will fail on Python 3.8-3.9. The project should either:

Add from __future__ import annotations at the top
Use Tuple[float, float] from typing (already imported)

Current: def stop_tracking(self) -> tuple[float, float]:
Should be: def stop_tracking(self) -> Tuple[float, float]:

⚠️ Major Issues

4. Missing Async Support

The benchmark script imports asyncio (benchmark.py:17) but never uses it. If the framework supports async operations, the benchmarks should measure them too.

5. No Test Coverage

A benchmark system should have its own tests to ensure:

Profilers work correctly
Report generation produces valid JSON/Markdown
Error handling works as expected
Statistical calculations are accurate

Recommendation: Add tests/benchmarks/ with unit tests for the utilities.

6. Pympler as Optional Dependency (utils.py:116-120)

The fallback to shallow size when pympler isn't available significantly affects measurement accuracy. This should be a required dependency or at least warn users loudly.

Fix:

try:
    from pympler import asizeof
    deep_size = asizeof.asizeof(obj)
except ImportError:
    import warnings
    warnings.warn(
        "pympler not installed. Deep size calculations will be inaccurate. "
        "Install with: pip install pympler"
    )
    deep_size = shallow_size

7. Warmup Strategy Issues (benchmark.py:166-172)

Warmup failures immediately abort the benchmark with a generic error. Consider:

Retrying once for transient network issues
Providing more specific error messages
Gracefully degrading to skip that specific test case

🟡 Code Quality Issues

8. Duplicate Code (benchmark.py:108-226, 228-359)

The benchmark_direct and benchmark_agent functions share 80%+ identical code. Consider extracting common logic:

def _benchmark_executor(
    executor: Union[Direct, Agent],
    test_case: Dict[str, Any],
    iterations: int,
    name: str,
    metadata: Dict[str, Any]
) -> BenchmarkResult:
    # Common benchmarking logic
    pass

def benchmark_direct(...):
    direct = Direct(model=model)
    return _benchmark_executor(direct, test_case, iterations, "Direct", {...})

9. Magic Numbers

utils.py:152: Warmup default is 1 - should be configurable or at least documented why 1 is chosen
benchmark.py:109: Default iterations 5 - document why 5 is sufficient for statistical validity

10. Inconsistent Error Messages (benchmark.py:479-493)

The validation help message shows gpt-5-mini-2025-08-07 as an example, which doesn't exist.

11. Long Functions

benchmark.py:run_benchmark (96 lines) - too complex, should be split
utils.py:BenchmarkReporter.save_markdown_report (188 lines) - needs refactoring

🟢 Minor Issues

12. Documentation Issues

Makefile:68: Installing with uv pip install -e .. assumes the parent directory is the Upsonic package, which is correct but should be documented
README.md:230: Dead links to docs.upsonic.ai (returns 404)
QUICKSTART.md: Written in Turkish, which is inconsistent with the rest of the English documentation

13. Makefile Portability

Uses bash-specific syntax ([[, source) that won't work on all systems
Color codes might not render correctly on Windows
Consider adding a note about requiring bash or providing Windows alternatives

14. Result File Organization

Results are organized by test case name but not by model. With multi-model support, this could get confusing. Consider:

results/
  Simple_Text_Query/
    gpt-4o-mini/
      2024_...json
    claude-3-5-haiku/
      2024_...json

15. Missing Gitignore Entry

The .gitignore excludes */results/* but the path is actually benchmarks/overhead_analysis/results/. Verify this works correctly.

📊 Performance Considerations

16. Memory Profiling Overhead

Using tracemalloc adds overhead to the benchmarks themselves. Consider:

Documenting this limitation
Optionally disabling memory profiling for pure speed tests
Running separate passes for memory vs speed

17. Statistical Validity

Default 5 iterations may not be enough for reliable statistics, especially with network-dependent operations. Consider:

Documenting recommended minimum (10-20 iterations)
Warning users if iterations < 10
Calculating confidence intervals

18. Cost Tracking Accuracy

Cost metrics depend on task.total_cost being set correctly. If the Task class doesn't populate these fields, the benchmarks will show 0 cost. Add validation:

if total_cost == 0 and iterations > 0:
    warnings.warn("Cost tracking returned $0. Verify API is returning token counts.")

🔒 Security Concerns

19. Environment Variable Handling

The benchmark disables telemetry with os.environ["UPSONIC_TELEMETRY"] = "False" (benchmark.py:26). This is fine, but:

Document why telemetry is disabled in benchmarks
Consider making it configurable
Ensure this doesn't affect production behavior

20. API Key Exposure Risk

While the .env file is gitignored, the error messages could potentially leak key prefixes. Ensure error messages never include actual keys.

🧪 Test Coverage Recommendations

Priority areas for adding tests:

utils.py:
- Test MemoryProfiler with known object sizes
- Test PerformanceProfiler statistical calculations
- Test BenchmarkReporter JSON/Markdown generation
- Test error cases (empty results, missing fields)
test_cases.py:
- Validate all test cases return correct structure
- Test get_test_case_by_name error handling
benchmark.py:
- Test model validation logic
- Test result aggregation
- Test error handling during benchmark runs

📝 Documentation Improvements Needed

Add docstrings to Makefile targets explaining what each does
Document the statistical methodology (why mean/median/stdev are all included)
Add examples of interpreting benchmark results
Document system requirements (minimum Python version, dependencies)
Add troubleshooting section for common errors

🎯 Recommendations for Merging

Must Fix Before Merge:

✅ Change default model from gpt-5-mini-2025-08-07 to valid model
✅ Fix type hints for Python 3.8+ compatibility
✅ Add pympler as required dependency or improve fallback handling
✅ Fix documentation links
✅ Remove or translate Turkish documentation to English

Should Fix Before Merge:

Add basic test coverage for utils.py
Extract duplicate code in benchmark functions
Improve error messages and retry logic
Document statistical methodology and limitations

Nice to Have:

Add async benchmark support
Improve result organization for multi-model comparisons
Add confidence interval calculations
Create Windows-compatible scripts

📊 Final Score

Code Quality: 7/10
Architecture: 9/10
Documentation: 8/10
Test Coverage: 2/10
Security: 8/10

Overall: 7/10 - Good foundation but needs critical fixes and test coverage before merge.

Recommendation: Request changes - the hardcoded invalid model name and type compatibility issues are blockers. Once fixed, this will be a valuable addition to the framework.

claude · 2026-01-30T10:30:47Z

Pull Request Review: PR #511 - Benchmark System

Thank you for this comprehensive benchmark infrastructure! I've conducted a thorough review covering code quality, potential bugs, performance, security, and test coverage.

📊 Summary Scorecard

🔴 CRITICAL ISSUES - Must Fix Before Merge

1. Python 3.8 Type Compatibility Bug (utils.py:97, 162)

Uses Python 3.9+ syntax tuple[float, float] instead of typing.Tuple[float, float]
Fix: Import Tuple from typing and use Tuple[float, float]

2. Invalid Default Model Name (benchmark.py:109, 229, 363, 618)

References non-existent model gpt-5-mini-2025-08-07
Fix: Change to gpt-4o-mini or openai/gpt-4o-mini

3. Race Condition in Memory Profiling (utils.py:93-108)

No check if tracemalloc is already running
Add tracemalloc.is_tracing() check before start/stop

4. Unsafe Division (utils.py:376-378)

No validation that iterations > 0 before division
Add validation to prevent ZeroDivisionError

⚠️ MAJOR ISSUES - Should Fix

5. Zero Test Coverage

No tests for 1,633 lines of new code
Need tests for MemoryProfiler, PerformanceProfiler, BenchmarkReporter

6. Massive Code Duplication (benchmark.py:108-359)

~200 lines of nearly identical code between benchmark_direct() and benchmark_agent()
Extract common logic into shared helper function

7. Silent Fallback (utils.py:121-126)

Silently falls back to shallow size when pympler missing
Add warning to inform users of inaccurate measurements

8. Insufficient Statistical Rigor

Default iterations=5 is too low for network operations
Increase to 10 minimum, add warning for low iteration counts

9. No Retry Logic (benchmark.py:166-172)

Immediately aborts on any error including transient failures
Add exponential backoff retry logic (3 attempts recommended)

🟡 CODE QUALITY ISSUES

10. Long Functions - save_markdown_report() is 188 lines
11. Magic Numbers - Use named constants for warmup=1, iterations=5, bytes conversion
12. Weak Type Hints - Many Dict[str, Any] should use TypedDict/dataclass
13. Unused Import - asyncio imported but never used in benchmark.py

🔒 SECURITY CONCERNS

14. Path Traversal Risk (utils.py:501-509)

test_name could contain ../../ for directory traversal
Add filename sanitization

15. No Input Validation (benchmark.py:363-386)

No validation of iterations, model, test_case_name

⚡ PERFORMANCE NOTES

16. Memory Profiling Overhead - tracemalloc adds 10-30% overhead, document this
17. No Async Support - Consider adding async benchmark variants

✅ RECOMMENDATIONS FOR MERGE

🔴 MUST FIX (Blocking):

Fix Python 3.8 type hints (tuple → Tuple)
Fix default model name (not GPT-5)
Add tracemalloc state checking
Add basic error handling for file I/O

🟡 SHOULD FIX (Strongly Recommended):
5. Add pympler fallback warning
6. Extract duplicate code
7. Add minimum test coverage
8. Increase default iterations to 10
9. Add retry logic for errors

🟢 NICE TO HAVE:
10. Break up long functions
11. Add async support
12. Extract magic numbers
13. Add path sanitization

🎯 FINAL VERDICT

RECOMMENDATION: REQUEST CHANGES ⚠️

Critical issues that block merging:

Python 3.8 compatibility bug
Invalid default model breaks all runs
Zero test coverage for 1,633 lines
Significant code duplication

Estimated effort to fix: 10-18 hours (2-4 for critical, 4-8 for tests, 4-6 for quality)

Great work on the documentation and Makefile! The design is solid, just needs fixes before production-ready.

claude · 2026-01-31T16:09:49Z

Code Review: PR #511 - Comprehensive Benchmark System

Summary

This PR introduces a well-architected benchmark system for performance analysis. The code demonstrates good separation of concerns and comprehensive metrics collection. However, there are critical issues that must be fixed before merging.

Overall Assessment: APPROVE with changes requested

Critical Issues (Must Fix)

1. Division by Zero Vulnerabilities in benchmarks/utils.py

Location: Lines 772-773, 786-787

Problem: If the denominator is 0, this will raise ZeroDivisionError and crash the benchmark.

Fix: Add guards before division for both memory ratio and cost ratio calculations.

2. Unused Import in benchmarks/overhead_analysis/benchmark.py

Location: Line 1482

Fix: Remove the unused asyncio import statement.

3. Overly Broad Exception Handling in benchmark.py

Location: Lines 1554-1564

Problem: Catching all Exception can mask critical errors like KeyboardInterrupt and SystemExit.

Fix: Allow system exceptions to propagate by catching them first and re-raising.

High Priority Recommendations

4. Code Duplication

Lines 1587-1690 vs 1693-1823 in benchmark.py have ~150 lines of nearly identical code. Extract common logic into a shared helper function.

5. Missing Unit Tests

No test files for the benchmark utilities. Add benchmarks/tests/ directory with tests for division by zero scenarios, memory profiler edge cases, performance calculations, and report generation.

Medium Priority Suggestions

Add warning when pympler is unavailable (utils.py:430-435)
Replace print statements with logging module throughout
Fix text formatting in test_cases.py using textwrap.dedent()

Security Notes

Environment variable modification (benchmark.py:1491) could affect other parts of the system
API key handling looks good with truncated error messages
File system operations are safe

What's Done Well

Excellent architecture with separation of concerns
Comprehensive type annotations
User-friendly error messages
Well-designed CLI interface
Comprehensive metrics collection
Both JSON and Markdown report formats

Action Items

Before Merge:

Fix division by zero in utils.py (lines 772-773, 786-787)
Remove unused asyncio import in benchmark.py (line 1482)
Fix overly broad exception handling in benchmark.py (lines 1554-1564)

Strongly Recommended:
4. Refactor duplicated benchmark code
5. Add unit tests

Great work on this benchmark system! Once the critical issues are addressed, this will be a valuable addition to Upsonic.

IremOztimur added 4 commits January 28, 2026 15:55

chore: initialize benchmarks project structure

f25162f

- Add .gitignore for Python, venv, and result files - Add __init__.py with project exports - Add Makefile for benchmark automation and common tasks

onuratakan merged commit 78eada8 into master Feb 1, 2026
5 checks passed

onuratakan deleted the feat/benchmark-system branch February 1, 2026 22:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add comprehensive benchmark system for performance analysis#511

feat: add comprehensive benchmark system for performance analysis#511
onuratakan merged 4 commits intomasterfrom
feat/benchmark-system

IremOztimur commented Jan 28, 2026

Uh oh!

claude bot commented Jan 28, 2026

Uh oh!

claude bot commented Jan 30, 2026

Uh oh!

claude bot commented Jan 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

IremOztimur commented Jan 28, 2026

Improvements:

Pull Requests:

Uh oh!

claude bot commented Jan 28, 2026

Pull Request Review: Comprehensive Benchmark System

Overall Assessment

✅ Strengths

🔴 Critical Issues

1. Security: Hardcoded Model Name (benchmark.py:109, 229, 363)

2. Missing Error Handling in Cost Calculation (utils.py:201)

3. Type Safety Issues (utils.py:91)

⚠️ Major Issues

4. Missing Async Support

5. No Test Coverage

6. Pympler as Optional Dependency (utils.py:116-120)

7. Warmup Strategy Issues (benchmark.py:166-172)

🟡 Code Quality Issues

8. Duplicate Code (benchmark.py:108-226, 228-359)

9. Magic Numbers

10. Inconsistent Error Messages (benchmark.py:479-493)

11. Long Functions

🟢 Minor Issues

12. Documentation Issues

13. Makefile Portability

14. Result File Organization

15. Missing Gitignore Entry

📊 Performance Considerations

16. Memory Profiling Overhead

17. Statistical Validity

18. Cost Tracking Accuracy

🔒 Security Concerns

19. Environment Variable Handling

20. API Key Exposure Risk

🧪 Test Coverage Recommendations

📝 Documentation Improvements Needed

🎯 Recommendations for Merging

Must Fix Before Merge:

Should Fix Before Merge:

Nice to Have:

📊 Final Score

Uh oh!

claude bot commented Jan 30, 2026

Pull Request Review: PR #511 - Benchmark System

📊 Summary Scorecard

🔴 CRITICAL ISSUES - Must Fix Before Merge

⚠️ MAJOR ISSUES - Should Fix

🟡 CODE QUALITY ISSUES

🔒 SECURITY CONCERNS

⚡ PERFORMANCE NOTES

✅ RECOMMENDATIONS FOR MERGE

🎯 FINAL VERDICT

Uh oh!

claude bot commented Jan 31, 2026

Code Review: PR #511 - Comprehensive Benchmark System

Summary

Critical Issues (Must Fix)

1. Division by Zero Vulnerabilities in benchmarks/utils.py

2. Unused Import in benchmarks/overhead_analysis/benchmark.py

3. Overly Broad Exception Handling in benchmark.py

High Priority Recommendations

4. Code Duplication

5. Missing Unit Tests

Medium Priority Suggestions

Security Notes

What's Done Well

Action Items

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants