[FEATURE][TESTING]: Performance testing and benchmarking framework

# 🚀 Epic: Performance Testing & Benchmarking Framework

## Goal

Enable **comprehensive performance testing, benchmarking, and regression detection** for MCP Gateway, providing automated tooling to measure throughput, latency, resource utilization, and scalability across all MCP protocol operations. This ensures production readiness, identifies performance bottlenecks, and prevents regressions in critical API paths.

## Why Now?

Performance is critical for production adoption of MCP Gateway. As the gateway federates multiple MCP servers, handles high-volume tool invocations, and serves as a central API hub, we need:

1. **Quantifiable performance baselines** to understand current system capabilities
2. **Regression detection** to catch performance degradations before production
3. **Scalability insights** to guide infrastructure planning and optimization
4. **Database vs. in-memory trade-offs** to optimize different deployment profiles
5. **Load testing automation** to validate production readiness

Without systematic performance testing, we risk shipping performance regressions, failing to meet production SLAs, and making uninformed architectural decisions.

This framework positions MCP Gateway as a **production-grade, performance-verified** platform.

---

## 📖 User Stories

<details>
<summary>US-1: DevOps Engineer - Automated Benchmarking</summary>

**As a** DevOps Engineer
**I want** to run automated performance benchmarks before each release
**So that** I can verify the system meets performance SLAs and detect regressions

**Acceptance Criteria:**

```gherkin
Given MCP Gateway is running locally or in Docker
When I execute "make test" from tests/performance/
Then the system should:
 - Verify all required services are running (gateway, database, cache)
 - Authenticate and obtain JWT tokens
 - Run benchmarks against all core endpoints (/tools, /resources, /prompts, /servers)
 - Generate detailed HTML reports with response time distributions
 - Compare results against baseline metrics
 - Flag performance regressions >10% degradation
 - Save new baseline if performance improved
 - Exit with code 0 if tests pass, non-zero if regressions detected
```

**Technical Requirements:**
- Makefile with test targets (test, test-tools, test-resources, test-prompts, etc.)
- Baseline management system storing P50/P95/P99 latencies
- Automated baseline comparison with configurable thresholds
- HTML report generation with charts and metrics
- Support both local and Docker testing modes

</details>

<details>
<summary>US-2: Performance Engineer - Multi-Scenario Testing</summary>

**As a** Performance Engineer
**I want** to run different workload scenarios (light, medium, heavy)
**So that** I can understand system behavior under various load conditions

**Acceptance Criteria:**

```gherkin
Given performance profiles defined (light.env, medium.env, heavy.env)
When I execute "./run-configurable.sh --profile heavy --scenarios all"
Then the system should:
 - Load heavy profile: 5000 requests, 200 concurrent users
 - Run all scenarios: tools, resources, prompts, mixed-workload, database, gateway-core
 - Test with 4 simulated MCP servers (SQLite + PostgreSQL)
 - Generate comparative HTML report
 - Show P50/P95/P99 latencies for each scenario
 - Display requests/sec and total duration
 - Compare against baseline for heavy profile
 - Store results in results/ directory with timestamp
```

**Technical Requirements:**
- Profile system with configurable load parameters
- Multi-scenario orchestration script
- Support for different database backends (SQLite, PostgreSQL)
- Dynamic Docker Compose generation for multi-server testing
- Scenario isolation and cleanup between runs
- Profile-specific baseline storage

</details>

<details>
<summary>US-3: Backend Developer - Database Performance Analysis</summary>

**As a** Backend Developer
**I want** to benchmark database operations separately from API logic
**So that** I can identify database bottlenecks and optimize queries

**Acceptance Criteria:**

```gherkin
Given database benchmarking scenario
When I execute "./scenarios/database-benchmark.sh"
Then the system should:
 - Test direct database query performance (list tools, servers, prompts)
 - Test database write operations (create/update/delete)
 - Test complex joins (tool + team + server queries)
 - Compare SQLite vs PostgreSQL performance
 - Generate database-specific performance report
 - Identify slow queries (>100ms)
 - Provide optimization recommendations
```

**Technical Requirements:**
- Direct database benchmarking bypassing API layer
- SQLAlchemy query profiling
- Connection pool sizing analysis
- Index effectiveness measurement
- Query plan analysis for slow queries

</details>

<details>
<summary>US-4: QA Engineer - Regression Detection</summary>

**As a** QA Engineer
**I want** automated regression detection in CI/CD pipelines
**So that** performance degradations are caught before production deployment

**Acceptance Criteria:**

```gherkin
Given baseline metrics stored from previous release
When CI runs performance tests on new commit
Then the system should:
 - Load baseline metrics from baselines/baseline_<profile>.json
 - Run full test suite against new code
 - Compare each metric against baseline
 - Calculate percentage change (positive = improvement, negative = regression)
 - Flag any metric with >10% regression
 - Generate GitHub Actions annotation for regressions
 - Block merge if critical regressions detected
 - Update baseline if improvements >5% sustained
```

**Technical Requirements:**
- Baseline storage in JSON format
- compare_results.py script for automated comparison
- Configurable regression thresholds per metric
- CI/CD integration (GitHub Actions, GitLab CI)
- Baseline versioning and rollback capability
- Historical trend analysis

</details>

<details>
<summary>US-5: Platform Admin - Scalability Testing</summary>

**As a** Platform Administrator
**I want** to test gateway scalability with multiple MCP servers
**So that** I can plan infrastructure for production workloads

**Acceptance Criteria:**

```gherkin
Given scalability test configuration
When I execute "./run-advanced.sh --servers 10 --profile heavy"
Then the system should:
 - Generate docker-compose.yaml with 10 MCP servers
 - Start all servers (5 SQLite, 5 PostgreSQL)
 - Register all servers with gateway
 - Run mixed workload across all servers
 - Measure aggregated throughput (RPS)
 - Measure resource utilization (CPU, memory, DB connections)
 - Test horizontal scaling limits
 - Generate scalability report with server count vs RPS correlation
```

**Technical Requirements:**
- Dynamic docker-compose.yaml generation
- Multi-server registration automation
- Load distribution across servers
- Resource monitoring (psutil, docker stats)
- Concurrency testing with 50-500 parallel clients
- Database connection pool saturation testing

</details>

<details>
<summary>US-6: Developer - Manual API Testing</summary>

**As a** Developer
**I want** CLI-based manual testing commands
**So that** I can quickly verify API functionality and performance during development

**Acceptance Criteria:**

```gherkin
Given MANUAL_TESTING.md guide
When I run CLI commands from the guide
Then I should be able to:
 - Login and obtain JWT token via curl
 - List tools/resources/prompts with jq filtering
 - Run quick performance tests with hey (100-1000 requests)
 - Monitor API health continuously
 - Test authentication token expiration
 - Benchmark specific endpoints individually
 - Verify deployment after code changes
```

**Technical Requirements:**
- MANUAL_TESTING.md with copy-paste ready commands
- curl examples for all core endpoints
- jq filters for response parsing
- hey benchmarking commands
- Continuous monitoring scripts
- Token generation utilities
- Quick smoke test scripts

</details>

---

## 🏗 Architecture

### Testing Framework Architecture

```mermaid
graph TB
 subgraph "Test Orchestration"
 A1[Makefile]
 A2[run-configurable.sh]
 A3[run-advanced.sh]
 A4[run-all.sh]
 end

 subgraph "Configuration"
 B1[config.yaml]
 B2[profiles/light.env]
 B3[profiles/medium.env]
 B4[profiles/heavy.env]
 end

 subgraph "Scenarios"
 C1[tools-benchmark.sh]
 C2[resources-benchmark.sh]
 C3[prompts-benchmark.sh]
 C4[mixed-workload.sh]
 C5[database-benchmark.sh]
 C6[gateway-core-benchmark.sh]
 end

 subgraph "Utilities"
 D1[setup-auth.sh]
 D2[check-services.sh]
 D3[baseline_manager.py]
 D4[compare_results.py]
 D5[report_generator.py]
 D6[generate_docker_compose.py]
 end

 subgraph "MCP Gateway"
 E1[REST API]
 E2[Database]
 E3[MCP Servers]
 end

 A1 --> C1
 A2 --> B1
 A2 --> B2
 A3 --> D6
 C1 --> D1
 C1 --> D2
 C1 --> E1
 D3 --> D4
 D4 --> D5
 E1 --> E2
 E1 --> E3
```

### Test Execution Flow

```mermaid
sequenceDiagram
 participant User
 participant Makefile
 participant Setup as setup-auth.sh
 participant Check as check-services.sh
 participant Scenario as Benchmark Script
 participant API as MCP Gateway
 participant Baseline as baseline_manager.py
 participant Compare as compare_results.py
 participant Report as report_generator.py

 User->>Makefile: make test
 Makefile->>Check: Verify services
 Check->>API: GET /health
 API-->>Check: 200 OK
 Check-->>Makefile: Services ready

 Makefile->>Setup: Authenticate
 Setup->>API: POST /auth/login
 API-->>Setup: JWT token
 Setup-->>Makefile: TOKEN exported

 Makefile->>Scenario: Run benchmark
 Scenario->>API: hey -n 1000 -c 50 GET /tools
 API-->>Scenario: Performance metrics
 Scenario->>Scenario: Parse hey output
 Scenario-->>Makefile: results/tools_YYYY-MM-DD-HHMMSS.txt

 Makefile->>Baseline: Load baseline
 Baseline-->>Makefile: baseline_light.json

 Makefile->>Compare: Compare results
 Compare->>Compare: Calculate delta %
 Compare-->>Makefile: comparison_report.txt

 Makefile->>Report: Generate HTML
 Report->>Report: Create charts
 Report-->>Makefile: report.html

 Makefile-->>User: Open report in browser
```

### Baseline Management System

```mermaid
graph LR
 A[Run Tests] --> B{Baseline Exists?}
 B -->|No| C[Save as Initial Baseline]
 B -->|Yes| D[Load Baseline]
 D --> E[Compare Metrics]
 E --> F{Regression?}
 F -->|Yes >10%| G[Fail Tests]
 F -->|No| H{Improvement?}
 H -->|Yes >5%| I[Update Baseline]
 H -->|No| J[Keep Baseline]
 I --> K[Pass Tests]
 J --> K
 C --> K
```

---

## 📂 Directory Structure

```
tests/performance/
├── Makefile # Main test automation (271 lines, 40+ targets)
├── config.yaml # Centralized configuration (386 lines)
├── README.md # Comprehensive guide (309 lines)
├── MANUAL_TESTING.md # CLI testing guide (458 lines)
├── PERFORMANCE_STRATEGY.md # Testing strategy doc (2116 lines)
│
├── profiles/ # Load profiles
│ ├── light.env # 100 req, 10 concurrent
│ ├── medium.env # 1000 req, 50 concurrent
│ └── heavy.env # 5000 req, 200 concurrent
│
├── scenarios/ # Benchmark scenarios
│ ├── tools-benchmark.sh # Test /tools endpoint (134 lines)
│ ├── resources-benchmark.sh # Test /resources endpoint (134 lines)
│ ├── prompts-benchmark.sh # Test /prompts endpoint (126 lines)
│ ├── mixed-workload.sh # Combined workload (152 lines)
│ ├── database-benchmark.sh # Direct DB tests (217 lines)
│ └── gateway-core-benchmark.sh # Core operations (268 lines)
│
├── payloads/ # Test request payloads
│ ├── tools/
│ │ ├── list_tools.json
│ │ ├── get_system_time.json
│ │ └── convert_time.json
│ ├── resources/
│ │ ├── list_resources.json
│ │ ├── read_timezone_info.json
│ │ └── read_world_times.json
│ └── prompts/
│ ├── list_prompts.json
│ └── get_compare_timezones.json
│
├── utils/ # Python utilities
│ ├── setup-auth.sh # Authentication helper (88 lines)
│ ├── check-services.sh # Service health check (67 lines)
│ ├── baseline_manager.py # Baseline CRUD (323 lines)
│ ├── compare_results.py # Regression detection (376 lines)
│ ├── report_generator.py # HTML reports (1195 lines)
│ └── generate_docker_compose.py # Multi-server setup (423 lines)
│
├── baselines/ # Stored baselines (gitignored)
│ ├── baseline_light.json
│ ├── baseline_medium.json
│ └── baseline_heavy.json
│
├── results/ # Test results (gitignored)
│ ├── tools_2025-10-10-143000.txt
│ ├── resources_2025-10-10-143100.txt
│ └── report_2025-10-10-143200.html
│
├── run-configurable.sh # Configurable test runner (435 lines)
├── run-advanced.sh # Multi-server testing (381 lines)
└── run-all.sh # All scenarios (271 lines)
```

---

## ⚙️ Configuration

### config.yaml Structure

```yaml
# Service Configuration
service:
 base_url: "http://localhost:4444"
 health_check_timeout: 30
 auth_required: true

# Authentication
auth:
 email: "admin@example.com"
 password: "changeme"
 token_cache: "/tmp/perf_test_token.txt"

# Performance Settings
performance:
 hey_binary: "hey"
 default_requests: 1000
 default_concurrency: 50
 default_timeout: 30

# Baseline Configuration
baselines:
 storage_dir: "baselines"
 regression_threshold: 10.0 # % degradation before failure
 improvement_threshold: 5.0 # % improvement to update baseline
 metrics:
 - p50_latency
 - p95_latency
 - p99_latency
 - requests_per_sec
 - total_time

# Test Scenarios
scenarios:
 tools:
 endpoint: "/tools"
 methods: [GET, POST]
 payloads_dir: "payloads/tools"
 resources:
 endpoint: "/resources"
 methods: [GET, POST]
 payloads_dir: "payloads/resources"
 # ... more scenarios

# Reporting
reporting:
 output_dir: "results"
 html_template: "utils/report_template.html"
 open_browser: true
 include_charts: true
 chart_types: ["latency_distribution", "throughput_timeline"]
```

### Profile Configuration Example

```bash
# profiles/heavy.env
# Heavy load profile for stress testing

# Load Parameters
export PERF_REQUESTS=5000
export PERF_CONCURRENCY=200
export PERF_TIMEOUT=60

# Test Configuration
export PERF_SCENARIOS="tools,resources,prompts,mixed,database,gateway-core"
export PERF_ITERATIONS=3
export PERF_WARMUP=true

# Database Configuration
export PERF_DB_BACKEND="postgresql"
export PERF_SERVERS_COUNT=10

# Reporting
export PERF_BASELINE_PROFILE="heavy"
export PERF_REGRESSION_THRESHOLD=10
export PERF_GENERATE_REPORT=true
```

---

## 📋 Implementation Tasks

### Phase 1: Core Infrastructure ✅

- [x] **Project Structure**
 - [x] Create tests/performance/ directory
 - [x] Add .gitignore for results, baselines, Docker artifacts
 - [x] Create subdirectories: scenarios/, utils/, payloads/, profiles/, baselines/
 - [x] Initialize README.md with overview

- [x] **Configuration System**
 - [x] Create config.yaml with service, auth, performance sections
 - [x] Define baseline configuration
 - [x] Create profile templates (light.env, medium.env, heavy.env)
 - [x] Add environment variable validation

- [x] **Makefile Automation**
 - [x] Create Makefile with 40+ targets
 - [x] Add test, test-tools, test-resources, test-prompts targets
 - [x] Implement test-all target for full suite
 - [x] Add clean, clean-results, clean-baselines targets
 - [x] Create help target with command documentation
 - [x] Add Docker targets (test-docker, docker-up, docker-down)

### Phase 2: Benchmark Scenarios ✅

- [x] **Tools Endpoint Benchmarks**
 - [x] Create scenarios/tools-benchmark.sh (134 lines)
 - [x] Test GET /tools with various limits (10, 50, 100)
 - [x] Test POST /tools/invoke with tool execution
 - [x] Parse hey output for metrics
 - [x] Save results with timestamps

- [x] **Resources Endpoint Benchmarks**
 - [x] Create scenarios/resources-benchmark.sh (134 lines)
 - [x] Test GET /resources listing
 - [x] Test GET /resources/{uri} retrieval
 - [x] Test StreamableHTTP resource types
 - [x] Handle large resource payloads

- [x] **Prompts Endpoint Benchmarks**
 - [x] Create scenarios/prompts-benchmark.sh (126 lines)
 - [x] Test GET /prompts listing
 - [x] Test GET /prompts/{name} retrieval
 - [x] Test prompt template expansion
 - [x] Measure argument validation overhead

- [x] **Mixed Workload Scenario**
 - [x] Create scenarios/mixed-workload.sh (152 lines)
 - [x] Run tools + resources + prompts concurrently
 - [x] Simulate realistic API usage patterns
 - [x] Measure aggregate system throughput
 - [x] Test connection pool saturation

- [x] **Database Benchmark Scenario**
 - [x] Create scenarios/database-benchmark.sh (217 lines)
 - [x] Test direct SQLAlchemy query performance
 - [x] Compare SQLite vs PostgreSQL
 - [x] Benchmark complex joins (tool + team + server)
 - [x] Test connection pool efficiency

- [x] **Gateway Core Benchmark**
 - [x] Create scenarios/gateway-core-benchmark.sh (268 lines)
 - [x] Test /health endpoint (no auth overhead)
 - [x] Test /servers listing and SSE connections
 - [x] Test gateway federation features
 - [x] Measure middleware overhead

### Phase 3: Utilities & Automation ✅

- [x] **Authentication Helper**
 - [x] Create utils/setup-auth.sh (88 lines)
 - [x] Implement login via POST /auth/login
 - [x] Extract JWT token from JSON response
 - [x] Cache token to /tmp/perf_test_token.txt
 - [x] Validate token expiration
 - [x] Auto-refresh expired tokens

- [x] **Service Health Check**
 - [x] Create utils/check-services.sh (67 lines)
 - [x] Test gateway availability (GET /health)
 - [x] Verify database connectivity
 - [x] Check Redis/cache availability
 - [x] Validate MCP servers are registered
 - [x] Exit with error code if services unavailable

- [x] **Baseline Manager**
 - [x] Create utils/baseline_manager.py (323 lines)
 - [x] Implement save_baseline() for storing metrics
 - [x] Implement load_baseline() for retrieval
 - [x] Add list_baselines() for discovery
 - [x] Implement delete_baseline() for cleanup
 - [x] Support profile-specific baselines
 - [x] JSON serialization with pretty printing

- [x] **Results Comparison**
 - [x] Create utils/compare_results.py (376 lines)
 - [x] Parse hey output format
 - [x] Extract P50, P95, P99 latencies
 - [x] Calculate requests/sec and total time
 - [x] Compare against baseline metrics
 - [x] Calculate percentage delta
 - [x] Flag regressions exceeding threshold
 - [x] Generate comparison report

- [x] **Report Generator**
 - [x] Create utils/report_generator.py (1195 lines)
 - [x] Generate HTML reports with embedded CSS/JS
 - [x] Create latency distribution charts (Chart.js)
 - [x] Create throughput timeline charts
 - [x] Add metric tables with color-coded deltas
 - [x] Include system information (CPU, memory, DB)
 - [x] Add test configuration summary
 - [x] Support multi-scenario comparison
 - [x] Auto-open in browser

- [x] **Docker Compose Generator**
 - [x] Create utils/generate_docker_compose.py (423 lines)
 - [x] Generate docker-compose.yaml dynamically
 - [x] Support 1-100 MCP servers
 - [x] Mix SQLite and PostgreSQL backends
 - [x] Assign unique ports to each server
 - [x] Configure resource limits (CPU, memory)
 - [x] Add health checks for all services
 - [x] Include gateway, database, Redis services

### Phase 4: Test Orchestration ✅

- [x] **Configurable Test Runner**
 - [x] Create run-configurable.sh (435 lines)
 - [x] Support --profile argument (light|medium|heavy)
 - [x] Support --scenarios argument (comma-separated)
 - [x] Support --baseline argument (save|compare|skip)
 - [x] Load profile environment variables
 - [x] Execute selected scenarios in sequence
 - [x] Collect all results
 - [x] Generate unified HTML report
 - [x] Exit with code based on regression detection

- [x] **Advanced Test Runner**
 - [x] Create run-advanced.sh (381 lines)
 - [x] Support --servers argument (1-100)
 - [x] Support --database argument (sqlite|postgresql|both)
 - [x] Generate docker-compose.yaml on-the-fly
 - [x] Start all containers with docker-compose up
 - [x] Wait for services to be healthy
 - [x] Register all servers with gateway
 - [x] Run full test suite
 - [x] Cleanup containers after tests

- [x] **All-Scenarios Runner**
 - [x] Create run-all.sh (271 lines)
 - [x] Run all profiles sequentially (light → medium → heavy)
 - [x] Run all scenarios for each profile
 - [x] Compare each run against profile baseline
 - [x] Generate master report comparing all profiles
 - [x] Flag critical regressions
 - [x] Provide summary statistics

### Phase 5: Documentation ✅

- [x] **README.md**
 - [x] Create comprehensive guide (309 lines)
 - [x] Add quick start section
 - [x] Document all Makefile targets
 - [x] Explain profile system
 - [x] Show example commands
 - [x] Add troubleshooting section
 - [x] Include expected performance baselines

- [x] **MANUAL_TESTING.md**
 - [x] Create CLI testing guide (458 lines)
 - [x] Add curl commands for all endpoints
 - [x] Include jq filters for response parsing
 - [x] Add hey benchmarking examples
 - [x] Provide quick smoke test script
 - [x] Add continuous monitoring script
 - [x] Include troubleshooting commands

- [x] **PERFORMANCE_STRATEGY.md**
 - [x] Create detailed strategy doc (2116 lines)
 - [x] Explain testing philosophy
 - [x] Document baseline management
 - [x] Describe regression detection algorithm
 - [x] Provide performance tuning tips
 - [x] Document environment optimization (.env.example updates)
 - [x] Include CI/CD integration examples

- [x] **Configuration Updates**
 - [x] Update .env.example with performance-optimized defaults
 - [x] Add LOG_LEVEL=ERROR for testing
 - [x] Add DISABLE_ACCESS_LOG=true
 - [x] Add MCPGATEWAY_LOGGING_OPTIMIZED=true
 - [x] Document performance impact of settings

### Phase 6: Test Payloads ✅

- [x] **Tools Payloads**
 - [x] Create payloads/tools/list_tools.json
 - [x] Create payloads/tools/get_system_time.json
 - [x] Create payloads/tools/convert_time.json
 - [x] Add parameter variations for testing

- [x] **Resources Payloads**
 - [x] Create payloads/resources/list_resources.json
 - [x] Create payloads/resources/read_timezone_info.json
 - [x] Create payloads/resources/read_world_times.json
 - [x] Test StreamableHTTP resources

- [x] **Prompts Payloads**
 - [x] Create payloads/prompts/list_prompts.json
 - [x] Create payloads/prompts/get_compare_timezones.json
 - [x] Add argument templates

### Phase 7: Integration & Bug Fixes ✅

- [x] **Bug Fix: JWT Team Dict Handling**
 - [x] Fix token_scoping.py team ID extraction (lines 373-375)
 - [x] Handle both dict and string team formats
 - [x] Normalize token_teams in _check_resource_team_ownership()
 - [x] Maintain backward compatibility
 - [x] Test with real JWT tokens

- [x] **Bug Fix: LoggingService Performance**
 - [x] Add MCPGATEWAY_LOGGING_OPTIMIZED setting
 - [x] Skip expensive stack inspection when optimized=true
 - [x] Reduce logging overhead from 50% to <5%
 - [x] Achieve 251x performance improvement (7 → 1810 RPS)

- [x] **Bug Fix: Ctrl+C Handling**
 - [x] Add signal handlers to benchmark scripts
 - [x] Cleanup background processes on interrupt
 - [x] Graceful shutdown of hey processes
 - [x] Prevent zombie processes

- [x] **Code Quality**
 - [x] Fix all pylint issues (10.00/10 rating)
 - [x] Add pylint disable comments for acceptable patterns
 - [x] Convert strings to f-strings in support_bundle_service.py
 - [x] Add import-outside-toplevel comments for optional imports

### Phase 8: Testing & Validation ✅

- [x] **Functional Testing**
 - [x] Test Makefile targets individually
 - [x] Validate profile loading
 - [x] Test authentication flow
 - [x] Verify service health checks
 - [x] Test baseline save/load/compare
 - [x] Validate report generation

- [x] **Performance Validation**
 - [x] Run light profile and verify metrics
 - [x] Run medium profile and check scalability
 - [x] Run heavy profile and validate stability
 - [x] Test with 4 MCP servers (SQLite + PostgreSQL)
 - [x] Verify 251x improvement sustained

- [x] **Documentation Validation**
 - [x] Verify all README examples work
 - [x] Test all MANUAL_TESTING.md commands
 - [x] Validate PERFORMANCE_STRATEGY.md accuracy
 - [x] Check .env.example settings

---

## ✅ Success Criteria

- [x] **Automated Testing**: Complete Makefile with 40+ targets for test automation
- [x] **Scenario Coverage**: 6 benchmark scenarios covering all core API operations
- [x] **Profile System**: 3 load profiles (light, medium, heavy) with configurable parameters
- [x] **Baseline Management**: Automatic baseline storage, comparison, and regression detection
- [x] **Reporting**: HTML reports with charts, metrics, and color-coded deltas
- [x] **Multi-Server Testing**: Docker Compose generation for 1-100 MCP servers
- [x] **Documentation**: 3 comprehensive docs (README, MANUAL_TESTING, PERFORMANCE_STRATEGY)
- [x] **Utilities**: 6 helper scripts/tools for automation (setup-auth, check-services, baseline_manager, compare_results, report_generator, generate_docker_compose)
- [x] **Performance**: 251x improvement verified (7 → 1810 RPS with optimizations)
- [x] **Bug Fixes**: JWT team handling, logging performance, Ctrl+C handling all resolved
- [x] **Code Quality**: Pylint 10.00/10 rating maintained

---

## 📊 Performance Impact

### Before Performance Testing Framework

- No systematic benchmarking
- No regression detection
- Unknown performance baselines
- Manual testing only
- No load profile definitions
- No baseline comparison

### After Performance Testing Framework

| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| Tools API (RPS) | 7 | 1810 | **251x** |
| P50 Latency | 180ms | 20ms | **9x faster** |
| P95 Latency | 350ms | 45ms | **7.8x faster** |
| Regression Detection | Manual | Automated | **100% coverage** |
| Test Scenarios | 0 | 6 | **6 scenarios** |
| Load Profiles | 0 | 3 | **3 profiles** |
| Automation Targets | 0 | 40+ | **Full automation** |

### Key Optimizations Enabled

1. **Logging Performance**: MCPGATEWAY_LOGGING_OPTIMIZED=true (251x improvement)
2. **Access Log Disabling**: DISABLE_ACCESS_LOG=true (reduces I/O overhead)
3. **Log Level Tuning**: LOG_LEVEL=ERROR for testing (minimal logging)
4. **Database Optimization**: Identified query bottlenecks via database-benchmark.sh
5. **Connection Pooling**: Validated pool sizing via mixed-workload.sh

---

## 📝 Additional Notes

🔹 **Baseline-Driven Development**: All performance changes validated against baselines before merge, ensuring no regressions slip into production.

🔹 **Profile-Based Testing**: Light (developer laptop), Medium (CI), Heavy (production simulation) profiles enable testing across deployment scenarios.

🔹 **Hey Tool Integration**: Using `hey` (Apache Bench successor) for HTTP load testing provides detailed latency distributions and throughput metrics.

🔹 **Docker Compose Automation**: Dynamic generation of multi-server test environments enables scalability testing without manual configuration.

🔹 **Regression Thresholds**: Configurable thresholds (default 10% degradation) balance sensitivity with real-world variance.

🔹 **HTML Reporting**: Interactive reports with Chart.js visualizations make performance data accessible to non-technical stakeholders.

🔹 **CI/CD Ready**: All scripts designed for integration with GitHub Actions, GitLab CI, Jenkins, etc.

🔹 **Future Extensions**:
 - WebSocket performance testing
 - SSE streaming benchmarks
 - Database migration performance
 - Plugin overhead measurement
 - Multi-region latency testing
 - Chaos engineering integration

---

## 🏁 Definition of Done

- [x] All implementation tasks completed (Phases 1-8)
- [x] 40+ Makefile targets operational
- [x] 6 benchmark scenarios functional
- [x] 3 load profiles defined and tested
- [x] Baseline management system working
- [x] Automated regression detection active
- [x] HTML report generation functional
- [x] All documentation complete (3 docs, 2883 lines)
- [x] All bug fixes merged (JWT, logging, Ctrl+C)
- [x] Code quality verified (pylint 10.00/10)
- [x] 251x performance improvement sustained
- [x] Docker multi-server testing validated
- [x] .env.example updated with optimizations
- [x] Manual testing guide validated
- [x] Team review completed and approved

---

## 🎯 Metrics Summary

| Component | Lines of Code | Files | Description |
|-----------|---------------|-------|-------------|
| Makefile | 271 | 1 | Test automation |
| Scenarios | 1031 | 6 | Benchmark scripts |
| Utilities | 2792 | 6 | Python/Bash helpers |
| Documentation | 2883 | 3 | README, guides, strategy |
| Configuration | 386 | 1 | config.yaml |
| Profiles | 15 | 3 | Load definitions |
| Payloads | 63 | 9 | Test data |
| **Total** | **7441** | **29** | **Complete framework** |

**Project Impact**: 8,343 insertions, 57 deletions, 42 files changed

Metric	Before	After	Improvement
Tools API (RPS)	7	1810	251x
P50 Latency	180ms	20ms	9x faster
P95 Latency	350ms	45ms	7.8x faster
Regression Detection	Manual	Automated	100% coverage
Test Scenarios	0	6	6 scenarios
Load Profiles	0	3	3 profiles
Automation Targets	0	40+	Full automation

Component	Lines of Code	Files	Description
Makefile	271	1	Test automation
Scenarios	1031	6	Benchmark scripts
Utilities	2792	6	Python/Bash helpers
Documentation	2883	3	README, guides, strategy
Configuration	386	1	config.yaml
Profiles	15	3	Load definitions
Payloads	63	9	Test data
Total	7441	29	Complete framework

[FEATURE][TESTING]: Performance testing and benchmarking framework #1203

Description

🚀 Epic: Performance Testing & Benchmarking Framework

Goal

Why Now?

📖 User Stories

🏗 Architecture

Testing Framework Architecture

Test Execution Flow

Baseline Management System

📂 Directory Structure

⚙️ Configuration

config.yaml Structure

Profile Configuration Example

📋 Implementation Tasks

Phase 1: Core Infrastructure ✅

Phase 2: Benchmark Scenarios ✅

Phase 3: Utilities & Automation ✅

Phase 4: Test Orchestration ✅

Phase 5: Documentation ✅

Phase 6: Test Payloads ✅

Phase 7: Integration & Bug Fixes ✅

Phase 8: Testing & Validation ✅

✅ Success Criteria

📊 Performance Impact

Before Performance Testing Framework

After Performance Testing Framework

Key Optimizations Enabled

📝 Additional Notes

🏁 Definition of Done

🎯 Metrics Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions