-
Notifications
You must be signed in to change notification settings - Fork 615
[PERFORMANCE]: Admin UI endpoint /admin/ has high latency under load #1907
Description
Summary
The /admin/ endpoint exhibits high latency (6+ seconds average, 12-15s P95) under extreme load, making it the primary bottleneck for overall P95 response times. This affects administrator experience when multiple users access the dashboard concurrently (unlikely, as this is 3000+ users).
Problem
Under load testing with 3000 concurrent users:
| Metric | Value |
|---|---|
| Average response time | 6,326ms |
| P95 | 12,000ms |
| P99 | 13,000ms |
| Response size | 587KB HTML |
Meanwhile, REST API endpoints (/tools, /gateways, etc.) respond in 2-5ms due to Redis caching.
Root Cause Analysis
1. Large HTML Response (587KB)
The admin dashboard renders everything in a single page:
- All tools, resources, prompts, servers, gateways
- Aggregated metrics for each entity type
- Team information and user roles
- Top performers lists
2. Heavy Template Rendering
Jinja2 template rendering is CPU-intensive. With 3 gateway replicas at 670-720% CPU, template rendering is a significant contributor.
3. Multiple Database Queries Per Request
Each /admin/ request queries:
toolstableresourcestablepromptstableserverstablegatewaystableemail_teamstablea2a_agentstable- Metrics aggregations for each type
4. No Response Caching
Unlike the REST API endpoints which are Redis-cached, the admin HTML response is regenerated on every request.
Evidence
Load Test Results
=== TOP ENDPOINTS BY LATENCY ===
Endpoint Reqs Avg P95
/admin/ 10629 6133 12000
/admin/tools 2614 2151 4500
/rpc tools/list 26607 1539 3300
=== FAST ENDPOINTS (Redis-cached, 2ms avg) ===
/tools, /servers, /gateways, /resources
Resource Usage During Load
gateway-1: 675% CPU, 4.6GB memory
gateway-2: 671% CPU, 4.7GB memory
gateway-3: 716% CPU, 4.7GB memory
postgres: 116% CPU, 592MB memory
Proposed Solutions
Phase 1: Quick Wins
-
Add response caching for
/admin/- Cache rendered HTML in Redis with short TTL (10-30 seconds)
- Invalidate on data changes
- Expected improvement: 90%+ reduction in render time for cached responses
-
Lazy-load dashboard sections
- Load summary counts first (fast)
- Load detailed lists via HTMX on scroll/click
- Already partially implemented with
/admin/*/partialendpoints
Phase 2: Pagination
-
Paginate entity lists by default
- Don't load all tools/resources/prompts at once
- Default to 25-50 items per page
- Add search/filter to find specific items
-
Defer metrics loading
- Load dashboard layout immediately
- Fetch metrics via async HTMX calls
- Show loading spinners during fetch
Phase 3: Architecture
-
Pre-compute dashboard data
- Background job to compute dashboard summary
- Store in Redis, refresh every 30-60 seconds
- Dashboard reads from cache only
-
Consider lighter admin summary endpoint
- New
/admin/summarywith just counts and key metrics - Full data loaded on-demand per section
- New
Files to Modify
| File | Change |
|---|---|
mcpgateway/admin.py |
Add response caching, pagination |
mcpgateway/cache/ |
Add admin response cache |
templates/admin/ |
Add lazy-loading, pagination UI |
mcpgateway/config.py |
Add ADMIN_RESPONSE_CACHE_TTL setting |
Related Issues
- [PERFORMANCE]: Metrics aggregation queries cause full table scans under load #1906 - Metrics aggregation full table scans (fixed with cache TTL)
- [PERFORMANCE]: N+1 query pattern in EmailTeam.get_member_count() #1892 - N+1 query in team member counts (fixed)
- [PERFORMANCE]: Admin UI endpoints have high tail latency (5-10s p95) #1894 - Admin UI high tail latency
Acceptance Criteria
-
/admin/average response time < 500ms under 1000 user load - P95 response time < 2 seconds
- No increase in error rate
- Dashboard remains functional and usable