Skip to content

[PERFORMANCE]: Admin UI endpoint /admin/ has high latency under load #1907

@crivetimihai

Description

@crivetimihai

Summary

The /admin/ endpoint exhibits high latency (6+ seconds average, 12-15s P95) under extreme load, making it the primary bottleneck for overall P95 response times. This affects administrator experience when multiple users access the dashboard concurrently (unlikely, as this is 3000+ users).

Problem

Under load testing with 3000 concurrent users:

Metric Value
Average response time 6,326ms
P95 12,000ms
P99 13,000ms
Response size 587KB HTML

Meanwhile, REST API endpoints (/tools, /gateways, etc.) respond in 2-5ms due to Redis caching.

Root Cause Analysis

1. Large HTML Response (587KB)

The admin dashboard renders everything in a single page:

  • All tools, resources, prompts, servers, gateways
  • Aggregated metrics for each entity type
  • Team information and user roles
  • Top performers lists

2. Heavy Template Rendering

Jinja2 template rendering is CPU-intensive. With 3 gateway replicas at 670-720% CPU, template rendering is a significant contributor.

3. Multiple Database Queries Per Request

Each /admin/ request queries:

  • tools table
  • resources table
  • prompts table
  • servers table
  • gateways table
  • email_teams table
  • a2a_agents table
  • Metrics aggregations for each type

4. No Response Caching

Unlike the REST API endpoints which are Redis-cached, the admin HTML response is regenerated on every request.

Evidence

Load Test Results

=== TOP ENDPOINTS BY LATENCY ===
Endpoint                                   Reqs      Avg     P95
/admin/                                   10629     6133   12000
/admin/tools                               2614     2151    4500
/rpc tools/list                           26607     1539    3300

=== FAST ENDPOINTS (Redis-cached, 2ms avg) ===
/tools, /servers, /gateways, /resources

Resource Usage During Load

gateway-1: 675% CPU, 4.6GB memory
gateway-2: 671% CPU, 4.7GB memory  
gateway-3: 716% CPU, 4.7GB memory
postgres:  116% CPU, 592MB memory

Proposed Solutions

Phase 1: Quick Wins

  1. Add response caching for /admin/

    • Cache rendered HTML in Redis with short TTL (10-30 seconds)
    • Invalidate on data changes
    • Expected improvement: 90%+ reduction in render time for cached responses
  2. Lazy-load dashboard sections

    • Load summary counts first (fast)
    • Load detailed lists via HTMX on scroll/click
    • Already partially implemented with /admin/*/partial endpoints

Phase 2: Pagination

  1. Paginate entity lists by default

    • Don't load all tools/resources/prompts at once
    • Default to 25-50 items per page
    • Add search/filter to find specific items
  2. Defer metrics loading

    • Load dashboard layout immediately
    • Fetch metrics via async HTMX calls
    • Show loading spinners during fetch

Phase 3: Architecture

  1. Pre-compute dashboard data

    • Background job to compute dashboard summary
    • Store in Redis, refresh every 30-60 seconds
    • Dashboard reads from cache only
  2. Consider lighter admin summary endpoint

    • New /admin/summary with just counts and key metrics
    • Full data loaded on-demand per section

Files to Modify

File Change
mcpgateway/admin.py Add response caching, pagination
mcpgateway/cache/ Add admin response cache
templates/admin/ Add lazy-loading, pagination UI
mcpgateway/config.py Add ADMIN_RESPONSE_CACHE_TTL setting

Related Issues

Acceptance Criteria

  • /admin/ average response time < 500ms under 1000 user load
  • P95 response time < 2 seconds
  • No increase in error rate
  • Dashboard remains functional and usable

Metadata

Metadata

Assignees

Labels

SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releasedatabaseduplicateThis issue or pull request already existsperformancePerformance related items

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions