[Preview] Mooncake performance optimization by xiaguan · Pull Request #934 · LMCache/LMCache

xiaguan · 2025-07-01T03:42:36Z

Overview

This PR introduces significant performance optimizations for Mooncake integration with LMCache, focusing on three key areas:

🚀 Performance Optimizations

1. Scheduler Query Performance Enhancement

Optimized scheduler lookup performance: Reduced query latency to 1-2ms consistently
Stable performance under load: Maintains consistent scheduling performance even when the system is busy, compared to the dev branch
Improved cache hit detection: Enhanced the lookup mechanism for better responsiveness

2. Zero-Copy Implementation

Direct memory access: Implemented get_into and put_from APIs for zero-copy data transfer
Buffer registration: Added CPU buffer registration for RDMA operations
Eliminated memory copies: Direct data transfer between Mooncake store and LMCache buffers

3. Batch Interface Implementation

Batch get operations: Added batch_get and batch_get_into for parallel data retrieval
Batch put operations: Implemented batch_put and batch_put_from for efficient data storage
Maximum bandwidth utilization: Leverages Mooncake's aggregated bandwidth through batch operations

🔧 Technical Changes

📊 Performance Impact

lmcache config

chunk_size: 256
local_device: "cpu"
remote_url: "mooncakestore://127.0.0.1:50051/"
remote_serde: "naive"
pipelined_backend: False
local_cpu: False
max_local_cpu_size: 5

mooncake config(Mooncake main branch installed)

{
    "local_hostname": "localhost",
    "metadata_server": "http://localhost:8080/metadata",
    "protocol": "rdma",
    "device_name": "erdma_0,erdma_1",
    "global_segment_size": 16106127360,
    "master_server_address": "localhost:50051",
    "local_buffer_size": 2147483648
}

dev branch, mooncake, 8192 input 128 output, 50 prompt, hit all, request rate inf

============ Serving Benchmark Result ============
Successful requests:                     50        
Benchmark duration (s):                  6.99      
Total input tokens:                      409494    
Total generated tokens:                  6400      
Request throughput (req/s):              7.16      
Output token throughput (tok/s):         916.14    
Total Token throughput (tok/s):          59534.17  
---------------Time to First Token----------------
Mean TTFT (ms):                          2418.55   
Median TTFT (ms):                        3253.96   
P99 TTFT (ms):                           3267.32   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          35.73     
Median TPOT (ms):                        29.25     
P99 TPOT (ms):                           51.84     
---------------Inter-token Latency----------------
Mean ITL (ms):                           35.73     
Median ITL (ms):                         29.13     
P99 ITL (ms):                            40.21     
==================================================

This branch, same bench

============ Serving Benchmark Result ============
Successful requests:                     50        
Benchmark duration (s):                  5.12      
Total input tokens:                      409494    
Total generated tokens:                  6400      
Request throughput (req/s):              9.77      
Output token throughput (tok/s):         1250.74   
Total Token throughput (tok/s):          81277.18  
---------------Time to First Token----------------
Mean TTFT (ms):                          814.67    
Median TTFT (ms):                        810.21    
P99 TTFT (ms):                           1443.45   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          32.95     
Median TPOT (ms):                        33.10     
P99 TPOT (ms):                           36.91     
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.95     
Median ITL (ms):                         29.49     
P99 ITL (ms):                            115.38    
==================================================

Note

This PR is a preview and will NOT be merged directly. We will collaborate with the LMCache team to gradually decompose and integrate these changes into the repository. This preview is provided for those who want to test Mooncake performance optimizations early.

For early adopters wanting to test these optimizations, please use this branch with caution and provide feedback on performance improvements.

xiaguan · 2025-08-20T05:12:40Z

duplicate with #1269 close

xiaguan added 4 commits June 24, 2025 15:04

remove layerwise

8ab2fc6

add batch get

b8d051d

add gpu batch

49cd007

add batch put

fb781b2

xiaguan marked this pull request as draft July 1, 2025 03:46

xiaguan mentioned this pull request Jul 1, 2025

[Store] Python store API supports batch operation kvcache-ai/Mooncake#511

Closed

xiaguan mentioned this pull request Jul 11, 2025

[Performance] use redis and two vllm instance, second request is slower than first #914

Closed

xiaguan closed this Aug 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Preview] Mooncake performance optimization#934

[Preview] Mooncake performance optimization#934
xiaguan wants to merge 4 commits intoLMCache:devfrom
xiaguan:mooncake_adapter

xiaguan commented Jul 1, 2025 •

edited

Loading

Uh oh!

xiaguan commented Aug 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xiaguan commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

🚀 Performance Optimizations

1. Scheduler Query Performance Enhancement

2. Zero-Copy Implementation

3. Batch Interface Implementation

🔧 Technical Changes

📊 Performance Impact

Note

Uh oh!

xiaguan commented Aug 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xiaguan commented Jul 1, 2025 •

edited

Loading