Skip to content

[Preview] Mooncake performance optimization#934

Closed
xiaguan wants to merge 4 commits intoLMCache:devfrom
xiaguan:mooncake_adapter
Closed

[Preview] Mooncake performance optimization#934
xiaguan wants to merge 4 commits intoLMCache:devfrom
xiaguan:mooncake_adapter

Conversation

@xiaguan
Copy link
Copy Markdown
Contributor

@xiaguan xiaguan commented Jul 1, 2025

Overview

This PR introduces significant performance optimizations for Mooncake integration with LMCache, focusing on three key areas:

🚀 Performance Optimizations

1. Scheduler Query Performance Enhancement

  • Optimized scheduler lookup performance: Reduced query latency to 1-2ms consistently
  • Stable performance under load: Maintains consistent scheduling performance even when the system is busy, compared to the dev branch
  • Improved cache hit detection: Enhanced the lookup mechanism for better responsiveness

2. Zero-Copy Implementation

  • Direct memory access: Implemented get_into and put_from APIs for zero-copy data transfer
  • Buffer registration: Added CPU buffer registration for RDMA operations
  • Eliminated memory copies: Direct data transfer between Mooncake store and LMCache buffers

3. Batch Interface Implementation

  • Batch get operations: Added batch_get and batch_get_into for parallel data retrieval
  • Batch put operations: Implemented batch_put and batch_put_from for efficient data storage
  • Maximum bandwidth utilization: Leverages Mooncake's aggregated bandwidth through batch operations

🔧 Technical Changes

📊 Performance Impact

lmcache config

chunk_size: 256
local_device: "cpu"
remote_url: "mooncakestore://127.0.0.1:50051/"
remote_serde: "naive"
pipelined_backend: False
local_cpu: False
max_local_cpu_size: 5

mooncake config(Mooncake main branch installed)

{
    "local_hostname": "localhost",
    "metadata_server": "http://localhost:8080/metadata",
    "protocol": "rdma",
    "device_name": "erdma_0,erdma_1",
    "global_segment_size": 16106127360,
    "master_server_address": "localhost:50051",
    "local_buffer_size": 2147483648
}

dev branch, mooncake, 8192 input 128 output, 50 prompt, hit all, request rate inf

============ Serving Benchmark Result ============
Successful requests:                     50        
Benchmark duration (s):                  6.99      
Total input tokens:                      409494    
Total generated tokens:                  6400      
Request throughput (req/s):              7.16      
Output token throughput (tok/s):         916.14    
Total Token throughput (tok/s):          59534.17  
---------------Time to First Token----------------
Mean TTFT (ms):                          2418.55   
Median TTFT (ms):                        3253.96   
P99 TTFT (ms):                           3267.32   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          35.73     
Median TPOT (ms):                        29.25     
P99 TPOT (ms):                           51.84     
---------------Inter-token Latency----------------
Mean ITL (ms):                           35.73     
Median ITL (ms):                         29.13     
P99 ITL (ms):                            40.21     
==================================================

This branch, same bench

============ Serving Benchmark Result ============
Successful requests:                     50        
Benchmark duration (s):                  5.12      
Total input tokens:                      409494    
Total generated tokens:                  6400      
Request throughput (req/s):              9.77      
Output token throughput (tok/s):         1250.74   
Total Token throughput (tok/s):          81277.18  
---------------Time to First Token----------------
Mean TTFT (ms):                          814.67    
Median TTFT (ms):                        810.21    
P99 TTFT (ms):                           1443.45   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          32.95     
Median TPOT (ms):                        33.10     
P99 TPOT (ms):                           36.91     
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.95     
Median ITL (ms):                         29.49     
P99 ITL (ms):                            115.38    
==================================================

Note

This PR is a preview and will NOT be merged directly. We will collaborate with the LMCache team to gradually decompose and integrate these changes into the repository. This preview is provided for those who want to test Mooncake performance optimizations early.


For early adopters wanting to test these optimizations, please use this branch with caution and provide feedback on performance improvements.

@xiaguan
Copy link
Copy Markdown
Contributor Author

xiaguan commented Aug 20, 2025

duplicate with #1269 close

@xiaguan xiaguan closed this Aug 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant