Label
Please label your issue with "bug" and any other relevant labels so that it can easily be easily categorized under LMCache Onboarding
Describe the bug
When using Redis remote connector, I noticed local memory leaking issue and TTFT P50 turns from ~0.4s to 3s after several hours of tests using https://github.com/LMCache/LMBenchmark/blob/main/real-multi-round-qa/multi-round-qa.py#L104. The hours taken to increase seems linear to the max_local_cpu_size config. This issue should not be specific to Redis connector since other connectors are interacting local cpu with similar approach (such as SageMaker HyperPod connector). Using Redis connector as example here since it exists for long time.
In logs, I'm seeing lots of No eviction candidates found in local cpu backend..
With enabling logs in this PR, #1972
I'm seeing multiple cases of local cpu memory leaking.
-
All of items are pinned and never unpinned. Local CPU backend state: total_items=18, pinned_count=18, ref_count_distribution={1: 18}. I added some pin/unpin logs locally and noticed it is from VLLM. By some reason, the wait_for_save() which has lookup_pin() is not called after lookup().
-
No items are in hot cache. Local CPU backend state: total_items=0, pinned_count=0, ref_count_distribution={}. I suspect it is caused by the items are not registered in hot cache for batched_get()
. In get(), it is registered.
|
local_cpu_backend.submit_put_task(key, memory_obj) |
To Reproduce
---
apiVersion: v1
kind: ConfigMap
metadata:
name: lmcache-config
namespace: lmcache
data:
lmcache.yaml: |
local_cpu: true # Set to 'true' in production to enable both LMCache native CPU offload AND distributed caching
chunk_size: 6082
max_local_cpu_size: 5
remote_url: "redis://redis.lmcache.svc.cluster.local:6379"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: test
namespace: lmcache
spec:
replicas: 40
selector:
matchLabels:
app: test
template:
metadata:
namespace: lmcache
labels:
app: test
spec:
containers:
- name: vllm
image: lmcache/vllm-openai:v0.3.9post2
command:
- /opt/venv/bin/vllm
- serve
- meta-llama/Llama-3.1-8B-Instruct
- --host
- 0.0.0.0
- --port
- "8000"
- --enable-prefix-caching
- --max-model-len
- "70000"
- --tensor-parallel-size
- "4"
- --kv-transfer-config
- '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
resources:
limits:
nvidia.com/gpu: "4"
requests:
nvidia.com/gpu: "4"
startupProbe:
failureThreshold: 60
httpGet:
path: /health
port: 8000
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
key: hf_token_llama
name: vllm-secrets
- name: LMCACHE_CONFIG_FILE
value: /etc/lmcache/lmcache.yaml
- name: PYTHONHASHSEED
value: "0"
- name: PROMETHEUS_MULTIPROC_DIR
value: "/tmp"
- name: LMCACHE_LOG_LEVEL
value: "DEBUG"
volumeMounts:
- name: lmcache-config
mountPath: /etc/lmcache
volumes:
- name: lmcache-config
configMap:
name: lmcache-config
Expected behavior
TTFT should be consistent and items should flow in local memory hot cache and be evicted as needed.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
EKS GPU instances
Additional context
Add any other context about the problem here.
Label
Please label your issue with "bug" and any other relevant labels so that it can easily be easily categorized under LMCache Onboarding
Describe the bug
When using Redis remote connector, I noticed local memory leaking issue and TTFT P50 turns from ~0.4s to 3s after several hours of tests using https://github.com/LMCache/LMBenchmark/blob/main/real-multi-round-qa/multi-round-qa.py#L104. The hours taken to increase seems linear to the max_local_cpu_size config. This issue should not be specific to Redis connector since other connectors are interacting local cpu with similar approach (such as SageMaker HyperPod connector). Using Redis connector as example here since it exists for long time.
In logs, I'm seeing lots of
No eviction candidates found in local cpu backend..With enabling logs in this PR, #1972
I'm seeing multiple cases of local cpu memory leaking.
All of items are pinned and never unpinned.
Local CPU backend state: total_items=18, pinned_count=18, ref_count_distribution={1: 18}.I added some pin/unpin logs locally and noticed it is from VLLM. By some reason, the wait_for_save() which has lookup_pin() is not called after lookup().No items are in hot cache.
Local CPU backend state: total_items=0, pinned_count=0, ref_count_distribution={}.I suspect it is caused by the items are not registered in hot cache for batched_get()LMCache/lmcache/v1/storage_backend/storage_manager.py
Line 450 in 8353b67
LMCache/lmcache/v1/storage_backend/storage_manager.py
Line 424 in 8353b67
To Reproduce
Expected behavior
TTFT should be consistent and items should flow in local memory hot cache and be evicted as needed.
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
EKS GPU instances
Additional context
Add any other context about the problem here.