[bug] Local CPU memory leaking when using remote connector

**Label**
Please label your issue with "bug" and any other relevant labels so that it can easily be easily categorized under [LMCache Onboarding](https://github.com/LMCache/LMCache/issues/1882)

**Describe the bug**
When using Redis remote connector, I noticed local memory leaking issue and TTFT P50 turns from ~0.4s to 3s after several hours of tests using https://github.com/LMCache/LMBenchmark/blob/main/real-multi-round-qa/multi-round-qa.py#L104. The hours taken to increase seems linear to the max_local_cpu_size config. This issue should not be specific to Redis connector since other connectors are interacting local cpu with similar approach (such as SageMaker HyperPod connector). Using Redis connector as example here since it exists for long time.

In logs, I'm seeing lots of `No eviction candidates found in local cpu backend.`.

With enabling logs in this PR, https://github.com/LMCache/LMCache/pull/1972

I'm seeing multiple cases of local cpu memory leaking.

1. All of items are pinned and never unpinned. `Local CPU backend state: total_items=18, pinned_count=18, ref_count_distribution={1: 18}.` I added some pin/unpin logs locally and noticed it is from VLLM. By some reason, the wait_for_save() which has lookup_pin() is not called after lookup().

2. No items are in hot cache. `Local CPU backend state: total_items=0, pinned_count=0, ref_count_distribution={}.` I suspect it is caused by the items are not registered in hot cache for batched_get() https://github.com/LMCache/LMCache/blob/8353b671b192d1db6bff6a4f96b3068aa3b27ff2/lmcache/v1/storage_backend/storage_manager.py#L450. In get(), it is registered. https://github.com/LMCache/LMCache/blob/8353b671b192d1db6bff6a4f96b3068aa3b27ff2/lmcache/v1/storage_backend/storage_manager.py#L424

**To Reproduce**
```
---

apiVersion: v1
kind: ConfigMap
metadata:
  name: lmcache-config
  namespace: lmcache
data:
  lmcache.yaml: |
    local_cpu: true  # Set to 'true' in production to enable both LMCache native CPU offload AND distributed caching
    chunk_size: 6082
    max_local_cpu_size: 5
    remote_url: "redis://redis.lmcache.svc.cluster.local:6379"

---

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test
  namespace: lmcache
spec:
  replicas: 40
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      namespace: lmcache
      labels:
        app: test
    spec:
      containers:
        - name: vllm
          image: lmcache/vllm-openai:v0.3.9post2
          command:
            - /opt/venv/bin/vllm
            - serve
            - meta-llama/Llama-3.1-8B-Instruct
            - --host
            - 0.0.0.0
            - --port
            - "8000"
            - --enable-prefix-caching
            - --max-model-len
            - "70000"
            - --tensor-parallel-size
            - "4"
            - --kv-transfer-config
            - '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
          resources:
            limits:
              nvidia.com/gpu: "4"
            requests:
              nvidia.com/gpu: "4"
          startupProbe:
            failureThreshold: 60
            httpGet:
              path: /health
              port: 8000
              scheme: HTTP
            initialDelaySeconds: 15
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  key: hf_token_llama
                  name: vllm-secrets
            - name: LMCACHE_CONFIG_FILE
              value: /etc/lmcache/lmcache.yaml
            - name: PYTHONHASHSEED
              value: "0"
            - name: PROMETHEUS_MULTIPROC_DIR
              value: "/tmp"
            - name: LMCACHE_LOG_LEVEL
              value: "DEBUG"
          volumeMounts:
            - name: lmcache-config
              mountPath: /etc/lmcache
      volumes:
        - name: lmcache-config
          configMap:
            name: lmcache-config

```

**Expected behavior**
TTFT should be consistent and items should flow in local memory hot cache and be evicted as needed.

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Desktop (please complete the following information):**
EKS GPU instances

**Additional context**
Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] Local CPU memory leaking when using remote connector #2017

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[bug] Local CPU memory leaking when using remote connector #2017

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions