fix: auto-generate lmcache_instance_id when value is None#2732
fix: auto-generate lmcache_instance_id when value is None#2732DongDongJu merged 12 commits intoLMCache:devfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical bug in the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses a bug where the lmcache_instance_id was not being auto-generated because the check for its existence was incorrect. The fix replaces hasattr with getattr to correctly check if the instance ID has a value or is None. This change is correct and effectively resolves the issue, ensuring that a unique instance ID is generated when one is not provided.
6b48980 to
4fe9741
Compare
| except Exception as e: | ||
| self._init_failed = True | ||
| self._init_failed_reason = str(e) | ||
| self._init_failed_reason = traceback.format_exc() |
There was a problem hiding this comment.
One minor nit: traceback.format_exc() relies on implicit sys.exc_info() state. Since you already have e in scope, "".join(traceback.format_exception(e)) would be slightly more explicit and resilient to future refactors where the exception context might shift.
There was a problem hiding this comment.
Got it, raised another revision leveraging format_exception instead
Signed-off-by: Can Sun <sucan@amazon.com>
Signed-off-by: Can Sun <sucan@amazon.com>
Signed-off-by: Can Sun <sucan@amazon.com>
4fe9741 to
5cfbf25
Compare
Signed-off-by: Can Sun <sucan@amazon.com>
a413268 to
bdde32a
Compare
| if not lmcache_worker_ids: | ||
| # start lmcache worker on all ranks | ||
| assert len(config.lmcache_worker_ports) == metadata.world_size | ||
| assert len(config.lmcache_worker_ports) >= metadata.world_size |
There was a problem hiding this comment.
Why do we change to >= here?
There was a problem hiding this comment.
From our testing, we found that this check seems to be too strict. For simplicity, we are giving 8 ports to be available to be used but TP may only be 4. This strict assertion is failing the check however redundant ports can just be ignored. What we just need to make sure is making the number of available ports greater than what's needed
There was a problem hiding this comment.
Same question. I cannot get it that well. why is it strict? Could you help me to understand it more?
There was a problem hiding this comment.
Here is the summary provided by AI:
The change from == to >= in both assertions is justified because:
-
The strict == is unnecessarily restrictive. The code only needs config.lmcache_worker_ports[self.worker_id] (first branch) or config.lmcache_worker_ports[index] (second branch) — it indexes into the list by position. As long as there are enough ports to
cover all workers, the access is safe. Extra ports in the list are simply unused and harmless. -
It enables config reuse across different deployment topologies. A user may define a single lmcache_worker_ports list (e.g., [8500, 8501, 8502, 8503]) and reuse the same config file across deployments with varying world_size (e.g., TP=2 vs TP=4). With
the strict ==, they'd have to maintain separate configs for each topology even though the smaller deployment would just use a prefix of the port list. -
It's consistent with how _get_lmcache_worker_ids works. That function already returns a subset of worker IDs (e.g., [0] for MLA mode regardless of world_size). The number of active workers can be less than the total ports configured. The >= check
correctly reflects this "at least enough" semantic. -
The correctness guarantee is preserved. The actual port selection logic (config.lmcache_worker_ports[self.worker_id] and config.lmcache_worker_ports[index]) is unchanged. The >= still guarantees no out-of-bounds access —
self.worker_id < world_size <= len(ports) in the first branch, and index < len(worker_ids) <= len(ports) in the second branch. -
The second assertion (len(ports) >= len(worker_ids)) is more correct than the original. The original len(worker_ids) == len(ports) would fail if a user provides extra ports as a superset. Since index = lmcache_worker_ids.index(self.worker_id) is bounded
by len(lmcache_worker_ids), having len(ports) >= len(worker_ids) is the precise safety condition.
In short: the >= is the minimum necessary precondition for safe indexing, while == was an overly strict invariant that rejected valid configurations.
There was a problem hiding this comment.
I still don't get it. Two questions:
- Is it related to this PR?
- In what case will
==fail? (E.g., something like run LMCache with Model X using parallel strategy Y..)
There was a problem hiding this comment.
Thanks for the explanation!
There was a problem hiding this comment.
-
Yes, it is related. We encountered two main issues when using the LMCache controller. All of them are due to the assertion failures during the cache engine initialization. One is that instance Id is None when not provided via ENV. And second is this issue due to too strict assertion. We always give 8 ports to be used (p5 instance has 8 GPUs) by worker but this is currently causing the initialization failures.
-
The condition will fail when you create a model deployment but worker number is not the same as tensor parallelization. In our case, we always give 8 ports to be used but this will fail the assertion on
g5.24xlargebecause it only has 4 GPUs. Besides, MLA (Multi-head latent attention) models always use 1 LMCache worker regardless of TP size. So it is quite hard for us to determine the right port number per customer's model deployment definition
There is no harm to relax this restriction I think. We just need to guarantee there are sufficient number of ports that can be used to allocate to each worker. This also aligns with P2P ports which does not fail when number of p2p init/lookup ports is greater than TP rank.
# p2p_backend.py
self.peer_init_port = config.p2p_init_ports[self.tp_rank]
self.peer_lookup_port = config.p2p_lookup_ports[self.tp_rank]There was a problem hiding this comment.
Verified locally that the change works without assertion errors and vLLM can respond successfully without issues. Notice that I gave 8 ports available to be used 8001,8002,8003,8004,8005,8006,8007,8008
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "17"
creationTimestamp: "2026-03-11T20:18:01Z"
generation: 26
labels:
app: tinyllama-kvcache
deploying-service: hyperpod-inference
name: tinyllama-kvcache
namespace: default
resourceVersion: "17420107"
uid: 2fb9b9c6-32fa-407c-92c9-a1e32c346089
spec:
progressDeadlineSeconds: 3600
replicas: 3
revisionHistoryLimit: 10
selector:
matchLabels:
app: tinyllama-kvcache
deploying-service: hyperpod-inference
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
annotations:
kubectl.kubernetes.io/restartedAt: "2026-03-22T23:30:42-07:00"
creationTimestamp: null
labels:
app: tinyllama-kvcache
deploying-service: hyperpod-inference
namespace: default
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- ml.g5.24xlarge
- key: topology.kubernetes.io/region
operator: In
values:
- us-west-2
weight: 100
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- ml.g5.24xlarge
- key: topology.kubernetes.io/region
operator: In
values:
- us-west-2
containers:
- args:
- /opt/ml/model
- --max-model-len
- "8192"
- --tensor-parallel-size
- "4"
- --kv-transfer-config
- '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'
env:
- name: OPTION_ROLLING_BATCH
value: vllm
- name: SAGEMAKER_SUBMIT_DIRECTORY
value: /opt/ml/model/code
- name: MODEL_CACHE_ROOT
value: /opt/ml/model
- name: SAGEMAKER_MODEL_SERVER_WORKERS
value: "1"
- name: SAGEMAKER_MODEL_SERVER_TIMEOUT
value: "3600"
- name: LMCACHE_CONTROLLER_PULL_URL
value: tinyllama-kvcache-default-routing-service.hyperpod-inference-system.svc.cluster.local:9000
- name: LMCACHE_CONTROLLER_URL
value: tinyllama-kvcache-default-routing-service.hyperpod-inference-system.svc.cluster.local:9000
- name: LMCACHE_DISTRIBUTED_URL
value: localhost:8009
- name: LMCACHE_ENABLE_CONTROLLER
value: "true"
- name: LMCACHE_LMCACHE_WORKER_PORT
value: "8001"
- name: LMCACHE_LMCACHE_WORKER_PORTS
value: 8001,8002,8003,8004,8005,8006,8007,8008
- name: LMCACHE_LOCAL_CPU
value: "true"
- name: LMCACHE_MAX_LOCAL_CPU_SIZE
value: "32"
- name: LMCACHE_P2P_INIT_PORTS
value: 9001,9002,9003,9004,9005,9006,9007,9008
- name: LMCACHE_P2P_LOOKUP_PORTS
value: 9011,9012,9013,9014,9015,9016,9017,9018
- name: LMCACHE_USE_EXPERIMENTAL
value: "true"
- name: PROMETHEUS_MULTIPROC_DIR
value: /tmp
- name: PYTHONHASHSEED
- name: LMCACHE_CONTROLLER_REPLY_URL
value: tinyllama-kvcache-default-routing-service.hyperpod-inference-system.svc.cluster.local:9001
- name: LMCACHE_LMCACHE_WORKER_HEARTBEAT_TIME
value: "10"
image: 006280839251.dkr.ecr.us-west-2.amazonaws.com/lmcache/vllm-openai:dev
imagePullPolicy: Always
name: tinyllama-kvcache
ports:
- containerPort: 8000
name: http
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 8000
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
nvidia.com/gpu: "4"
requests:
cpu: "6"
memory: 30Gi
nvidia.com/gpu: "4"
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /opt/ml/model
name: model-weights
subPath: models/Llama-3.1-8B-Instruct
- mountPath: /dev/shm
name: shm
- env:
- name: MODEL_HOST
value: localhost
- name: MODEL_PORT
value: "8000"
- name: RESOURCE_NAME
value: tinyllama-kvcache
- name: NAMESPACE
value: default
image: 861276088053.dkr.ecr.us-west-2.amazonaws.com/nginx-reverse-proxy:v1.0.1
imagePullPolicy: IfNotPresent
name: sidecar-reverse-proxy
ports:
- containerPort: 8081
name: http-proxy
protocol: TCP
- containerPort: 9113
name: metrics
protocol: TCP
resources:
limits:
cpu: 200m
memory: 256Mi
requests:
cpu: 100m
memory: 128Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /usr/local/openresty/nginx/conf/nginx.conf
name: sidecar-config
subPath: nginx.conf
- mountPath: /etc/nginx/lua/metrics.lua
name: metrics-lua
subPath: metrics.lua
- args:
- --config=/etc/otel/config.yaml
env:
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/hyperpod/otel_collector:v1751654772563
imagePullPolicy: IfNotPresent
name: otel-collector
ports:
- containerPort: 4317
name: otlp-grpc
protocol: TCP
- containerPort: 4318
name: otlp-http
protocol: TCP
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 50m
memory: 64Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/otel/config.yaml
name: otel-collector-config
subPath: config.yaml
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
topologySpreadConstraints:
- labelSelector:
matchLabels:
app: tinyllama-kvcache
deploying-service: hyperpod-inference
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
volumes:
- emptyDir:
medium: Memory
name: shm
- name: model-weights
persistentVolumeClaim:
claimName: inference-s3-pvc-tinyllama-kvcache-inf
- configMap:
defaultMode: 420
name: iec-tinyllama-kvcache-nginx
name: sidecar-config
- configMap:
defaultMode: 420
name: iec-tinyllama-kvcache-lua
name: metrics-lua
- configMap:
defaultMode: 420
name: iec-tinyllama-kvcache-otel
name: otel-collector-config
status:
availableReplicas: 3
conditions:
- lastTransitionTime: "2026-03-23T04:22:56Z"
lastUpdateTime: "2026-03-23T04:22:56Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: "2026-03-23T03:51:04Z"
lastUpdateTime: "2026-03-23T17:57:59Z"
message: ReplicaSet "tinyllama-kvcache-68db6c44df" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
observedGeneration: 26
readyReplicas: 3
replicas: 3
updatedReplicas: 3
There was a problem hiding this comment.
So this PR includes two solution not just solution for "fix: auto-generate lmcache_instance_id when value is None". Is this correct?
There was a problem hiding this comment.
Yes, sorry about the confusion in commit message. I introduced this change after I fix the instance_id issue. I can update the commit message.
| if not lmcache_worker_ids: | ||
| # start lmcache worker on all ranks | ||
| assert len(config.lmcache_worker_ports) == metadata.world_size | ||
| assert len(config.lmcache_worker_ports) >= metadata.world_size |
There was a problem hiding this comment.
Same question. I cannot get it that well. why is it strict? Could you help me to understand it more?
Signed-off-by: Can Sun <sucan@amazon.com>
6a1f046 to
11a95b7
Compare
Signed-off-by: cansun <80425164+can-sun@users.noreply.github.com>
DongDongJu
left a comment
There was a problem hiding this comment.
LGTM as well. please make the next one what you mentioned.
Thank you! |
* fix: auto-generate lmcache_instance_id when value is None Signed-off-by: Can Sun <sucan@amazon.com> * fix: include full traceback in LMCacheManager init failure logs Signed-off-by: Can Sun <sucan@amazon.com> * fix: use traceback.format_exception(e) instead of format_exc() Signed-off-by: Can Sun <sucan@amazon.com> * Relax worker port count assertion Signed-off-by: Can Sun <sucan@amazon.com> * Revert worker port count assertion to strict equality Signed-off-by: Can Sun <sucan@amazon.com> --------- Signed-off-by: Can Sun <sucan@amazon.com> Signed-off-by: cansun <80425164+can-sun@users.noreply.github.com>
* fix: auto-generate lmcache_instance_id when value is None Signed-off-by: Can Sun <sucan@amazon.com> * fix: include full traceback in LMCacheManager init failure logs Signed-off-by: Can Sun <sucan@amazon.com> * fix: use traceback.format_exception(e) instead of format_exc() Signed-off-by: Can Sun <sucan@amazon.com> * Relax worker port count assertion Signed-off-by: Can Sun <sucan@amazon.com> * Revert worker port count assertion to strict equality Signed-off-by: Can Sun <sucan@amazon.com> --------- Signed-off-by: Can Sun <sucan@amazon.com> Signed-off-by: cansun <80425164+can-sun@users.noreply.github.com>
* fix: auto-generate lmcache_instance_id when value is None Signed-off-by: Can Sun <sucan@amazon.com> * fix: include full traceback in LMCacheManager init failure logs Signed-off-by: Can Sun <sucan@amazon.com> * fix: use traceback.format_exception(e) instead of format_exc() Signed-off-by: Can Sun <sucan@amazon.com> * Relax worker port count assertion Signed-off-by: Can Sun <sucan@amazon.com> * Revert worker port count assertion to strict equality Signed-off-by: Can Sun <sucan@amazon.com> --------- Signed-off-by: Can Sun <sucan@amazon.com> Signed-off-by: cansun <80425164+can-sun@users.noreply.github.com>
* fix: auto-generate lmcache_instance_id when value is None Signed-off-by: Can Sun <sucan@amazon.com> * fix: include full traceback in LMCacheManager init failure logs Signed-off-by: Can Sun <sucan@amazon.com> * fix: use traceback.format_exception(e) instead of format_exc() Signed-off-by: Can Sun <sucan@amazon.com> * Relax worker port count assertion Signed-off-by: Can Sun <sucan@amazon.com> * Revert worker port count assertion to strict equality Signed-off-by: Can Sun <sucan@amazon.com> --------- Signed-off-by: Can Sun <sucan@amazon.com> Signed-off-by: cansun <80425164+can-sun@users.noreply.github.com>
Problem
_post_initinconfig_base.pyuseshasattr()to check iflmcache_instance_idneeds auto-generation. Since the attribute always exists with defaultNonefrom the config schema, the check always passes and auto-generation is skipped.This causes
LMCacheWorkerinitialization to fail with a bareAssertionErrorwhenenable_controller=Trueand no explicitLMCACHE_LMCACHE_INSTANCE_IDenv var is set.Fix
Use
getattr()withNonefallback instead ofhasattr()to properly detect unset values.