LMCache Blog

LMCache Blog

[en][中文]

About us

  • LMCache Home
  • LMCache on GitHub

Categories

  • ascend (1)
  • Benchmark (8)
  • Best practices (3)
  • lmcache (2)
  • New features (7)
  • News (16)
  • Performance (3)
  • Tutorial (9)
  • Uncategorized (1)

Tags

Follow us on: X, LinkedIn

  • AMD × LMcache: AMD GPU Acceleration with LMcache

    By

    Andy Luo

    ,

    Haichen Zhang

    ,

    AMD AIG

    ,

    Yihua

    ,

    nijaba

    and

    LMCache Lab

    Jan 9, 2026
    Uncategorized

    Introduction LLM inference becomes increasingly challenging as context length grows and workloads scale. Traditional serving engines rely on prefix-based KV cache reuse, which limits opportunities for optimization, especially when processing long, repeated, or overlapping text across different requests. LMCache addresses this challenge. It is an extension to LLM serving engines that dramatically reduces time-to-first-token (TTFT)…

    Read more: AMD × LMcache: AMD GPU Acceleration with LMcache
    AMD × LMcache: AMD GPU Acceleration with LMcache
  • Context Engineering & Reuse Pattern Under the Hood of Claude Code

    By

    Kobe

    and

    Mengbing

    Dec 23, 2025
    Benchmark, lmcache
    cacheblend, claude-code, lmcache

    Over the last few months, Claude Code has quietly become one of the most interesting & widely-adopted real-world agentic systems available to normal developers. Unlike cloud-only agents whose internals remain hidden behind API gateways like Perplexity, Devin, or Manus, nor as fully open source agents like Mini SWE Agent or Terminus 2 where you can…

    Read more: Context Engineering & Reuse Pattern Under the Hood of Claude Code
    Context Engineering & Reuse Pattern Under the Hood of Claude Code
  • LMCache x Ascend: Accelerating LLM inference on Ascend NPUs

    By

    LMCache Lab

    Nov 4, 2025
    ascend, lmcache, News

    Supporting Ascend NPUs We’re delighted to announce that LMCache now officially supports Ascend NPUs with the release of the LMCache-Ascend plugin. LMCache-Ascend supports a broad range of Ascend compute platforms from the cloud to the edge. This major platform expansion underscores LMCache’s commitment to delivering leading performance across a diverse hardware ecosystem, enabling developers to…

    Read more: LMCache x Ascend: Accelerating LLM inference on Ascend NPUs
    LMCache x Ascend: Accelerating LLM inference on Ascend NPUs
  • Tensormesh unveiled and LMCache joins the PyTorch Foundation

    By

    Junchen Jiang

    Oct 31, 2025
    News
    lmcache, pytorch, tensormesh

    Announcing Tensormesh First I wanted to repeat here what I posted on the LMCache #general Slack channel last week: I am delighted to announce that the team that founded the LMCache project has decided to form a company, Tensormesh, a few months ago. As we are announcing the beta of our first product, we have…

    Read more: Tensormesh unveiled and LMCache joins the PyTorch Foundation
    Tensormesh unveiled and LMCache joins the PyTorch Foundation
  • Breaking the Memory Barrier: How LMCache and CoreWeave Power Efficient LLM Inference for Cohere

    By

    Walter Beller-Morales (Cohere)

    ,

    Kishor Aher (CoreWeave)

    and

    Samuel Shen (Tensormesh)

    Oct 29, 2025
    Benchmark, Performance
    benchmark, CAIOS, cohere, coreweave, RAG, storage, tensormesh

    The challenge: Scaling enterprise AI Enterprises today are racing to integrate large language models (LLMs) into their products and workflows, but doing it at scale brings challenges in performance, cost, and accuracy. Organizations need models to be based on their specific data, while making sure that this information remains private. Cohere, one of the leading…

    Read more: Breaking the Memory Barrier: How LMCache and CoreWeave Power Efficient LLM Inference for Cohere
    Breaking the Memory Barrier: How LMCache and CoreWeave Power Efficient LLM Inference for Cohere
  • LMCache on Google Kubernetes Engine: Boosting LLM Inference Performance with KV Cache on Tiered Storage

    By

    Danna Wang (Google)

    Oct 7, 2025
    Benchmark
    benchmark, gke, Google, storage, vLLM

    Overview of the Collaboration The KV Cache is a memory optimization that makes Large Language Models(LLMs) run the forward pass faster by storing Key (K) and Value (V) matrices to prevent the model from recalculating them for the entire text sequence with every new generated token. Maximizing the KV Cache hit rate with storage is…

    Read more: LMCache on Google Kubernetes Engine: Boosting LLM Inference Performance with KV Cache on Tiered Storage
    LMCache on Google Kubernetes Engine: Boosting LLM Inference Performance with KV Cache on Tiered Storage
  • Implementing LMCache Plugin Framework & lmcache_frontend: Design Philosophy

    By

    Kobe

    and

    Baolong

    Sep 23, 2025
    New features
    lmcache, vLLM

    A flexible plugin system for enhanced observability and management Abstract In large-scale language model inference scenarios, efficient memory management and KV cache optimization are crucial. LMCache, as a KV cache management system specifically designed for vLLM, requires more flexible extension mechanisms to meet the needs of monitoring, troubleshooting, and state insight when facing complex production…

    Read more: Implementing LMCache Plugin Framework & lmcache_frontend: Design Philosophy
    Implementing LMCache Plugin Framework & lmcache_frontend: Design Philosophy
  • NVIDIA Dynamo integrates LMCache, Accelerating LLM Inference

    By

    LMCache Team

    Sep 18, 2025
    News
    dynamo, lmcache, nvidia, vLLM

    We’re thrilled to announce that Nvidia Dynamo has integrated LMCache as a KV caching layer solution. This is a big milestone: Dynamo gets a battle-tested caching solution, and LMCache becomes part of a data center-scale inference platform used by many developers worldwide to deploy AI at scale. For comprehensive details about Dynamo’s KV cache optimization…

    Read more: NVIDIA Dynamo integrates LMCache, Accelerating LLM Inference
    NVIDIA Dynamo integrates LMCache, Accelerating LLM Inference
  • Extending LMCache Backends: A Comprehensive Guide to Custom Backend Development

    By

    Baolong

    and

    Kobe

    Sep 11, 2025
    Tutorial
    backend, customization, extension, lmcache, storage

    In large language model inference scenarios, the performance and flexibility of KVCache caching systems directly impact overall service efficiency. LMCache, as a high-performance large model caching framework, provides developers with rich extension capabilities through its modular backend design. This article will start with LMCache backend’s extension mechanism, using the officially provided lmc_external_log_backend as an example,…

    Read more: Extending LMCache Backends: A Comprehensive Guide to Custom Backend Development
    Extending LMCache Backends: A Comprehensive Guide to Custom Backend Development
  • Nvidia Dynamo + LMCache: Accelerating the Future of LLM Inference

    By

    LMCache Team

    Sep 7, 2025
    Best practices, Performance
    collaboration, distributed-inference, dynamo, nvidia, performance

    We’re thrilled to announce that the Nvidia Dynamo project has integrated LMCache as its KV caching layer solution. This is a big milestone: Dynamo gets a battle-tested caching solution, and LMCache becomes part of a production-scale ecosystem used by many developers worldwide. Why KV Caching Matters KV caching is a foundational optimization for modern LLM…

    Read more: Nvidia Dynamo + LMCache: Accelerating the Future of LLM Inference
    Nvidia Dynamo + LMCache: Accelerating the Future of LLM Inference
1 2 3 4
Next Page→

LMCache Team  •  (c)2025  •  lmcache.github.io 

 

Loading Comments...