Skip to content

fix(datastorage): remove unbounded in-memory cache of request/response bodies (OOM)#1781

Merged
looplj merged 1 commit into
looplj:unstablefrom
suixinio:fix/datastorage-memcache-oom
Jun 4, 2026
Merged

fix(datastorage): remove unbounded in-memory cache of request/response bodies (OOM)#1781
looplj merged 1 commit into
looplj:unstablefrom
suixinio:fix/datastorage-memcache-oom

Conversation

@suixinio

@suixinio suixinio commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Problem

DataStorageService builds the S3 and GCS filesystems by wrapping the remote FS in
afero.NewCacheOnReadFs(base, afero.NewMemMapFs(), 5*time.Minute)
(internal/server/biz/data_storage.go, createS3Fs / createGcsFs).

The in-memory MemMapFs cache layer is never evicted: afero's cacheTime
is only a read-staleness check (it re-reads from base when a cached file is older
than the TTL) — it never removes entries from the layer. With store_request_body /
store_response_body enabled, every request writes body files keyed by a unique
request id, so the in-memory layer accumulates a copy of every body ever written
and grows without bound until the process is OOM-killed and restarts — repeatedly.

Evidence

Heap profiling of an affected instance (go tool pprof -inuse_space) showed the
live heap dominated by this cache:

  • github.com/spf13/afero/mem.(*File).Write97% of in-use heap, reached via
    afero.(*UnionFile).Write (the CacheOnReadFs layer) on the HTTP request path.
  • Live heap grew monotonically (~10 MB/min under traffic) until the memory limit,
    then OOM — reproduced across multiple independent capture windows.
  • Goroutine count was stable, ruling out a goroutine leak.

Fix

createS3Fs / createGcsFs now return the base filesystem directly, with no
in-memory CacheOnReadFs layer. Request/response bodies are write-once and read
rarely (the admin request-detail view), so reads go straight to the object store.
This removes the unbounded retention entirely and keeps object storage
authoritative (no cache staleness on overwrite, delete, or storage reconfiguration).

A bounded in-memory LRU read cache was considered, but for a write-once /
read-rarely blob workload it provides marginal benefit while adding
cache-invalidation surface, so it was dropped in favor of the simpler, fully
correct approach.

Verification

  • Built and deployed the fixed image; UI and API healthy.
  • Observed ~80 min under real traffic: live heap peaks at ~86 MB and oscillates
    around ~50 MB (vs. a monotonic climb to ~470 MB before the fix); 0 OOM kills,
    0 restarts
    (previously OOM-killed every 25–80 min).
  • golangci-lint reports no new issues.

Diff: internal/server/biz/data_storage.go only (+11 / −9).

createS3Fs/createGcsFs wrapped the remote filesystem in
afero.NewCacheOnReadFs(base, afero.NewMemMapFs(), 5m). That in-memory
layer is never evicted (afero's cacheTime only re-checks freshness on
read, it never deletes cached files), so with store_request_body /
store_response_body enabled it retained an in-memory copy of every
request/response body ever written, keyed by unique request id. The live
heap grew unbounded (~10MB/min under traffic; confirmed via pprof:
afero/mem.(*File).Write = 97% of live heap) until the container hit its
memory limit and was OOM-killed, repeatedly.

Fix: createS3Fs/createGcsFs return the base filesystem directly. Reads go
straight to the object store; request/response bodies are write-once and
read rarely (admin request-detail view), so an in-memory read cache is
unnecessary. This removes the unbounded retention entirely and keeps
object storage authoritative (no cache staleness on overwrite, delete, or
storage reconfiguration).
@greptile-apps

greptile-apps Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Removes the afero.NewCacheOnReadFs + afero.NewMemMapFs layer from both createS3Fs and createGcsFs, returning the underlying object-store filesystem directly to eliminate an unbounded in-memory cache that caused recurring OOM crashes.

  • S3: createS3Fs now returns s3fs.NewFsFromClient(...) directly instead of wrapping it in a CacheOnReadFs backed by MemMapFs.
  • GCS: createGcsFs similarly drops the CacheOnReadFs wrapper and returns afero.NewBasePathFs(fs, bucketName) directly.
  • Both changes include explanatory comments discouraging re-introduction of the cache layer.

Confidence Score: 5/5

Safe to merge — the change is a targeted two-line removal of a well-understood leak with no new logic introduced.

The fix removes the CacheOnReadFs+MemMapFs layer from both S3 and GCS constructors and returns the base filesystem directly. The remaining imports (time, afero) are still used elsewhere in the file. WebDAV is unaffected. No new control flow, no new dependencies, and the author provided heap-profile evidence confirming the root cause and post-fix stability.

No files require special attention.

Important Files Changed

Filename Overview
internal/server/biz/data_storage.go Removes CacheOnReadFs wrapper from createS3Fs and createGcsFs; the time import remains valid for other usages; WebDAV path is unaffected; no logic errors introduced.

Sequence Diagram

sequenceDiagram
    participant Handler as HTTP Handler
    participant DSS as DataStorageService
    participant Fs as afero.Fs (S3/GCS)
    participant Store as Object Store (S3/GCS)

    Note over DSS,Store: Before fix
    Handler->>DSS: WriteFile(body)
    DSS->>Fs: CacheOnReadFs.Write(body)
    Fs->>Fs: MemMapFs.Write(body) [NEVER evicted]
    Fs->>Store: baseFs.Write(body)
    Note over Fs: Memory grows unbounded → OOM

    Note over DSS,Store: After fix
    Handler->>DSS: WriteFile(body)
    DSS->>Fs: baseFs.Write(body)
    Fs->>Store: Write(body)
    Note over Fs: No in-memory retention
Loading

Reviews (1): Last reviewed commit: "fix(datastorage): remove unbounded in-me..." | Re-trigger Greptile

@looplj looplj merged commit cd1157c into looplj:unstable Jun 4, 2026
4 checks passed
junjiangao pushed a commit to junjiangao/axonhub that referenced this pull request Jun 5, 2026
…ooplj#1781)

createS3Fs/createGcsFs wrapped the remote filesystem in
afero.NewCacheOnReadFs(base, afero.NewMemMapFs(), 5m). That in-memory
layer is never evicted (afero's cacheTime only re-checks freshness on
read, it never deletes cached files), so with store_request_body /
store_response_body enabled it retained an in-memory copy of every
request/response body ever written, keyed by unique request id. The live
heap grew unbounded (~10MB/min under traffic; confirmed via pprof:
afero/mem.(*File).Write = 97% of live heap) until the container hit its
memory limit and was OOM-killed, repeatedly.

Fix: createS3Fs/createGcsFs return the base filesystem directly. Reads go
straight to the object store; request/response bodies are write-once and
read rarely (admin request-detail view), so an in-memory read cache is
unnecessary. This removes the unbounded retention entirely and keeps
object storage authoritative (no cache staleness on overwrite, delete, or
storage reconfiguration).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants