fix(datastorage): remove unbounded in-memory cache of request/response bodies (OOM)#1781
Conversation
createS3Fs/createGcsFs wrapped the remote filesystem in afero.NewCacheOnReadFs(base, afero.NewMemMapFs(), 5m). That in-memory layer is never evicted (afero's cacheTime only re-checks freshness on read, it never deletes cached files), so with store_request_body / store_response_body enabled it retained an in-memory copy of every request/response body ever written, keyed by unique request id. The live heap grew unbounded (~10MB/min under traffic; confirmed via pprof: afero/mem.(*File).Write = 97% of live heap) until the container hit its memory limit and was OOM-killed, repeatedly. Fix: createS3Fs/createGcsFs return the base filesystem directly. Reads go straight to the object store; request/response bodies are write-once and read rarely (admin request-detail view), so an in-memory read cache is unnecessary. This removes the unbounded retention entirely and keeps object storage authoritative (no cache staleness on overwrite, delete, or storage reconfiguration).
Greptile SummaryRemoves the
Confidence Score: 5/5Safe to merge — the change is a targeted two-line removal of a well-understood leak with no new logic introduced. The fix removes the CacheOnReadFs+MemMapFs layer from both S3 and GCS constructors and returns the base filesystem directly. The remaining imports (time, afero) are still used elsewhere in the file. WebDAV is unaffected. No new control flow, no new dependencies, and the author provided heap-profile evidence confirming the root cause and post-fix stability. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant Handler as HTTP Handler
participant DSS as DataStorageService
participant Fs as afero.Fs (S3/GCS)
participant Store as Object Store (S3/GCS)
Note over DSS,Store: Before fix
Handler->>DSS: WriteFile(body)
DSS->>Fs: CacheOnReadFs.Write(body)
Fs->>Fs: MemMapFs.Write(body) [NEVER evicted]
Fs->>Store: baseFs.Write(body)
Note over Fs: Memory grows unbounded → OOM
Note over DSS,Store: After fix
Handler->>DSS: WriteFile(body)
DSS->>Fs: baseFs.Write(body)
Fs->>Store: Write(body)
Note over Fs: No in-memory retention
Reviews (1): Last reviewed commit: "fix(datastorage): remove unbounded in-me..." | Re-trigger Greptile |
…ooplj#1781) createS3Fs/createGcsFs wrapped the remote filesystem in afero.NewCacheOnReadFs(base, afero.NewMemMapFs(), 5m). That in-memory layer is never evicted (afero's cacheTime only re-checks freshness on read, it never deletes cached files), so with store_request_body / store_response_body enabled it retained an in-memory copy of every request/response body ever written, keyed by unique request id. The live heap grew unbounded (~10MB/min under traffic; confirmed via pprof: afero/mem.(*File).Write = 97% of live heap) until the container hit its memory limit and was OOM-killed, repeatedly. Fix: createS3Fs/createGcsFs return the base filesystem directly. Reads go straight to the object store; request/response bodies are write-once and read rarely (admin request-detail view), so an in-memory read cache is unnecessary. This removes the unbounded retention entirely and keeps object storage authoritative (no cache staleness on overwrite, delete, or storage reconfiguration).
Problem
DataStorageServicebuilds the S3 and GCS filesystems by wrapping the remote FS inafero.NewCacheOnReadFs(base, afero.NewMemMapFs(), 5*time.Minute)(
internal/server/biz/data_storage.go,createS3Fs/createGcsFs).The in-memory
MemMapFscache layer is never evicted: afero'scacheTimeis only a read-staleness check (it re-reads from base when a cached file is older
than the TTL) — it never removes entries from the layer. With
store_request_body/store_response_bodyenabled, every request writes body files keyed by a uniquerequest id, so the in-memory layer accumulates a copy of every body ever written
and grows without bound until the process is OOM-killed and restarts — repeatedly.
Evidence
Heap profiling of an affected instance (
go tool pprof -inuse_space) showed thelive heap dominated by this cache:
github.com/spf13/afero/mem.(*File).Write≈ 97% of in-use heap, reached viaafero.(*UnionFile).Write(theCacheOnReadFslayer) on the HTTP request path.then OOM — reproduced across multiple independent capture windows.
Fix
createS3Fs/createGcsFsnow return the base filesystem directly, with noin-memory
CacheOnReadFslayer. Request/response bodies are write-once and readrarely (the admin request-detail view), so reads go straight to the object store.
This removes the unbounded retention entirely and keeps object storage
authoritative (no cache staleness on overwrite, delete, or storage reconfiguration).
A bounded in-memory LRU read cache was considered, but for a write-once /
read-rarely blob workload it provides marginal benefit while adding
cache-invalidation surface, so it was dropped in favor of the simpler, fully
correct approach.
Verification
around ~50 MB (vs. a monotonic climb to ~470 MB before the fix); 0 OOM kills,
0 restarts (previously OOM-killed every 25–80 min).
golangci-lintreports no new issues.Diff:
internal/server/biz/data_storage.goonly (+11 / −9).