-
Notifications
You must be signed in to change notification settings - Fork 10.2k
Description
Background
Our environment has a Prometheus server that remote-writes to a "long-term" prometheus server (less events, bigger retention).
What did you do?
Upgraded the Prometheus Helm chart from 27.42.0 -> 27.50.0 (which includes v3.8.0)
NOTE: issue seems to resolve itself after downgrading to chart version 27.48 (just before v3.8.0)
What did you expect to see?
- Nothing -- business as usual.
- Expected to still see remote writes happening to our "long-term" server
What did you see instead? Under which circumstances?
- Began seeing these logs quite often in the prometheus server logs:
time=2025-12-07T19:58:52.582Z level=ERROR source=queue_manager.go:1709 msg="we got 2xx status code from the Receiver yet statistics indicate some data was not written; investigation needed" component=remote remote_name=25df39 url=http://prometheus-longterm-server.observability.svc/api/v1/write failedSampleCount=1861 failedHistogramCount=0 failedExemplarCount=0
- As far as I could tell based on our dashboards, etc. (which pull from our "long-term" server, it doesn't seem like metrics are actually being dropped.
As soon as I downgraded the chart to 27.48.0 (before prometheus v3.8.0 was introduced), these errors seemed to disappear.
System information
No response
Prometheus version
prometheus, version 3.7.3 (branch: HEAD, revision: 0a41f0000705c69ab8e0f9a723fc73e39ed62b07)
build user: root@08c890a84441
build date: 20251030-07:26:10
go version: go1.25.3
platform: linux/amd64
tags: netgo,builtinassets
Prometheus configuration file
Alertmanager version
Alertmanager configuration file
Logs