-
Notifications
You must be signed in to change notification settings - Fork 236
Description
We run many workers on GKE that essentially pull messages from a Pubsub subscription at a rate of about 120k/min, do some processing and write the result into Bigtable. As part of a recent change we upgraded this Pubsub client library from 0.29.1 to 1.2 and immediately started to see alerts.
What we started to see after this upgrade was spikes in the reported backlog (and oldest unacked message age). These happened hourly. However, our service appeared not to suffer, and continued to output its product at a steady rate.
Here is an overview of this upgrade as seen in Stackdriver (running Pubsub v1.2 highlighted in red, then after 10am Friday we reverted JUST the pubsub version and the process returned to normal):
Zooming into approx Friday 12am until noon, and showing backlog at the top and oldest message at the bottom:
It is pretty clearly something that happens every hour.
I know there's another Github issue for memory spikes, but at least as far as we can tell that's not the case for us. In fact, I don't think we actually saw a real impact on processing output. This assessment is based on: 1) we didn't see lag in out downstream client which is usually the case with actual backlogs and 2) we didn't see increase in worker cpu when the backlog recovered. The biggest problem is we use these alerts on this subscription as our main indicator that the process may be experiencing problems for some reason.
Environment details
- OS: GKE
- Node.js version: 12.x
- npm version: 6.11.3
@google-cloud/pubsubversion: 1.2
Steps to reproduce
Not sure we do anything special. Each worker instance creates a simple client which processes messages. We run this on GKE, with one node instance per pod. Approximately 64 workers are all pulling from the same subscription. We generally just ack messages regardless of successful processing because in this application it's ok to just drop them (a separate process will take care of it).
Hope this is helpful. Let me know if I can provide any additional data.
Thanks!

