Skip to content

cache descriptors to reduce API calls#218

Merged
SuperQ merged 1 commit intoprometheus-community:masterfrom
fbs:descriptorcache
May 25, 2023
Merged

cache descriptors to reduce API calls#218
SuperQ merged 1 commit intoprometheus-community:masterfrom
fbs:descriptorcache

Conversation

@fbs
Copy link
Contributor

@fbs fbs commented Apr 20, 2023

See issue #217

This commit introduces a simple prefix -> []descriptor cache. If the prefix isn't in the cache the original page based lookup is performed and at the end of it the cache is updated. If it does exist in the cache the cached value is used.
No attempt is made to prevent two routines from updating the cache concurrently. If two concurrent requests see an expired cache entry they'll both do the ListDescriptor call and update the cache. Doing this keeps the cache simple and doesn't block one request on the other. This situation will also be rare in practice, as there is usually only one or two prometheuses scraping.

The cache itself comes in 3 flavors:

  • noop, does nothing
  • descriptorCache, caches all descriptors
  • googleDescriptorCache, caches only descriptors of google metrics

The cache is disabled by default and can be enabled with the new --monitoring.descriptor-cache-ttl CLI flag, if the TTL is set to a value >0s the cache is enabled. It defaults to using the google only cache, to control that the
--no-monitoring.descriptor-cache-only-google is introduced

@fbs fbs force-pushed the descriptorcache branch from b9c0c6d to ac4876e Compare April 20, 2023 12:34
return true
}

func TestDescriptorCache(t *testing.T) {
Copy link
Contributor Author

@fbs fbs Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test feels fragile but not sure how bad it turns out to be in practice. Will see what the ci has to say about it

@fbs fbs force-pushed the descriptorcache branch from ac4876e to 1350904 Compare April 21, 2023 11:47
See issue prometheus-community#217

Currently the prefix -> descriptor mapping happens on every /metrics
call. In our setup we have a lot of exporters running with specific
prefixes that usually map to only a single/handful of metrics. This
makes the ratio of ListDescriptor to ListTimeseries calls often close to
50/50 (43% of total request are ListDescriptor in a random sampling).

In our case we're only interested in scraping the metrics produced by
google services. Those services very rarely change their metrics so
fetching the descriptors every time is just wasted work.
While its only a few calls per exporter it adds up to millions of calls
a month (2 (prometheuses in HA) * 1 (scrape per minute) * 60 * 24 * 31
(hours per month) * 20 prefixes = 1'728'000), across many google
projects.

This commit introduces a simple prefix -> []descriptor cache. If the
prefix isn't in the cache the original page based lookup is performed
and at the end of it the cache is updated. If it does exist in the cache
the cached value is used.
No attempt is made to prevent two routines from updating the cache
concurrently. If two concurrent requests see an expired cache entry
they'll both do the ListDescriptor call and update the cache. Doing this
keeps the cache simple and doesn't block one request on the other.
This situation will also be rare in practice, as there is usually only
one or two prometheuses scraping.

The cache itself comes in 3 flavors:

- noop, does nothing
- descriptorCache, caches all descriptors
- googleDescriptorCache, caches only descriptors of google metrics

The cache is disabled by default and can be enabled with the new
`--monitoring.descriptor-cache-ttl` CLI flag, if the TTL is set to a
value >0s the cache is enabled. It defaults to using the google only
cache, to control that the
`--no-monitoring.descriptor-cache-only-google` is introduced

Signed-off-by: bas smit <bsmit@bol.com>
@fbs fbs force-pushed the descriptorcache branch from 1350904 to a16df59 Compare May 4, 2023 17:31
// descriptorCache is a MetricTypePrefix -> MetricDescriptor cache
type descriptorCache struct {
cache map[string]*descriptorCacheEntry
lock sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to use an RWMutex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think it matters in this case. The lock is not held during the API call itself, only for read/write of the map itself.

Copy link
Contributor

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about if we should use an RWMutex/RLock(), otherwise LGTM.

@fbs
Copy link
Contributor Author

fbs commented May 25, 2023

Had >200 exporters running with this patch for a while, haven't seen any weird behavior and achieved a 40% reduction in number of api calls

@SuperQ SuperQ merged commit 50c3be2 into prometheus-community:master May 25, 2023
SuperQ added a commit that referenced this pull request May 25, 2023
* [FEATURE] cache descriptors to reduce API calls #218

Signed-off-by: SuperQ <superq@gmail.com>
@weyert
Copy link

weyert commented May 25, 2023

Had >200 exporters running with this patch for a while, haven't seen any weird behavior and achieved a 40% reduction in number of api calls

40% less api calls that's great! This collector gets expensive really quickly

SuperQ added a commit that referenced this pull request May 26, 2023
* [FEATURE] cache descriptors to reduce API calls #218

Signed-off-by: SuperQ <superq@gmail.com>
SuperQ added a commit that referenced this pull request May 26, 2023
Release v0.14.0

* [FEATURE] cache descriptors to reduce API calls #218

---------

Signed-off-by: SuperQ <superq@gmail.com>
@fbs
Copy link
Contributor Author

fbs commented May 26, 2023

The issue has more details, but the reduction is because in a lot of cases we have very specific prefixes to reduce cost. If you only have a broad prefix like storage.googleapis.com/ the reduction will be way less.

Semi related, another thing you could look at is 'caching' the reply. In a HA env you have 2 prometheuses scraping the exporter, resulting in 2x the api calls to google. Depending on your usecase it might be ok to give one of them the cached results of the other

@fbs fbs deleted the descriptorcache branch May 26, 2023 12:22
@WojciechKuk
Copy link

WojciechKuk commented Feb 3, 2025

Semi related, another thing you could look at is 'caching' the reply.

@fbs this parameter (--monitoring.descriptor-cache-ttl) is enough for this? (I have 2 prometheuses in HA setup, scraping one stackdriver_exporter instance - I need all metrics+values to be cached for 60s - not only descriptor data)

@fbs
Copy link
Contributor Author

fbs commented Feb 3, 2025

@WojciechKuk no, afaik there is no way to cache the /metrics response yet. If there is no ticket for it it might be good to create one. I'm still interested in the feature, just never got to it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants