cache descriptors to reduce API calls#218
Conversation
| return true | ||
| } | ||
|
|
||
| func TestDescriptorCache(t *testing.T) { |
There was a problem hiding this comment.
The test feels fragile but not sure how bad it turns out to be in practice. Will see what the ci has to say about it
See issue prometheus-community#217 Currently the prefix -> descriptor mapping happens on every /metrics call. In our setup we have a lot of exporters running with specific prefixes that usually map to only a single/handful of metrics. This makes the ratio of ListDescriptor to ListTimeseries calls often close to 50/50 (43% of total request are ListDescriptor in a random sampling). In our case we're only interested in scraping the metrics produced by google services. Those services very rarely change their metrics so fetching the descriptors every time is just wasted work. While its only a few calls per exporter it adds up to millions of calls a month (2 (prometheuses in HA) * 1 (scrape per minute) * 60 * 24 * 31 (hours per month) * 20 prefixes = 1'728'000), across many google projects. This commit introduces a simple prefix -> []descriptor cache. If the prefix isn't in the cache the original page based lookup is performed and at the end of it the cache is updated. If it does exist in the cache the cached value is used. No attempt is made to prevent two routines from updating the cache concurrently. If two concurrent requests see an expired cache entry they'll both do the ListDescriptor call and update the cache. Doing this keeps the cache simple and doesn't block one request on the other. This situation will also be rare in practice, as there is usually only one or two prometheuses scraping. The cache itself comes in 3 flavors: - noop, does nothing - descriptorCache, caches all descriptors - googleDescriptorCache, caches only descriptors of google metrics The cache is disabled by default and can be enabled with the new `--monitoring.descriptor-cache-ttl` CLI flag, if the TTL is set to a value >0s the cache is enabled. It defaults to using the google only cache, to control that the `--no-monitoring.descriptor-cache-only-google` is introduced Signed-off-by: bas smit <bsmit@bol.com>
| // descriptorCache is a MetricTypePrefix -> MetricDescriptor cache | ||
| type descriptorCache struct { | ||
| cache map[string]*descriptorCacheEntry | ||
| lock sync.Mutex |
There was a problem hiding this comment.
Would it make sense to use an RWMutex?
There was a problem hiding this comment.
Don't think it matters in this case. The lock is not held during the API call itself, only for read/write of the map itself.
SuperQ
left a comment
There was a problem hiding this comment.
Question about if we should use an RWMutex/RLock(), otherwise LGTM.
|
Had >200 exporters running with this patch for a while, haven't seen any weird behavior and achieved a 40% reduction in number of api calls |
* [FEATURE] cache descriptors to reduce API calls #218 Signed-off-by: SuperQ <superq@gmail.com>
40% less api calls that's great! This collector gets expensive really quickly |
* [FEATURE] cache descriptors to reduce API calls #218 Signed-off-by: SuperQ <superq@gmail.com>
Release v0.14.0 * [FEATURE] cache descriptors to reduce API calls #218 --------- Signed-off-by: SuperQ <superq@gmail.com>
|
The issue has more details, but the reduction is because in a lot of cases we have very specific prefixes to reduce cost. If you only have a broad prefix like Semi related, another thing you could look at is 'caching' the reply. In a HA env you have 2 prometheuses scraping the exporter, resulting in 2x the api calls to google. Depending on your usecase it might be ok to give one of them the cached results of the other |
@fbs this parameter ( |
|
@WojciechKuk no, afaik there is no way to cache the |
See issue #217
This commit introduces a simple prefix -> []descriptor cache. If the prefix isn't in the cache the original page based lookup is performed and at the end of it the cache is updated. If it does exist in the cache the cached value is used.
No attempt is made to prevent two routines from updating the cache concurrently. If two concurrent requests see an expired cache entry they'll both do the ListDescriptor call and update the cache. Doing this keeps the cache simple and doesn't block one request on the other. This situation will also be rare in practice, as there is usually only one or two prometheuses scraping.
The cache itself comes in 3 flavors:
The cache is disabled by default and can be enabled with the new
--monitoring.descriptor-cache-ttlCLI flag, if the TTL is set to a value >0s the cache is enabled. It defaults to using the google only cache, to control that the--no-monitoring.descriptor-cache-only-googleis introduced