Use a linked list for memSeries.headChunk#11818
Use a linked list for memSeries.headChunk#11818jesusvazquez merged 6 commits intoprometheus:mainfrom
Conversation
|
Trying to debug those failures from |
This should be fixed now |
|
Now I got a fuzzing failure, not sure if that's legit or just a flaky test as reported in #11810 |
Works after pushing more commits |
|
/prombench main I have not checked the code yet, but running the benchmark for a few hours. And the fuzzer failure is likely flaky; we have been seeing it a lot lately. |
|
/prombench cancel This is looking better. Review is on my TODO list! |
|
Benchmark cancel is in progress. |
|
If this looks mergeable then the next step could be turning mmap and OoO slices into linked lists. Hopefully that would also reduce the number of lines in those verbose methods that iterate all slices. |
a9fefd9 to
5d6f1cd
Compare
|
I am reviewing the PR now (thanks for the work and patience!), and I am having second thoughts about the linked list. @bboreham from #10709 (comment), we figured that having it as a slice is +16 bytes for a slice (Edit: it is actually 24. See below comments. Hence the doubly linked list is already better than using a slice). But now, Also, @prymitive, do you think we can get away with a single-direction linked list? Or is it bi-directional for efficiency reason? But even with that, let's say we did add 16 bytes to the series with slices, that is like 15MiB of memory for 1M series (30MiB with GC overhead). Take for 100M series, that is ~1.5GiB (~3GiB with GC overhead), which will be a small % of total memory consumption at that scale. I prefer we go simple and use slices. @bboreham do you see any problem here or anything off in my analysis? |
|
I checked on of our Mimir deployments which uses this TSDB underneath, assuming 32 bytes total overhead per series with GC, it is 0.6-0.7% more memory (same with linked list or slices I presume?). |
A slice header is 24 bytes so 16 more than one 8-byte pointer or 8 more than 2 pointers. However I envisaged a singly-linked list would be sufficient. Also I envisaged the list should be intrusive - |
|
So if we are able to use a singly linked list, we save 16bytes per series over the slice. Makes sense. Thanks for correcting my math!
➕ |
FYI I did that in a different branch - main...prymitive:prometheus:v3_to_v2420 Edit: this one uses a single link instead of two. |
Can you update this branch to have a singly linked list for only the in-order part? It will be easier to review the out-of-order thing separately. But if that makes things messy, it is ok to have both in the same PR |
codesome
left a comment
There was a problem hiding this comment.
I added the same comments from above as code review comments for easy tracking. I did not see any major concern. So I will do a deep review once the suggested refactor is in place because I assume few things to change with that.
tsdb/head.go
Outdated
| type memChunkList struct { | ||
| memChunk | ||
| prev *memChunkList // older element, nil for the last (oldest) element on the list | ||
| next *memChunkList // newer element, nil for the first (most recent) element on the list |
There was a problem hiding this comment.
Can we get away with not having next?
tsdb/head.go
Outdated
| minTime, maxTime int64 | ||
| } | ||
|
|
||
| type memChunkList struct { |
There was a problem hiding this comment.
We can merge this with memChunk and embed the pointers there.
Sorry, it was for m-mapped chunks, not OoO, didn't touch that part at all. |
Oh, okay. Yeah let's skip the mmap chunks for now. I did not look at the code and assumed it was for the ooo head chunk. Would be interesting to see for mmap chunks later as well, but that list would be longer than the head chunk list and might have some performance implications. |
mmap list would only be long(er) if you use a scrape interval that's small enough to create enough chunks. With 2h block duration and 1m scrape interval we end up with max 1 mmaped chunk. But with 30s scrape interval we end up with up to 6. At that point (having 6 mmap chunks) we not only pay more for linked list but also have to deal with the fact that tsdb code will iterate mmap chunks in both directions, which add a bit of code when there's only a single link. Slices are simply a bit easier to iterate in whatever way one wishes. |
In what sense? |
Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
As suggested we should m-map chunks immediately during WAL replay to avoid memory spikes from large number of in-memory chunks. Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
|
Rebased again |
jesusvazquez
left a comment
There was a problem hiding this comment.
Hello @prymitive! Awesome work 💪
I've reviewed this and I think we are ready to merge this. Seems there is enough coverage from the unit tests and your changes are looking good. Please note this is the kind of PR where the contributor has spent many hours and the reviewer a few minutes so I might have missed something! I did look thoroughly though!
I have an ask though, the changes done are only present in the in-order part of memSeries. Some memSeries now may have out-of-order (OOO) data if the setting is enabled and to make things similar I'd like both implementations of the headChunk to be identical.
Because this PR is already quite large we can do this in two ways a) add the changes in this PR before we merge it or b) merge this now and work on updating the OOO part separately.
I'm fine with either way but I'm inclined to go with b only if you think you'll have some time to work on this in the coming weeks/months else I'd rather wait.
I'll have this PR open to watch out for your response!
|
Thanks for reviewing it! It's would be great if we could finally get this one done. |
|
@prymitive congrats on getting this done! Merging was the easy part. Please feel free to cc me in the follow up PRs. I'm interested in having a similar implementation in with the ooo head chunk so we reduce cognitive load of future reader and we improve consistency in the codebase. Converting the m-maped chunks to a linked list also sounds interesting and challenging 👍 |
|
/prombench v2.46.0 |
|
(Just one to run a benchmark for 1-2 days after the rebase changes to make sure everything works fine) |
|
Benchmark tests are running for 3 days! If this is intended ignore this message otherwise you can cancel it by commenting: |
|
Benchmark tests are running for 5 days! If this is intended ignore this message otherwise you can cancel it by commenting: |
|
/prombench cancel |
|
Benchmark cancel is in progress. |
Actually @jesusvazquez mmapped chunks slice is the easy one, ooo on the other hand isn't. With ooo mmapping is used differently, like here: https://github.com/prometheus/prometheus/blob/main/tsdb/head_append.go#L953-L973 Mmap references are also used here: https://github.com/prometheus/prometheus/blob/main/tsdb/record/record.go#L393 There's a lot Overall seems like mmapping of chunks is ooo code feels more like a design decision rather than implementation detail. Which raises a few different issues for me:
|
Follow up on prometheus#11818 and make the list of mmapped chunks a linked list as well. Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
|
I was a bit shocked to discover that this PR made PromQL benchmarks 3x slower, but this turns out to be down to the benchmark creating an unusual number of head chunks in the list. See #12755. However I had already started thinking about how to reduce the number of times the list is walked. |
This is a linked list version of #10709