Skip to content

Use a linked list for memSeries.headChunk#11818

Merged
jesusvazquez merged 6 commits intoprometheus:mainfrom
prymitive:headChunkV2
Jul 31, 2023
Merged

Use a linked list for memSeries.headChunk#11818
jesusvazquez merged 6 commits intoprometheus:mainfrom
prymitive:headChunkV2

Conversation

@prymitive
Copy link
Contributor

This is a linked list version of #10709

@prymitive prymitive marked this pull request as ready for review January 6, 2023 13:22
@prymitive prymitive requested a review from codesome as a code owner January 6, 2023 13:22
@prymitive
Copy link
Contributor Author

Trying to debug those failures from TestRulesUnitTest but struggling a bit with it.
It's not always failing so it seems that there's a race somewhere, but didn't yet manage to find the issue.

@prymitive
Copy link
Contributor Author

Trying to debug those failures from TestRulesUnitTest but struggling a bit with it. It's not always failing so it seems that there's a race somewhere, but didn't yet manage to find the issue.

This should be fixed now

@prymitive
Copy link
Contributor Author

Now I got a fuzzing failure, not sure if that's legit or just a flaky test as reported in #11810

@prymitive
Copy link
Contributor Author

Now I got a fuzzing failure, not sure if that's legit or just a flaky test as reported in #11810

Works after pushing more commits

@codesome
Copy link
Member

/prombench main

I have not checked the code yet, but running the benchmark for a few hours.

And the fuzzer failure is likely flaky; we have been seeing it a lot lately.

@prombot
Copy link
Contributor

prombot commented Jan 10, 2023

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-11818 and main

After successful deployment, the benchmarking metrics can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart main

@codesome
Copy link
Member

/prombench cancel

This is looking better. Review is on my TODO list!

@prombot
Copy link
Contributor

prombot commented Jan 10, 2023

Benchmark cancel is in progress.

@prymitive
Copy link
Contributor Author

If this looks mergeable then the next step could be turning mmap and OoO slices into linked lists. Hopefully that would also reduce the number of lines in those verbose methods that iterate all slices.

@codesome
Copy link
Member

codesome commented Mar 27, 2023

I am reviewing the PR now (thanks for the work and patience!), and I am having second thoughts about the linked list.

@bboreham from #10709 (comment), we figured that having it as a slice is +16 bytes for a slice (Edit: it is actually 24. See below comments. Hence the doubly linked list is already better than using a slice). But now, memChunkList also adds +16 bytes for the 2 pointers. So I don't see a benefit of a linked list.

Also, @prymitive, do you think we can get away with a single-direction linked list? Or is it bi-directional for efficiency reason?

But even with that, let's say we did add 16 bytes to the series with slices, that is like 15MiB of memory for 1M series (30MiB with GC overhead). Take for 100M series, that is ~1.5GiB (~3GiB with GC overhead), which will be a small % of total memory consumption at that scale. I prefer we go simple and use slices. @bboreham do you see any problem here or anything off in my analysis?

@codesome
Copy link
Member

I checked on of our Mimir deployments which uses this TSDB underneath, assuming 32 bytes total overhead per series with GC, it is 0.6-0.7% more memory (same with linked list or slices I presume?).

@bboreham
Copy link
Member

having it as a slice is +16 bytes for a slice. But now, memChunkList also adds +16 bytes for the 2 pointers.

A slice header is 24 bytes so 16 more than one 8-byte pointer or 8 more than 2 pointers.

However I envisaged a singly-linked list would be sufficient. Also I envisaged the list should be intrusive - next would be part of memChunk.

@codesome
Copy link
Member

So if we are able to use a singly linked list, we save 16bytes per series over the slice. Makes sense. Thanks for correcting my math!

Also I envisaged the list should be intrusive - next would be part of memChunk.

@prymitive
Copy link
Contributor Author

prymitive commented Mar 27, 2023

If this looks mergeable then the next step could be turning mmap and OoO slices into linked lists. Hopefully that would also reduce the number of lines in those verbose methods that iterate all slices.

FYI I did that in a different branch - main...prymitive:prometheus:v3_to_v2420

Edit: this one uses a single link instead of two.

@codesome
Copy link
Member

FYI I did that in a different branch

Can you update this branch to have a singly linked list for only the in-order part? It will be easier to review the out-of-order thing separately. But if that makes things messy, it is ok to have both in the same PR

Copy link
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the same comments from above as code review comments for easy tracking. I did not see any major concern. So I will do a deep review once the suggested refactor is in place because I assume few things to change with that.

tsdb/head.go Outdated
type memChunkList struct {
memChunk
prev *memChunkList // older element, nil for the last (oldest) element on the list
next *memChunkList // newer element, nil for the first (most recent) element on the list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get away with not having next?

tsdb/head.go Outdated
minTime, maxTime int64
}

type memChunkList struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can merge this with memChunk and embed the pointers there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do

@prymitive
Copy link
Contributor Author

FYI I did that in a different branch

Can you update this branch to have a singly linked list for only the in-order part? It will be easier to review the out-of-order thing separately. But if that makes things messy, it is ok to have both in the same PR

Sorry, it was for m-mapped chunks, not OoO, didn't touch that part at all.
Yes, I can skip the mmap commit.

@codesome
Copy link
Member

codesome commented Mar 27, 2023

Sorry, it was for m-mapped chunks, not OoO, didn't touch that part at all.

Oh, okay. Yeah let's skip the mmap chunks for now. I did not look at the code and assumed it was for the ooo head chunk. Would be interesting to see for mmap chunks later as well, but that list would be longer than the head chunk list and might have some performance implications.

@prymitive
Copy link
Contributor Author

Sorry, it was for m-mapped chunks, not OoO, didn't touch that part at all.

Oh, okay. Yeah let's skip the mmap chunks for now. I did not look at the code and assumed it was for the ooo head chunk. Would be interesting to see for mmap chunks later as well, but that list would be longer than the head chunk list and might have some performance implications.

mmap list would only be long(er) if you use a scrape interval that's small enough to create enough chunks. With 2h block duration and 1m scrape interval we end up with max 1 mmaped chunk. But with 30s scrape interval we end up with up to 6.

At that point (having 6 mmap chunks) we not only pay more for linked list but also have to deal with the fact that tsdb code will iterate mmap chunks in both directions, which add a bit of code when there's only a single link. Slices are simply a bit easier to iterate in whatever way one wishes.

@bboreham
Copy link
Member

pay more for linked list

In what sense?

Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
As suggested we should m-map chunks immediately during WAL replay to avoid memory spikes from large number of in-memory chunks.

Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
@prymitive
Copy link
Contributor Author

Rebased again

Copy link
Member

@jesusvazquez jesusvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @prymitive! Awesome work 💪

I've reviewed this and I think we are ready to merge this. Seems there is enough coverage from the unit tests and your changes are looking good. Please note this is the kind of PR where the contributor has spent many hours and the reviewer a few minutes so I might have missed something! I did look thoroughly though!

I have an ask though, the changes done are only present in the in-order part of memSeries. Some memSeries now may have out-of-order (OOO) data if the setting is enabled and to make things similar I'd like both implementations of the headChunk to be identical.

Because this PR is already quite large we can do this in two ways a) add the changes in this PR before we merge it or b) merge this now and work on updating the OOO part separately.

I'm fine with either way but I'm inclined to go with b only if you think you'll have some time to work on this in the coming weeks/months else I'd rather wait.

I'll have this PR open to watch out for your response!

@prymitive
Copy link
Contributor Author

Thanks for reviewing it! It's would be great if we could finally get this one done.
Given that this PR was open for over a year now (see #10709) I'd rather now risk it waiting even more for another set of reviews (especially if it was to grow even more).
Could we please merge it now and I'll follow up with OoO changes?
I would also like to try and convert m-maped chunks slice to a linked list, so that would be another potential follow up.

@jesusvazquez jesusvazquez merged commit 3c80963 into prometheus:main Jul 31, 2023
@jesusvazquez
Copy link
Member

jesusvazquez commented Jul 31, 2023

@prymitive congrats on getting this done! Merging was the easy part.

Please feel free to cc me in the follow up PRs. I'm interested in having a similar implementation in with the ooo head chunk so we reduce cognitive load of future reader and we improve consistency in the codebase.

Converting the m-maped chunks to a linked list also sounds interesting and challenging 👍

@jesusvazquez
Copy link
Member

/prombench v2.46.0

@prombot
Copy link
Contributor

prombot commented Aug 3, 2023

⏱️ Welcome to Prometheus Benchmarking Tool. ⏱️

Compared versions: PR-11818 and v2.46.0

After successful deployment, the benchmarking metrics can be viewed at:

Other Commands:
To stop benchmark: /prombench cancel
To restart benchmark: /prombench restart v2.46.0

@jesusvazquez
Copy link
Member

(Just one to run a benchmark for 1-2 days after the rebase changes to make sure everything works fine)

@prombot
Copy link
Contributor

prombot commented Aug 6, 2023

Benchmark tests are running for 3 days! If this is intended ignore this message otherwise you can cancel it by commenting: /prombench cancel

@prombot
Copy link
Contributor

prombot commented Aug 8, 2023

Benchmark tests are running for 5 days! If this is intended ignore this message otherwise you can cancel it by commenting: /prombench cancel

@roidelapluie
Copy link
Member

/prombench cancel

@prombot
Copy link
Contributor

prombot commented Aug 8, 2023

Benchmark cancel is in progress.

@prymitive
Copy link
Contributor Author

Please feel free to cc me in the follow up PRs. I'm interested in having a similar implementation in with the ooo head chunk so we reduce cognitive load of future reader and we improve consistency in the codebase.

Converting the m-maped chunks to a linked list also sounds interesting and challenging +1

Actually @jesusvazquez mmapped chunks slice is the easy one, ooo on the other hand isn't.
From what I can see ooo uses mmaped chunks in a different way than regular memChunk does.
With head chunks for regular samples mmapping was just a way of offloading "closed" (no longer accepting new samples) chunks from memory and so it was really an implementation details if it was a slice or a linked list, or when the mmapping would occur. The main challenge was that any change to it involved touching a lot of places.

With ooo mmapping is used differently, like here:

https://github.com/prometheus/prometheus/blob/main/tsdb/head_append.go#L953-L973

Mmap references are also used here:

https://github.com/prometheus/prometheus/blob/main/tsdb/record/record.go#L393

There's a lot

Overall seems like mmapping of chunks is ooo code feels more like a design decision rather than implementation detail. Which raises a few different issues for me:

  • re-factoring of ooo code to mmap chunks in the background might require a redesign of ooo code, especially given that mmapping seems to be almost required there
  • it's unclear if mmaping chunks in the background would help ooo code in any way
  • The goal of reduce cognitive load of future reader and we improve consistency in the codebase might be difficult to achieve given how different ooo logic already is

prymitive added a commit to prymitive/prometheus that referenced this pull request Aug 10, 2023
Follow up on prometheus#11818 and make the list of mmapped chunks a linked list as well.

Signed-off-by: Łukasz Mierzwa <l.mierzwa@gmail.com>
@bboreham
Copy link
Member

I was a bit shocked to discover that this PR made PromQL benchmarks 3x slower, but this turns out to be down to the benchmark creating an unusual number of head chunks in the list. See #12755.

However I had already started thinking about how to reduce the number of times the list is walked.
I'd be interested in your thoughts. main...bboreham:head-chunk-list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants