Fix quadratic file reading in restic mount#2790
Fix quadratic file reading in restic mount#2790MichaelEischer merged 3 commits intorestic:masterfrom
Conversation
MichaelEischer
left a comment
There was a problem hiding this comment.
I like the usage of a binary search. It could be possible to speed this up even further by caching the result of the search.
The new cache implementation looses one useful property of the old one: If file blocks are read before and after a blob boundary then the old implementation will end up with caching both blobs. As blobs are always read from the repository, this could end up in a large slowdown. (I think pdf files trigger rather erratic disk access patterns).
The proper solution here would be something like e.g. a global LRU cache. That could store the last 50-100 MB of accessed blobs.
e035196 to
d868196
Compare
|
I added an LRU cache. Regardless of whether the kernel caches FUSE blocks, it can still be useful when comparing files across snapshots. |
cfde670 to
d89da7d
Compare
|
@MichaelEischer Merge? |
55084f2 to
ba702ad
Compare
MichaelEischer
left a comment
There was a problem hiding this comment.
The PR looks mostly fine. I just think that there's still a small race condition regarding the LRU-cache usage.
d2bdda5 to
5a31546
Compare
aawsome
left a comment
There was a problem hiding this comment.
I would suggest not to re-implement binary search.
The other changes look good to me.
|
@greatroar just realized that you exactly kicked out "import sort" with your last commit. Just took a look at the source: Can't believe this is slower than your code except this is really about too much function calls... |
|
The speed difference is more than a factor of 2, see the commit message for 5a31546. And it really is failure to inline: Can inline, but no "inlined" message. From what I can tell from the assembler, it's the call (stack handling?) that's expensive. Anyway, the throughput difference will be very small compared to the sort.Search version. I wrote the custom binary search first, but only pushed it in response to @MichaelEischer's performance concerns, to avoid the who-locks-what-in-FUSE debate. So it comes down to readability and I'll happily revert the last commit if the maintainers want it. |
Also just did the comparison myself 😉 - the missing inlining explains it. However, I would vote for the "sort.Search"-solution for easier readability. If we do want to improve further, I would suggest to optimize for the sequential read case which still might be the most often used access pattern. That is, remember the last read blob (respectively index to Content) and check if the newly requested offset fits to this (or the next, if the last blob was completely read). If not, start the binary search which should be already fast enough for all other kinds of access. |
@greatroar That's what I had in mind for "caching the result of the search", but in my test with a 60GB file the binary search seemed to be fast enough. I actually prefer the 'sort.Search' solution to the reimplemented binary search. I've noticed that PR now reuses the simplelru package instead of adding another dependency to restic and I think it's rather important to keep the amount of dependencies at a reasonable level. So the only thing remaining is a small changelog entry which mentions the optimized file access in the mount command . That changelog entry could later on also be extended to cover the other mount PRs. |
I did tests all these three variants
with a 10GB file containing zeros (thus being able to reuse the already-cached blob) and used a However, there was no notable time difference and the searching part could not be seen in the cpu profile. |
|
@aawsome I took a look at the blob lookup performance using the cpu profiler. Even for a 50GB file the old simple loop just consumed 10% in |
668b728 to
e6cd22f
Compare
|
Ok, added a changelog entry and restored the sort-Search version |
76779e9 to
4cf1c8e
Compare
What is the purpose of this change? What does it change?
Doing a single read on a file in a restic mount takes time linear in the size of the file, since each read does a linear scan through an array of blob sizes, then another linear scan through an array of cached blob contents. That means that reading a whole file takes quadratic time O(n²/B), with B the block size.
This patch changes the linear scans to a binary search and a constant time operation. I measured the effect by backing up a 100GiB file of zeros (204800 blobs), then running cat inside a mount. The result is a more than 2× speedup. Before:
After:
$ /usr/bin/time cat snapshots/latest/stdin >/dev/null 1.07user 138.08system 26:04.32elapsed 8%CPU (0avgtext+0avgdata 1960maxresident)k 0inputs+0outputs (0major+133minor)pagefaults 0swapsThat's half an hour wasted scanning arrays. Obviously, this gets much worse at terabyte scale.
The cache semantics have changed, but IMHO the old semantics didn't make much sense. Reading blob i would flush blobs 0 through i-1 from the cache, so reading backwards would cause the whole file to get cached in RAM without any eviction, while reading forwards would only cache the last block, with a linear-time eviction every time.
Existing tests pass; they fail when I introduce off-by-one errors in the new block skipping code.
Was the change discussed in an issue or in the forum before?
#2587 optimizes reading performance, but misses the quadratic time issue.
Checklist
changelog/unreleased/that describes the changes for our users (template here)gofmton the code in all commits