Skip to content

Fix quadratic file reading in restic mount#2790

Merged
MichaelEischer merged 3 commits intorestic:masterfrom
greatroar:fix-quadratic-read
Jul 12, 2020
Merged

Fix quadratic file reading in restic mount#2790
MichaelEischer merged 3 commits intorestic:masterfrom
greatroar:fix-quadratic-read

Conversation

@greatroar
Copy link
Copy Markdown
Contributor

@greatroar greatroar commented Jun 14, 2020

What is the purpose of this change? What does it change?

Doing a single read on a file in a restic mount takes time linear in the size of the file, since each read does a linear scan through an array of blob sizes, then another linear scan through an array of cached blob contents. That means that reading a whole file takes quadratic time O(n²/B), with B the block size.

This patch changes the linear scans to a binary search and a constant time operation. I measured the effect by backing up a 100GiB file of zeros (204800 blobs), then running cat inside a mount. The result is a more than 2× speedup. Before:

$ /usr/bin/time cat snapshots/latest/stdin >/dev/null 
2.45user 228.15system 1:07:33elapsed 5%CPU (0avgtext+0avgdata 2104maxresident)k
8inputs+0outputs (1major+132minor)pagefaults 0swaps

After:

$ /usr/bin/time cat snapshots/latest/stdin >/dev/null 
1.07user 138.08system 26:04.32elapsed 8%CPU (0avgtext+0avgdata 1960maxresident)k
0inputs+0outputs (0major+133minor)pagefaults 0swaps

That's half an hour wasted scanning arrays. Obviously, this gets much worse at terabyte scale.

The cache semantics have changed, but IMHO the old semantics didn't make much sense. Reading blob i would flush blobs 0 through i-1 from the cache, so reading backwards would cause the whole file to get cached in RAM without any eviction, while reading forwards would only cache the last block, with a linear-time eviction every time.

Existing tests pass; they fail when I introduce off-by-one errors in the new block skipping code.

Was the change discussed in an issue or in the forum before?

#2587 optimizes reading performance, but misses the quadratic time issue.

Checklist

  • I have read the Contribution Guidelines
  • I have enabled maintainer edits for this PR
  • I have added tests for all changes in this PR
  • I have added documentation for the changes (in the manual)
  • There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
  • I have run gofmt on the code in all commits
  • All commit messages are formatted in the same style as the other commits in the repo
  • I'm done, this Pull Request is ready for review

Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the usage of a binary search. It could be possible to speed this up even further by caching the result of the search.

The new cache implementation looses one useful property of the old one: If file blocks are read before and after a blob boundary then the old implementation will end up with caching both blobs. As blobs are always read from the repository, this could end up in a large slowdown. (I think pdf files trigger rather erratic disk access patterns).

The proper solution here would be something like e.g. a global LRU cache. That could store the last 50-100 MB of accessed blobs.

@greatroar greatroar force-pushed the fix-quadratic-read branch from e035196 to d868196 Compare June 17, 2020 10:29
@greatroar
Copy link
Copy Markdown
Contributor Author

I added an LRU cache. Regardless of whether the kernel caches FUSE blocks, it can still be useful when comparing files across snapshots.

@greatroar greatroar force-pushed the fix-quadratic-read branch 7 times, most recently from cfde670 to d89da7d Compare June 22, 2020 07:39
@greatroar
Copy link
Copy Markdown
Contributor Author

@MichaelEischer Merge?

@greatroar greatroar closed this Jun 30, 2020
@greatroar greatroar reopened this Jun 30, 2020
@greatroar greatroar force-pushed the fix-quadratic-read branch from 55084f2 to ba702ad Compare June 30, 2020 12:46
Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks mostly fine. I just think that there's still a small race condition regarding the LRU-cache usage.

@greatroar greatroar force-pushed the fix-quadratic-read branch 2 times, most recently from d2bdda5 to 5a31546 Compare July 5, 2020 20:57
Copy link
Copy Markdown
Contributor

@aawsome aawsome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest not to re-implement binary search.

The other changes look good to me.

@aawsome
Copy link
Copy Markdown
Contributor

aawsome commented Jul 10, 2020

@greatroar just realized that you exactly kicked out "import sort" with your last commit.
Is standard sort really slowing things down that much? Doesn't the closure gets inlined?

Just took a look at the source:
https://golang.org/src/sort/search.go?s=2246:2286#L49

Can't believe this is slower than your code except this is really about too much function calls...

@greatroar
Copy link
Copy Markdown
Contributor Author

greatroar commented Jul 10, 2020

The speed difference is more than a factor of 2, see the commit message for 5a31546. And it really is failure to inline:

$ go build -gcflags=-m |& grep -F Read.func
./file.go:125:51: can inline (*file).Read.func1

Can inline, but no "inlined" message. From what I can tell from the assembler, it's the call (stack handling?) that's expensive.

Anyway, the throughput difference will be very small compared to the sort.Search version. I wrote the custom binary search first, but only pushed it in response to @MichaelEischer's performance concerns, to avoid the who-locks-what-in-FUSE debate. So it comes down to readability and I'll happily revert the last commit if the maintainers want it.

@aawsome
Copy link
Copy Markdown
Contributor

aawsome commented Jul 10, 2020

The speed difference is more than a factor of 2, see the commit message for 5a31546. And it really is failure to inline:

Also just did the comparison myself 😉 - the missing inlining explains it. However, I would vote for the "sort.Search"-solution for easier readability.

If we do want to improve further, I would suggest to optimize for the sequential read case which still might be the most often used access pattern. That is, remember the last read blob (respectively index to Content) and check if the newly requested offset fits to this (or the next, if the last blob was completely read). If not, start the binary search which should be already fast enough for all other kinds of access.

@MichaelEischer
Copy link
Copy Markdown
Member

If we do want to improve further, I would suggest to optimize for the sequential read case which still might be the most often used access pattern. That is, remember the last read blob (respectively index to Content) and check if the newly requested offset fits to this (or the next, if the last blob was completely read). If not, start the binary search which should be already fast enough for all other kinds of access.

@greatroar That's what I had in mind for "caching the result of the search", but in my test with a 60GB file the binary search seemed to be fast enough. I actually prefer the 'sort.Search' solution to the reimplemented binary search.

I've noticed that PR now reuses the simplelru package instead of adding another dependency to restic and I think it's rather important to keep the amount of dependencies at a reasonable level.

So the only thing remaining is a small changelog entry which mentions the optimized file access in the mount command . That changelog entry could later on also be extended to cover the other mount PRs.

@aawsome
Copy link
Copy Markdown
Contributor

aawsome commented Jul 11, 2020

@greatroar That's what I had in mind for "caching the result of the search", but in my test with a 60GB file the binary search seemed to be fast enough. I actually prefer the 'sort.Search' solution to the reimplemented binary search.

I did tests all these three variants

  • sort.Search
  • manual binary search implemented by @greatroar
  • saving the index which will be used in the sequentially following next read.

with a 10GB file containing zeros (thus being able to reuse the already-cached blob) and used a time cat file >/dev/null. This means the third variant always gets a hit here.

However, there was no notable time difference and the searching part could not be seen in the cpu profile.
IMO this speaks for using the simple sort.Search solution.

@MichaelEischer
Copy link
Copy Markdown
Member

@aawsome I took a look at the blob lookup performance using the cpu profiler. Even for a 50GB file the old simple loop just consumed 10% in fuse/file.Read . As the slowdown is quadratic this means that the replaced code part probably didn't consume more than 1% of the runtime of the Read method, even with the inefficient implementation. So in total you'll only see a noticeable difference for 100GB+ files.

@greatroar greatroar force-pushed the fix-quadratic-read branch from 668b728 to e6cd22f Compare July 12, 2020 08:16
@greatroar
Copy link
Copy Markdown
Contributor Author

Ok, added a changelog entry and restored the sort-Search version

Copy link
Copy Markdown
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks. I've rebased the branch onto the current master and force pushed it to restart the CI run, which once again had problems with the rclone test.

@MichaelEischer MichaelEischer merged commit b84f517 into restic:master Jul 12, 2020
@greatroar greatroar deleted the fix-quadratic-read branch July 13, 2020 04:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants