-
Notifications
You must be signed in to change notification settings - Fork 14
Dirty tracking performance improvements #210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
csegarragonz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, very minor changes but will approve so that you don't have to re-request a review.
| /* | ||
| * Returns a list of flags marking which bytes differ between the two arrays. | ||
| */ | ||
| std::vector<bool> diffArrays(std::span<const uint8_t> a, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this a bit-wise XOR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's currently byte-wise, not bit-wise, but this may change. I noticed that this function isn't actually used, so I should probably just delete it.
|
|
||
| uint32_t diffPageStart = 0; | ||
| bool diffInProgress = false; | ||
| for (int i = 0; i < nPages; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, isn't this essentially the same as faabric::util::getDiffRegions()?
eigenraven
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had a quick look at this, because it sounds very similar to what I'm doing to reduce the size of incremental snapshots for offloading in my project. I'd suggest not creating byte-wise diffs, as that's pretty costly performance-wise, you can play around with variants in the quickbench link I put in a comment. The code in faabric/util/delta uses a configurable "page" size (any difference in a page = emit a diff for the whole page), in my testing around 64 yielded the best performance without making the diffs larger than a few percent - you might want to test this for your applications. The code could be relatively easily modified to take in an array of modified OS pages to skip known-unmodified pages, I'm happy to do this if you're interested as it's on my todo list anyway, and then you could use the generate/applyDelta functions
|
|
||
| std::vector<bool> diffs(a.size(), false); | ||
| for (int i = 0; i < a.size(); i++) { | ||
| diffs[i] = a.data()[i] != b.data()[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be very slow due to the vector<bool> specialization, just swapping the type to uint8_t (at the cost of 8x more memory) makes it 11x faster: https://quick-bench.com/q/d8INA76hIDM3m4gfC4xJKI4HcQM
|
I'm going to merge this and take discussions of improvements and tweaks offline. |
This PR is the third (and final) in a series of overhauls of the threading, snapshotting and dirty tracking model.
This PR addresses the underlying issue of the performance of the dirty tracking, which is currently based on soft dirty PTEs. While it works, it has a couple of shortcomings:
/proc/self/clear_refsand reading from/proc/self/pagemaprespectively, which can be a drain on performance when done repeatedly in a tight loop.Ideally we would do this tracking with
userfaultfdwrite-protected pages, however, this is only available in kernels 5.7+, i.e. Ubuntu 22+ which isn't yet stable. In the interim we can use anmprotect+SIGSEGVapproach, where we make pages read-only usingmprotect, then catch the resulting segfault when they're written to, and mark them as dirty.Although in our current use-cases the PTEs are less performant that the segfault approach, this may not be the case for all workloads, so I'd like to keep the soft PTEs approach around.
PTEs, segfaults and
userfaultfdall result in a different approach to aggregating diffs for a batch of threads. PTEs are system-wide, while segfaults are handled by individual threads (so we have to introduce thread-local tracking), anduserfaultfdwould use a single background tracker thread per application. To abstract the boilerplate for doing this and support switching more easily, I've introduced a singleDirtyTrackerclass that abstracts all the details, and a configuration parameterDIRTY_TRACKING_MODEwhich can be set tosoftpteorsegfaultfor now, anduffdin future.This results in quite a few changes to the code:
DirtyTrackerclass.Executorclass (previously this was scattered across Faabric and Faasm).SnapshotDataclass.