[DNM] Performance changes to reduce false sharing - for discussion.#9431
[DNM] Performance changes to reduce false sharing - for discussion.#9431joemario wants to merge 1 commit intoceph:hammerfrom
Conversation
1) First is aligning classes and contended hot data members. 2) Demonstrate the expense OpTracker::_mark_event can be, and discuss opportunities.
|
@joemario I'd would love to resurrect this and get the simple changes (e.g., cacheline alignements) into mergeable patches... |
|
I went back and found the presentation (because Mark indexed all the recordings, thank you Mark). https://bluejeans.com/playback/s/ha97eh8zOC5OPKy63VBWM1XSIDCAs6UVV8dWWhu04RyvrRjOuEG13JZ6dEpOiczK slides at: Joe's primary finding was that the C++ iostream class usage in Ceph's logging code was the primary cause of cacheline contention, or "false sharing". Joe's PR was there so he could present an analysis of exactly where the problem was in the Ceph code, and was admittedly not ready to be pulled in as is. But a huge part of the work of performance optimization is analysis and Joe did just about all that was needed there IMHO. Is there a Ceph developer who could help? This probably needs to be broken up into commit(s) to remove the iostream problem and commit(s) to add the cacheline alignment to Ceph data structures. IMHO optimizations like this are extremely relevant to help Ceph keep up with faster storage and networks. |
|
Sage:
To follow up on Ben's reply earlier, here are the performance opportunities I saw:
* The biggest slowdown I saw was from using the slow stringstream objects in the default logging path. It showed up mostly under mark_event(). I understand some Ceph benchmarking turns off the default logging in /etc/ceph/ceph.conf. Fixing this stringstream cost should get you most of the performance that you get by disabling that default logging. One of the C++ developers looked at that Ceph code and came up with ideas for how to replace stringstream.
*
The next biggest item was fixing the cpu cacheline false sharing. The PoC patches I submitted last year show how easy some of those are to fix. Although I saw only a 1-3% speedup in my small "1-disk/2-osd" setup, the speedup should be larger in a larger cpu-bound ceph setup. It will also be larger once the above stringstream problem is fixed.
I'd be more than happy to work with someone to reprofile Ceph on a larger installation for false sharing and to show them all the source code locations that need to be modified. (I'm not a Ceph developer and don't want to take the time to learn the Ceph patch testing and submission process).
Let me know if anyone is interested in working with me on this, and if you want me to connect you to the C++ compiler developer for his ideas on removing the stringstring slowdown. He's happy to consult.
Joe
…----- Original Message -----
From: "Sage Weil" ***@***.***>
To: "ceph/ceph" ***@***.***>
Cc: "joemario" ***@***.***>, "Mention" ***@***.***>
Sent: Tuesday, March 28, 2017 2:25:26 PM
Subject: Re: [ceph/ceph] [DNM] Performance changes to reduce false sharing -
for discussion. (#9431)
@joemario I'd would love to resurrect this and get the simple changes (e.g.,
cacheline alignements) into mergeable patches...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub , or mute the thread .
|
| { | ||
| if (!tracking_enabled) | ||
| return; | ||
|
|
There was a problem hiding this comment.
@joemario Please keep PR title < 50 characters for future PRs, also prepend component name before PR in PR title.
https://github.com/ceph/ceph/blob/master/SubmittingPatches.rst#title-of-pull-requests-and-title-of-commits
|
@aclamk do you mind looking at this and seeing if we can/should apply it? |
|
@liewegas Implementation could use some polishing:
Should I make adaptation of this for master? |
|
Hi Adam and Sage:
The "perf c2c" cacheline contention functionality got checked in upstream
last winter, so it's now available. It's also in RHEL 7.4 and CentOS
7.4. If the setup is still on CentOS 7.3, the 7.4 version of perf can be
copied back to 7.3 and it should work. (No guarantee, butIt always has for
me).to to fin
Since I did the work for that PR on a very small setup (single system ceph
setup with one SSD), and since the ceph codebase has changed considerably,
I recommend rerunning the perf c2c tool on the latest ceph codebase in a
much larger environment. The larger environment will more clearly show the
cost of cacheline contention.
I'm not a ceph developer, but am intimately familiar with the perf c2c tool
and interpreting its output. I'd be very happy to work with a ceph
developer on running the tool and analyzing the results.
The skipping of _mark_event in my PR was only to demonstrate the
performance cost of using C++ iostream in the default logging path. One of
our C++ compiler developers looked at the ceph logging implementation
(_mark-event) and had recommendations for speeding it up. I can pull him
in to work with someone if the interest is there.
That _mark_event code path is the biggest bottleneck I found wrt to
contention and will only get bigger on larger installations.
I did try both grouping together and separating the two m_queue_mutex*
structs you mention below. I had mixed results, probably due to my small
setup. I agree with you that if they're used together, they can be
grouped, but it should be tested first, as there are a few cases where
that's not always true.
I'm busy till mid October, but after that if you want to get a reasonably
sized ceph installation (where the OSDs can be pushed hard), I'd be happy
to work with you on rerunning and analyzing perf c2c to find the contention
locations in the latest codebase and to attack the iostream logging issue.
Joe
…On Thu, Sep 14, 2017 at 6:53 AM, Adam Kupczyk ***@***.***> wrote:
@liewegas <https://github.com/liewegas>
Idea is sound. Having a cache entry changed by one CPU and read by another
is known to be slower.
Strategies like the one proposed here are likely to slightly improve
performance.
Downside: it is difficult to measure precisely.
Implementation could use some polishing:
1. hot variables that are used together should go into one cache raw,
for example
pthread_mutex_t m_queue_mutex *attribute*((aligned (64)));
pthread_t m_queue_mutex_holder *attribute*((aligned (64)));
should become a struct
2. I guess alignas should replace *attribute*((aligned))
3. we can't skip _mark_event method, can we?
Should I make adaptation of this for master?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9431 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AOQQyn7b6L69tX8e4U9a39RZMZwcf9zlks5siQW-gaJpZM4IsCbx>
.
|
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
This commit demonstrates the benefit of reducing cacheline contention, in two ways:
This commit is not to be checked in as-is.
Although I am not a Ceph developer, I have been able to realize worthwhile performance gains from these changes. Ben England and I would like to share these findings at a upcoming weekly Ceph performance meeting.