Skip to content

Hines/thread padding#2951

Open
nrnhines wants to merge 10 commits into
masterfrom
hines/thread-padding
Open

Hines/thread padding#2951
nrnhines wants to merge 10 commits into
masterfrom
hines/thread-padding

Conversation

@nrnhines

Copy link
Copy Markdown
Member

An experiment to see if performance improves if destructive (write) cacheline sharing is avoided (cache lines are assumed to be 64bytes).

Using the model of #2787, improvement is small or nonexistent on an Apple M1. Timing results for 8 threads are

nthread   8.2.4.   master.  padding(this PR)
8 run.      16.69.   17.57.       17.38
8 solve    15.41     15.56.      15.49

Perhaps there is a datahandle overhead for plotting that could be solved by caching.

@azure-pipelines

Copy link
Copy Markdown

✔️ 7492f82 -> Azure artifacts URL

@bbpbuildbot

This comment has been minimized.

@azure-pipelines

Copy link
Copy Markdown

✔️ 7ab6c4e -> Azure artifacts URL

@bbpbuildbot

This comment has been minimized.

@mgeplf

mgeplf commented Jul 5, 2024

Copy link
Copy Markdown
Collaborator

An experiment to see if performance improves if destructive (write) cacheline sharing is avoided (cache lines are assumed to be 64bytes).

Do you have a sense how often there is false sharing? IE: How often they are written to?
I'm not familiar enough with the code, but rather than padding, most of the time I've seen false sharing issues, it can be fixed by each thread having its own data area, and only doing a single write to potential shared cache lines once it's done the work.

@nrnhines

nrnhines commented Jul 5, 2024

Copy link
Copy Markdown
Member Author

Do you have a sense how often there is false sharing? IE: How often they are written to?

Not really. We've been circling around the 8.2 vs 9.0 performance issue #2787 for quite a while and have made a lot of progress (mostly caching pointers to doubles such as diam and area). With perf I got close to measuring false sharing but didn't get to the point where I could focus on a specific section of code. As you can see from this PR, the code changes for the padding experiment were quite simple and I'm delighted that the SoA permutation support allowed it to be carried out.

I'm not familiar enough with the code, but rather than padding, most of the time I've seen false sharing issues, it can be fixed by each thread having its own data area, and only doing a single write to potential shared cache lines once it's done the work.

You are exactly correct. Ironically, that was how it was done in long ago precursors to 8.2. In 9.0, threads constitute merely a permutation of the data (into thread partitions) where each partition for each SoA variable (with this PR) begins on a cacheline boundary. It is actually quite beautiful to me to see the tremendous simplification of threads in 9.0 into a fairly trivial low level permutation. The 8.2 and before implementation of threads was a far reaching and complex change involving a great deal of memory management and copying between representations.

At the moment, even with this PR, 8.2 has a slight performance edge over 9.0 on the #2787 model issue with gcc. That performance edge disappears for the Apple M1 and is much smaller with the intel compiler on x86_64.

The bottom line so far with this experiment is that it appears that destructive cache line sharing is not a significant performance issue for #2787 . But pinning that down with confidence requires a bit more testing.

@pramodk

pramodk commented Jul 8, 2024

Copy link
Copy Markdown
Member

With perf I got close to measuring false sharing but didn't get to the point where I could focus on a specific section of code

With the perf c2c we discussed, did you see significant false sharing reported?

Just for reference, with h.tstop = 1000, the timings for this PR against the master:

Compiler Build Type Threads Time (seconds)
gcc master 8 11.45
gcc master 8 11.48
gcc padding 8 11.11
gcc padding 8 11.08
intel master 8 8.37
intel master 8 8.39
intel padding 8 8.34
intel padding 8 8.41

@mgeplf

mgeplf commented Jul 8, 2024

Copy link
Copy Markdown
Collaborator

With perf I got close to measuring false sharing but didn't get to the point where I could focus on a specific section of code.

Interesting, it's too bad more concrete proof didn't surface.

You are exactly correct. Ironically, that was how it was done in long ago precursors to 8.2. [...]

Also very interesting. It's certainly nice that it's relatively simple to implement with the updated system.

The bottom line so far with this experiment is that it appears that destructive cache line sharing is not a significant performance issue for #2787 .

and

[...] the timings for this PR against the master:

Whenever I see small perf changes, like the above, I always remember the Producing Wrong Data Without Doing Anything Obviously Wrong!. It's old, but the take away is pretty stunning; that there can be measurement bias simply due to link order, or environment size. Their showings aren't necessarily applicable here, it's more that small perf improvements may be have different root causes, and teasing out causal relationships is important.

@azure-pipelines

Copy link
Copy Markdown

✔️ 19c2887 -> Azure artifacts URL

@bbpbuildbot

This comment has been minimized.

@sonarqubecloud

Copy link
Copy Markdown

@azure-pipelines

Copy link
Copy Markdown

✔️ 13d7fd7 -> Azure artifacts URL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants