Hines/thread padding by nrnhines · Pull Request #2951 · neuronsimulator/nrn

nrnhines · 2024-06-30T14:21:43Z

An experiment to see if performance improves if destructive (write) cacheline sharing is avoided (cache lines are assumed to be 64bytes).

Using the model of #2787, improvement is small or nonexistent on an Apple M1. Timing results for 8 threads are

nthread   8.2.4.   master.  padding(this PR)
8 run.      16.69.   17.57.       17.38
8 solve    15.41     15.56.      15.49

Perhaps there is a datahandle overhead for plotting that could be solved by caching.

so introduce Memb_list m_cache_offset in analogy with m_storage_offset.

azure-pipelines · 2024-06-30T14:46:46Z

✔️ 7492f82 -> Azure artifacts URL

azure-pipelines · 2024-07-04T17:20:17Z

✔️ 7ab6c4e -> Azure artifacts URL

mgeplf · 2024-07-05T07:35:51Z

An experiment to see if performance improves if destructive (write) cacheline sharing is avoided (cache lines are assumed to be 64bytes).

Do you have a sense how often there is false sharing? IE: How often they are written to?
I'm not familiar enough with the code, but rather than padding, most of the time I've seen false sharing issues, it can be fixed by each thread having its own data area, and only doing a single write to potential shared cache lines once it's done the work.

nrnhines · 2024-07-05T12:02:49Z

Do you have a sense how often there is false sharing? IE: How often they are written to?

Not really. We've been circling around the 8.2 vs 9.0 performance issue #2787 for quite a while and have made a lot of progress (mostly caching pointers to doubles such as diam and area). With perf I got close to measuring false sharing but didn't get to the point where I could focus on a specific section of code. As you can see from this PR, the code changes for the padding experiment were quite simple and I'm delighted that the SoA permutation support allowed it to be carried out.

I'm not familiar enough with the code, but rather than padding, most of the time I've seen false sharing issues, it can be fixed by each thread having its own data area, and only doing a single write to potential shared cache lines once it's done the work.

You are exactly correct. Ironically, that was how it was done in long ago precursors to 8.2. In 9.0, threads constitute merely a permutation of the data (into thread partitions) where each partition for each SoA variable (with this PR) begins on a cacheline boundary. It is actually quite beautiful to me to see the tremendous simplification of threads in 9.0 into a fairly trivial low level permutation. The 8.2 and before implementation of threads was a far reaching and complex change involving a great deal of memory management and copying between representations.

At the moment, even with this PR, 8.2 has a slight performance edge over 9.0 on the #2787 model issue with gcc. That performance edge disappears for the Apple M1 and is much smaller with the intel compiler on x86_64.

The bottom line so far with this experiment is that it appears that destructive cache line sharing is not a significant performance issue for #2787 . But pinning that down with confidence requires a bit more testing.

pramodk · 2024-07-08T10:08:11Z

With perf I got close to measuring false sharing but didn't get to the point where I could focus on a specific section of code

With the perf c2c we discussed, did you see significant false sharing reported?

Just for reference, with h.tstop = 1000, the timings for this PR against the master:

Compiler	Build Type	Threads	Time (seconds)
gcc	master	8	11.45
gcc	master	8	11.48
gcc	padding	8	11.11
gcc	padding	8	11.08
intel	master	8	8.37
intel	master	8	8.39
intel	padding	8	8.34
intel	padding	8	8.41

mgeplf · 2024-07-08T10:34:35Z

With perf I got close to measuring false sharing but didn't get to the point where I could focus on a specific section of code.

Interesting, it's too bad more concrete proof didn't surface.

You are exactly correct. Ironically, that was how it was done in long ago precursors to 8.2. [...]

Also very interesting. It's certainly nice that it's relatively simple to implement with the updated system.

The bottom line so far with this experiment is that it appears that destructive cache line sharing is not a significant performance issue for #2787 .

and

[...] the timings for this PR against the master:

Whenever I see small perf changes, like the above, I always remember the Producing Wrong Data Without Doing Anything Obviously Wrong!. It's old, but the take away is pretty stunning; that there can be measurement bias simply due to link order, or environment size. Their showings aren't necessarily applicable here, it's more that small perf improvements may be have different root causes, and teasing out causal relationships is important.

azure-pipelines · 2024-07-09T12:45:53Z

✔️ 19c2887 -> Azure artifacts URL

sonarqubecloud · 2024-07-15T12:33:16Z

Quality Gate passed

Issues
33 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

azure-pipelines · 2024-07-15T13:31:18Z

✔️ 13d7fd7 -> Azure artifacts URL

bbpbuildbot · 2024-07-15T13:46:56Z

nrnhines added 7 commits June 24, 2024 08:37

Tentative design for NrnThread padding to avoid cacheline sharing.

ac8d660

print anticipated padding needed to avoid cacheline sharing.

41287d6

node padding (avoids a problem with forward declaration of Node)

9e47932

bug in mech_padding

b627b67

Do not need to calculate ipad. It is the current_row.

d466d27

clang-format

a60bd0c

The cache range has a nodecount size

7492f82

so introduce Memb_list m_cache_offset in analogy with m_storage_offset.

This comment has been minimized.

Sign in to view

Merge branch 'master' into hines/thread-padding

7ab6c4e

This comment has been minimized.

Sign in to view

Merge branch 'master' into hines/thread-padding

19c2887

This comment has been minimized.

Sign in to view

pramodk mentioned this pull request Jul 9, 2024

Add documentation about using Intel Vtune with NEURON #2971

Merged

Merge branch 'master' into hines/thread-padding

13d7fd7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hines/thread padding#2951

Hines/thread padding#2951
nrnhines wants to merge 10 commits into
masterfrom
hines/thread-padding

nrnhines commented Jun 30, 2024

Uh oh!

azure-pipelines Bot commented Jun 30, 2024

Uh oh!

This comment has been minimized.

azure-pipelines Bot commented Jul 4, 2024

Uh oh!

This comment has been minimized.

mgeplf commented Jul 5, 2024

Uh oh!

nrnhines commented Jul 5, 2024

Uh oh!

pramodk commented Jul 8, 2024

Uh oh!

mgeplf commented Jul 8, 2024

Uh oh!

azure-pipelines Bot commented Jul 9, 2024

Uh oh!

This comment has been minimized.

sonarqubecloud Bot commented Jul 15, 2024

Uh oh!

azure-pipelines Bot commented Jul 15, 2024

Uh oh!

bbpbuildbot commented Jul 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

nrnhines commented Jun 30, 2024

Uh oh!

azure-pipelines Bot commented Jun 30, 2024

Uh oh!

This comment has been minimized.

azure-pipelines Bot commented Jul 4, 2024

Uh oh!

This comment has been minimized.

mgeplf commented Jul 5, 2024

Uh oh!

nrnhines commented Jul 5, 2024

Uh oh!

pramodk commented Jul 8, 2024

Uh oh!

mgeplf commented Jul 8, 2024

Uh oh!

azure-pipelines Bot commented Jul 9, 2024

Uh oh!

This comment has been minimized.

sonarqubecloud Bot commented Jul 15, 2024

Quality Gate passed

Uh oh!

azure-pipelines Bot commented Jul 15, 2024

Uh oh!

bbpbuildbot commented Jul 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants