Adopt CoreNEURON cell permute into NEURON. by nrnhines · Pull Request #2598 · neuronsimulator/nrn

nrnhines · 2023-11-01T13:23:12Z

Introduced ParallelContext.optimize_node_order(arg) where arg is 0, 1, or 2. 0 means no permutation, 1 and 2 are equivalent to CoreNEURON cellpermute value.

Permutation, if requested, takes place in nrn/src/nrnoc/treeset.cpp:v_setup_vectors after constructing NrnThread and before nrn_ensure_model_data_are_sorted by means of a call to neuron::nrn_permute_node_order(). Permutation changes
NrnThread : _v_node, _v_parent, v_parent_index, and Node.v_node_index. Additionally it permutes the instances of each mechanism so that each mechanism nodelist is monotonically ordered.

The PR started out copying src/coreneuron/permute files into a new src/node_order_optim folder. Since these file copies share predominantly the same code, it was decided for the nrn build to reach into src/coreneuron/permute and compile with
CORENRN_BUILD=0 while the corenrn build compiles those files with CORENRN_BUILD=1.

At time of PR opening, Release performance on an old x86_64 Intel Xeon E5606 @ 2.13GHz for the nrntraub model is

mpiexec -n 4 nrniv -mpi -c nthread=1 -c mytstop=200 -c use_gap=1 -c coreneuron=0 -c one_tenth_ncell=1 init.hoc

perm          8.2.3       nrn         corenrn
0                150.8     146.2       
1                              151.5        98.15
2                               147.5       94.66

Note that permutation was primarily intended for GPU performance.
On an Apple M1 the performance is

perm          8.2.3       nrn         corenrn
0                40.08     31.16       
1                              31.51        16.46
2                              30.66        16.25

Some questions about this PR are:
Should the permutation vectors be converted to std::vector?
Should the permutation algorithm be in-place instead of temporary use of a vector copy?
Is it worthwhile now to factor so that coreneuron and neuron share as much as possible the permutation code?
One would expect the permutation for cpu to allow use of more pipelineing or simd data processing. Should we pursue now?
Is now the best time to convert many NrnThread A* to std::vector<A> and clean up NrnThread?

[x] document ParallelContext.optimize_node_order
[x] test
[ ] cover

azure-pipelines · 2023-11-01T13:51:28Z

✔️ 0032d4e -> Azure artifacts URL

codecov · 2023-11-01T14:15:00Z

Codecov Report

Attention: Patch coverage is 80.00000% with 27 lines in your changes are missing coverage. Please review.

Project coverage is 67.17%. Comparing base (6de86fc) to head (ef9613e).
Report is 1 commits behind head on master.

Files	Patch %	Lines
src/coreneuron/permute/cellorder.cpp	71.64%	19 Missing ⚠️
src/coreneuron/permute/node_permute.cpp	60.00%	8 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2598      +/-   ##
==========================================
+ Coverage   67.16%   67.17%   +0.01%     
==========================================
  Files         562      564       +2     
  Lines      104118   104246     +128     
==========================================
+ Hits        69929    70028      +99     
- Misses      34189    34218      +29

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

iomaganaris · 2023-11-07T15:49:49Z

I have started having a look at the changes in this PR and here are my initial answers to some of your questions:

Should the permutation vectors be converted to std::vector?

I think it's a good idea. That way we can make sure that we don't leak anything, for example p gets leaked IIUC here:

nrn/src/node_order_optim/cellorder.cpp

Line 356 in 0032d4e

auto p = inverse_permute(perm, nt.end);

Should the permutation algorithm be in-place instead of temporary use of a vector copy?

It's a good idea as well. Those permutation vectors can be pretty large so it would be good to do this operation in-place. We already have a similar algorithm for that in

nrn/src/neuron/container/soa_container.hpp

Line 1080 in 0032d4e

for (std::size_t i = 0; i < my_size; ++i) {

but I'm not sure whether we can reuse this since the logic is a bit different. Let me know if I can help with that.

Is it worthwhile now to factor so that coreneuron and neuron share as much as possible the permutation code?

That's a non trivial question. I think it would be nice to share some parts that should be (with some work) independent of the NEURON/CoreNEURON details. However I think this would increase the complexity in the build system since at the moment we don't link together NEURON and CoreNEURON libraries. I am wondering in the end whether it makes sense to have separate copies for the two (minus the GPU part of the algorithms in NEURON) and when we are ready to merge the CoreNEURON logic and GPU support in NEURON we can add the needed parts. Anyway I guess that the permutation code is not something that we expect to change very frequently.

One would expect the permutation for cpu to allow use of more pipelineing or simd data processing. Should we pursue now?

This could probably provide some small benefits but I don't think we would see any great increase in CPU performance. We can study this with some more detailed profiling though as an exercise with different circuit sizes and CPU architectures (more or less memory bandwidth).

nrnhines · 2023-11-07T17:12:47Z

have started having a look at the changes in this PR

Thanks @iomaganaris for looking at this.
I would say that if transforming the permutation int* to std::vector can be localized to permutation code then let's go for it.
As I was looking into that, I noticed many int* and other types of arrays in NrnThread that also would seem to benefit from the same treatment. I guess I was unsure of whether it is best to mix this PR with just a partial permutation related transformation or do it in a separate transformation. One issue here that gets too close to my cognitive clarity limits is that this pr's permutation is apparently a reverse permutation in the context that is used by the SoA change. I think it makes sense to modify this pr so that it is consistent with the direction of SoA permutations.

I have yet to understand in detail, how in-place permutation works. At my present level of understanding, it seems like it would be a tradeoff between space and time. One can indeed avoid copying an array of length N by using a series of just N + 1 moves with only one extra scalar variable. But I get stuck on the apparently quadratic number of comparisons to figure out if an element has already been moved (was done in a previous cycle). At least we have a working version, so substituting with an in place version should still work whether I understand it or not.

I was hoping for a performance benefit on cpu but the issue was complicated by the fact that all the parallel pragma details for
solve_interleaved2 are turned off with if(0) and pragmas for solve_interleaved1 contain if(nt->compute_gpu)

I agree that it is probably a mistake for now to thing about factoring. Need some more examples of adopting coreneuron algorithms into NEURON.

iomaganaris · 2023-11-10T12:13:33Z

Hi @nrnhines ,

One issue here that gets too close to my cognitive clarity limits is that this pr's permutation is apparently a reverse permutation in the context that is used by the SoA change. I think it makes sense to modify this pr so that it is consistent with the direction of SoA permutations.

It's true that those are two different things. We could think whether it makes sense to create a generalized function for both cases but if it's too hard and complicates things too much I wouldn't go into it now.

I have yet to understand in detail, how in-place permutation works. At my present level of understanding, it seems like it would be a tradeoff between space and time. One can indeed avoid copying an array of length N by using a series of just N + 1 moves with only one extra scalar variable. But I get stuck on the apparently quadratic number of comparisons to figure out if an element has already been moved (was done in a previous cycle). At least we have a working version, so substituting with an in place version should still work whether I understand it or not.

I thought that this was a fun exercise and I definitely spent more time than necessary for that but I found various permutation algorithms, played with the a bit and created a microbenchmark to check the performance and their tradeoffs.
Here is the C++ mini app I wrote:

#include <vector>
#include <random>
#include <iostream>
#include <algorithm> // for copy
#include <iterator> // for ostream_iterator
#include <unordered_set>
#include <chrono>
#include <functional>

// #include <format>

using std::chrono::high_resolution_clock;
using std::chrono::nanoseconds;
using std::chrono::milliseconds;

template<typename T>
void debug_print(const std::vector<T>& v, const std::vector<int>& p) {
    std::cout << "v" << std::endl;
    std::copy(v.begin(), v.end(), std::ostream_iterator<int>(std::cout, " "));
    std::cout << std::endl;
    std::cout << "p" << std::endl;
    std::copy(p.begin(), p.end(), std::ostream_iterator<int>(std::cout, " "));
    std::cout << std::endl;
}

// From https://devblogs.microsoft.com/oldnewthing/20170102-00/?p=95095
// Issue is that indices are copied by value
template<typename T>
void
apply_permutation1(
    std::vector<T>& v,
    std::vector<int> indices)
{
    for (size_t i = 0; i < indices.size(); i++) {
        T t{std::move(v[i])};
        auto current = i;
        while (i != indices[current]) {
            auto next = indices[current];
            v[current] = std::move(v[next]);
            indices[current] = current;
            current = next;
        }
        v[current] = std::move(t);
        indices[current] = current;
    }
}

// From https://devblogs.microsoft.com/oldnewthing/20170102-00/?p=95095
// A different version of apply_permutation1
// Issue is that indices are copied by value
template<typename T>
void
apply_permutation2(
    std::vector<T>& v,
    std::vector<int> indices)
{
    using std::swap; // to permit Koenig lookup
    for (size_t i = 0; i < indices.size(); i++) {
        auto current = i;
        while (i != indices[current]) {
            auto next = indices[current];
            swap(v[current], v[next]);
            indices[current] = current;
            current = next;
        }
        indices[current] = current;
    }
}

// I forgot where I found that but it's really slow (O(N^2))
template<typename T>
void
apply_permutation3(
    std::vector<T>& A,
    std::vector<int>& p)
{
    using std::swap; // to permit Koenig lookup
    size_t ind{};
    double temp{};
    for (size_t i = 0; i < p.size(); i++) {
        ind = p[i];
        while (ind < i) {
            ind = p[ind];
        }
        swap(A[i], A[ind]);
    }
}

// From https://cstheory.stackexchange.com/questions/6711/complexity-of-applying-a-permutation-in-place
// void compute_permutation(Object[] array, int[] pi)
// {
//     for(int i=0; i<array.length; i++)
//     {
//         if (pi[i] >= 0)
//         {
//             for(int j=i, k; i != (k = pi[j]); j = k)
//             {
//                 Object tmp = array[j];
//                 array[j] = array[k];
//                 array[k] = tmp;
//                 pi[j] = ~pi[j];
//             }
//         }
//     }

//     for(int i=0; i<pi.length; i++) pi[i] = ~pi[i];
// }

// TODO: C++ version not working
// template<typename T>
// void
// apply_permutation4(
//     std::vector<T>& v,
//     std::vector<int>& indices)
// {
//     debug_print(v,indices);
//     using std::swap; // to permit Koenig lookup
//     for (size_t i = 0; i < indices.size(); i++) {
//         if (indices[i] >= 0) {
//             auto k = indices[i];
//             for (size_t j = i; j != indices[j]; j = k) {
//                 swap(v[j], v[k]);
//                 k = indices[j];
//                 indices[j] = -indices[j];
//                 debug_print(v,indices);
//             }
//         }
//     }
//     for (auto& indice : indices) {
//         indice = -indice;
//     }
// }

// Simpler and better way for the current algorithm
template<typename T>
void
apply_permutation5(
    std::vector<T>& v,
    std::vector<int>& indices)
{
    std::vector<T> new_v;
    new_v.reserve(v.size());
    std::transform(indices.begin(), indices.end(), std::back_inserter(new_v), [&v](int ind) { return v[ind]; });
    v = std::move(new_v);
}

// Optimized algorithm based on https://devblogs.microsoft.com/oldnewthing/20170102-00/?p=95095
// and `Option 5` from https://cstheory.stackexchange.com/a/52365
// O(N) runtime and no O(1) space
// `indices` need to be threadsafe and writable (they return to their original state at the end)
template<typename T>
void
apply_permutation6(
    std::vector<T>& v,
    std::vector<int>& indices)
{
    using std::swap; // to permit Koenig lookup
    for (size_t i = 0; i < indices.size(); i++) {
        if (indices[i] >= 0) {
            auto current = i;
            while (indices[current] != i) {
                const auto next = indices[current];
                swap(v[current], v[next]);
                // -1 is needed because otherwise there is an issue if
                // indices[current] is 0
                indices[current] = -indices[current]-1;
                current = next;
            }
            indices[current] = -indices[current]-1;
        }
    }
    std::transform(indices.begin(), indices.end(), indices.begin(), [](int ind) { return -ind-1; });
}

// Similar to apply_permutation6 but with extra boolean vector
// to find visited elements
// Runtime complexity O(N) but space complexity O(N)
template<typename T>
void
apply_permutation7(
    std::vector<T>& v,
    const std::vector<int>& indices)
{
    std::vector<bool> visited(indices.size(), false);
    using std::swap; // to permit Koenig lookup
    for (size_t i = 0; i < indices.size(); i++) {
        if (!visited[i]) {
            auto current = i;
            while (indices[current] != i) {
                const auto next = indices[current];
                swap(v[current], v[next]);
                visited[current] = true;
                current = next;
            }
            visited[current] = true;
        }
    }
}

template<typename T, typename F, int n>
void run_experiment(std::vector<T>& v, std::vector<int>& p, F f) {
    const auto num_experiments = 100;

    size_t sum_times{};
    for (auto experiment = 0; experiment < num_experiments; ++experiment) {
        for (int i = 0; i < v.size(); ++i) {
            v[i] = i;
        }

        auto start = high_resolution_clock::now();
        f(v, p);
        auto end = high_resolution_clock::now();
        sum_times += duration_cast<nanoseconds>(end - start).count();
    }
    std::cout << "  apply_permutation" << n << ": "
                << sum_times / num_experiments << " ns"
                << std::endl;
    if (v != p) {
        std::cout << "v and p after apply_permutation" << n << " are different" << std::endl;
        // debug_print(v,p);
    }
}

int main(int argc, char *argv[]) {
    const auto N = argc > 1 ? std::atoi(argv[1]) : 1000000;
    std::cout << "Running experiment with " << N << " elements" << std::endl;
    std::vector<int> v;
    v.resize(N);
    std::vector<int> p;
    p.reserve(N);

    std::random_device dev;
    std::mt19937 rng(dev());
    std::uniform_int_distribution<std::mt19937::result_type> distN(0, N-1);

    std::unordered_set<int> p_unique;
    
    while (p_unique.size() < N) {
        auto rand_num = distN(rng);
        while (p_unique.contains(rand_num)) {
            rand_num = distN(rng);
        }
        p_unique.insert(rand_num);
    }

    for (int i = 0; i < N; ++i) {
        v[i] = i;
    }
    for (const auto& perm : p_unique) {
        p.push_back(perm);
    }

    // debug_print(v,p);

    run_experiment<int, decltype(apply_permutation1<int>), 1>(v, p, apply_permutation1);

    run_experiment<int, decltype(apply_permutation2<int>), 2>(v, p, apply_permutation2);

    run_experiment<int, decltype(apply_permutation3<int>), 3>(v, p, apply_permutation3);
    
    // run_experiment<int, decltype(apply_permutation4<int>), 4>(v, p, apply_permutation4);

    run_experiment<int, decltype(apply_permutation5<int>), 5>(v, p, apply_permutation5);

    run_experiment<int, decltype(apply_permutation6<int>), 6>(v, p, apply_permutation6);

    run_experiment<int, decltype(apply_permutation7<int>), 7>(v, p, apply_permutation7);

}

And here are the results on an Intel Xeon Cascade Lake:

I compiled the code with icpx -O2 -g -std=c++20 permute.cpp -o permute.
My conclusion is that apply_permutation5 is the best choice in general. It only costs some memory however if we don't expect the numbers of elements of the vectors and their size to be extremely large there is an large benefit in runtime. If the size of the vectors can grow a lot then apply_permutation6 might be a better option that does the permute in place.

I was hoping for a performance benefit on cpu but the issue was complicated by the fact that all the parallel pragma details for
solve_interleaved2 are turned off with if(0) and pragmas for solve_interleaved1 contain if(nt->compute_gpu)

Regarding the increase in CPU performance with the permutations I think that most of the algorithms are already memory bound so I don't expect really large improvements in performance compared to cell permute 1 at least.

nrnhines · 2023-11-10T14:05:32Z

check the [in place permutation algorithm] performance and their tradeoffs.

Great job! It is nice to see that they are practically linear in time. It wasn't clear to me if 1 and 2 changed the permutation vector so at the end it is different than from the beginning. If so, they are non-starters since we have to apply the permutation to several data vectors in several different contexts. (I notice the SoA avoids this by permuting entire container rows at once. But, actually, that SoA permutation adds to my cognitive confusion. Why permute twice? Once to properly order nodes and mechanism instances in NrnThread, and once to do the SoA. Does NrnThread even need to be an actualized data structure except for parent node indices along with some of its administrative information?) Anyway, we can leave the in place apply_permutation6 in the wings for potential future consideration, and replace this PR's permutation method with conceptually similar apply_permutation5.
In regard to apply_permutation6, I think it is amusing that the use of the sign bit in the permutation vector to signal if that element has already been used can be considered cheating in regard to space complexity (i.e. using a boolean vector by the backdoor so to speak. No question that it reduces the time order complexity, though.)

Regarding the increase in CPU performance with the permutations I think that most of the algorithms are already memory bound so I don't expect really large improvements in performance compared to cell permute 1 at least.

You are likely correct. It just seems to me that "memory bound" opens the issue of memory latency, bandwidth, and data ordering.
The whole point of these permutations was, given an ordering of adjacent nodes, the parents of those nodes are also adjacently ordered. The idea was that gaussian elimination using cachelines of child node data, the parent node data would also be in cachelines (albeit with some possible parent cacheline alignment issues).

iomaganaris · 2023-11-13T11:07:24Z

It wasn't clear to me if 1 and 2 changed the permutation vector so at the end it is different than from the beginning. If so, they are non-starters since we have to apply the permutation to several data vectors in several different contexts.

Indeed the algorithms of apply_permutation1 and apply_permutation2 change the permute vector so I had to pass it by value. In general in all the algorithms I had as a requirement to keep the permutation vector intact after the end of the function.

I notice the SoA avoids this by permuting entire container rows at once. But, actually, that SoA permutation adds to my cognitive confusion. Why permute twice? Once to properly order nodes and mechanism instances in NrnThread, and once to do the SoA.

IIUC the comments in

nrn/src/nrnoc/treeset.cpp

Line 2215 in ae9b284

/** @brief Sort the underlying storage for Nodes.

we need to permute the nodes based on the permutation of the NrnThreads and that's why we do it twice.

Does NrnThread even need to be an actualized data structure except for parent node indices along with some of its administrative information?

I didn't understand what you mean by an actualized data structure here. Something important that NrnThread also holds is the sparse13 vectors as well.

Anyway, we can leave the in place apply_permutation6 in the wings for potential future consideration, and replace this PR's permutation method with conceptually similar apply_permutation5.

👍
Let me know if you want me to push this change in the PR.

You are likely correct. It just seems to me that "memory bound" opens the issue of memory latency, bandwidth, and data ordering.
The whole point of these permutations was, given an ordering of adjacent nodes, the parents of those nodes are also adjacently ordered. The idea was that gaussian elimination using cachelines of child node data, the parent node data would also be in cachelines (albeit with some possible parent cacheline alignment issues).

Without remembering atm all the details about how cell permute 2 works what you suggest could potentially improve the memory accesses I believe. I guess by how much it depends on the circuit. Also I'm wondering whether for the CPU the nwarps variable should be different to get better results and also whether we can help the compiler with some extra vectorization pragmas similar to the OpenACC pragmas. Maybe the first step towards this is cleaning up the NEURON version of the code from the GPU related aspects?

nrnhines · 2023-11-14T00:21:49Z

@iomaganaris > Let me know if you want me to push this change in the PR.

Please do so

I didn't understand what you mean by an actualized data structure

In coreneuron, there are no such objects as Node. There is a lengthy comment in nrn/src/coreneuron/sim/multicore.cpp that presents the history of Node. Now, with the new SoA containers in NEURON, Node in NEURON gets a lot of legacy use but seems in more or less reduced to a container::Node::storage instance containing the data of all Nodes.
Since SoA is now doing the threads mostly in terms of segregating the threads via data reordering, I was wondering if parent indices should become part of the SoA node_data. So, for the most part a NrnThread conceptually is little more than a range in node_data. Anyway, it is just something to think about for future simplification and memory saving.

I'm wondering whether for the CPU the nwarps variable should be different to get better results...

Yes! Also, now that this PR shares, with slight modifications, the cell_permute gaussian elimination algorithms with CoreNEURON, there is a clear issue involved in that CoreNEURON is twice as fast as NEURON. Doubtless, some of this is due to nocmodl vs new NMODL. But it seems very worthwhile to begin a detailed diagnosis of the reasons for the performance difference. Well... Maybe that should await a version of NMODL that emits NEURON compatible translations.

sonarqubecloud · 2023-11-20T12:00:44Z

SonarCloud Quality Gate failed.

6 Bugs
0 Vulnerabilities
0 Security Hotspots
272 Code Smells

No Coverage information
77.3% Duplication

Catch issues before they fail your Quality Gate with our IDE extension SonarLint

azure-pipelines · 2023-11-20T12:37:47Z

✔️ e944815 -> Azure artifacts URL

* Added permute benchmark file * A bit of refactoring with new permute agorithm * Clearing up more unused code in NEURON and added implementations for reverse_permute * Improve return values of new functions based on C++ core guidelines * Added unit test for permutation algorithms

sonarqubecloud · 2024-01-07T16:16:30Z

Quality Gate failed

Failed conditions

73.1% Duplication on New Code (required ≤ 3%)
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarCloud

Catch issues before they fail your Quality Gate with our IDE extension SonarLint

azure-pipelines · 2024-01-07T18:14:00Z

✔️ 7b744f5 -> Azure artifacts URL

azure-pipelines · 2024-03-01T01:15:40Z

✔️ 56f86b9 -> Azure artifacts URL

azure-pipelines · 2024-03-26T11:10:15Z

✔️ 8071fc3 -> Azure artifacts URL

…s sanitizer" This reverts commit 9863fb5.

azure-pipelines · 2024-03-27T13:34:22Z

✔️ f8f9033 -> Azure artifacts URL

azure-pipelines · 2024-04-01T12:33:54Z

✔️ 84feb23 -> Azure artifacts URL

#if CORENRN_BUILD used where code fragments differ

azure-pipelines · 2024-04-02T15:42:15Z

✔️ de71e34 -> Azure artifacts URL

nrnhines · 2024-04-02T17:04:54Z

@pramodk The gitlab CI is consistently failing, apparently in something related to ringtest. It's not clear to me what the problem is. However I did add a an optim_node_order test to a new external_ringtest_nrn group so maybe there is some kind of bad interaction with what gitlab is doing (see the change in test/external/ringtest/CMakeLists.txt and the new ringtest tag in test/external/CMakeLists.txt)

sonarqubecloud · 2024-04-09T19:17:50Z

Quality Gate passed

Issues
13 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

azure-pipelines · 2024-04-09T19:45:54Z

✔️ ef9613e -> Azure artifacts URL

bbpbuildbot · 2024-04-09T20:33:12Z

nrnhines added 8 commits October 8, 2023 20:19

experiment with coreneuron/permute in neuron

1c524ef

tentative solve_interleave for neuron.

9ab7e72

temporary test

8bc7ed0

Simple test.py of pc.optimize_node_order(1) works.

68f6f94

augment test.py with several ExpSyn in same segment.

4f67ede

comment out print of permutation info

2dc374d

Merge branch 'master' into hines/node_order_optim

f104712

Merge branch 'master' into hines/node_order_optim

0032d4e