Adopt CoreNEURON cell permute into NEURON.#2598
Conversation
|
✔️ 0032d4e -> Azure artifacts URL |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2598 +/- ##
==========================================
+ Coverage 67.16% 67.17% +0.01%
==========================================
Files 562 564 +2
Lines 104118 104246 +128
==========================================
+ Hits 69929 70028 +99
- Misses 34189 34218 +29 ☔ View full report in Codecov by Sentry. |
This comment has been minimized.
This comment has been minimized.
|
I have started having a look at the changes in this PR and here are my initial answers to some of your questions:
I think it's a good idea. That way we can make sure that we don't leak anything, for example nrn/src/node_order_optim/cellorder.cpp Line 356 in 0032d4e
It's a good idea as well. Those permutation vectors can be pretty large so it would be good to do this operation in-place. We already have a similar algorithm for that in nrn/src/neuron/container/soa_container.hpp Line 1080 in 0032d4e
That's a non trivial question. I think it would be nice to share some parts that should be (with some work) independent of the NEURON/CoreNEURON details. However I think this would increase the complexity in the build system since at the moment we don't link together NEURON and CoreNEURON libraries. I am wondering in the end whether it makes sense to have separate copies for the two (minus the GPU part of the algorithms in NEURON) and when we are ready to merge the CoreNEURON logic and GPU support in NEURON we can add the needed parts. Anyway I guess that the permutation code is not something that we expect to change very frequently.
This could probably provide some small benefits but I don't think we would see any great increase in CPU performance. We can study this with some more detailed profiling though as an exercise with different circuit sizes and CPU architectures (more or less memory bandwidth). |
Thanks @iomaganaris for looking at this. I have yet to understand in detail, how in-place permutation works. At my present level of understanding, it seems like it would be a tradeoff between space and time. One can indeed avoid copying an array of length N by using a series of just N + 1 moves with only one extra scalar variable. But I get stuck on the apparently quadratic number of comparisons to figure out if an element has already been moved (was done in a previous cycle). At least we have a working version, so substituting with an in place version should still work whether I understand it or not. I was hoping for a performance benefit on cpu but the issue was complicated by the fact that all the parallel pragma details for I agree that it is probably a mistake for now to thing about factoring. Need some more examples of adopting coreneuron algorithms into NEURON. |
|
Hi @nrnhines ,
It's true that those are two different things. We could think whether it makes sense to create a generalized function for both cases but if it's too hard and complicates things too much I wouldn't go into it now.
I thought that this was a fun exercise and I definitely spent more time than necessary for that but I found various permutation algorithms, played with the a bit and created a microbenchmark to check the performance and their tradeoffs. #include <vector>
#include <random>
#include <iostream>
#include <algorithm> // for copy
#include <iterator> // for ostream_iterator
#include <unordered_set>
#include <chrono>
#include <functional>
// #include <format>
using std::chrono::high_resolution_clock;
using std::chrono::nanoseconds;
using std::chrono::milliseconds;
template<typename T>
void debug_print(const std::vector<T>& v, const std::vector<int>& p) {
std::cout << "v" << std::endl;
std::copy(v.begin(), v.end(), std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
std::cout << "p" << std::endl;
std::copy(p.begin(), p.end(), std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
}
// From https://devblogs.microsoft.com/oldnewthing/20170102-00/?p=95095
// Issue is that indices are copied by value
template<typename T>
void
apply_permutation1(
std::vector<T>& v,
std::vector<int> indices)
{
for (size_t i = 0; i < indices.size(); i++) {
T t{std::move(v[i])};
auto current = i;
while (i != indices[current]) {
auto next = indices[current];
v[current] = std::move(v[next]);
indices[current] = current;
current = next;
}
v[current] = std::move(t);
indices[current] = current;
}
}
// From https://devblogs.microsoft.com/oldnewthing/20170102-00/?p=95095
// A different version of apply_permutation1
// Issue is that indices are copied by value
template<typename T>
void
apply_permutation2(
std::vector<T>& v,
std::vector<int> indices)
{
using std::swap; // to permit Koenig lookup
for (size_t i = 0; i < indices.size(); i++) {
auto current = i;
while (i != indices[current]) {
auto next = indices[current];
swap(v[current], v[next]);
indices[current] = current;
current = next;
}
indices[current] = current;
}
}
// I forgot where I found that but it's really slow (O(N^2))
template<typename T>
void
apply_permutation3(
std::vector<T>& A,
std::vector<int>& p)
{
using std::swap; // to permit Koenig lookup
size_t ind{};
double temp{};
for (size_t i = 0; i < p.size(); i++) {
ind = p[i];
while (ind < i) {
ind = p[ind];
}
swap(A[i], A[ind]);
}
}
// From https://cstheory.stackexchange.com/questions/6711/complexity-of-applying-a-permutation-in-place
// void compute_permutation(Object[] array, int[] pi)
// {
// for(int i=0; i<array.length; i++)
// {
// if (pi[i] >= 0)
// {
// for(int j=i, k; i != (k = pi[j]); j = k)
// {
// Object tmp = array[j];
// array[j] = array[k];
// array[k] = tmp;
// pi[j] = ~pi[j];
// }
// }
// }
// for(int i=0; i<pi.length; i++) pi[i] = ~pi[i];
// }
// TODO: C++ version not working
// template<typename T>
// void
// apply_permutation4(
// std::vector<T>& v,
// std::vector<int>& indices)
// {
// debug_print(v,indices);
// using std::swap; // to permit Koenig lookup
// for (size_t i = 0; i < indices.size(); i++) {
// if (indices[i] >= 0) {
// auto k = indices[i];
// for (size_t j = i; j != indices[j]; j = k) {
// swap(v[j], v[k]);
// k = indices[j];
// indices[j] = -indices[j];
// debug_print(v,indices);
// }
// }
// }
// for (auto& indice : indices) {
// indice = -indice;
// }
// }
// Simpler and better way for the current algorithm
template<typename T>
void
apply_permutation5(
std::vector<T>& v,
std::vector<int>& indices)
{
std::vector<T> new_v;
new_v.reserve(v.size());
std::transform(indices.begin(), indices.end(), std::back_inserter(new_v), [&v](int ind) { return v[ind]; });
v = std::move(new_v);
}
// Optimized algorithm based on https://devblogs.microsoft.com/oldnewthing/20170102-00/?p=95095
// and `Option 5` from https://cstheory.stackexchange.com/a/52365
// O(N) runtime and no O(1) space
// `indices` need to be threadsafe and writable (they return to their original state at the end)
template<typename T>
void
apply_permutation6(
std::vector<T>& v,
std::vector<int>& indices)
{
using std::swap; // to permit Koenig lookup
for (size_t i = 0; i < indices.size(); i++) {
if (indices[i] >= 0) {
auto current = i;
while (indices[current] != i) {
const auto next = indices[current];
swap(v[current], v[next]);
// -1 is needed because otherwise there is an issue if
// indices[current] is 0
indices[current] = -indices[current]-1;
current = next;
}
indices[current] = -indices[current]-1;
}
}
std::transform(indices.begin(), indices.end(), indices.begin(), [](int ind) { return -ind-1; });
}
// Similar to apply_permutation6 but with extra boolean vector
// to find visited elements
// Runtime complexity O(N) but space complexity O(N)
template<typename T>
void
apply_permutation7(
std::vector<T>& v,
const std::vector<int>& indices)
{
std::vector<bool> visited(indices.size(), false);
using std::swap; // to permit Koenig lookup
for (size_t i = 0; i < indices.size(); i++) {
if (!visited[i]) {
auto current = i;
while (indices[current] != i) {
const auto next = indices[current];
swap(v[current], v[next]);
visited[current] = true;
current = next;
}
visited[current] = true;
}
}
}
template<typename T, typename F, int n>
void run_experiment(std::vector<T>& v, std::vector<int>& p, F f) {
const auto num_experiments = 100;
size_t sum_times{};
for (auto experiment = 0; experiment < num_experiments; ++experiment) {
for (int i = 0; i < v.size(); ++i) {
v[i] = i;
}
auto start = high_resolution_clock::now();
f(v, p);
auto end = high_resolution_clock::now();
sum_times += duration_cast<nanoseconds>(end - start).count();
}
std::cout << " apply_permutation" << n << ": "
<< sum_times / num_experiments << " ns"
<< std::endl;
if (v != p) {
std::cout << "v and p after apply_permutation" << n << " are different" << std::endl;
// debug_print(v,p);
}
}
int main(int argc, char *argv[]) {
const auto N = argc > 1 ? std::atoi(argv[1]) : 1000000;
std::cout << "Running experiment with " << N << " elements" << std::endl;
std::vector<int> v;
v.resize(N);
std::vector<int> p;
p.reserve(N);
std::random_device dev;
std::mt19937 rng(dev());
std::uniform_int_distribution<std::mt19937::result_type> distN(0, N-1);
std::unordered_set<int> p_unique;
while (p_unique.size() < N) {
auto rand_num = distN(rng);
while (p_unique.contains(rand_num)) {
rand_num = distN(rng);
}
p_unique.insert(rand_num);
}
for (int i = 0; i < N; ++i) {
v[i] = i;
}
for (const auto& perm : p_unique) {
p.push_back(perm);
}
// debug_print(v,p);
run_experiment<int, decltype(apply_permutation1<int>), 1>(v, p, apply_permutation1);
run_experiment<int, decltype(apply_permutation2<int>), 2>(v, p, apply_permutation2);
run_experiment<int, decltype(apply_permutation3<int>), 3>(v, p, apply_permutation3);
// run_experiment<int, decltype(apply_permutation4<int>), 4>(v, p, apply_permutation4);
run_experiment<int, decltype(apply_permutation5<int>), 5>(v, p, apply_permutation5);
run_experiment<int, decltype(apply_permutation6<int>), 6>(v, p, apply_permutation6);
run_experiment<int, decltype(apply_permutation7<int>), 7>(v, p, apply_permutation7);
}And here are the results on an Intel Xeon Cascade Lake: I compiled the code with
Regarding the increase in CPU performance with the permutations I think that most of the algorithms are already memory bound so I don't expect really large improvements in performance compared to cell permute 1 at least. |
Great job! It is nice to see that they are practically linear in time. It wasn't clear to me if 1 and 2 changed the permutation vector so at the end it is different than from the beginning. If so, they are non-starters since we have to apply the permutation to several data vectors in several different contexts. (I notice the SoA avoids this by permuting entire container rows at once. But, actually, that SoA permutation adds to my cognitive confusion. Why permute twice? Once to properly order nodes and mechanism instances in NrnThread, and once to do the SoA. Does NrnThread even need to be an actualized data structure except for parent node indices along with some of its administrative information?) Anyway, we can leave the in place apply_permutation6 in the wings for potential future consideration, and replace this PR's permutation method with conceptually similar apply_permutation5.
You are likely correct. It just seems to me that "memory bound" opens the issue of memory latency, bandwidth, and data ordering. |
Indeed the algorithms of
IIUC the comments in Line 2215 in ae9b284 NrnThreads and that's why we do it twice.
I didn't understand what you mean by
👍
Without remembering atm all the details about how cell permute 2 works what you suggest could potentially improve the memory accesses I believe. I guess by how much it depends on the circuit. Also I'm wondering whether for the CPU the |
|
@iomaganaris > Let me know if you want me to push this change in the PR. Please do so
In coreneuron, there are no such objects as Node. There is a lengthy comment in nrn/src/coreneuron/sim/multicore.cpp that presents the history of Node. Now, with the new SoA containers in NEURON, Node in NEURON gets a lot of legacy use but seems in more or less reduced to a
Yes! Also, now that this PR shares, with slight modifications, the cell_permute gaussian elimination algorithms with CoreNEURON, there is a clear issue involved in that CoreNEURON is twice as fast as NEURON. Doubtless, some of this is due to nocmodl vs new NMODL. But it seems very worthwhile to begin a detailed diagnosis of the reasons for the performance difference. Well... Maybe that should await a version of NMODL that emits NEURON compatible translations. |
|
SonarCloud Quality Gate failed.
|
|
✔️ e944815 -> Azure artifacts URL |
This comment has been minimized.
This comment has been minimized.
* Added permute benchmark file * A bit of refactoring with new permute agorithm * Clearing up more unused code in NEURON and added implementations for reverse_permute * Improve return values of new functions based on C++ core guidelines * Added unit test for permutation algorithms
|
This comment has been minimized.
This comment has been minimized.
|
✔️ 7b744f5 -> Azure artifacts URL |
|
✔️ 56f86b9 -> Azure artifacts URL |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
✔️ 8071fc3 -> Azure artifacts URL |
This comment has been minimized.
This comment has been minimized.
…s sanitizer" This reverts commit 9863fb5.
|
✔️ f8f9033 -> Azure artifacts URL |
This comment has been minimized.
This comment has been minimized.
|
✔️ 84feb23 -> Azure artifacts URL |
This comment has been minimized.
This comment has been minimized.
#if CORENRN_BUILD used where code fragments differ
|
✔️ de71e34 -> Azure artifacts URL |
This comment has been minimized.
This comment has been minimized.
|
@pramodk The gitlab CI is consistently failing, apparently in something related to ringtest. It's not clear to me what the problem is. However I did add a an optim_node_order test to a new external_ringtest_nrn group so maybe there is some kind of bad interaction with what gitlab is doing (see the change in test/external/ringtest/CMakeLists.txt and the new ringtest tag in test/external/CMakeLists.txt) |
|
|
✔️ ef9613e -> Azure artifacts URL |

















Introduced ParallelContext.optimize_node_order(arg) where arg is 0, 1, or 2. 0 means no permutation, 1 and 2 are equivalent to CoreNEURON cellpermute value.
Permutation, if requested, takes place in nrn/src/nrnoc/treeset.cpp:v_setup_vectors after constructing NrnThread and before
nrn_ensure_model_data_are_sortedby means of a call toneuron::nrn_permute_node_order(). Permutation changesNrnThread : _v_node, _v_parent, v_parent_index, and Node.v_node_index. Additionally it permutes the instances of each mechanism so that each mechanism nodelist is monotonically ordered.
The PR started out copying src/coreneuron/permute files into a new src/node_order_optim folder. Since these file copies share predominantly the same code, it was decided for the nrn build to reach into src/coreneuron/permute and compile with
CORENRN_BUILD=0while the corenrn build compiles those files withCORENRN_BUILD=1.At time of PR opening, Release performance on an old x86_64 Intel Xeon E5606 @ 2.13GHz for the nrntraub model is
Note that permutation was primarily intended for GPU performance.
On an Apple M1 the performance is
Some questions about this PR are:
Should the permutation vectors be converted to
std::vector?Should the permutation algorithm be in-place instead of temporary use of a vector copy?
Is it worthwhile now to factor so that coreneuron and neuron share as much as possible the permutation code?
One would expect the permutation for cpu to allow use of more pipelineing or simd data processing. Should we pursue now?
Is now the best time to convert many
NrnThreadA*tostd::vector<A>and clean upNrnThread?[x] document ParallelContext.optimize_node_order
[x] test
[ ] cover