SpectrumScale_NETWORK_READINESS icon indicating copy to clipboard operation
SpectrumScale_NETWORK_READINESS copied to clipboard

Fix when running nsdperf over RoCE

Open cristeab opened this issue 5 years ago • 2 comments

The original version used only the first GID of a specific network interface. This fix puts in a vector all GIDs it can find for a given interface, then it finds the first interface name and uses it to run the tests.

In order to show the GIDs of a Mellanox ConnectX-6 adapter use:

[root@localhost ~]# show_gids mlx5_0 DEV PORT INDEX GID IPv4 VER DEV


mlx5_0 1 0 fe80:0000:0000:0000:0e42:a1ff:fe5d:4db8 v1 ens1f0 mlx5_0 1 1 fe80:0000:0000:0000:0e42:a1ff:fe5d:4db8 v2 ens1f0 mlx5_0 1 2 fe80:0000:0000:0000:1186:a6b1:3f0b:c441 v1 ens1f0 mlx5_0 1 3 fe80:0000:0000:0000:1186:a6b1:3f0b:c441 v2 ens1f0 mlx5_0 1 4 0000:0000:0000:0000:0000:ffff:ac13:003c 172.19.0.60 v1 ens1f0 mlx5_0 1 5 0000:0000:0000:0000:0000:ffff:ac13:003c 172.19.0.60 v2 ens1f0 n_gids_found=6

Note that in this case the original nsdperf version fails because it uses only the first GID and nsdperf exists with this error below:

[root@localhost ~]# ./nsdperf-rdma -r mlx5_0/1 -s -d 05:32:39.904017 nsdperf-rdma 1.28 server started Connection from 172.19.4.2 05:32:49.594867 got msg Version ID 2 len 0 from 172.19.4.2/0 05:32:49.620581 got msg Parms ID 4 len 56 from 172.19.4.2/0 05:32:49.642896 RDMA port mlx5_0:1 has no address 05:32:49.643393 sending msg ReplyErr ID 4 len 24 to 172.19.4.2/0 05:33:22.192687 got msg Kill ID 6 len 0 from 172.19.4.2/0 Connection to 172.19.4.2/0 broken 05:33:22.193039 Closed connection to 172.19.4.2/0

cristeab avatar Mar 24 '21 16:03 cristeab

thanks a lot. As soon we can test in our lab will merge. Many thanks for the work

bolinches avatar Mar 29 '21 12:03 bolinches

@cristeab Sorry for the delay

We have tested it and works nicely, thanks you so much for the effort you put into this. You are going to bare with us here a little bit, let me explain.

We (and I specially) was not expecting to get collaborations on the code this "soon", we have an internal repository where we we have a nsdperf version 1.29 where RoCE is already there plus other things. But your collaboration has put show clearly a few things.

First and foremost current model we use use to develop this tool does not work as-is. We are de facto alienating non IBMers from helping as into the development at best by using internal repos instead of a public one

Second, for once we have a good collaboration we are not sure how to proceed, we have 1.29 there with this and other changes

And last but certainly not least, we need to change how we work here. It won't happen overnight but the conversations have started already and I hope we can come out with something more clear later on. To you but to any other collaborator please bare with us a bit longer.

Thanks a lot for your effort but for now I will not merge the changes until we have a more clear way how we proceed here if fully move 1.29 here, or a variation of your changes and 1.29

I really thank you what you have done once again, I hope you do not feel it is going to waste even if we go with 1.29. Even if that is the case this is clearly showing that we need to change how we move forward with this excellent network benchmark tool.

bolinches avatar Apr 05 '21 18:04 bolinches