Convert network interface and queuing to rust#3480
Conversation
|
Since this involves network-related changes, I'm running Tor+TGen benchmarks to test on a larger scale too. Edited to add results:
We are seeing non-deterministic performance results here.
I could give some speculation about what could maybe be going on here, but I'll resist that temptation since I don't know how much confidence we have in the significance of the performance results. Also, I'm not sure there is enough evidence here to suggest that the Rust version is buggy whereas the C version was not. |
|
Cool, it's nice to see all of that remaining C code finally go away :) Usually I'm not too worried about the network graphs changing, but this one is a little concerning as the tail has grown significantly, with some download times almost doubling. It's not clear to me that there being a tail in tor metrics means that we should see a tail in a tgen-only simulation. And considering only the tgen simulation, this seems like an unnatural result from the hub-and-spoke network graph where all bandwidths are the same. But I don't have a good grasp on the tgen sim and what the expected results are. I don't think it's worth blocking this PR on that though, since I don't see anything incorrect in the implementation. |
|
I hit a failure in the tor extra test: I'm not sure if this is known to be flaky for torbridgeclients. |
|
It looks like the cause for the test failure was an error during the TGen <-> Tor proxy handshake The relevant bit:
which could be more helpful. Here is what we have in the Tor log at that time: I think the relevant bit is
So it looks like there was a conflux sequence number check error on the circuit set chosen for the tgen stream. That conflux error showed up in the Tor logs at the exact instant that the TGen logs reported the error too (00:25:06.014). Haven't been able to reproduce locally yet. |
|
Since this seems to be conflux related, perhaps it is related to the bug that was fixed in Tor 0.4.8.13: https://gitlab.torproject.org/tpo/core/tor/-/raw/release-0.4.8/ReleaseNotes It might be worth bumping Tor from version 0.4.8.12 to version 0.4.8.13 in our CI test. |
I asked on #tor-dev, and the conflux fix should be unrelated to this error. So you may have found a new tor bug :) I don't think this would be related to the new Shadow changes in this PR. If we had accidentally reordered a packet or something and corrupted one of the cells, it would have caused an openssl error. |
|
Hmm, OK. Now that I got the Tor extra test working locally, I can report that the tor minimal test is passing for me when using |
|
BTW, in case you missed it (there were a lot of comments in the PR) I had a solution to the non-determinism caused by my If you get a chance to quickly look at 584ffa8, let me know what you think. Thanks! |
|
WRT the long tail in tgen, I agree with you that this is concerning. And the long tail still exists after my fix to the non-deterministic priority problem. (Tgen benchmark results: https://github.com/shadow/benchmark-results/tree/master/tgen/2025-01-09-T16-56-24) So, I downloaded the raw benchmark data from the nightly run and the latest benchmark of this PR (after 584ffa8) and analyzed a few things. Raw number of tgen successes and errors: For the 3 errors in the PR, the downloads made it to 71-85% completed before the error occurred. For the 8 errors in nightly, the range was 50-95%. So in the case of the PR, it sure doesn't seem like the extra network load from the more successful transfers would fully explain the long tail we observe. I also looked at client goodput over time (the number of bytes sent+received by each tgen client). This graph maps pretty closely to the existing tgen graphs in the benchmark so I didn't really learn anything new.
|
|
FWIW, all 40 TGen extra tests do pass. Those are designed to ensure that we can maintain a minimum level of network performance across a range of slow and fast networks. More details: https://github.com/shadow/shadow/tree/main/src/test/tgen |
|
I discovered the source of the performance difference as explained in #3489 |
The new NetworkQueue is generic over the type it holds to enable us to store sockets as we do now (to main consistency with the legacy C code), or to possibly store packets in the future.


The new rust code mostly follows the legacy C implementation. We may want to explore better designs for handling netif functionalities in the future, such as just storing packets directly rather than shared socket refs, but such a change will be easier when the code is in rust first, and then even easier when the legacy TCP stack is removed.
Completes step 7 in the network stack migration plan in #2729.