Skip to content

Convert network interface and queuing to rust#3480

Merged
robgjansen merged 6 commits intoshadow:mainfrom
robgjansen:netif-rs
Jan 12, 2025
Merged

Convert network interface and queuing to rust#3480
robgjansen merged 6 commits intoshadow:mainfrom
robgjansen:netif-rs

Conversation

@robgjansen
Copy link
Copy Markdown
Member

The new rust code mostly follows the legacy C implementation. We may want to explore better designs for handling netif functionalities in the future, such as just storing packets directly rather than shared socket refs, but such a change will be easier when the code is in rust first, and then even easier when the legacy TCP stack is removed.

Completes step 7 in the network stack migration plan in #2729.

@robgjansen robgjansen self-assigned this Dec 30, 2024
@github-actions github-actions bot added Component: Testing Unit and integration tests and frameworks Component: Main Composing the core Shadow executable labels Dec 30, 2024
@robgjansen
Copy link
Copy Markdown
Member Author

robgjansen commented Dec 30, 2024

Since this involves network-related changes, I'm running Tor+TGen benchmarks to test on a larger scale too.

Edited to add results:

We are seeing non-deterministic performance results here.

  • The Tor results don't seem to be very informative otherwise, since performance changes but we can't read too much into things because it's probably not significant.
  • The TGen results are more interesting: because there is a smaller effect of randomness in that simulation, the simulated network performance differences we see there are maybe more meaningful. It seems like we're seeing (1) fewer streams failing, (2) slightly lower latency (time to first byte), and (3) slightly higher download time (time to last byte), including a long tail that we've commonly observed in Tor metrics data but have previously not seen reproduced well in shadow.

I could give some speculation about what could maybe be going on here, but I'll resist that temptation since I don't know how much confidence we have in the significance of the performance results. Also, I'm not sure there is enough evidence here to suggest that the Rust version is buggy whereas the C version was not.

@robgjansen robgjansen marked this pull request as ready for review December 30, 2024 01:12
@github-actions github-actions bot added the Component: Documentation In-repository documentation, under docs/ label Dec 30, 2024
@stevenengler
Copy link
Copy Markdown
Contributor

Cool, it's nice to see all of that remaining C code finally go away :)

Usually I'm not too worried about the network graphs changing, but this one is a little concerning as the tail has grown significantly, with some download times almost doubling.

time_to_last_byte_recv_1048576_cdf

It's not clear to me that there being a tail in tor metrics means that we should see a tail in a tgen-only simulation. And considering only the tgen simulation, this seems like an unnatural result from the hub-and-spoke network graph where all bandwidths are the same. But I don't have a good grasp on the tgen sim and what the expected results are.

I don't think it's worth blocking this PR on that though, since I don't see anything incorrect in the implementation.

@robgjansen
Copy link
Copy Markdown
Member Author

I hit a failure in the tor extra test:

** Shadow completed successfully
Bootstrapped count: 13/13
Successful client stream count: 10/10 (minimum 10)
Successful torclient stream count: 10/10 (minimum 10)
Successful torbridgeclient stream count: 9/10 (minimum 10)
Verification failed: Not enough torbridgeclient streams were successful :(

I'm not sure if this is known to be flaky for torbridgeclients.

@robgjansen
Copy link
Copy Markdown
Member Author

It looks like the cause for the test failure was an error during the TGen <-> Tor proxy handshake

2000-01-01 00:25:06 946686306.314000 [warning] [tgen-transport.c:1160] [_tgentransport_receiveSocksResponseStatus] connection from localhost:127.0.0.1:29297 through socks proxy localhost:127.0.0.1:9000 to fileserver:11.0.0.4:80 failed: our request was not granted by the SOCKS server: error status 1: general failure
2000-01-01 00:25:06 946686306.314000 [critical] [tgen-stream.c:1787] [_tgenstream_runTransportEventLoop] transport connection or proxy handshake failed, stream cannot begin
2000-01-01 00:25:06 946686306.314000 [message] [tgen-stream.c:1723] [_tgenstream_log] [stream-error] transport [fd=7,local=localhost:127.0.0.1:29297,proxy=localhost:127.0.0.1:9000,remote=fileserver:11.0.0.4:80,state=ERROR,error=STATUS] stream [id=1,vertexid=stream,name=torbridgeclient,peername=(null),sendsize=1024,recvsize=1048576,sendstate=SEND_COMMAND,recvstate=RECV_NONE,error=PROXY] bytes [total-bytes-recv=0,total-bytes-send=0,payload-bytes-recv=0,payload-bytes-send=0,payload-progress-recv=0.00%,payload-progress-send=0.00%] times [created-ts=946686306014000,usecs-to-socket-create=0,usecs-to-socket-connect=0,usecs-to-proxy-init=0,usecs-to-proxy-choice=0,usecs-to-proxy-request=0,usecs-to-proxy-response=-1,usecs-to-command=-1,usecs-to-response=-1,usecs-to-first-byte-recv=-1,usecs-to-last-byte-recv=-1,usecs-to-checksum-recv=-1,usecs-to-first-byte-send=-1,usecs-to-last-byte-send=-1,usecs-to-checksum-send=-1,now-ts=946686306314000]

The relevant bit:

our request was not granted by the SOCKS server: error status 1: general failure

which could be more helpful. Here is what we have in the Tor log at that time:

Jan 01 00:25:06.014 [info] connection_handle_listener_read: New SOCKS connection opened from 127.0.0.1.
Jan 01 00:25:06.014 [info] rep_hist_note_used_port: New port prediction added. Will continue predictive circ building for 2371 more seconds.
Jan 01 00:25:06.014 [info] connection_edge_process_inbuf: data from edge while in 'waiting for circuit' state. Leaving it on buffer.
Jan 01 00:25:06.014 [info] exit (high-uptime) circ (length 3): $A52CA5B56C64D864F6AE43E56F29ACBD5706DDA1(open) $2350D51B1F2550DED67603DE1F402007C2B63A0A(open) $4EBB385C80A2CA5D671E16F1C722FBFB5F176891(open)
Jan 01 00:25:06.014 [info] link_apconn_to_circ: Looks like completed circuit to hidden service doesn't allow optimistic data for connection to 11.0.0.4
Jan 01 00:25:06.014 [info] connection_ap_handshake_send_begin: Sending relay cell 0 on circ 3298338958 to begin stream 2658.
Jan 01 00:25:06.014 [info] connection_ap_handshake_send_begin: Address/port sent, ap socket 11, n_circ_id 3298338958
Jan 01 00:25:06.314 [info] cfx_del_leg: Conflux sequence number check failed, tearing down entire set.
Jan 01 00:25:06.314 [info] cfx_del_leg: Conflux current circuit has closed, tearing down entire set.
Jan 01 00:25:06.314 [info] circuit_mark_for_close_: Circuit 4124250453 (id: 14) marked for close at src/core/or/conflux_pool.c:708 (orig reason: 9, new reason: 0)
Jan 01 00:25:06.314 [info] circuit_mark_for_close_: Circuit 3298338958 (id: 13) marked for close at src/core/or/command.c:577 (orig reason: 1, new reason: 0)
Jan 01 00:25:06.314 [info] connection_edge_destroy: CircID 0: At an edge. Marking connection for close.
Jan 01 00:25:06.314 [info] circuit_free_: Circuit 0 (id: 14) has been freed.
Jan 01 00:25:06.314 [info] circuit_free_: Circuit 0 (id: 13) has been freed.

I think the relevant bit is

cfx_del_leg: Conflux sequence number check failed, tearing down entire set.

So it looks like there was a conflux sequence number check error on the circuit set chosen for the tgen stream. That conflux error showed up in the Tor logs at the exact instant that the TGen logs reported the error too (00:25:06.014).

Haven't been able to reproduce locally yet.

@robgjansen
Copy link
Copy Markdown
Member Author

Since this seems to be conflux related, perhaps it is related to the bug that was fixed in Tor 0.4.8.13:

Changes in version 0.4.8.13 - 2024-10-24
  This is minor release fixing an important client circuit building (Conflux
  related) bug which lead to performance degradation and extra load on the
  network. Some minor memory leaks fixes as well as an important minor feature
  for pluggable transports. We strongly recommend to update as soon as possible
  for clients in order to neutralize this conflux bug.

o Major bugfixes (circuit building):
    - Conflux circuit building was ignoring the "predicted ports"
      feature, which aims to make Tor stop building circuits if there
      have been no user requests lately. This bug led to every idle Tor
      on the network building and discarding circuits every 30 seconds,
      which added overall load to the network, used bandwidth and
      battery from clients that weren't actively using their Tor, and
      kept sockets open on guards which added connection padding
      essentially forever. Fixes bug 40981; bugfix on 0.4.8.1-alpha;

https://gitlab.torproject.org/tpo/core/tor/-/raw/release-0.4.8/ReleaseNotes

It might be worth bumping Tor from version 0.4.8.12 to version 0.4.8.13 in our CI test.

@stevenengler
Copy link
Copy Markdown
Contributor

stevenengler commented Jan 9, 2025

Since this seems to be conflux related, perhaps it is related to the bug that was fixed in Tor 0.4.8.13:

I asked on #tor-dev, and the conflux fix should be unrelated to this error. So you may have found a new tor bug :)

I don't think this would be related to the new Shadow changes in this PR. If we had accidentally reordered a packet or something and corrupted one of the cells, it would have caused an openssl error.

@robgjansen
Copy link
Copy Markdown
Member Author

Hmm, OK. Now that I got the Tor extra test working locally, I can report that the tor minimal test is passing for me when usingtor-0.4.9.1-alpha. I'll rebase the PR now so that we can see what happens in the CI with tor-0.4.8.13.

@robgjansen
Copy link
Copy Markdown
Member Author

robgjansen commented Jan 9, 2025

BTW, in case you missed it (there were a lot of comments in the PR) I had a solution to the non-determinism caused by my Prioritized struct. See the comments in the thread here: #3480 (comment)

If you get a chance to quickly look at 584ffa8, let me know what you think. Thanks!

@robgjansen
Copy link
Copy Markdown
Member Author

WRT the long tail in tgen, I agree with you that this is concerning. And the long tail still exists after my fix to the non-deterministic priority problem. (Tgen benchmark results: https://github.com/shadow/benchmark-results/tree/master/tgen/2025-01-09-T16-56-24)

So, I downloaded the raw benchmark data from the nightly run and the latest benchmark of this PR (after 584ffa8) and analyzed a few things.

Raw number of tgen successes and errors:

$ xzcat nightly/shadow.data/hosts/peer*/tgen.1001*xz | grep "stream-success" | wc -l
99989
$ xzcat nightly/shadow.data/hosts/peer*/tgen.1001*xz | grep "stream-error" | wc -l
11
$ xzcat 3480/shadow.data/hosts/peer*/tgen.1001*xz | grep "stream-success" | wc -l
99997
$ xzcat 3480/shadow.data/hosts/peer*/tgen.1001*xz | grep "stream-error" | wc -l
3

For the 3 errors in the PR, the downloads made it to 71-85% completed before the error occurred. For the 8 errors in nightly, the range was 50-95%.

So in the case of the PR, it sure doesn't seem like the extra network load from the more successful transfers would fully explain the long tail we observe.

I also looked at client goodput over time (the number of bytes sent+received by each tgen client). This graph maps pretty closely to the existing tgen graphs in the benchmark so I didn't really learn anything new.

Screenshot 2025-01-09 at 4 27 46 PM

@robgjansen
Copy link
Copy Markdown
Member Author

FWIW, all 40 TGen extra tests do pass. Those are designed to ensure that we can maintain a minimum level of network performance across a range of slow and fast networks.

More details: https://github.com/shadow/shadow/tree/main/src/test/tgen

@robgjansen
Copy link
Copy Markdown
Member Author

I discovered the source of the performance difference as explained in #3489

The new NetworkQueue is generic over the type it holds to enable
us to store sockets as we do now (to main consistency with the
legacy C code), or to possibly store packets in the future.
@robgjansen robgjansen merged commit 1d7e0a9 into shadow:main Jan 12, 2025
@robgjansen robgjansen deleted the netif-rs branch January 12, 2025 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Documentation In-repository documentation, under docs/ Component: Main Composing the core Shadow executable Component: Testing Unit and integration tests and frameworks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants