-
Notifications
You must be signed in to change notification settings - Fork 268
Nondeterminism in chutney+tor+arti #3610
Description
chutney tests involving arti hidden services appear to be non-deterministic under shadow. I haven't narrowed down yet whether the non-determinism extends to other chutney networks (e.g. not involving arti or not involving hidden services). It's more obvious in the (arti) hidden service case since those transfers are flaky, resulting in corresponding tests run under shadow also being flaky. (instead of always passing or always failing with a given simulation seed)
Locally running a slightly modified version of arti's "shadow-test-networks" CI job, extracted to this script:
set -xeuo pipefail
for flavor in \
hs-v3-arti \
; do
shadow-exec \
--preserve=always \
--shadow-args="--model-unblocked-syscall-latency=true --max-unapplied-cpu-latency=10ms --strace-logging-mode=deterministic" \
-- \
bash -c \
"
set -x
set -euo pipefail
export CHUTNEY_ARTI=\"$CHUTNEY_ARTI\"
# Too noisy
export CHUTNEY_DEBUG=0
# Tell the "verify" test to connect to hidden services with *every*
# client. We want this to e.g. validate cross-compatibility between tor
# and arti clients and servers.
export CHUTNEY_HS_MULTI_CLIENT=1
export PATH=\"$PATH\"
tools/test-network.sh \
--chutney-path \"$CHUTNEY_PATH\" \
--net-dir \"$CHUTNEY_DATA_DIR\" \
--sandbox 0 \
--no-controlsocket \
--no-ipv6 \
--tor \"$(which tor)\" \
--tor-gencert \"$(which tor-gencert)\" \
--stop-time 0 \
--flavor \"$flavor\" || \"$CHUTNEY_PATH\"/chutney stop \"$flavor\""
done
I eliminated some sources of nondeterminism by setting the output directory (CHUTNEY_DATA_DIR) to the same fresh empty directory between runs (and moving previous results out of the way).
I still get substantial diffs using:
git diff --no-index --text --word-diff -- ~/tmp/net[01]/shadow-exec-*/shadow.data/hosts/host/bash.1000.strace
Going roughly in order:
There are a few of these sched_getaffinity calls using different PIDs (in arti). I think this is the native PID that the dynamic linker gets hold of before we get control. The result is the same in both cases ESRCH, so it doesn't seem likely to itself cause further divergence.
@@ -72207,7 +72207,7 @@
00:00:01.000000000 [tid 1051] read(3, <pointer>, 1024) = 1024
00:00:01.000000000 [tid 1051] read(3, <pointer>, 1024) = 125
00:00:01.000000000 [tid 1051] close(3) = 0
00:00:01.000000000 [tid 1051] [-sched_getaffinity(1570339,-]{+sched_getaffinity(1570910,+} 32, <pointer>) = -3 (ESRCH)
00:00:01.000000000 [tid 1051] rt_sigaction(11, <pointer>, <pointer>, 8) = 0
00:00:01.000000000 [tid 1051] sigaltstack(<pointer>, <pointer>) = 0
00:00:01.000000000 [tid 1051] mmap(<pointer>, 12288, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_STACK | MAP_PRIVATE, 4294967295, 0) = <pointer>Next we have a sequence of syscalls on a file descriptor (EDIT: in a tor relay process), with different descriptor numbers in the runs. That shouldn't happen; something has gone wrong by this point.
@@ -1589285,10 +1589285,10 @@
000000087654062000 [tid 1087] time(...) = 946684887
000000087654062000 [tid 1087] clock_gettime(...) = 0
000000087654062000 [tid 1087] time(...) = 946684887
00:01:27.654062000 [tid 1087] [-getsockopt(16,-]{+getsockopt(24,+} 6, 11, <pointer>, <pointer>) = 0
00:01:27.654062000 [tid 1087] [-ioctl(16,-]{+ioctl(24,+} 35147, <non-deterministic>) = 0
00:01:27.654062000 [tid 1087] [-getsockopt(17,-]{+getsockopt(18,+} 6, 11, <pointer>, <pointer>) = 0
00:01:27.654062000 [tid 1087] [-ioctl(17,-]{+ioctl(18,+} 35147, <non-deterministic>) = 0
000000087654062000 [tid 1087] clock_gettime(...) = 0
000000087654062000 [tid 1087] clock_gettime(...) = 0
000000087654062000 [tid 1087] time(...) = 946684887A little later we see sequences of syscalls that are identical but with a different PID.
EDIT: these are tor authority threads
EDIT: these happen after a blocked epoll_wait. In one case the original thread (1067) is run next (why did it get blocked at all??), and in the second case a different thread runs instead. Because both threads were blocked in the same place in the same code, we get the same sequence of syscalls, but with different pids. But of course this will lead to further divergence.
00:01:27.654062000 [tid 1067] epoll_wait(3, <pointer>, 32, 2) = <blocked>
00:01:27.654062000 [tid [-1067]-]{+1072]+} epoll_wait(3, <pointer>, 32, [-2)-]{+376)+} = 1
000000087654062000 [tid [-1067]-]{+1072]+} clock_gettime(...) = 0
000000087654062000 [tid [-1067]-]{+1072]+} time(...) = 946684887
000000087654062000 [tid [-1067]-]{+1072]+} clock_gettime(...) = 0
000000087654062000 [tid [-1067]-]{+1072]+} clock_gettime(...) = 0
00:01:27.654062000 [tid [-1067]-]{+1072]+} read(19, <pointer>, 5) = 5
00:01:27.654062000 [tid [-1067]-]{+1072]+} read(19, <pointer>, 531) = 531
00:01:27.654062000 [tid [-1067]-]{+1072]+} read(19, <pointer>, 5) = -11 (EWOULDBLOCK)
000000087654062000 [tid [-1067]-]{+1072]+} time(...) = 946684887
000000087654062000 [tid [-1067]-]{+1072]+} clock_gettime(...) = 0
000000087654062000 [tid [-1067]-]{+1072]+} time(...) = 946684887
000000087654062000 [tid [-1067]-]{+1072]+} clock_gettime(...) = 0
000000087654062000 [tid [-1067]-]{+1072]+} clock_gettime(...) = 0
000000087654062000 [tid [-1067]-]{+1072]+} gettimeofday(...) = 0
000000087654062000 [tid [-1067]-]{+1072]+} gettimeofday(...) = 0
00:01:27.654062000 [tid [-1067]-]{+1072]+} futex(<pointer>, 129, 1, <pointer>, <pointer>, <non-deterministic>) = 1
000000087654062000 [tid [-1067]-]{+1072]+} time(...) = 946684887
00:01:27.654062000 [tid [-1067]-]{+1072]+} epoll_wait(3, <pointer>, 32, [-2)-]{+346)+} = <blocked>
00:01:27.654062000 [tid [-1057]-]{+1077]+} epoll_wait(3, <pointer>, 32, [-350)-]{+354)+} = 1
000000087654062000 [tid [-1057]-]{+1077]+} clock_gettime(...) = 0
000000087654062000 [tid [-1057]-]{+1077]+} time(...) = 946684887
000000087654062000 [tid [-1057]-]{+1077]+} clock_gettime(...) = 0
000000087654062000 [tid [-1057]-]{+1077]+} clock_gettime(...) = 0I'll stop here; the traces continue to diverge further and further by this point.