Skip to content

Nondeterminism in chutney+tor+arti #3610

@sporksmith

Description

@sporksmith

chutney tests involving arti hidden services appear to be non-deterministic under shadow. I haven't narrowed down yet whether the non-determinism extends to other chutney networks (e.g. not involving arti or not involving hidden services). It's more obvious in the (arti) hidden service case since those transfers are flaky, resulting in corresponding tests run under shadow also being flaky. (instead of always passing or always failing with a given simulation seed)

Locally running a slightly modified version of arti's "shadow-test-networks" CI job, extracted to this script:

set -xeuo pipefail

for flavor in \
      hs-v3-arti \
      ; do
  shadow-exec \
    --preserve=always \
    --shadow-args="--model-unblocked-syscall-latency=true --max-unapplied-cpu-latency=10ms --strace-logging-mode=deterministic" \
    -- \
  bash -c \
  "
    set -x
    set -euo pipefail
    export CHUTNEY_ARTI=\"$CHUTNEY_ARTI\"
    # Too noisy
    export CHUTNEY_DEBUG=0
    # Tell the "verify" test to connect to hidden services with *every*
    # client. We want this to e.g. validate cross-compatibility between tor
    # and arti clients and servers.
    export CHUTNEY_HS_MULTI_CLIENT=1
    export PATH=\"$PATH\"
    tools/test-network.sh \
    --chutney-path \"$CHUTNEY_PATH\" \
    --net-dir \"$CHUTNEY_DATA_DIR\" \
    --sandbox 0 \
    --no-controlsocket \
    --no-ipv6 \
    --tor \"$(which tor)\" \
    --tor-gencert \"$(which tor-gencert)\" \
    --stop-time 0 \
    --flavor \"$flavor\" || \"$CHUTNEY_PATH\"/chutney stop \"$flavor\""
done

I eliminated some sources of nondeterminism by setting the output directory (CHUTNEY_DATA_DIR) to the same fresh empty directory between runs (and moving previous results out of the way).

I still get substantial diffs using:
git diff --no-index --text --word-diff -- ~/tmp/net[01]/shadow-exec-*/shadow.data/hosts/host/bash.1000.strace

Going roughly in order:

There are a few of these sched_getaffinity calls using different PIDs (in arti). I think this is the native PID that the dynamic linker gets hold of before we get control. The result is the same in both cases ESRCH, so it doesn't seem likely to itself cause further divergence.

@@ -72207,7 +72207,7 @@
00:00:01.000000000 [tid 1051] read(3, <pointer>, 1024) = 1024
00:00:01.000000000 [tid 1051] read(3, <pointer>, 1024) = 125
00:00:01.000000000 [tid 1051] close(3) = 0
00:00:01.000000000 [tid 1051] [-sched_getaffinity(1570339,-]{+sched_getaffinity(1570910,+} 32, <pointer>) = -3 (ESRCH)
00:00:01.000000000 [tid 1051] rt_sigaction(11, <pointer>, <pointer>, 8) = 0
00:00:01.000000000 [tid 1051] sigaltstack(<pointer>, <pointer>) = 0
00:00:01.000000000 [tid 1051] mmap(<pointer>, 12288, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_STACK | MAP_PRIVATE, 4294967295, 0) = <pointer>

Next we have a sequence of syscalls on a file descriptor (EDIT: in a tor relay process), with different descriptor numbers in the runs. That shouldn't happen; something has gone wrong by this point.

@@ -1589285,10 +1589285,10 @@
000000087654062000 [tid 1087] time(...) = 946684887
000000087654062000 [tid 1087] clock_gettime(...) = 0
000000087654062000 [tid 1087] time(...) = 946684887
00:01:27.654062000 [tid 1087] [-getsockopt(16,-]{+getsockopt(24,+} 6, 11, <pointer>, <pointer>) = 0
00:01:27.654062000 [tid 1087] [-ioctl(16,-]{+ioctl(24,+} 35147, <non-deterministic>) = 0
00:01:27.654062000 [tid 1087] [-getsockopt(17,-]{+getsockopt(18,+} 6, 11, <pointer>, <pointer>) = 0
00:01:27.654062000 [tid 1087] [-ioctl(17,-]{+ioctl(18,+} 35147, <non-deterministic>) = 0
000000087654062000 [tid 1087] clock_gettime(...) = 0
000000087654062000 [tid 1087] clock_gettime(...) = 0
000000087654062000 [tid 1087] time(...) = 946684887

A little later we see sequences of syscalls that are identical but with a different PID.

EDIT: these are tor authority threads

EDIT: these happen after a blocked epoll_wait. In one case the original thread (1067) is run next (why did it get blocked at all??), and in the second case a different thread runs instead. Because both threads were blocked in the same place in the same code, we get the same sequence of syscalls, but with different pids. But of course this will lead to further divergence.

00:01:27.654062000 [tid 1067] epoll_wait(3, <pointer>, 32, 2) = <blocked>
00:01:27.654062000 [tid [-1067]-]{+1072]+} epoll_wait(3, <pointer>, 32, [-2)-]{+376)+} = 1
000000087654062000 [tid [-1067]-]{+1072]+} clock_gettime(...) = 0
000000087654062000 [tid [-1067]-]{+1072]+} time(...) = 946684887
000000087654062000 [tid [-1067]-]{+1072]+} clock_gettime(...) = 0
000000087654062000 [tid [-1067]-]{+1072]+} clock_gettime(...) = 0
00:01:27.654062000 [tid [-1067]-]{+1072]+} read(19, <pointer>, 5) = 5
00:01:27.654062000 [tid [-1067]-]{+1072]+} read(19, <pointer>, 531) = 531
00:01:27.654062000 [tid [-1067]-]{+1072]+} read(19, <pointer>, 5) = -11 (EWOULDBLOCK)
000000087654062000 [tid [-1067]-]{+1072]+} time(...) = 946684887
000000087654062000 [tid [-1067]-]{+1072]+} clock_gettime(...) = 0
000000087654062000 [tid [-1067]-]{+1072]+} time(...) = 946684887
000000087654062000 [tid [-1067]-]{+1072]+} clock_gettime(...) = 0
000000087654062000 [tid [-1067]-]{+1072]+} clock_gettime(...) = 0
000000087654062000 [tid [-1067]-]{+1072]+} gettimeofday(...) = 0
000000087654062000 [tid [-1067]-]{+1072]+} gettimeofday(...) = 0
00:01:27.654062000 [tid [-1067]-]{+1072]+} futex(<pointer>, 129, 1, <pointer>, <pointer>, <non-deterministic>) = 1
000000087654062000 [tid [-1067]-]{+1072]+} time(...) = 946684887
00:01:27.654062000 [tid [-1067]-]{+1072]+} epoll_wait(3, <pointer>, 32, [-2)-]{+346)+} = <blocked>
00:01:27.654062000 [tid [-1057]-]{+1077]+} epoll_wait(3, <pointer>, 32, [-350)-]{+354)+} = 1
000000087654062000 [tid [-1057]-]{+1077]+} clock_gettime(...) = 0
000000087654062000 [tid [-1057]-]{+1077]+} time(...) = 946684887
000000087654062000 [tid [-1057]-]{+1077]+} clock_gettime(...) = 0
000000087654062000 [tid [-1057]-]{+1077]+} clock_gettime(...) = 0

I'll stop here; the traces continue to diverge further and further by this point.

Metadata

Metadata

Assignees

Labels

Type: BugError or flaw producing unexpected results

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions