Skip to content

Non-determinism in golang runtime (etcd example, golang tests, obfs4proxy, ...) #2693

@stevenengler

Description

@stevenengler

The syscall order isn't always consistent in the etcd example. For example, see the following two runs:

diff --git a/shadow.data/hosts/server1/server1.etcd.1000.strace b/shadow.data.2/hosts/server1/server1.etcd.1000.strace
index 3ac9e59f..17eb6044 100644
--- a/shadow.data/hosts/server1/server1.etcd.1000.strace
+++ b/shadow.data.2/hosts/server1/server1.etcd.1000.strace
@@ -207,10 +207,13 @@
 00:00:00.000000000 [tid 1003] gettid(...) = 1003
 00:00:00.000000000 [tid 1003] futex(...) = 1
 00:00:00.000000000 [tid 1000] futex(...) = 0
+000000000000000000 [tid 1000] clock_gettime(...) = 0
 00:00:00.000000000 [tid 1000] futex(...) = 1
 00:00:00.000000000 [tid 1003] futex(...) = 0
+000000000000000000 [tid 1003] clock_gettime(...) = 0
 00:00:00.000000000 [tid 1003] futex(...) = 1
 00:00:00.000000000 [tid 1000] futex(...) = 0
+000000000000000000 [tid 1000] clock_gettime(...) = 0
 00:00:00.000000000 [tid 1000] clock_gettime(...) = 0
 00:00:00.000000000 [tid 1000] rt_sigprocmask(...) = 0
 00:00:00.000000000 [tid 1000] mmap(0x0, 8392704, PROT_NONE, MAP_PRIVATE | MAP_ANON | MAP_ANONYMOUS | MAP_STACK, -1, 0) = 0x7fffceedd000
@@ -221,15 +224,15 @@
 00:00:00.000000000 [tid 1000] rt_sigprocmask(...) = 0
 00:00:00.000000000 [tid 1000] rt_sigprocmask(...) = 0
 000000000000000000 [tid 1000] clock_gettime(...) = 0
-000000000000000000 [tid 1000] clock_gettime(...) = 0
-000000000000000000 [tid 1000] clock_gettime(...) = 0
 00:00:00.000000000 [tid 1000] fcntl(0, 3) = 32768
 00:00:00.000000000 [tid 1000] fcntl(1, 3) = 32769
 00:00:00.000000000 [tid 1000] fcntl(2, 3) = 32769
 000000000000000000 [tid 1000] clock_gettime(...) = 0
 000000000000000000 [tid 1000] clock_gettime(...) = 0
 00:00:00.000000000 [tid 1000] socket(Unix, 526338, 0) = 3
+000000000000000000 [tid 1000] clock_gettime(...) = 0
 00:00:00.000000000 [tid 1000] bind(3, sockaddr_un { unnamed }, 2) = 0
+000000000000000000 [tid 1000] clock_gettime(...) = 0
 00:00:00.000000000 [tid 1000] epoll_create1(...) = 4
 00:00:00.000000000 [tid 1000] pipe2([1040768, 65728] (0xc000107348), O_CLOEXEC | O_NDELAY | O_NONBLOCK) = 0
 00:00:00.000000000 [tid 1000] epoll_ctl(...) = 0

Most of the diff is clock_gettime, but occasionally some other syscalls are moved around as well. When combined with the model_unblocked_syscall_latency option, the diff varies much more and sometimes the simulation will fail (see #2161 (comment)).

While we don't necessarily need to support etcd (we don't know of anyone using etcd in shadow), this does point towards a non-determinism issue that we should fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: BugError or flaw producing unexpected results

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions