Skip to content

Flaky test: DockerSwarmSuite.TestSwarmClusterRotateUnlockKey #38885

@thaJeztah

Description

@thaJeztah

Let's create a separate issue for this one (also tracked in #33041 and #37306

Seen failing in https://jenkins.dockerproject.org/job/Docker-PRs-experimental/44501/console (and many other times)

03:25:28 FAIL: docker_cli_swarm_test.go:1316: DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
03:25:28
03:25:28 Creating a new daemon
03:25:28 [dcd909916369d] waiting for daemon to start
03:25:28 [dcd909916369d] daemon started
03:25:28
03:25:28 Creating a new daemon
03:25:28 [de6869a9c7827] waiting for daemon to start
03:25:28 [de6869a9c7827] daemon started
03:25:28
03:25:28 Creating a new daemon
03:25:28 [d899b634e4c28] waiting for daemon to start
03:25:28 [d899b634e4c28] daemon started
03:25:28
03:25:28 [de6869a9c7827] exiting daemon
03:25:28 [de6869a9c7827] waiting for daemon to start
03:25:28 [de6869a9c7827] daemon started
03:25:28
03:25:28 [d899b634e4c28] exiting daemon
03:25:28 [d899b634e4c28] waiting for daemon to start
03:25:28 [d899b634e4c28] daemon started
03:25:28
03:25:28 [de6869a9c7827] exiting daemon
03:25:28 [de6869a9c7827] waiting for daemon to start
03:25:28 [de6869a9c7827] daemon started
03:25:28
03:25:28 [d899b634e4c28] exiting daemon
03:25:28 [d899b634e4c28] waiting for daemon to start
03:25:28 [d899b634e4c28] daemon started
03:25:28
03:25:28 docker_cli_swarm_test.go:1386:
03:25:28     c.Assert(err, checker.IsNil, check.Commentf("%s", outs))
03:25:28 ... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc0008f00a0), Stderr:[]uint8(nil)} ("exit status 1")
03:25:28 ... Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
03:25:28
03:25:28
03:25:28 [dcd909916369d] exiting daemon
03:25:28 [de6869a9c7827] exiting daemon
03:25:28 [d899b634e4c28] exiting daemon
03:25:32

This is the test:

// This differs from `TestSwarmRotateUnlockKey` because that one rotates a single node, which is the leader.
// This one keeps the leader up, and asserts that other manager nodes in the cluster also have their unlock
// key rotated.
func (s *DockerSwarmSuite) TestSwarmClusterRotateUnlockKey(c *check.C) {
if runtime.GOARCH == "s390x" {
c.Skip("Disabled on s390x")
}
if runtime.GOARCH == "ppc64le" {
c.Skip("Disabled on ppc64le")
}
d1 := s.AddDaemon(c, true, true) // leader - don't restart this one, we don't want leader election delays
d2 := s.AddDaemon(c, true, true)
d3 := s.AddDaemon(c, true, true)
outs, err := d1.Cmd("swarm", "update", "--autolock")
c.Assert(err, checker.IsNil, check.Commentf("%s", outs))
unlockKey := getUnlockKey(d1, c, outs)
// Rotate multiple times
for i := 0; i != 3; i++ {
outs, err = d1.Cmd("swarm", "unlock-key", "-q", "--rotate")
c.Assert(err, checker.IsNil, check.Commentf("%s", outs))
// Strip \n
newUnlockKey := outs[:len(outs)-1]
c.Assert(newUnlockKey, checker.Not(checker.Equals), "")
c.Assert(newUnlockKey, checker.Not(checker.Equals), unlockKey)
d2.RestartNode(c)
d3.RestartNode(c)
for _, d := range []*daemon.Daemon{d2, d3} {
c.Assert(getNodeStatus(c, d), checker.Equals, swarm.LocalNodeStateLocked)
outs, _ := d.Cmd("node", "ls")
c.Assert(outs, checker.Contains, "Swarm is encrypted and needs to be unlocked")
cmd := d.Command("swarm", "unlock")
cmd.Stdin = bytes.NewBufferString(unlockKey)
result := icmd.RunCmd(cmd)
if result.Error == nil {
// On occasion, the daemon may not have finished
// rotating the KEK before restarting. The test is
// intentionally written to explore this behavior.
// When this happens, unlocking with the old key will
// succeed. If we wait for the rotation to happen and
// restart again, the new key should be required this
// time.
time.Sleep(3 * time.Second)
d.RestartNode(c)
cmd = d.Command("swarm", "unlock")
cmd.Stdin = bytes.NewBufferString(unlockKey)
result = icmd.RunCmd(cmd)
}
result.Assert(c, icmd.Expected{
ExitCode: 1,
Err: "invalid key",
})
outs, _ = d.Cmd("node", "ls")
c.Assert(outs, checker.Contains, "Swarm is encrypted and needs to be unlocked")
cmd = d.Command("swarm", "unlock")
cmd.Stdin = bytes.NewBufferString(newUnlockKey)
icmd.RunCmd(cmd).Assert(c, icmd.Success)
c.Assert(getNodeStatus(c, d), checker.Equals, swarm.LocalNodeStateActive)
outs, err = d.Cmd("node", "ls")
c.Assert(err, checker.IsNil, check.Commentf("%s", outs))
c.Assert(outs, checker.Not(checker.Contains), "Swarm is encrypted and needs to be unlocked")
}
unlockKey = newUnlockKey
}
}

d1 = dcd909916369d
d2 = de6869a9c7827
d3 = d899b634e4c28

03:25:28 FAIL: docker_cli_swarm_test.go:1316: DockerSwarmSuite.TestSwarmClusterRotateUnlockKey
03:25:28

Create 3 daemons;

Daemon 1 (d1 = dcd909916369d)

d1 := s.AddDaemon(c, true, true) // leader - don't restart this one, we don't want leader election delays

03:25:28 Creating a new daemon
03:25:28 [dcd909916369d] waiting for daemon to start
03:25:28 [dcd909916369d] daemon started
03:25:28

Daemon 2 (d2 = de6869a9c7827)

d2 := s.AddDaemon(c, true, true)

03:25:28 Creating a new daemon
03:25:28 [de6869a9c7827] waiting for daemon to start
03:25:28 [de6869a9c7827] daemon started
03:25:28

Daemon 3 (d3 = d899b634e4c28)

d3 := s.AddDaemon(c, true, true)

03:25:28 Creating a new daemon
03:25:28 [d899b634e4c28] waiting for daemon to start
03:25:28 [d899b634e4c28] daemon started
03:25:28

In a loop (3 times);

Iteration 1:

Restart daemon d2

03:25:28 [de6869a9c7827] exiting daemon
03:25:28 [de6869a9c7827] waiting for daemon to start
03:25:28 [de6869a9c7827] daemon started
03:25:28

Restart daemon d3

03:25:28 [d899b634e4c28] exiting daemon
03:25:28 [d899b634e4c28] waiting for daemon to start
03:25:28 [d899b634e4c28] daemon started
03:25:28

Iteration 2:

Restart daemon d2

03:25:28 [de6869a9c7827] exiting daemon
03:25:28 [de6869a9c7827] waiting for daemon to start
03:25:28 [de6869a9c7827] daemon started
03:25:28

Restart daemon d3

03:25:28 [d899b634e4c28] exiting daemon
03:25:28 [d899b634e4c28] waiting for daemon to start
03:25:28 [d899b634e4c28] daemon started
03:25:28

Failing here;

outs, err = d.Cmd("node", "ls")
c.Assert(err, checker.IsNil, check.Commentf("%s", outs))

03:25:28 docker_cli_swarm_test.go:1386:
03:25:28     c.Assert(err, checker.IsNil, check.Commentf("%s", outs))
03:25:28 ... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc0008f00a0), Stderr:[]uint8(nil)} ("exit status 1")
03:25:28 ... Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
03:25:28
03:25:28

Teardown:

03:25:28 [dcd909916369d] exiting daemon
03:25:28 [de6869a9c7827] exiting daemon
03:25:28 [d899b634e4c28] exiting daemon
03:25:32

Logs:

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/testingkind/bugBugs are bugs. The cause may or may not be known at triage time so debugging may be needed.

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions