Fix TestCreateServiceSecretFileMode, TestCreateServiceConfigFileMode#42960
Conversation
|
@AkihiroSuda @cpuguy83 PTAL |
| }) | ||
| assert.NilError(t, err) | ||
| assert.Check(t, is.Equal(len(tasks), 1)) | ||
| tasks := swarm.GetRunningTasks(t, client, serviceID) |
There was a problem hiding this comment.
This utility should probably also be updated to check the actual state, but leaving that for a follow-up (similar to;
moby/integration/internal/swarm/states.go
Lines 59 to 70 in 7b9275c
|
LOL! okay, okay, I'll fix |
f674eaa to
9b41144
Compare
|
So this PR should fix the bug in the test (looking for "any" task), but doesn't solve flakiness in tasks failing to start (i.e., we end up without running tasks within XX seconds); Let me try get a bundle to see why the tasks fail to start |
4941809 to
634b490
Compare
|
I think I found the actual problem with this test (also see #37132 (comment)), and ... it's silly 🤦 Updated this PR, and the description. |
| swarm.ServiceWithReplicas(instances), | ||
| swarm.ServiceWithName(serviceName), | ||
| swarm.ServiceWithCommand([]string{"/bin/sh", "-c", "ls -l /etc/secret || /bin/top"}), | ||
| swarm.ServiceWithCommand([]string{"/bin/sh", "-c", "ls -l /etc/secret && sleep 600"}), |
There was a problem hiding this comment.
FWIW, we don't really need these tasks to keep running (running once and print the output would be sufficient), but keeping them running makes sure SwarmKit doesn't continue reconciling / spinning up new tasks.
Possibly we could simplify the test to get the results out in some different way, but I'm gonna leave that for a future exercise.
There was a problem hiding this comment.
Not super familiar with this test code, but what image is it running? Some sleep(1) implementations (including the GNU coreutils version) accept inf as an argument to sleep indefinitely. That might help here rather than assuming the test will be done before 600s have elapsed.
There was a problem hiding this comment.
Not super familiar with this test code, but what image is it running?
The default for these tests is to use busybox.
The Dockerfile has a frozen-images stage, which downloads some images that are used in integration tests. Originally this was added to;
- be sure we always use the exact same image
- speed up the tests (no
docker pullfor each of the tests); this also helped with hiccups / outages of Docker Hub, if they would happen
Nowadays it also comes in handy with rate limits 😅 (although, we should set up a "pull only" token in CI)
Some sleep(1) implementations (including the GNU coreutils version) accept inf as an argument to sleep indefinitely. That might help here rather than assuming the test will be done before 600s have elapsed.
Oh, TIL about inf being a valid option. Looks like busybox sleep supports that value.
I'll update the test; I suspect there will be various other tests that used <some arbitrary, randomly picked long timeout>; we can go through our tests and update those as well.
634b490 to
c3ba756
Compare
|
|
||
| body, err := client.ContainerLogs(ctx, tasks[0].Status.ContainerStatus.ContainerID, types.ContainerLogsOptions{ | ||
| body, err := client.ServiceLogs(ctx, serviceID, types.ContainerLogsOptions{ | ||
| Tail: "1", |
There was a problem hiding this comment.
1 line should be enough 😅 (famous last words)
|
@samuelkarp @AkihiroSuda this LGTY? This should bring CI in a slightly better shape |
samuelkarp
left a comment
There was a problem hiding this comment.
LGTM with a minor suggestion
| swarm.ServiceWithReplicas(instances), | ||
| swarm.ServiceWithName(serviceName), | ||
| swarm.ServiceWithCommand([]string{"/bin/sh", "-c", "ls -l /etc/secret || /bin/top"}), | ||
| swarm.ServiceWithCommand([]string{"/bin/sh", "-c", "ls -l /etc/secret && sleep 600"}), |
There was a problem hiding this comment.
Not super familiar with this test code, but what image is it running? Some sleep(1) implementations (including the GNU coreutils version) accept inf as an argument to sleep indefinitely. That might help here rather than assuming the test will be done before 600s have elapsed.
…ileMode Looks like this test was broken from the start, and fully relied on a race condition. (Test was added in 65ee7ff) The problem is in the service's command: `ls -l /etc/config || /bin/top`, which will either: - exit immediately if the secret is mounted correctly at `/etc/config` (which it should) - keep running with `/bin/top` if the above failed After the service is created, the test enters a race-condition, checking for 1 task to be running (which it ocassionally is), after which it proceeds, and looks up the list of tasks of the service, to get the log output of `ls -l /etc/config`. This is another race: first of all, the original filter for that task lookup did not filter by `running`, so it would pick "any" task of the service (either failed, running, or "completed" (successfully exited) tasks). In the meantime though, SwarmKit kept reconciling the service, and creating new tasks, so even if the test was able to get the ID of the correct task, that task may already have been exited, and removed (task-limit is 5 by default), so only if the test was "lucky", it would be able to get the logs, but of course, chances were likely that it would be "too late", and the task already gone. The problem can be easily reproduced when running the steps manually: echo 'CONFIG' | docker config create myconfig - docker service create --config source=myconfig,target=/etc/config,mode=0777 --name myservice busybox sh -c 'ls -l /etc/config || /bin/top' The above creates the service, but it keeps retrying, because each task exits immediately (followed by SwarmKit reconciling and starting a new task); mjntpfkkyuuc1dpay4h00c4oo overall progress: 0 out of 1 tasks 1/1: ready [======================================> ] verify: Detected task failure ^COperation continuing in background. Use `docker service ps mjntpfkkyuuc1dpay4h00c4oo` to check progress. And checking the tasks for the service reveals that tasks exit cleanly (no error), but _do exit_, so swarm just keeps up reconciling, and spinning up new tasks; docker service ps myservice --no-trunc ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS 2wmcuv4vffnet8nybg3he4v9n myservice.1 busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57 docker-desktop Ready Ready less than a second ago 5p8b006uec125iq2892lxay64 \_ myservice.1 busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57 docker-desktop Shutdown Complete less than a second ago k8lpsvlak4b3nil0zfkexw61p \_ myservice.1 busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57 docker-desktop Shutdown Complete 6 seconds ago vsunl5pi7e2n9ol3p89kvj6pn \_ myservice.1 busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57 docker-desktop Shutdown Complete 11 seconds ago orxl8b6kt2l6dfznzzd4lij4s \_ myservice.1 busybox:latest@sha256:f7ca5a32c10d51aeda3b4d01c61c6061f497893d7f6628b92f822f7117182a57 docker-desktop Shutdown Complete 17 seconds ago This patch changes the service's command to `sleep`, so that a successful task (after successfully performing `ls -l /etc/config`) continues to be running until the service is deleted. With that change, the service should (usually) reconcile immediately, which removes the race condition, and should also make it faster :) This patch changes the tests to use client.ServiceLogs() instead of using the service's tasklist to directly access container logs. This should also fix some failures that happened if some tasks failed to start before reconciling, in which case client.TaskList() (with the current filters), could return more tasks than anticipated (as it also contained the exited tasks); === RUN TestCreateServiceSecretFileMode create_test.go:291: assertion failed: 2 (int) != 1 (int) --- FAIL: TestCreateServiceSecretFileMode (7.88s) === RUN TestCreateServiceConfigFileMode create_test.go:355: assertion failed: 2 (int) != 1 (int) --- FAIL: TestCreateServiceConfigFileMode (7.87s) Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
c3ba756 to
13cff6d
Compare
|
Changed I'll merge this one, once CI completes |
|
CI passed, but failed in cleanup afterwards; |
hopefully fixes #37132
Looks like this test was broken from the start, and fully relied on a race
condition. (Test was added in 65ee7ff / #36130)
The problem is in the service's command:
ls -l /etc/config || /bin/top, whichwill either:
/etc/config(which it should)/bin/topif the above failedAfter the service is created, the test enters a race-condition, checking for 1
task to be running (which it ocassionally is), after which it proceeds, and looks
up the list of tasks of the service, to get the log output of
ls -l /etc/config.This is another race: first of all, the original filter for that task lookup did
not filter by
running, so it would pick "any" task of the service (either failed,running, or "completed" (successfully exited) tasks).
In the meantime though, SwarmKit kept reconciling the service, and creating new
tasks, so even if the test was able to get the ID of the correct task, that task
may already have been exited, and removed (task-limit is 5 by default), so only
if the test was "lucky", it would be able to get the logs, but of course, chances
were likely that it would be "too late", and the task already gone.
The problem can be easily reproduced when running the steps manually:
The above creates the service, but it keeps retrying, because each task exits
immediately (followed by SwarmKit reconciling and starting a new task);
And checking the tasks for the service reveals that tasks exit cleanly (no error),
but do exit, so swarm just keeps up reconciling, and spinning up new tasks;
This patch changes the service's command to
sleep, so that a successful task(after successfully performing
ls -l /etc/config) continues to be running untilthe service is deleted. With that change, the service should (usually) reconcile
immediately, which removes the race condition, and should also make it faster :)
This patch changes the tests to use client.ServiceLogs() instead of using the
service's tasklist to directly access container logs. This should also fix some
failures that happened if some tasks failed to start before reconciling, in which
case client.TaskList() (with the current filters), could return more tasks than
anticipated (as it also contained the exited tasks);