support cluster events by dongluochen · Pull Request #32421 · moby/moby

dongluochen · 2017-04-06T23:34:53Z

This PR is ready for review.

===
This is a work item in progress. Test needs to be added. It also needs a change in Swarmkit Watch API moby/swarmkit#2099. Initial design discussion is at moby/swarmkit#491. I'd like to get design feedback.

Signed-off-by: Dong Chen dongluo.chen@docker.com

- What I did
Add swarm events to docker event stream.

- How I did it
Use Swarm Store Watch API to get change notification from Raft store. Translate that into Docker event format and push it into Docker event structure.

- How to verify it
On a manager node, docker events includes node's local events and cluster events. Cluster events have a global scope while node's local events are local. Existing event filters apply to cluster events. A new scope filter is added.

#run this on a manager node
$ docker events -f scope=global

# a node joins a cluster
2017-04-06T18:03:46.551104594Z node create s0ugk1wi0vgxnspxx0ptmon47 (name=)
2017-04-06T18:03:46.553642227Z node update s0ugk1wi0vgxnspxx0ptmon47 (name=)
2017-04-06T18:03:46.556070184Z node update s0ugk1wi0vgxnspxx0ptmon47 (name=)
2017-04-06T18:03:46.562025609Z node update s0ugk1wi0vgxnspxx0ptmon47 (name=)
2017-04-06T18:03:46.689608127Z node update s0ugk1wi0vgxnspxx0ptmon47 (name=ip-172-19-71-145, state.new=ready, state.old=unknown)
# a node goes down
2017-04-06T18:04:32.705082118Z node update 0urgktyobae3etk5e4n4331es (name=ip-172-19-147-51, state.new=down, state.old=ready)
# a node is back up
2017-04-06T18:05:06.288643169Z node update 0urgktyobae3etk5e4n4331es (name=ip-172-19-147-51, state.new=ready, state.old=down)
# promote a node
2017-04-06T18:05:29.965972204Z node update 0urgktyobae3etk5e4n4331es (desiredrole.new=manager, desiredrole.old=worker, name=ip-172-19-147-51)
2017-04-06T18:05:29.972791974Z node update 0urgktyobae3etk5e4n4331es (name=ip-172-19-147-51)
2017-04-06T18:05:29.992599632Z node update 0urgktyobae3etk5e4n4331es (name=ip-172-19-147-51)
2017-04-06T18:05:30.007889135Z node update 0urgktyobae3etk5e4n4331es (name=ip-172-19-147-51)
2017-04-06T18:05:30.084421029Z node update 0urgktyobae3etk5e4n4331es (name=ip-172-19-147-51)
2017-04-06T18:05:30.088454689Z node update 0urgktyobae3etk5e4n4331es (name=ip-172-19-147-51)
# demote a node
2017-04-06T18:06:15.004097376Z node update 0urgktyobae3etk5e4n4331es (desiredrole.new=worker, desiredrole.old=manager, name=ip-172-19-147-51)
2017-04-06T18:06:20.015726988Z node update 0urgktyobae3etk5e4n4331es (name=ip-172-19-147-51)
2017-04-06T18:06:20.035951960Z node update 0urgktyobae3etk5e4n4331es (name=ip-172-19-147-51)
2017-04-06T18:06:20.048380156Z node update 0urgktyobae3etk5e4n4331es (name=ip-172-19-147-51)
2017-04-06T18:06:20.068198599Z node update 0urgktyobae3etk5e4n4331es (name=ip-172-19-147-51)
2017-04-06T18:06:20.072033661Z node update 0urgktyobae3etk5e4n4331es (name=ip-172-19-147-51)
# change a node's availability to pause
2017-04-06T18:07:01.620569322Z node update 0urgktyobae3etk5e4n4331es (availability.new=pause, availability.old=active, name=ip-172-19-147-51)
# change a node's availability to active
2017-04-06T18:07:35.825859057Z node update 0urgktyobae3etk5e4n4331es (availability.new=active, availability.old=pause, name=ip-172-19-147-51)

# create a service
2017-04-06T18:08:43.303340796Z service create 9vvofszhb6iv4k3tmphras96u (name=nginx)
2017-04-06T18:08:43.307625739Z service update 9vvofszhb6iv4k3tmphras96u (name=nginx)
# scale a service
2017-04-06T18:09:37.018798236Z service update 9vvofszhb6iv4k3tmphras96u (name=nginx, replicas.new=3, replicas.old=2)
# update image of a service
2017-04-06T18:11:19.122732546Z service update 9vvofszhb6iv4k3tmphras96u (image.new=nginx:1.10.3@sha256:6202beb06ea61f44179e02ca965e8e13b961d12640101fca213efbfd145d7575, image.old=nginx:latest@sha256:e6693c20186f837fc393390135d8a598a96a833917917789d63766cab6c59582, name=nginx)
2017-04-06T18:11:19.126619069Z service update 9vvofszhb6iv4k3tmphras96u (name=nginx, updatestate.new=updating, updatestate.old=nil)
2017-04-06T18:11:41.552581741Z service update 9vvofszhb6iv4k3tmphras96u (name=nginx, updatestate.new=completed, updatestate.old=updating)
# remove a service
2017-04-06T18:12:23.466026101Z service remove 9vvofszhb6iv4k3tmphras96u (name=nginx)

# create a network
2017-04-06T18:12:45.047497468Z network create qew8jjh6riv75r8tj4wf0imfw (name=tnet)
2017-04-06T18:12:45.053957928Z network update qew8jjh6riv75r8tj4wf0imfw (name=tnet)
# create a container associtated with this network
# create a service associated with this network
2017-04-06T18:14:38.161988206Z service create v6md2fr205au7c6wq5zdo6sz1 (name=nginx)
2017-04-06T18:14:38.166951070Z service update v6md2fr205au7c6wq5zdo6sz1 (name=nginx)
# remove the service
2017-04-06T18:15:29.441350106Z service remove v6md2fr205au7c6wq5zdo6sz1 (name=nginx)
# remove a network
2017-04-06T18:15:43.558832104Z network remove qew8jjh6riv75r8tj4wf0imfw (name=tnet)

# create a secret
2017-04-06T18:16:27.127307662Z secret create p3z8b3f2q40fun94yq3vd2g9y (name=mysecret)
# remove a secret
2017-04-06T18:16:55.393649748Z secret remove p3z8b3f2q40fun94yq3vd2g9y (name=mysecret)

- Description for the changelog
Add cluster events to Docker event stream.

aaronlehmann · 2017-04-07T00:27:43Z

daemon/events.go

Maybe we should create a func eventTimestamp(swarmapi.Meta, swarmapi.WatchActionKind) time.Time function that handles extracting the timestamp. Then we can call it in generateClusterEvent and pass in the timestamp instead of needing to handle it in all the different switch statements.

aaronlehmann · 2017-04-07T00:30:17Z

daemon/cluster/noderunner.go

Would it be useful for Watch to support something like Kind: "*"?

In general I want people to filter for the specific events of interest, but the Docker events API is a bit of an unusual case.

We don't want to get task change notifications. They are too many and are implementation details that we don't want user rely on that.

dongluochen · 2017-05-09T05:46:29Z

Test failure looks unrelated. It's already tracked by #33041.

23:20:09 ----------------------------------------------------------------------
23:20:09 FAIL: check_test.go:355: DockerSwarmSuite.TearDownTest
23:20:09 
23:20:09 check_test.go:360:
23:20:09     d.Stop(c)
23:20:09 daemon/daemon.go:392:
23:20:09     t.Fatalf("Error while stopping the daemon %s : %v", d.id, err)
23:20:09 ... Error: Error while stopping the daemon d95be62fe1bc6 : exit status 2

dongluochen · 2017-05-09T05:47:33Z

Please review.

AkihiroSuda · 2017-05-09T05:58:00Z

daemon/events.go

logrus.Warn on default?

AkihiroSuda · 2017-05-09T05:59:45Z

daemon/events.go

Can we use make these string literals const-ants?

Ideally they should be defined in swarmkit. But I don't find them there. I don't think they should be added here.

AkihiroSuda · 2017-05-09T06:01:19Z

cc @aluzzardi @aaronlehmann

aaronlehmann · 2017-05-09T17:33:15Z

daemon/cluster/cluster.go

Comment doesn't match the field name.

Maybe this can just be Events? I like to avoid putting type information in the field names.

Change to WatchStream.

aaronlehmann · 2017-05-09T17:38:29Z

daemon/cluster/noderunner.go

I can't see any code path where it is possible for n.cluster to be nil. I'd suggest removing this - it is a little misleading to suggest that n.cluster can be nil, when other code would blow up in this case.

aaronlehmann · 2017-05-09T17:40:44Z

daemon/cluster/noderunner.go

logrus.WithError(err).Error("failed to receive changes from store watch API")

aaronlehmann · 2017-05-09T17:41:11Z

daemon/cluster/noderunner.go

Should this log anything?

aaronlehmann · 2017-05-09T17:41:48Z

daemon/cluster/noderunner.go

Select on this channel write and <-ctx.Done().

When <-ctx.Done(), shouldn't watch.Recv() return with error? If it doesn't return with error, it may cause a thread leak. Adding <-ctx.Done() along with the channel write wouldn't help because it may not reach here.

aaronlehmann · 2017-05-09T17:51:29Z

daemon/events.go

aaronlehmann · 2017-05-09T17:51:32Z

daemon/events.go

aaronlehmann · 2017-05-09T17:53:23Z

daemon/events.go

Don't we need something to check if oldNode is nil? The old value is not guaranteed to be available.

Why is that? The watch API asks for old object. In an update case, the old object should always be available.

Providing the old object is best-effort. We're not able to do it if you're looking at past events. I don't think the events code is using this right now, but the point is that the API doesn't guarantee it can provide the old object. Even if it did, this code should still check. It's node good for the client side of a client/server system to crash if the server omits expected values.

aaronlehmann · 2017-05-09T17:53:45Z

daemon/events.go

Don't we need something to check if oldService is nil? The old value is not guaranteed to be available.

aaronlehmann · 2017-05-09T17:54:10Z

daemon/events/events.go

"and publishes it"

aaronlehmann · 2017-05-10T21:45:50Z

daemon/events.go

Don't we need to check if OldObject is nil before calling GetNode, etc.?

The GetNode function handles it directly.

func (m *Object) GetNode() *Node { if x, ok := m.GetObject().(*Object_Node); ok { return x.Node } return nil } func (m *Object) GetObject() isObject_Object { if m != nil { return m.Object } return nil }

aaronlehmann · 2017-05-10T22:06:11Z

LGTM

aaronlehmann · 2017-05-11T18:36:37Z

Please rebase

aluzzardi · 2017-05-12T18:49:28Z

daemon/events.go

Maybe this should be swarm. Not sure though.

@dongluochen @aaronlehmann ?

I'll update it to swarm.

aluzzardi · 2017-05-12T18:52:02Z

daemon/events.go

I think we should output node.Role (attributes["role..."]) rather than node.Spec.DesiredRole

Changed to role.new.

The code is still using node.Spec.DesiredRole. I believe @aluzzardi wanted this to look at node.Role instead.

Thanks @aaronlehmann. Updated.

dongluochen · 2017-05-16T23:57:38Z

Ping @AkihiroSuda @vdemeester @tonistiigi. Please take a look if you are interested.

tonistiigi · 2017-05-17T00:19:25Z

cmd/dockerd/daemon.go

There shouldn't be a need to close this. I think there is a possibility that cluster can still attempt to write to this chan after Cleanup has been called(although very unlikely as shutdownDaemon is not fast).

Thanks. Let me remove it.

If it is not closed, the ProcessClusterNotifications goroutine will never exit.

@aaronlehmann It should take a context then. Or the cluster.Cleanup() should close it after it knows there are no writes coming.

Add a context to ProcessClusterNotifications.

tonistiigi · 2017-05-17T00:21:00Z

cmd/dockerd/daemon.go

No need to change but wondering why you chose to pass in a channel and handle this in dockerd instead of defining a LogEvent interface in cluster, that would be called to process events.

It was decided to add a watch API to expose cluster change. But do not add a separate LogEvent interface for simplicity.
moby/swarmkit#491 (comment)

I meant interface in docker's cluster pkg like cluster.Backend cluster.NetworkSubnetsProvider that would be implemented by daemon.Daemon

I'll look at it in a separate change.

aaronlehmann · 2017-05-17T02:27:06Z

daemon/cluster/noderunner.go

Does this need to also select on the Done method of the context passed to ProcessClusterNotifications, so that this won't block forever if ProcessClusterNotifications exits due to context cancellation?

I don't think it necessary to do that. c.Cleanup() at https://github.com/moby/moby/blob/master/cmd/dockerd/daemon.go#L278 would trigger return on this goroutine. The call path is Cluster.Cleanup() -> nodeRunner.Stop() -> set nodeRunner.closed -> nodeRunner.handleNodeExit() -> cancel nodeRunner.handleControlSocketChange context -> store watch API got cancelled.

store watch API got cancelled

This will never happen if this function is blocked on the channel write instead of calling Recv. I don't see what prevents ProcessClusterNotifications from exiting before this function.

As an aside, I think it's bad for ProcessClusterNotifications to exit on either a channel close or a context cancellation. There should be only one way to shut it down.

Let's use the context from nodeRunner.handleControlSocketChange to make sure it exits.

vdemeester

LGTM 🦁 😍
(one small question, but can't wait to have that !!)

vdemeester · 2017-05-17T08:05:56Z

daemon/events.go

We don't display labels changes, by design ?

The attributes are extra information for an event. For example, a node update event 2017-04-06T18:07:01.620569322Z node update 0urgktyobae3etk5e4n4331es (availability.new=pause, availability.old=active, name=ip-172-19-147-51) contains a fix part timestamp node update nodeID and the attributes part availability.new, availability.old, name. We can add labels change to the attributes if they are useful.

Attributes are added to events in an ad-hoc way. Since there are a lot of information in swarm objects, revealing all changes may reduce readability. For example, a node goes thru a list of changes when joining a cluster, which is internal procedure and users shouldn't put much effort to inspect them.

I'm not very sure how this will be used. We are starting with basic events moby/swarmkit#491 (comment). I think it'll be extended based on user feedbacks.

I've seen many tools using labels to add configuration that's used when listening for events - labels would be a likely candidate to add

Signed-off-by: Dong Chen <dongluo.chen@docker.com>

GordonTheTurtle added the status/0-triage label Apr 6, 2017

aaronlehmann reviewed Apr 7, 2017

View reviewed changes

cpuguy83 added status/1-design-review and removed status/0-triage labels Apr 7, 2017

dongluochen force-pushed the swarmkit_events branch 2 times, most recently from cf3e70e to 8b6feed Compare April 7, 2017 23:56

wanghaibo mentioned this pull request Apr 13, 2017

[Swarm Mode 1.12] Improvements nginx-proxy/nginx-proxy#520

Closed

GordonTheTurtle assigned AkihiroSuda Apr 21, 2017

dongluochen force-pushed the swarmkit_events branch 2 times, most recently from a25b632 to 9dd7ff8 Compare May 5, 2017 22:45

Yshayy mentioned this pull request May 6, 2017

Gateway Timeout when rolling update a scaled service in Docker swarm mode traefik/traefik#1480

Closed

dongluochen force-pushed the swarmkit_events branch from 9dd7ff8 to a42a92c Compare May 8, 2017 22:14

dongluochen changed the title ~~[WIP] support cluster events~~ support cluster events May 9, 2017

AkihiroSuda reviewed May 9, 2017

View reviewed changes

daemon/events.go Outdated

Copy link
Copy Markdown

Member

AkihiroSuda May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logrus.Warn on default?

dongluochen reacted with thumbs up emoji

AkihiroSuda reviewed May 9, 2017

View reviewed changes

aaronlehmann suggested changes May 9, 2017

View reviewed changes

dongluochen force-pushed the swarmkit_events branch from a42a92c to 00358e7 Compare May 10, 2017 06:22

aaronlehmann reviewed May 10, 2017

View reviewed changes

aaronlehmann added the rebuild/* label May 10, 2017

GordonTheTurtle removed the rebuild/* label May 10, 2017

aaronlehmann approved these changes May 10, 2017

View reviewed changes

aaronlehmann added status/2-code-review and removed status/1-design-review labels May 10, 2017

AkihiroSuda added the status/failing-ci Indicates that the PR in its current state fails the test suite label May 11, 2017

aluzzardi reviewed May 12, 2017

View reviewed changes

dongluochen force-pushed the swarmkit_events branch from b90a001 to c8681d4 Compare May 12, 2017 20:21

thaJeztah removed the status/failing-ci Indicates that the PR in its current state fails the test suite label May 15, 2017

dongluochen force-pushed the swarmkit_events branch from c8681d4 to 41a6963 Compare May 16, 2017 00:27

aaronlehmann approved these changes May 16, 2017

View reviewed changes

thaJeztah added this to the 17.06.0 milestone May 17, 2017

tonistiigi reviewed May 17, 2017

View reviewed changes

dongluochen force-pushed the swarmkit_events branch from 41a6963 to 5264148 Compare May 17, 2017 02:16

aaronlehmann reviewed May 17, 2017

View reviewed changes

vdemeester approved these changes May 17, 2017

View reviewed changes

support cluster events

59d45c3

Signed-off-by: Dong Chen <dongluo.chen@docker.com>

dongluochen force-pushed the swarmkit_events branch from 5264148 to 59d45c3 Compare May 17, 2017 18:52

aaronlehmann merged commit 3d63049 into moby:master May 17, 2017

mlaventure added the impact/changelog label May 18, 2017

dongluochen mentioned this pull request May 18, 2017

API: Events stream moby/swarmkit#491

Closed

This was referenced May 26, 2017

Docker swarm mode, Traefik on manager node only ? traefik/traefik#766

Open

Migrate from docker/engine-api to moby/moby/client traefik/traefik#1674

Closed

WTFKr0 mentioned this pull request Jun 7, 2017

Support Swarm events portainer/portainer#920

Closed

albers mentioned this pull request Jun 27, 2017

missing documentation for swarm events docker/cli#252

Closed

thaJeztah added impact/api impact/cli labels Jun 27, 2017

This was referenced Jul 10, 2017

docs: add filter scope for command events and more cluster events docker/cli#314

Merged

add cluster events details in swagger.yml #34035

Merged

[question] Is it duplicated that network has both destroy and remove events? #34054

Closed

thaJeztah mentioned this pull request Sep 20, 2017

Swarm events #23827

Closed

thaJeztah mentioned this pull request Dec 20, 2017

Remove TestEventsLimit(), and minor cleanups #35844

Merged

victormakmo mentioned this pull request Jan 15, 2018

Traefik drop request when nginx trigger configure reload traefik/traefik#2210

Closed

thaJeztah mentioned this pull request Feb 26, 2025

daemon/cluster: some cleanups in initialisation #49549

Merged

Conversation

dongluochen commented Apr 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongluochen Apr 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongluochen commented May 9, 2017

Uh oh!

dongluochen commented May 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AkihiroSuda commented May 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongluochen May 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aaronlehmann commented May 10, 2017

Uh oh!

aaronlehmann commented May 11, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongluochen commented May 16, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongluochen commented Apr 6, 2017 •

edited

Loading

dongluochen Apr 7, 2017 •

edited

Loading

dongluochen May 10, 2017 •

edited

Loading