Skip to content

support cluster events#32421

Merged
aaronlehmann merged 1 commit intomoby:masterfrom
dongluochen:swarmkit_events
May 17, 2017
Merged

support cluster events#32421
aaronlehmann merged 1 commit intomoby:masterfrom
dongluochen:swarmkit_events

Conversation

@dongluochen
Copy link
Copy Markdown
Contributor

@dongluochen dongluochen commented Apr 6, 2017

This PR is ready for review.

===
This is a work item in progress. Test needs to be added. It also needs a change in Swarmkit Watch API moby/swarmkit#2099. Initial design discussion is at moby/swarmkit#491. I'd like to get design feedback.

Signed-off-by: Dong Chen dongluo.chen@docker.com

- What I did
Add swarm events to docker event stream.

- How I did it
Use Swarm Store Watch API to get change notification from Raft store. Translate that into Docker event format and push it into Docker event structure.

- How to verify it
On a manager node, docker events includes node's local events and cluster events. Cluster events have a global scope while node's local events are local. Existing event filters apply to cluster events. A new scope filter is added.

#run this on a manager node
$ docker events -f scope=global

# a node joins a cluster
2017-04-06T18:03:46.551104594Z node create s0ugk1wi0vgxnspxx0ptmon47 (name=)
2017-04-06T18:03:46.553642227Z node update s0ugk1wi0vgxnspxx0ptmon47 (name=)
2017-04-06T18:03:46.556070184Z node update s0ugk1wi0vgxnspxx0ptmon47 (name=)
2017-04-06T18:03:46.562025609Z node update s0ugk1wi0vgxnspxx0ptmon47 (name=)
2017-04-06T18:03:46.689608127Z node update s0ugk1wi0vgxnspxx0ptmon47 (name=ip-172-19-71-145, state.new=ready, state.old=unknown)
# a node goes down
2017-04-06T18:04:32.705082118Z node update 0urgktyobae3etk5e4n4331es (name=ip-172-19-147-51, state.new=down, state.old=ready)
# a node is back up
2017-04-06T18:05:06.288643169Z node update 0urgktyobae3etk5e4n4331es (name=ip-172-19-147-51, state.new=ready, state.old=down)
# promote a node
2017-04-06T18:05:29.965972204Z node update 0urgktyobae3etk5e4n4331es (desiredrole.new=manager, desiredrole.old=worker, name=ip-172-19-147-51)
2017-04-06T18:05:29.972791974Z node update 0urgktyobae3etk5e4n4331es (name=ip-172-19-147-51)
2017-04-06T18:05:29.992599632Z node update 0urgktyobae3etk5e4n4331es (name=ip-172-19-147-51)
2017-04-06T18:05:30.007889135Z node update 0urgktyobae3etk5e4n4331es (name=ip-172-19-147-51)
2017-04-06T18:05:30.084421029Z node update 0urgktyobae3etk5e4n4331es (name=ip-172-19-147-51)
2017-04-06T18:05:30.088454689Z node update 0urgktyobae3etk5e4n4331es (name=ip-172-19-147-51)
# demote a node
2017-04-06T18:06:15.004097376Z node update 0urgktyobae3etk5e4n4331es (desiredrole.new=worker, desiredrole.old=manager, name=ip-172-19-147-51)
2017-04-06T18:06:20.015726988Z node update 0urgktyobae3etk5e4n4331es (name=ip-172-19-147-51)
2017-04-06T18:06:20.035951960Z node update 0urgktyobae3etk5e4n4331es (name=ip-172-19-147-51)
2017-04-06T18:06:20.048380156Z node update 0urgktyobae3etk5e4n4331es (name=ip-172-19-147-51)
2017-04-06T18:06:20.068198599Z node update 0urgktyobae3etk5e4n4331es (name=ip-172-19-147-51)
2017-04-06T18:06:20.072033661Z node update 0urgktyobae3etk5e4n4331es (name=ip-172-19-147-51)
# change a node's availability to pause
2017-04-06T18:07:01.620569322Z node update 0urgktyobae3etk5e4n4331es (availability.new=pause, availability.old=active, name=ip-172-19-147-51)
# change a node's availability to active
2017-04-06T18:07:35.825859057Z node update 0urgktyobae3etk5e4n4331es (availability.new=active, availability.old=pause, name=ip-172-19-147-51)

# create a service
2017-04-06T18:08:43.303340796Z service create 9vvofszhb6iv4k3tmphras96u (name=nginx)
2017-04-06T18:08:43.307625739Z service update 9vvofszhb6iv4k3tmphras96u (name=nginx)
# scale a service
2017-04-06T18:09:37.018798236Z service update 9vvofszhb6iv4k3tmphras96u (name=nginx, replicas.new=3, replicas.old=2)
# update image of a service
2017-04-06T18:11:19.122732546Z service update 9vvofszhb6iv4k3tmphras96u (image.new=nginx:1.10.3@sha256:6202beb06ea61f44179e02ca965e8e13b961d12640101fca213efbfd145d7575, image.old=nginx:latest@sha256:e6693c20186f837fc393390135d8a598a96a833917917789d63766cab6c59582, name=nginx)
2017-04-06T18:11:19.126619069Z service update 9vvofszhb6iv4k3tmphras96u (name=nginx, updatestate.new=updating, updatestate.old=nil)
2017-04-06T18:11:41.552581741Z service update 9vvofszhb6iv4k3tmphras96u (name=nginx, updatestate.new=completed, updatestate.old=updating)
# remove a service
2017-04-06T18:12:23.466026101Z service remove 9vvofszhb6iv4k3tmphras96u (name=nginx)

# create a network
2017-04-06T18:12:45.047497468Z network create qew8jjh6riv75r8tj4wf0imfw (name=tnet)
2017-04-06T18:12:45.053957928Z network update qew8jjh6riv75r8tj4wf0imfw (name=tnet)
# create a container associtated with this network
# create a service associated with this network
2017-04-06T18:14:38.161988206Z service create v6md2fr205au7c6wq5zdo6sz1 (name=nginx)
2017-04-06T18:14:38.166951070Z service update v6md2fr205au7c6wq5zdo6sz1 (name=nginx)
# remove the service
2017-04-06T18:15:29.441350106Z service remove v6md2fr205au7c6wq5zdo6sz1 (name=nginx)
# remove a network
2017-04-06T18:15:43.558832104Z network remove qew8jjh6riv75r8tj4wf0imfw (name=tnet)

# create a secret
2017-04-06T18:16:27.127307662Z secret create p3z8b3f2q40fun94yq3vd2g9y (name=mysecret)
# remove a secret
2017-04-06T18:16:55.393649748Z secret remove p3z8b3f2q40fun94yq3vd2g9y (name=mysecret)

- Description for the changelog
Add cluster events to Docker event stream.

daemon/events.go Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should create a func eventTimestamp(swarmapi.Meta, swarmapi.WatchActionKind) time.Time function that handles extracting the timestamp. Then we can call it in generateClusterEvent and pass in the timestamp instead of needing to handle it in all the different switch statements.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful for Watch to support something like Kind: "*"?

In general I want people to filter for the specific events of interest, but the Docker events API is a bit of an unusual case.

Copy link
Copy Markdown
Contributor Author

@dongluochen dongluochen Apr 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to get task change notifications. They are too many and are implementation details that we don't want user rely on that.

@dongluochen
Copy link
Copy Markdown
Contributor Author

Test failure looks unrelated. It's already tracked by #33041.

23:20:09 ----------------------------------------------------------------------
23:20:09 FAIL: check_test.go:355: DockerSwarmSuite.TearDownTest
23:20:09 
23:20:09 check_test.go:360:
23:20:09     d.Stop(c)
23:20:09 daemon/daemon.go:392:
23:20:09     t.Fatalf("Error while stopping the daemon %s : %v", d.id, err)
23:20:09 ... Error: Error while stopping the daemon d95be62fe1bc6 : exit status 2

@dongluochen dongluochen changed the title [WIP] support cluster events support cluster events May 9, 2017
@dongluochen
Copy link
Copy Markdown
Contributor Author

Please review.

daemon/events.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logrus.Warn on default?

daemon/events.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use make these string literals const-ants?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally they should be defined in swarmkit. But I don't find them there. I don't think they should be added here.

@AkihiroSuda
Copy link
Copy Markdown
Member

cc @aluzzardi @aaronlehmann

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment doesn't match the field name.

Maybe this can just be Events? I like to avoid putting type information in the field names.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to WatchStream.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see any code path where it is possible for n.cluster to be nil. I'd suggest removing this - it is a little misleading to suggest that n.cluster can be nil, when other code would blow up in this case.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logrus.WithError(err).Error("failed to receive changes from store watch API")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this log anything?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Select on this channel write and <-ctx.Done().

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When <-ctx.Done(), shouldn't watch.Recv() return with error? If it doesn't return with error, it may cause a thread leak. Adding <-ctx.Done() along with the channel write wouldn't help because it may not reach here.

daemon/events.go Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty case

daemon/events.go Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty case

daemon/events.go Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need something to check if oldNode is nil? The old value is not guaranteed to be available.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that? The watch API asks for old object. In an update case, the old object should always be available.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Providing the old object is best-effort. We're not able to do it if you're looking at past events. I don't think the events code is using this right now, but the point is that the API doesn't guarantee it can provide the old object. Even if it did, this code should still check. It's node good for the client side of a client/server system to crash if the server omits expected values.

daemon/events.go Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need something to check if oldService is nil? The old value is not guaranteed to be available.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"and publishes it"

daemon/events.go Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to check if OldObject is nil before calling GetNode, etc.?

Copy link
Copy Markdown
Contributor Author

@dongluochen dongluochen May 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GetNode function handles it directly.

   func (m *Object) GetNode() *Node {
       if x, ok := m.GetObject().(*Object_Node); ok {
         return x.Node
       }
       return nil
   }

   func (m *Object) GetObject() isObject_Object {
       if m != nil {
         return m.Object
       }
       return nil
   }

@aaronlehmann
Copy link
Copy Markdown

LGTM

@AkihiroSuda AkihiroSuda added the status/failing-ci Indicates that the PR in its current state fails the test suite label May 11, 2017
@aaronlehmann
Copy link
Copy Markdown

Please rebase

daemon/events.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be swarm. Not sure though.

@dongluochen @aaronlehmann ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update it to swarm.

daemon/events.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should output node.Role (attributes["role..."]) rather than node.Spec.DesiredRole

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to role.new.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is still using node.Spec.DesiredRole. I believe @aluzzardi wanted this to look at node.Role instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aaronlehmann. Updated.

@thaJeztah thaJeztah removed the status/failing-ci Indicates that the PR in its current state fails the test suite label May 15, 2017
@dongluochen
Copy link
Copy Markdown
Contributor Author

Ping @AkihiroSuda @vdemeester @tonistiigi. Please take a look if you are interested.

@thaJeztah thaJeztah added this to the 17.06.0 milestone May 17, 2017
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There shouldn't be a need to close this. I think there is a possibility that cluster can still attempt to write to this chan after Cleanup has been called(although very unlikely as shutdownDaemon is not fast).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Let me remove it.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is not closed, the ProcessClusterNotifications goroutine will never exit.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronlehmann It should take a context then. Or the cluster.Cleanup() should close it after it knows there are no writes coming.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a context to ProcessClusterNotifications.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change but wondering why you chose to pass in a channel and handle this in dockerd instead of defining a LogEvent interface in cluster, that would be called to process events.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was decided to add a watch API to expose cluster change. But do not add a separate LogEvent interface for simplicity.
moby/swarmkit#491 (comment)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant interface in docker's cluster pkg like cluster.Backend cluster.NetworkSubnetsProvider that would be implemented by daemon.Daemon

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look at it in a separate change.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to also select on the Done method of the context passed to ProcessClusterNotifications, so that this won't block forever if ProcessClusterNotifications exits due to context cancellation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it necessary to do that. c.Cleanup() at https://github.com/moby/moby/blob/master/cmd/dockerd/daemon.go#L278 would trigger return on this goroutine. The call path is Cluster.Cleanup() -> nodeRunner.Stop() -> set nodeRunner.closed -> nodeRunner.handleNodeExit() -> cancel nodeRunner.handleControlSocketChange context -> store watch API got cancelled.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store watch API got cancelled

This will never happen if this function is blocked on the channel write instead of calling Recv. I don't see what prevents ProcessClusterNotifications from exiting before this function.

As an aside, I think it's bad for ProcessClusterNotifications to exit on either a channel close or a context cancellation. There should be only one way to shut it down.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the context from nodeRunner.handleControlSocketChange to make sure it exits.

Copy link
Copy Markdown
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🦁 😍
(one small question, but can't wait to have that !!)

daemon/events.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't display labels changes, by design ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The attributes are extra information for an event. For example, a node update event 2017-04-06T18:07:01.620569322Z node update 0urgktyobae3etk5e4n4331es (availability.new=pause, availability.old=active, name=ip-172-19-147-51) contains a fix part timestamp node update nodeID and the attributes part availability.new, availability.old, name. We can add labels change to the attributes if they are useful.

Attributes are added to events in an ad-hoc way. Since there are a lot of information in swarm objects, revealing all changes may reduce readability. For example, a node goes thru a list of changes when joining a cluster, which is internal procedure and users shouldn't put much effort to inspect them.

I'm not very sure how this will be used. We are starting with basic events moby/swarmkit#491 (comment). I think it'll be extended based on user feedbacks.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen many tools using labels to add configuration that's used when listening for events - labels would be a likely candidate to add

Signed-off-by: Dong Chen <dongluo.chen@docker.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants