Skip to content

Add cluster health metrics to NodeController#30686

Merged
k8s-github-robot merged 1 commit intokubernetes:masterfrom
gmarek:metrics
Aug 19, 2016
Merged

Add cluster health metrics to NodeController#30686
k8s-github-robot merged 1 commit intokubernetes:masterfrom
gmarek:metrics

Conversation

@gmarek
Copy link
Copy Markdown
Contributor

@gmarek gmarek commented Aug 16, 2016

Follow up of #28832

This adds metrics to monitor cluster/zone status.

cc @alex-mohr @fabioy @wojtek-t @Q-Lee


This change is Reviewable

@gmarek gmarek added the release-note-none Denotes a PR that doesn't merit a release note. label Aug 16, 2016
@k8s-github-robot k8s-github-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 16, 2016
prometheus.GaugeOpts{
Subsystem: NodeControllerSubsystem,
Name: ZoneHealthStatisticKey,
Help: "Gauge measuring percentage of healty nodes per zone.",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer that this be a raw number (e.g. number of healthy nodes per zone) rather than a percentage. This allows for broader usage of this metric, including allowing a dashboard creator to combine with the other "ZoneSize" metric for percentage graphs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if you feel strongly about it. I used percentage because NodeController logic is percentage-based, so it'll be easier to reason about cluster behavior in the current version.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just do both? (i.e. three metrics)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that;)

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 17, 2016
@gmarek
Copy link
Copy Markdown
Contributor Author

gmarek commented Aug 17, 2016

@daviopp @fabioy - I added "evictionsInLast" metric together with pretty complex logic needed for it. Can you please review this PR ASAP, so it might have a chance of going in 1.4?

@gmarek
Copy link
Copy Markdown
Contributor Author

gmarek commented Aug 17, 2016

I must figure out what to do with the fact that metrics appear only after eviction takes place. I need to make sure that we'll be showing appropriate set of zones if they change during runtime.

@davidopp davidopp assigned wojtek-t and unassigned davidopp and wojtek-t Aug 18, 2016
@davidopp
Copy link
Copy Markdown
Contributor

Unfortunately I'm not going to have time to review this before the deadline. Assigning to @wojtek-t to find a reviewer.

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 18, 2016
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 18, 2016
}

// Register prometheus metrics
Register()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we generally want to register metrics in init() function (instead of here) - that is the patern that we are using in the code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Done.

@gmarek
Copy link
Copy Markdown
Contributor Author

gmarek commented Aug 18, 2016

@wojtek-t PTAL

newState := nc.computeZoneStateFunc(v)
ZoneSize.WithLabelValues(k).Set(float64(len(v)))
unhealthy, newState := nc.computeZoneStateFunc(v)
ZoneHealth.WithLabelValues(k).Set(float64(100*(len(v)-unhealthy)) / float64(len(v)))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that len(v) > 0 here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if it were 0, then there would be no Nodes in the given zone, and hence the zone would not exist. This map is created from scratch each time monitorNodeStatus is called.

@wojtek-t
Copy link
Copy Markdown
Member

LGTM

@wojtek-t wojtek-t added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Aug 18, 2016
@k8s-bot
Copy link
Copy Markdown

k8s-bot commented Aug 18, 2016

GCE e2e build/test passed for commit 5d8cb17.

@gmarek gmarek added this to the v1.4 milestone Aug 19, 2016
@gmarek gmarek added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Aug 19, 2016
@k8s-github-robot
Copy link
Copy Markdown

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-bot
Copy link
Copy Markdown

k8s-bot commented Aug 19, 2016

GCE e2e build/test passed for commit 5d8cb17.

@k8s-github-robot
Copy link
Copy Markdown

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 6b20896 into kubernetes:master Aug 19, 2016
@davidopp
Copy link
Copy Markdown
Contributor

Thanks a lot for reviewing this, @wojtek-t

@gmarek gmarek deleted the metrics branch August 30, 2016 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants