add metrics for image pulling: error; in progress count; thoughput #7313
add metrics for image pulling: error; in progress count; thoughput #7313fuweid merged 1 commit intocontainerd:mainfrom
Conversation
|
Hi @pacoxu. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
I will do some test locally later. |
afc79c3 to
e80b52c
Compare
The metrics now is like above. |
e80b52c to
cde8646
Compare
Jenkins-J
left a comment
There was a problem hiding this comment.
This looks good to me.
pkg/cri/server/metrics.go
Outdated
| containerStopTimer = ns.NewLabeledTimer("container_stop", "time to stop a container", "runtime") | ||
| containerStartTimer = ns.NewLabeledTimer("container_start", "time to start a container", "runtime") | ||
|
|
||
| imagePullingError = ns.NewLabeledCounter("image_pulling_error", "error count of image pulling by image name", "image_name") |
There was a problem hiding this comment.
Using image name will make the cardinality too high. Prometheus' doc recommends to keep the cardinality lower than 10, so I guess registry name should be a better option here.
[1] https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels
There was a problem hiding this comment.
imagePulls = ns.NewLabeledCounter("image_pulling", "count of image pulling by registry", "status", "registry")
I changed to this.
There was a problem hiding this comment.
I don't think it's reasonable to assume that containerd will only pull from 10 distinct domains during its lifetime (which appears to be the actual distinction, not registries?). Some container registry providers use many distinct domains (for example, Amazon ECR has a different domain name for each customer account).
pkg/cri/server/metrics.go
Outdated
| containerStopTimer metrics.LabeledTimer | ||
| containerStartTimer metrics.LabeledTimer | ||
|
|
||
| imagePullingError metrics.LabeledCounter |
There was a problem hiding this comment.
Should we use a latency metric instead, with error, registry as labels? By doing that users will be able to
- see the image pull latencies,
- see both error count and success count, and therefore calculate the error ratio.
There was a problem hiding this comment.
To add a count for both error pulling and success pulling, I agree.
imagePulls = ns.NewLabeledCounter("image_pulling", "count of image pulling by registry", "error", "registry")
For image pull latencies, do you mean the pulling duration? Could you explain?
- for the duration, I think imagePullThroughput would be better as the image size would 1MiB or 1GB. The latency does not make sense in some scenarios.
There was a problem hiding this comment.
If we add the error as a label, some error message contains the image name and some detailed information. If so, the registry label is not that important. If not, we have to make the error more groupable.
There was a problem hiding this comment.
I was thinking of making this metric something like this:
imagePullTimer = ns.NewLabeledTimer("image_pull", "time to pull an image", "error", "registry")
, where error is just a boolean indicating whether the pull succeeds or not.
By doing this:
- if user wants to get the overall pulling duration, regardless of each image size, they can get this metric with
error=='false'. - if user wants to get the number of failed image pulls, they can get counts of this metric with
error=='true'
But yeah, I think what you are doing here (i.e. separating the counter metric from duration/throughput metric) is also fine. I don't have a strong preference here.
There was a problem hiding this comment.
I added like
imagePulls = ns.NewLabeledCounter("image_pulls", "count of image pulling by registry", "status", "registry")
Two value is valid: status=failed and status=succeed
cde8646 to
b8de211
Compare
pkg/cri/server/image_pull.go
Outdated
| inProgressImagePulls.Inc() | ||
| defer inProgressImagePulls.Dec() | ||
| var domain string | ||
| var err error |
There was a problem hiding this comment.
I guess a common practice of handling error in defer is to use named return value, instead of using a var like here, because in this case you can't be sure err will always be the return value.
For example, if you look at
containerd/pkg/cri/server/image_pull.go
Line 196 in b8de211
the returned error is based on a local err created in the for-loop. So your defer here will not catch that local error.
(It's similar to https://go.dev/play/p/LNkan9NkAUt)
There was a problem hiding this comment.
If so, I have to wrap like below.
func (c *criService) PullImage(ctx context.Context, r *runtime.PullImageRequest) (*runtime.PullImageResponse, error) {
resp, domain, err := c.PullImageImp(ctx, r)
if err != nil {
imagePulls.WithValues("failed", domain).Inc()
} else {
imagePulls.WithValues("succeed", domain).Inc()
}
return resp, err
}
PullImageImp is the original function. Is this OK?
There was a problem hiding this comment.
no you dont need to. Just use a named return value for the returned error and use the named return value in your defer, and golang will make sure it is always the returned error. Like this:
func (c *criService) PullImage(ctx context.Context, r *runtime.PullImageRequest) (_ *runtime.PullImageResponse, retErr error) {
defer func() {
if retErr != nil {
...
} else {
...
}
}()
There was a problem hiding this comment.
As I need the domain as a label of the metrics, and the image registry domain is not returned.
This still does not work.
So I kept the change, would you take a look again?
There was a problem hiding this comment.
Sorry for the late reply. I am just trying to minimize the number of wrapper methods. Can we do something like
func (c *criService) PullImage(ctx context.Context, r *runtime.PullImageRequest) (_ *runtime.PullImageResponse, retErr error) {
var domain string
defer func() {
if retErr != nil {
...
} else {
...
}
}()
and domain will be empty if it errs when trying to parse it, and will be non-empty if the parsing succeeds.
Commit b8de211 is missing a the |
Thanks. I will sign off all commits. |
b8de211 to
0b70c38
Compare
pkg/cri/server/image_pull.go
Outdated
| imagePullThroughput.WithLabelValues(domain).Observe(imagePullingSpeed) | ||
|
|
||
| log.G(ctx).Infof("Pulled image %q with image id %q, repo tag %q, repo digest %q, size %q in %v s", imageRef, imageID, | ||
| repoTag, repoDigest, strconv.FormatInt(size, 10), time.Since(startTime).Seconds()) |
There was a problem hiding this comment.
nod one metric per domain ref is probably enough..
If someone needs per image log data... maybe that should be retrieved from the log?
We might need to add the size and duration to the CRI PullImage Response..
pkg/cri/server/image_pull.go
Outdated
| imagePullThroughput.WithLabelValues(domain).Observe(imagePullingSpeed) | ||
|
|
||
| log.G(ctx).Infof("Pulled image %q with image id %q, repo tag %q, repo digest %q, size %q in %v s", imageRef, imageID, | ||
| repoTag, repoDigest, strconv.FormatInt(size, 10), time.Since(startTime).Seconds()) |
There was a problem hiding this comment.
I think per domain by default is too granular/too high cardinality.
8e53068 to
83ba4b4
Compare
|
@mikebrow @samuelkarp Thanks for your review. |
|
Code LGTM, but can you squash your commits into one? |
…nt; thoughput Signed-off-by: Paco Xu <paco.xu@daocloud.io>
83ba4b4 to
c59f163
Compare
|
Squashed. |
|
apologies for commenting on an old MR, but any ideas if and how CRI metrics can be enabled for an AKS cluster running containerd? I'd like to measure image pull times for a cluster, and getting it from the CRI metrics is the only way I have found so far? |
Does |
|
Hi, If I implement |
We may discuss it in kubernetes/enhancements#2371. @shivam99aa |
/cc @cpuguy83
Fixes #7241
This pr tries to add some metrics for image pulling.
image throughout may be related to CPU/disk io/network.
Further options to add for image pulling:
TODO(some unfinished work in this PR)
already exist layers or imageinto account. (This means the metrics will show a quick pulling for existing images. As user always want to know which image pulling is slow, this can be fixed later.)