[cri] Handle pod transition states gracefully while listing pod stats#8671
Conversation
|
Hi @jsturtevant. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
|
||
| processes, err := task.Pids(ctx) | ||
| if err != nil { | ||
| if errdefs.IsNotFound(err) { |
There was a problem hiding this comment.
this requires microsoft/hcsshim#1807 but should be ok to merge this and follow up via hcsshim version bump. Alternatively, I can handle the errors directly here.
13c357b to
74cfd2f
Compare
b0f40c9 to
d0e8742
Compare
|
looks like the ci integration was a flake |
|
/hold |
d0e8742 to
05b5bcb
Compare
|
/hold cancel |
|
/ok-to-test |
mikebrow
left a comment
There was a problem hiding this comment.
would prefer doing this in a common way across platform impls over in podsandboxstats and listpodsandboxstats...
not sure it makes sense to return nil vs err for podsandboxstats.. but agree it makes sense to not fail all of listpodsandboxstats for one pod being not ready or unknown at the point of gathering the stats, that feels like an error
@bobbypage @haircommander do you have any thoughts? I assume some of the changes to soften the errors would apply to Linux side as well |
I am leaning towards returning an err for It might make sense to wrap some of the transient errs to make the decision easier to understand but at this time I'm not sure I know which errors we would really want to fail on for |
|
I believe most of the time an error is treated as transient. if it's worth logging in the kubelet, then I'd report it. If it's part of normal state changes in containerd, then I would say hide the errors. e.g: is there action a user can take, or a maintainer should be aware of? if so, error. if not, hide |
05b5bcb to
67ea3f4
Compare
|
@haircommander @mikebrow I've made the following change e489ff5 |
d3de050 to
5cba93b
Compare
|
@haircommander @mikebrow PTAL |
|
LGTM |
mikebrow
left a comment
There was a problem hiding this comment.
See comments regarding using switch for err processing and changing the default for err to continue setting up to return the best can do results and the first/last/list of transient/errors etc.. vs return nil with err on unknown type of err..
|
will be interesting on the kubelet side if they will choose to parse and eval the transient stats errors or log and ignore as we are moving to here... or if there will be a subsequent request to go even further and clear the transient errors. FYI @haircommander @dims @saschagrunert @SergeyKanzhelev |
3d7fd8b to
f1f144f
Compare
@mikebrow - I updated the PR to use a switch statement for style |
|
just a couple logistical things.. pls remove the .idea files and squash.. thx! |
.idea/.gitignore
Outdated
| @@ -0,0 +1,8 @@ | |||
| # Default ignored files | |||
When the pods are transitioning there are several cases where containers might not be in valid state. There were several cases where the stats where failing hard but we should just continue on as they are transient and will be picked up again when kubelet queries for the stats again. Signed-off-by: James Sturtevant <jstur@microsoft.com> Signed-off-by: Mark Rossetti <marosset@microsoft.com>
f1f144f to
f914edf
Compare
done and done! |
|
any objections to flagging these as cherry pick for 1.7 and 1.6? |
When the pods are transitioning there are several cases where containers might not be in valid state. There were several cases where the stats where failing hard but we should just continue on as they are transient and will be picked up again when kubelet queries for the stats again.
I've found these after running the e2e tests in kubernetes/kubernetes#116968 several times.