Skip to content

criu checkpoint/restore: print errors from criu log#3816

Merged
lifubang merged 4 commits intoopencontainers:mainfrom
kolyshkin:show-criu-errors
Aug 5, 2023
Merged

criu checkpoint/restore: print errors from criu log#3816
lifubang merged 4 commits intoopencontainers:mainfrom
kolyshkin:show-criu-errors

Conversation

@kolyshkin
Copy link
Copy Markdown
Contributor

@kolyshkin kolyshkin commented Apr 6, 2023

When criu fails, it does not give us much context to understand what was the cause of an error -- for that, we need to take a look into its log file.

This is somewhat complicated to do (as you can see in parts of checkpoint.bats removed by this commit), and not very convenient.

Add a function to find and log errors from criu logs, in case either checkpoint or restore has failed.

Fixes: #3711

@kolyshkin
Copy link
Copy Markdown
Contributor Author

@adrianreber @avagin PTAL

@kolyshkin kolyshkin added this to the 1.2.0 milestone Apr 9, 2023
@adrianreber
Copy link
Copy Markdown
Contributor

That is good idea. A bit wild maybe. When using CRIU in Podman or CRI-O users usually get the error message from runc and pass it to the user in combination with the location of the log file.

I guess it would be important to know if users of runc's checkpoint functionality (like containerd, Podman and CRI-O) can handle this changed runc behaviour.

Your log scanner only seems to run during a failure so the added time to scan the log file should normally not be a problem.

I like this idea of better error reporting to the user but I am not sure how well runc's user (container engines) can handle multiline error messages.

@kolyshkin
Copy link
Copy Markdown
Contributor Author

I like this idea of better error reporting to the user but I am not sure how well runc's user (container engines) can handle multiline error messages.

I've seen quite a few cases when runc emits multiple errors/warnings etc., so this should not be something that's entirely new. Surely, we'll have 1.2.0rc released to test this.

@avagin
Copy link
Copy Markdown
Contributor

avagin commented Apr 24, 2023

LGTM

@kolyshkin kolyshkin force-pushed the show-criu-errors branch 3 times, most recently from 7579873 to 15c1c60 Compare August 1, 2023 17:32
@kolyshkin kolyshkin requested review from avagin and rst0git August 1, 2023 17:33
Copy link
Copy Markdown
Contributor

@rst0git rst0git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

No code change, only added periods to some comments to make godot happy.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Use "switch t" since we only check t.

2. Remove unneeded t assignment.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When criu fails, it does not give us much context to understand what
was the cause of an error -- for that, we need to take a look into its
log file.

This is somewhat complicated to do (as you can see in parts of
checkpoint.bats removed by this commit), and not very user-friendly.

Add a function to find and log errors from criu logs, together with some
preceding context, in case either checkpoint or restore has failed.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
As we now log the log file name in logCriuErrors.

While at it, there is no need to use var.String() with %s as it is done
by the runtime.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@avagin
Copy link
Copy Markdown
Contributor

avagin commented Aug 3, 2023

LGTM

@kolyshkin
Copy link
Copy Markdown
Contributor Author

@adrianreber @AkihiroSuda @cyphar PTAL

Copy link
Copy Markdown
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Seems a little bit crazy but I guess it's better to get some information in the actual runc log rather than requiring the user to go looking themselves. We will have to see if this breaks anything in 1.2-rc1.

@kolyshkin
Copy link
Copy Markdown
Contributor Author

LGTM. Seems a little bit crazy but I guess it's better to get some information in the actual runc log rather than requiring the user to go looking themselves. We will have to see if this breaks anything in 1.2-rc1.

The alternative is to make criu report extended errors. This is somewhat complicated because when debugging criu some context is usually needed (to see what happened just before the error) and grep -B N log naturally provides this context. Nevertheless, maybe something better will replace this in, alas not yet, so for now this PR is (way) better than spewing the log file name.

@kolyshkin
Copy link
Copy Markdown
Contributor Author

@lifubang PTAL

@lifubang lifubang merged commit 7b1fc98 into opencontainers:main Aug 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement c/r error reporting

6 participants