rseq() c/r support by mihalicyn · Pull Request #1706 · checkpoint-restore/criu

mihalicyn · 2021-12-21T20:48:31Z

This is patchset provides the rseq() C/R support

There are four patches which provide desired support:
cr-dump: handle rseq/rseq_cs flags properly

Userspace may configure rseq cs abort policy by
setting RSEQ_CS_FLAG_NO_RESTART_ON_* flags.

In ("cr-dump: fixup thread IP when inside rseq cs") we have supported
the case when process was caught by CRIU during rseq cs execution by
fixing up IP to abort_ip. Thats a common case, but there is special flag
called RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL, in this case we have to leave
process IP as it was before CRIU seized it. Unfortunately, that's not
all that we need here. We also must preserve (struct rseq)->rseq_cs field.

You may ask like "why we need to preserve it by hands? CRIU is dumping
all process memory and restores it". That's true. But not so easy. The problem
here is that the kernel performs this field cleanup when it realized that
the process gets out of rseq cs. But during dump/restore procedures we are
executing parasite/restorer from the process context. It means that process
will get out of rseq cs in any case and (struct rseq)->rseq_cs will be cleared
by the kernel. So we need to restore this field by hands at the *last* stage
of restore just before releasing processes.

cr-dump: fixup thread IP when inside rseq cs

If we caught the process when it's inside rseq
critical section we have to handle it properly.

From the kernel side of view, if the process
is executing inside the rseq cs and gets a signal,
rseq critical section execution will be interrupted
and after signal handler execution, we will proceed
to rseq cs abort handler instead of continuing normal
rseq cs execution (if RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
isn't set).

When CRIU seizes processes that's the same thing as
getting signal from the rseq point of view. So we need
to fixup instruction pointer to rseq cs abort handler
address.

rseq: fail dump if rseq is used but host doesn't support get_rseq_conf feature

A lot of kernel versions lacks support for ptrace(PTRACE_GET_RSEQ_CONFIGURATION).
But the userspace may be fresh (for instance containers with fresh Fedora runs
on CentOS 7 host). Consider two scenarious:

- kernel has no ptrace(PTRACE_GET_RSEQ_CONFIGURATION) support

1. there is a process which use rseq => fail dump
2. there is no process which use rseq => we can dump without any problems

But how to determine if process use rseq or not without get_rseq_conf feature?
Let's just try to do rseq registration from the parasite. If rseq is already
registered then we'll got EBUSY error. If not we'll success in registration.

rseq: initial support

Support basic rseq C/R scenario. Assume that:
- there are no processes with IP inside the rseq critical section (CS)
- kernel has ptrace(PTRACE_GET_RSEQ_CONFIGURATION) support

On dump:
1. use ptrace(PTRACE_GET_RSEQ_CONFIGURATION) to get
struct rseq pointer, rseq size and signature from the kernel.
2. save to the image

On restore:
1. get rseq ptr, size, signature from the image
2. register it back using rseq() from the restorer parasite

Have done:

Fixes #1696

https://criu.org/Rseq

adrianreber · 2021-12-21T21:10:00Z

You can run tests in CI with a kernel >= 5.13 if you use our KVM based setup which we use for vdso=0 testing. If you start a rawhide container on the Fedora Vagrant VM you should be able to get latest glibc with a new enough kernel.

codecov-commenter · 2021-12-22T14:38:32Z

Codecov Report

Merging #1706 (a381fb0) into criu-dev (7d7d25f) will decrease coverage by 0.05%.
The diff coverage is 65.85%.

❗ Current head a381fb0 differs from pull request most recent head da6a243. Consider uploading reports for the commit da6a243 to get more accurate results

@@             Coverage Diff              @@
##           criu-dev    #1706      +/-   ##
============================================
- Coverage     69.35%   69.30%   -0.06%     
============================================
  Files           128      128              
  Lines         32087    32213     +126     
============================================
+ Hits          22255    22326      +71     
- Misses         9832     9887      +55

Impacted Files	Coverage Δ
criu/arch/x86/include/asm/types.h	`100.00% <ø> (ø)`
criu/include/parasite.h	`100.00% <ø> (ø)`
criu/include/pstree.h	`100.00% <ø> (ø)`
criu/include/rst_info.h	`100.00% <ø> (ø)`
criu/include/util.h	`100.00% <ø> (ø)`
criu/util.c	`60.32% <40.54%> (-0.78%)`	⬇️
criu/cr-check.c	`62.94% <60.00%> (+1.00%)`	⬆️
criu/cr-restore.c	`67.06% <65.78%> (-0.23%)`	⬇️
criu/cr-dump.c	`75.30% <72.13%> (-0.21%)`	⬇️
criu/pstree.c	`85.26% <86.66%> (+0.03%)`	⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d7d25f...da6a243. Read the comment docs.

mihalicyn · 2021-12-22T14:39:53Z

Hehe, great. We have CentOS 8 working. That's good news because it's the corner case when we have rseq() syscall but lacks ptrace(PTRACE_GET_RSEQ_CONFIGURATION).
Fedora novdso (with fresh kernel) passes too! Fine.

mihalicyn · 2021-12-22T14:56:45Z

You can run tests in CI with a kernel >= 5.13 if you use our KVM based setup which we use for vdso=0 testing. If you start a rawhide container on the Fedora Vagrant VM you should be able to get latest glibc with a new enough kernel.

Yes, thanks for pointing it out. I've taken a look at scripts/ci/vagrant.sh it looks like for our needs can just adjust FEDORA_VERSION / FEDORA_BOX_VERSION variables. But unfortunately, on https://app.vagrantup.com/fedora repository there is no image for fedora-rawhide. :(

adrianreber · 2021-12-22T15:04:39Z

You can run tests in CI with a kernel >= 5.13 if you use our KVM based setup which we use for vdso=0 testing. If you start a rawhide container on the Fedora Vagrant VM you should be able to get latest glibc with a new enough kernel.

Yes, thanks for pointing it out. I've taken a look at scripts/ci/vagrant.sh it looks like for our needs can just adjust FEDORA_VERSION / FEDORA_BOX_VERSION variables. But unfortunately, on https://app.vagrantup.com/fedora repository there is no image for fedora-rawhide. :(

Just run a rawhide container via Podman on the vagrant VM. You can basically re-use our rawhide on GitHub Actions setup in the VM and you should have everything you need.

mihalicyn · 2021-12-22T15:08:48Z

You can run tests in CI with a kernel >= 5.13 if you use our KVM based setup which we use for vdso=0 testing. If you start a rawhide container on the Fedora Vagrant VM you should be able to get latest glibc with a new enough kernel.

Yes, thanks for pointing it out. I've taken a look at scripts/ci/vagrant.sh it looks like for our needs can just adjust FEDORA_VERSION / FEDORA_BOX_VERSION variables. But unfortunately, on https://app.vagrantup.com/fedora repository there is no image for fedora-rawhide. :(

Just run a rawhide container via Podman on the vagrant VM. You can basically re-use our rawhide on GitHub Actions setup in the VM and you should have everything you need.

thanks, Adrian ;) Will try!

mihalicyn · 2022-04-13T14:59:41Z

I have a fix in #1738

great! Let's reopen that PR?

Add rseq syscall numbers for: arm/aarch64, mips64, ppc64le, s390, x86_64/x86 Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

Add "get_rseq_conf" feature corresponding to the ptrace(PTRACE_GET_RSEQ_CONFIGURATION) support. Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

Support basic rseq C/R scenario. Assume that: - there are no processes with IP inside the rseq critical section (CS) - kernel has ptrace(PTRACE_GET_RSEQ_CONFIGURATION) support On dump: 1. use ptrace(PTRACE_GET_RSEQ_CONFIGURATION) to get struct rseq pointer, rseq size and signature from the kernel. 2. save to the image On restore: 1. get rseq ptr, size, signature from the image 2. register it back using rseq() from the restorer parasite Fixes: checkpoint-restore#1696 Reported-by: Radostin Stoyanov <radostin@redhat.com> Suggested-by: Florian Weimer <fweimer@redhat.com> Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

figiel · 2022-04-14T15:59:30Z

@mihalicyn OK, I'll try to review this tomorrow.

test/zdtm/static/rseq00.c

test/zdtm/transition/rseq01.c

criu/cr-dump.c

…f feature A lot of kernel versions lacks support for ptrace(PTRACE_GET_RSEQ_CONFIGURATION). But the userspace may be fresh (for instance containers with fresh Fedora runs on CentOS 7 host). Consider two scenarious: - kernel has no ptrace(PTRACE_GET_RSEQ_CONFIGURATION) support 1. there is a process which use rseq => fail dump 2. there is no process which use rseq => we can dump without any problems But how to determine if process use rseq or not without get_rseq_conf feature? Let's just try to do rseq registration from the parasite. If rseq is already registered then we'll got EBUSY error. If not we'll success in registration. Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

Here we just want to check that if rseq was registered before C/R it remains registered after it. Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

Let's see how rseq() C/R feature works This reverts commit d99def7. Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

We have ability to use nested virtualization on Cirrus, and already have "Vagrant Fedora based test (no VDSO)" test, let's do analogical for Fedora Rawhide to get fresh kernel. Suggested-by: Adrian Reber <areber@redhat.com> Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

Let's take thread_pointer() implementation from Glibc. It will be useful in the further because Glibc stores struct rseq on the TLS. Absolute address can be calculated as __criu_thread_pointer() + __rseq_offset. __rseq_offset is an exported symbol from Glibc itself. We need to have an ability to determine where struct rseq is stored to unregister it in CRIU during the restore stage. For different libc like musl-libc we will have to handle rseq separately depends on how struct rseq is stored. Right now that's not a problem because musl-libc has no rseq support, so we don't need to unregister it. https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=8dbeb0561eeb876f557ac9eef5721912ec074ea5 https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=cb976fba4c51ede7bf8cee5035888527c308dfbc Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

mihalicyn · 2022-04-17T18:51:24Z

Dear friends,
zdtm/transition/rseq01 on Alpine Linux was fixed. Thanks to Mathieu @compudj for looking into this.
I've shot myself in the foot by corrupting the stack by an unpaired fldl instruction. That's really strange that we've got crashes only on musl-libc.

Andrei suggests using the simpler and better approach to make the rseq cs section execution time larger by using the pause instruction. I will do that in the later version but let's keep the current version with FPU instructions in memory of the hours spent in searching for mistake. :D

criu/cr-restore.c

Fresh glibc does rseq registration by default during start_thread(). [ see https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=95e114a0919d844d8fe07839cb6538b7f5ee920e ] This cause process crashes during memory restore procedure, because memory which corresponds to the struct rseq will be unmapped and overriden in __export_restore_task. Let's perform rseq unregistration just before unmap_old_vmas(). To achieve that we need to determine (struct rseq) address at first while we are in Glibc (we do that in prep_libc_rseq_info using Glibc exported symbols). See also ("nptl: Add public rseq symbols and <sys/rseq.h>") https://sourceware.org/git?p=glibc.git;a=commit;h=c901c3e764d7c7079f006b4e21e877d5036eb4f5 ("nptl: Add <thread_pointer.h> for defining __thread_pointer") https://sourceware.org/git?p=glibc.git;a=commit;h=8dbeb0561eeb876f557ac9eef5721912ec074ea5 TODO: do the same for musl-libc if it will start to register rseq by default Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

Fresh Glibc does rseq() register by default. We need to unregister rseq before registering our own. Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

If we caught the process when it's inside rseq critical section we have to handle it properly. From the kernel side of view, if the process is executing inside the rseq cs and gets a signal, rseq critical section execution will be interrupted and after signal handler execution, we will proceed to rseq cs abort handler instead of continuing normal rseq cs execution (if RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL isn't set). When CRIU seizes processes that's the same thing as getting signal from the rseq point of view. So we need to fixup instruction pointer to rseq cs abort handler address. Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

This reverts commit f008f74. Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

Userspace may configure rseq cs abort policy by setting RSEQ_CS_FLAG_NO_RESTART_ON_* flags. In ("cr-dump: fixup thread IP when inside rseq cs") we have supported the case when process was caught by CRIU during rseq cs execution by fixing up IP to abort_ip. Thats a common case, but there is special flag called RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL, in this case we have to leave process IP as it was before CRIU seized it. Unfortunately, that's not all that we need here. We also must preserve (struct rseq)->rseq_cs field. You may ask like "why we need to preserve it by hands? CRIU is dumping all process memory and restores it". That's true. But not so easy. The problem here is that the kernel performs this field cleanup when it realized that the process gets out of rseq cs. But during dump/restore procedures we are executing parasite/restorer from the process context. It means that process will get out of rseq cs in any case and (struct rseq)->rseq_cs will be cleared by the kernel. So we need to restore this field by hands at the *last* stage of restore just before releasing processes. Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

That's strange but rseq02 test fails with: 09:06:16.222: 51: exit 555f52082120 555f52082120 09:06:16.282: 51: exit 555f52082120 555f52082120 09:06:16.340: 51: exit 555f52082120 555f52082120 09:06:16.397: 51: exit 555f52082120 555f52082120 09:06:16.503: 51: exit 0 555f52082120 09:06:16.503: 51: FAIL: rseq02.c:235: Failed to increment per-cpu counter (errno = 2 (No such file or directory)) 09:06:16.503: 51: FAIL: rseq02.c:246: (errno = 16 (Device or resource busy)) It means that rseq_cs pointer was cleaned up by the kernel despite of NO_RESTART* flags. That's a hardly reproducible and I will investigate that. Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

We have a separate target for alpine in script/ci/Makefile which defines some extra opts for zdtm using ZDTM_OPTIONS variable. But really it doesn't work. First of all, variable should be named as ZDTM_OPTS and also we have to specify it directly in the CONTAINER_RUNTIME cmdline to make it work. I've also changed variable value just to make it consistent with docker.env value which was really used. Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

It looks like we've got broken fhandles from fdinfo for inotifies/fanotifies for btrfs. I will look into that. Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

mountinfo contains more info than just "mount" output Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

mihalicyn · 2022-04-18T08:22:30Z

JFYI: I'm working on the article https://criu.org/Rseq

mihalicyn force-pushed the rseq branch 2 times, most recently from 93192f5 to 4d7f1d1 Compare December 21, 2021 21:06

mihalicyn force-pushed the rseq branch 3 times, most recently from d46609b to 7d23bde Compare December 22, 2021 14:24

mihalicyn requested review from 0x7f454c46, Snorch, avagin, rppt and rst0git December 22, 2021 14:41

mihalicyn force-pushed the rseq branch 14 times, most recently from 49b90a6 to c6d2ff1 Compare December 23, 2021 13:50

mihalicyn added 5 commits April 14, 2022 11:11

compel: add rseq syscall into compel std plugin syscall tables

8d9143a

Add rseq syscall numbers for: arm/aarch64, mips64, ppc64le, s390, x86_64/x86 Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

kerndat: check for rseq syscall support

8a6ef04

Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

util: move fork_and_ptrace_attach helper from cr-check

36ffd78

Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

cr-check: Add ptrace rseq conf dump feature

e4226e9

Add "get_rseq_conf" feature corresponding to the ptrace(PTRACE_GET_RSEQ_CONFIGURATION) support. Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

figiel suggested changes Apr 15, 2022

View reviewed changes

mihalicyn added 5 commits April 17, 2022 18:33

zdtm: add basic static/rseq00 test for rseq C/R

38a9576

Here we just want to check that if rseq was registered before C/R it remains registered after it. Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

Revert "ci: disable glibc rseq support"

373318f

Let's see how rseq() C/R feature works This reverts commit d99def7. Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

avagin reviewed Apr 18, 2022

View reviewed changes

criu/cr-restore.c Show resolved Hide resolved

mihalicyn added 12 commits April 18, 2022 09:32

zdtm/static/rseq00: fix rseq test when linking with a fresh Glibc

db12a56

Fresh Glibc does rseq() register by default. We need to unregister rseq before registering our own. Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

compel: add helpers to get/set instruction pointer

6e0a642

Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

zdtm: add transition/rseq01 test for amd64

97a6ef9

Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

Revert "test: disable rseq also on Archlinux"

e23cf94

This reverts commit f008f74. Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

zdtm: add rseq02 transition test with NO_RESTART CS flag

d2e3dd1

Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

ci: criu-fault: skip inotify_irmap fault-injection on btrfs

8a8c8ad

It looks like we've got broken fhandles from fdinfo for inotifies/fanotifies for btrfs. I will look into that. Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

ci: print mountinfo instead of mount cmd output

da6a243

mountinfo contains more info than just "mount" output Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

edsantiago mentioned this pull request Apr 21, 2022

Workaround criu re-linking output in system test containers/podman#13958

Merged

compor mentioned this pull request Feb 26, 2023

CRIU segfault on restore for ARM systems-nuts/unifico#228

Open

2 tasks

Conversation

mihalicyn commented Dec 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrianreber commented Dec 21, 2021

Uh oh!

codecov-commenter commented Dec 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mihalicyn commented Dec 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mihalicyn commented Dec 22, 2021

Uh oh!

adrianreber commented Dec 22, 2021

Uh oh!

mihalicyn commented Dec 22, 2021

Uh oh!

mihalicyn commented Apr 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

figiel commented Apr 14, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mihalicyn commented Apr 17, 2022

Uh oh!

Uh oh!

mihalicyn commented Apr 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

mihalicyn commented Dec 21, 2021 •

edited

Loading

codecov-commenter commented Dec 22, 2021 •

edited

Loading

mihalicyn commented Dec 22, 2021 •

edited

Loading

mihalicyn commented Apr 13, 2022 •

edited

Loading