Skip to content

rseq() c/r support#1706

Merged
avagin merged 22 commits intocheckpoint-restore:criu-devfrom
mihalicyn:rseq
Apr 18, 2022
Merged

rseq() c/r support#1706
avagin merged 22 commits intocheckpoint-restore:criu-devfrom
mihalicyn:rseq

Conversation

@mihalicyn
Copy link
Member

@mihalicyn mihalicyn commented Dec 21, 2021

This is patchset provides the rseq() C/R support

There are four patches which provide desired support:
cr-dump: handle rseq/rseq_cs flags properly

Userspace may configure rseq cs abort policy by
setting RSEQ_CS_FLAG_NO_RESTART_ON_* flags.

In ("cr-dump: fixup thread IP when inside rseq cs") we have supported
the case when process was caught by CRIU during rseq cs execution by
fixing up IP to abort_ip. Thats a common case, but there is special flag
called RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL, in this case we have to leave
process IP as it was before CRIU seized it. Unfortunately, that's not
all that we need here. We also must preserve (struct rseq)->rseq_cs field.

You may ask like "why we need to preserve it by hands? CRIU is dumping
all process memory and restores it". That's true. But not so easy. The problem
here is that the kernel performs this field cleanup when it realized that
the process gets out of rseq cs. But during dump/restore procedures we are
executing parasite/restorer from the process context. It means that process
will get out of rseq cs in any case and (struct rseq)->rseq_cs will be cleared
by the kernel. So we need to restore this field by hands at the *last* stage
of restore just before releasing processes.

cr-dump: fixup thread IP when inside rseq cs

If we caught the process when it's inside rseq
critical section we have to handle it properly.

From the kernel side of view, if the process
is executing inside the rseq cs and gets a signal,
rseq critical section execution will be interrupted
and after signal handler execution, we will proceed
to rseq cs abort handler instead of continuing normal
rseq cs execution (if RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
isn't set).

When CRIU seizes processes that's the same thing as
getting signal from the rseq point of view. So we need
to fixup instruction pointer to rseq cs abort handler
address.

rseq: fail dump if rseq is used but host doesn't support get_rseq_conf feature

A lot of kernel versions lacks support for ptrace(PTRACE_GET_RSEQ_CONFIGURATION).
But the userspace may be fresh (for instance containers with fresh Fedora runs
on CentOS 7 host). Consider two scenarious:

- kernel has no ptrace(PTRACE_GET_RSEQ_CONFIGURATION) support

1. there is a process which use rseq => fail dump
2. there is no process which use rseq => we can dump without any problems

But how to determine if process use rseq or not without get_rseq_conf feature?
Let's just try to do rseq registration from the parasite. If rseq is already
registered then we'll got EBUSY error. If not we'll success in registration.

rseq: initial support

Support basic rseq C/R scenario. Assume that:
- there are no processes with IP inside the rseq critical section (CS)
- kernel has ptrace(PTRACE_GET_RSEQ_CONFIGURATION) support

On dump:
1. use ptrace(PTRACE_GET_RSEQ_CONFIGURATION) to get
struct rseq pointer, rseq size and signature from the kernel.
2. save to the image

On restore:
1. get rseq ptr, size, signature from the image
2. register it back using rseq() from the restorer parasite

Have done:

  • basic case support (just dump/restore task_struct rseq-related fields)
  • basic zdtm static test for amd64
  • support kernels where rseq() syscall is present but ptrace(PTRACE_GET_RSEQ_CONFIGURATION) is not.
  • fix cross-compile issues
  • prepare CI for fedora rawhide env + kernel >= 5.13
  • fix an issue when CRIU was compiled against new glibc (with rseq support)
  • fix COMPAT segfault
  • support "transitional" states
  • support for RSEQ_CS_* flags
  • write transition test

Fixes #1696

https://criu.org/Rseq

@mihalicyn mihalicyn force-pushed the rseq branch 2 times, most recently from 93192f5 to 4d7f1d1 Compare December 21, 2021 21:06
@adrianreber
Copy link
Member

You can run tests in CI with a kernel >= 5.13 if you use our KVM based setup which we use for vdso=0 testing. If you start a rawhide container on the Fedora Vagrant VM you should be able to get latest glibc with a new enough kernel.

@mihalicyn mihalicyn force-pushed the rseq branch 3 times, most recently from d46609b to 7d23bde Compare December 22, 2021 14:24
@codecov-commenter
Copy link

codecov-commenter commented Dec 22, 2021

Codecov Report

Merging #1706 (a381fb0) into criu-dev (7d7d25f) will decrease coverage by 0.05%.
The diff coverage is 65.85%.

❗ Current head a381fb0 differs from pull request most recent head da6a243. Consider uploading reports for the commit da6a243 to get more accurate results

@@             Coverage Diff              @@
##           criu-dev    #1706      +/-   ##
============================================
- Coverage     69.35%   69.30%   -0.06%     
============================================
  Files           128      128              
  Lines         32087    32213     +126     
============================================
+ Hits          22255    22326      +71     
- Misses         9832     9887      +55     
Impacted Files Coverage Δ
criu/arch/x86/include/asm/types.h 100.00% <ø> (ø)
criu/include/parasite.h 100.00% <ø> (ø)
criu/include/pstree.h 100.00% <ø> (ø)
criu/include/rst_info.h 100.00% <ø> (ø)
criu/include/util.h 100.00% <ø> (ø)
criu/util.c 60.32% <40.54%> (-0.78%) ⬇️
criu/cr-check.c 62.94% <60.00%> (+1.00%) ⬆️
criu/cr-restore.c 67.06% <65.78%> (-0.23%) ⬇️
criu/cr-dump.c 75.30% <72.13%> (-0.21%) ⬇️
criu/pstree.c 85.26% <86.66%> (+0.03%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d7d25f...da6a243. Read the comment docs.

@mihalicyn
Copy link
Member Author

mihalicyn commented Dec 22, 2021

Hehe, great. We have CentOS 8 working. That's good news because it's the corner case when we have rseq() syscall but lacks ptrace(PTRACE_GET_RSEQ_CONFIGURATION).
Fedora novdso (with fresh kernel) passes too! Fine.

@mihalicyn
Copy link
Member Author

You can run tests in CI with a kernel >= 5.13 if you use our KVM based setup which we use for vdso=0 testing. If you start a rawhide container on the Fedora Vagrant VM you should be able to get latest glibc with a new enough kernel.

Yes, thanks for pointing it out. I've taken a look at scripts/ci/vagrant.sh it looks like for our needs can just adjust FEDORA_VERSION / FEDORA_BOX_VERSION variables. But unfortunately, on https://app.vagrantup.com/fedora repository there is no image for fedora-rawhide. :(

@adrianreber
Copy link
Member

You can run tests in CI with a kernel >= 5.13 if you use our KVM based setup which we use for vdso=0 testing. If you start a rawhide container on the Fedora Vagrant VM you should be able to get latest glibc with a new enough kernel.

Yes, thanks for pointing it out. I've taken a look at scripts/ci/vagrant.sh it looks like for our needs can just adjust FEDORA_VERSION / FEDORA_BOX_VERSION variables. But unfortunately, on https://app.vagrantup.com/fedora repository there is no image for fedora-rawhide. :(

Just run a rawhide container via Podman on the vagrant VM. You can basically re-use our rawhide on GitHub Actions setup in the VM and you should have everything you need.

@mihalicyn
Copy link
Member Author

You can run tests in CI with a kernel >= 5.13 if you use our KVM based setup which we use for vdso=0 testing. If you start a rawhide container on the Fedora Vagrant VM you should be able to get latest glibc with a new enough kernel.

Yes, thanks for pointing it out. I've taken a look at scripts/ci/vagrant.sh it looks like for our needs can just adjust FEDORA_VERSION / FEDORA_BOX_VERSION variables. But unfortunately, on https://app.vagrantup.com/fedora repository there is no image for fedora-rawhide. :(

Just run a rawhide container via Podman on the vagrant VM. You can basically re-use our rawhide on GitHub Actions setup in the VM and you should have everything you need.

thanks, Adrian ;) Will try!

@mihalicyn mihalicyn force-pushed the rseq branch 14 times, most recently from 49b90a6 to c6d2ff1 Compare December 23, 2021 13:50
@mihalicyn
Copy link
Member Author

mihalicyn commented Apr 13, 2022

I have a fix in #1738

great! Let's reopen that PR?

Add rseq syscall numbers for:
arm/aarch64, mips64, ppc64le, s390, x86_64/x86

Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Add "get_rseq_conf" feature corresponding to the
ptrace(PTRACE_GET_RSEQ_CONFIGURATION) support.

Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Support basic rseq C/R scenario. Assume that:
- there are no processes with IP inside the rseq critical section (CS)
- kernel has ptrace(PTRACE_GET_RSEQ_CONFIGURATION) support

On dump:
1. use ptrace(PTRACE_GET_RSEQ_CONFIGURATION) to get
struct rseq pointer, rseq size and signature from the kernel.
2. save to the image

On restore:
1. get rseq ptr, size, signature from the image
2. register it back using rseq() from the restorer parasite

Fixes: checkpoint-restore#1696

Reported-by: Radostin Stoyanov <radostin@redhat.com>
Suggested-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
@figiel
Copy link

figiel commented Apr 14, 2022

@mihalicyn OK, I'll try to review this tomorrow.

…f feature

A lot of kernel versions lacks support for ptrace(PTRACE_GET_RSEQ_CONFIGURATION).
But the userspace may be fresh (for instance containers with fresh Fedora runs
on CentOS 7 host). Consider two scenarious:

- kernel has no ptrace(PTRACE_GET_RSEQ_CONFIGURATION) support

1. there is a process which use rseq => fail dump
2. there is no process which use rseq => we can dump without any problems

But how to determine if process use rseq or not without get_rseq_conf feature?
Let's just try to do rseq registration from the parasite. If rseq is already
registered then we'll got EBUSY error. If not we'll success in registration.

Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Here we just want to check that if rseq was registered before C/R
it remains registered after it.

Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Let's see how rseq() C/R feature works

This reverts commit d99def7.

Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
We have ability to use nested virtualization on
Cirrus, and already have "Vagrant Fedora based test (no VDSO)"
test, let's do analogical for Fedora Rawhide to get fresh kernel.

Suggested-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Let's take thread_pointer() implementation from Glibc.
It will be useful in the further because Glibc stores
struct rseq on the TLS. Absolute address can be calculated
as __criu_thread_pointer() + __rseq_offset.
__rseq_offset is an exported symbol from Glibc itself.

We need to have an ability to determine where struct
rseq is stored to unregister it in CRIU during the restore
stage.

For different libc like musl-libc we will have to handle
rseq separately depends on how struct rseq is stored.

Right now that's not a problem because musl-libc has no
rseq support, so we don't need to unregister it.

https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=8dbeb0561eeb876f557ac9eef5721912ec074ea5
https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=cb976fba4c51ede7bf8cee5035888527c308dfbc

Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
@mihalicyn
Copy link
Member Author

Dear friends,
zdtm/transition/rseq01 on Alpine Linux was fixed. Thanks to Mathieu @compudj for looking into this.
I've shot myself in the foot by corrupting the stack by an unpaired fldl instruction. That's really strange that we've got crashes only on musl-libc.

Andrei suggests using the simpler and better approach to make the rseq cs section execution time larger by using the pause instruction. I will do that in the later version but let's keep the current version with FPU instructions in memory of the hours spent in searching for mistake. :D

Fresh glibc does rseq registration by default during start_thread().
[ see https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=95e114a0919d844d8fe07839cb6538b7f5ee920e ]

This cause process crashes during memory restore procedure, because
memory which corresponds to the struct rseq will be unmapped and overriden
in __export_restore_task.

Let's perform rseq unregistration just before unmap_old_vmas(). To achieve
that we need to determine (struct rseq) address at first while we are in Glibc
(we do that in prep_libc_rseq_info using Glibc exported symbols).

See also
("nptl: Add public rseq symbols and <sys/rseq.h>")
https://sourceware.org/git?p=glibc.git;a=commit;h=c901c3e764d7c7079f006b4e21e877d5036eb4f5
("nptl: Add <thread_pointer.h> for defining __thread_pointer")
https://sourceware.org/git?p=glibc.git;a=commit;h=8dbeb0561eeb876f557ac9eef5721912ec074ea5

TODO: do the same for musl-libc if it will start to register rseq by default

Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Fresh Glibc does rseq() register by default. We need to unregister
rseq before registering our own.

Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
If we caught the process when it's inside rseq
critical section we have to handle it properly.

From the kernel side of view, if the process
is executing inside the rseq cs and gets a signal,
rseq critical section execution will be interrupted
and after signal handler execution, we will proceed
to rseq cs abort handler instead of continuing normal
rseq cs execution (if RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
isn't set).

When CRIU seizes processes that's the same thing as
getting signal from the rseq point of view. So we need
to fixup instruction pointer to rseq cs abort handler
address.

Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
This reverts commit f008f74.

Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Userspace may configure rseq cs abort policy by
setting RSEQ_CS_FLAG_NO_RESTART_ON_* flags.

In ("cr-dump: fixup thread IP when inside rseq cs") we have supported
the case when process was caught by CRIU during rseq cs execution by
fixing up IP to abort_ip. Thats a common case, but there is special flag
called RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL, in this case we have to leave
process IP as it was before CRIU seized it. Unfortunately, that's not
all that we need here. We also must preserve (struct rseq)->rseq_cs field.

You may ask like "why we need to preserve it by hands? CRIU is dumping
all process memory and restores it". That's true. But not so easy. The problem
here is that the kernel performs this field cleanup when it realized that
the process gets out of rseq cs. But during dump/restore procedures we are
executing parasite/restorer from the process context. It means that process
will get out of rseq cs in any case and (struct rseq)->rseq_cs will be cleared
by the kernel. So we need to restore this field by hands at the *last* stage
of restore just before releasing processes.

Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
That's strange but rseq02 test fails with:
09:06:16.222:    51: exit 555f52082120 555f52082120
09:06:16.282:    51: exit 555f52082120 555f52082120
09:06:16.340:    51: exit 555f52082120 555f52082120
09:06:16.397:    51: exit 555f52082120 555f52082120
09:06:16.503:    51: exit 0 555f52082120
09:06:16.503:    51: FAIL: rseq02.c:235: Failed to increment per-cpu counter (errno = 2 (No such file or directory))
09:06:16.503:    51: FAIL: rseq02.c:246:  (errno = 16 (Device or resource busy))

It means that rseq_cs pointer was cleaned up by the kernel despite of
NO_RESTART* flags. That's a hardly reproducible and I will investigate that.

Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
We have a separate target for alpine in script/ci/Makefile
which defines some extra opts for zdtm using ZDTM_OPTIONS
variable. But really it doesn't work. First of all, variable
should be named as ZDTM_OPTS and also we have to specify
it directly in the CONTAINER_RUNTIME cmdline to make it work.

I've also changed variable value just to make it consistent
with docker.env value which was really used.

Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
It looks like we've got broken fhandles from fdinfo
for inotifies/fanotifies for btrfs. I will look into that.

Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
mountinfo contains more info than just "mount" output

Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
@mihalicyn
Copy link
Member Author

JFYI: I'm working on the article https://criu.org/Rseq

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-auto-close Don't auto-close as a stale issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement rseq support, as required by glibc 2.35

9 participants