Skip to content

udev: fix race condition in symlink generation code#9551

Closed
mwilck wants to merge 42 commits intosystemd:masterfrom
mwilck:ups/udev-fix-symlink-race
Closed

udev: fix race condition in symlink generation code#9551
mwilck wants to merge 42 commits intosystemd:masterfrom
mwilck:ups/udev-fix-symlink-race

Conversation

@mwilck
Copy link
Copy Markdown
Contributor

@mwilck mwilck commented Jul 9, 2018

This request replaces the previous pull requests #8667 and #8819. These requests belong together, as #8819 introduced currently failing tests that were fixed by #8667. Especially in #8667, several updates from my side may confuse readers. Therefore I decided to create a new pull request.

With lots of contenders for a certain symlink, the current udev code suffers from two problems:

  1. the symlink creation is racy if udev events from several contenders are processed simultaneously. This is a real world problem that has been reported to SUSE by customers. The patch set extends the udev test suite by new test cases that reproduce this failure reliably, in particular on multicore systems (my 4-core laptop suffices).
  2. the algorithm for selecting the best symlink is highly inefficient for many contenders, because all potential symlink targets are represented by files flatly stored under /run/udev/links/$LINK. To obtain the highest-prio link, udev must read all entries and retrieve their priority from the udev db, which is again inherently racy. This inefficiency is strongly aggravated if locking is used to solve the first problem.

Another issue which I faced when looking for a solution for this problem was the bad scaling of fcntl() based file locking in current kernels. Neil Brown has developed a fix for this, but it will take a lot of time until that's broadly available. Therefore I am using a different approach for locking based on a set of SysV semaphores, which doesn't have kernel-side scaling issues.

In the discussion for #8667, @poettering had asked for a solution without locking ("simple rescan"). After having conducted lots of experiment and research in alternative solutions I can assure that no such solution exists. Like in other, similar cases, locking the shared data structure is the only safe way to avoid the race.

Solving both 1.) and 2.) above requires both locking and changing the internal data structures udev uses for managing symlink targets. With the method presented here, the race is avoided, and yet the code scales better than the original in my tests.

The patch set also fixes some minor issues I encountered while working on the udev test suite.

Copy link
Copy Markdown
Member

@keszybz keszybz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an impressive patch series. For the delay in review. The perl part looks great.

I think that they python part has some issues — maybe it can be split out to a separate PR, to get the rest merged.

I made some comments below about the C part.

@keszybz keszybz added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Aug 22, 2018
@yuwata
Copy link
Copy Markdown
Member

yuwata commented Sep 20, 2018

Needs to be rebased. Could you update this?

@mwilck mwilck force-pushed the ups/udev-fix-symlink-race branch from 4cb19d4 to 46cbe12 Compare January 4, 2019 21:11
@mwilck
Copy link
Copy Markdown
Contributor Author

mwilck commented Jan 4, 2019

@keszybz, @yuwata,

thanks a lot for the review. And big apologies that it took me so long to get back to this issue.
I refreshed the pull request now. It's rebased, most style issues should be fixed, and I implemented
more benign failure mode if semaphores aren't supported on certain systems.
Testing showed no regressions from the previous submission. I also did valgrind tests to make sure I got the cleanup stuff right. It's looking good so far.

Again, sorry for the long delay, and please give this another shot.

@keszybz keszybz added good-to-merge/after-next-release and removed reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks labels Jan 9, 2019
mwilck added 18 commits January 9, 2019 14:04
Allow testing cases where multiple devices are added and removed.
This implies a change of the data structure: every test allows
for multiple devices to be added, and "exp_name" etc. are now properties
of the device, not of the test.
It's not necessary to write the rules for every udev run, as we
now may have many (rather than just 2) per test.
Allow testing cases where multiple devices are added and removed
simultaneously. Tests are started as synchronously as possible using a
semaphore, in order to test possible race conditions. If this isn't desired,
the test parameter "sleep_us" can be set to the number of microseconds to wait
between udev invocations.
More often than not, the created devnode is the basename of the
sysfs entry. The "devnode" device may be used to override the
auto-detected node name.

Permissions and major/minor number are now verified on the devnode
itself, not on symlinks.

For those tests where exp_name is set to the computed devnode name,
the explicit "exp_name" can be removed. "exp_name" is only required for
symlinks.

This allows separate testing for devnodes and symlinks an a follow-up
patch.
Test if symlinks are created correctly by comparing the symlink
targets to the devnode path. This implies (for the symlink) that
major/minor numbers and permissions are correct, as we have tested
that on the devnode already.
Instead of testing the existence or non-exisitence of just a single
symlink, allow testing of several links per device.

Change the test definitions accordingly.
udev hasn't supported renaming device nodes for some time.
the "last_rule" option hasn't been supported for some time.
Therefore this test fails if a "not_exp_links" attribute is added,
as it should be. Mark it appropriately.
Add some rules that make it a bit harder to pass, mainly the
non-existence checks.
These rules have survived from an ancient version of the code
and save no purpose any more.
As we can check multiple links in a single test now, these 3
tests can be merged into one.
As we can test multiple devices and multiple links per device
in one test now, these two tests can be merged into one.
This is helpful to catch possible regressions in the test.
Also, don't count wait() errors, they are likely not udev errors.
Add 4 new tests using multiple devices. Number 2-4 use many
devices claiming the same symlink, where only one device has
a higher priority thatn the others. They fail sporadically with
the current code, if a race condition causes the symlink to point
to the wrong device. Test 4 is like test 2 with sleeps in between,
it's much less likely to fail.
for easier reproduction of sporadic test failures.
Manually listing all devices in the test definition becomes cumbersome with
lots of devices. Add a function that scans on all block devices in
the test sysfs and generates a list of devices to test.
Script for generating LOTS of SCSI disk and partition devices in
the fake sysfs we use for udev testing.

This script is supposed to be run after sys-script.py. It uses
code from sys-script.py as template to generate additional SCSI disk data
structures. Together with the "generator" code in udev-test.pl
added in the previous patch, it allows to run udev tests with almost
arbitrarily many devices, and thus to do performance scaling tests.
node_symlink() itself creates the directory structure, no need
to do it twice.
Increase the timeout of the udev test, which has become more
time-consuming because of the multi-device tests added in previous
patches.
@mwilck
Copy link
Copy Markdown
Contributor Author

mwilck commented Jan 14, 2019

@yuwata, thanks again for the thorough review.

This latest push should fix the issues you raised. I also increased the timeout value to 90s, let's see if it's enough.

FTR: systemd is currently using "timeout-mutiplier=3" for the ASAN + UBSAN test, but looking at the actual times required (also in "good" case), a higher multiplier seems to be in order. The ASAN+UBSAN test make take ~40s where the "normal" test completes in less than a second. I'd vote for a timeout multiplier of 10 or even 100. At first glance this seems to apply to gcc + ASAN only, not clang. But I can't claim that I've looket at this in depth. Probably up to other people to analyze.

@mwilck
Copy link
Copy Markdown
Contributor Author

mwilck commented Jan 14, 2019

The Travis CI test has completed successfully now after the timeout had been increased. Actual runtime in the "ASAN + UBSAN" test was 107.5s. Test duration without ASAN+UBSAN was 24.6s.

@poettering
Copy link
Copy Markdown
Member

poettering commented Jan 16, 2019

I am not convinced we really need locking here. What about this algorithm:

  1. read the entry, i.e. where it currently points with readlink() (will result in ENOENT if symlink was missing so far)
  2. iterate through the list, figure out what symlink will win now
  3. if the same symlink would win as determined in 1, shortcut things, all is good, exit
  4. otherwise, if in step 1 the old symlink could be read: create the new symlink under a temporary name, then use renameat2() with RENAME_EXCHANGE. Then validate that we just moved away matches our expectations from step 1. If not, goto 1. If this worked, all is goot, exit.
  5. otherwise if in step 2 the old symlink could not be read due to ENOENT: create the new symlink under a temporary name, then use renameat2() with RENAME_NOREPLACE. This will fail if the symlink already exists now, in which case goto 1. If this worked, all is good, exit.

Wouldn't that work, too? this approach basically use RENAME_EXCHANGE and RENAME_NOREPLACE as a method to determine whether the symlink previously in place still matters our expectaions from the beginning of the algorithm. And if it doesn't we just restart things.

All without SysV IPC and everything, with only file system primitives.

@poettering
Copy link
Copy Markdown
Member

(And to say this clearly, I don't think SysV IPC should ever be an option. There are so many better options, BSD file locks or even robust pthread mutexes in a shared mmap buffer, but SysV IPC is really not a good idea. Best of course is not to serialize here at all though, as I proposed above.)

@mwilck
Copy link
Copy Markdown
Contributor Author

mwilck commented Jan 16, 2019

I am not convinced we really need locking here. What about this algorithm:
[...]
Wouldn't that work, too? this approach basically use RENAME_EXCHANGE and RENAME_NOREPLACE as a method to determine whether the symlink previously in place still matters our expectaions from the beginning of the algorithm. And if it doesn't we just restart things.

You made a similar suggestion on #8667. I replied in #8667 (comment) and #8667 (comment). The racy part is not the creation of the symlink itself. It's the determination of the best priority. Also, as long as the best priority is determined by calling udev_device_get_devlink_priority() for each contender, there's a race between link_update and updating the udev db file, too.

This is not only a theoretical argument. As pointed out in #8667 (comment), I experimented with various approaches after you suggestion in #8667 (comment), and while the algorithm I came up with may not have been exactly the same as what you suggested here, I can clearly say that nothing except locking avoided the race - which totally makes sense from my point of view. Unfortunately you never found the time to reply to my arguments there. We could have saved time if you did. But never mind. If you insist, I can try what you just suggested instead of my locking approach, but I would be very surprised if it worked.

(And to say this clearly, I don't think SysV IPC should ever be an option. There are so many better options, BSD file locks or even robust pthread mutexes in a shared mmap buffer, but SysV IPC is really not a good idea.

Would you mind explaining why you're against SysV IPC? File locks as currently available in systemd don't work because the kernel implementation scales very badly. I've provided details about that in #8667 (comment) (kernel patches are under way to improve that but it'll take time until they're in widespread use). I haven't tried pthread mutexes in shared mmap, but I wonder why you consider them "better" than SysV semaphores. The SysV implementation in the kernel is rock solid and very efficient. Also, the concept of semaphore sets is very useful to avoid using a single lock for every symlink without resorting to inefficient file locking.

Copy link
Copy Markdown
Member

@poettering poettering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's a beginning of a review, but quite frankly, sysv ipc is a no-go, this needs to be solved differently

return log_error_errno(-errno, "Failed to initialize semaphores: %m");

/* Dummy semop to set sem_otime */
if (semop(semid, dummy_op, (sizeof(dummy_op)/sizeof(*dummy_op))) == -1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, sysv IPC is just plain awful, let's not even start with this. They have no sane clean-up story, and just weird all over the place. I am sure in 2019 sysv IP is a total nogo for new code.

(Oh, and please don't reinvent ELEMENTSOF()) here...

#include "udev-node.h"
#include "siphash24.h"
#include "hash-funcs.h"
#include "udev.h"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nor sorted alphabetically

* The default maximum semaphore set size under Linux (SEMMSL) is 32000.
*/
#define N_SEMAPHORES 1024
static const char links_dirname[] = "/run/udev/links/";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why define a static array for this? sounds unnecessary, just use the string directly each time? it's more readable that way I am sure

for (i = 0; i < N_SEMAPHORES; i++)
val[i] = 1;
if (semctl(semid, 0, SETALL, val) == -1)
return log_error_errno(-errno, "Failed to initialize semaphores: %m");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to negate errno here, log_error_errno() dies that implicitly

semid = semget(key, N_SEMAPHORES, 0600|IPC_CREAT|IPC_EXCL);
if (semid != -1) {
unsigned short val[N_SEMAPHORES];
int i;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you use this to iterate through an array, should be size_t hence

return log_error_errno(-errno, "Failed to set sem_otime: %m");
return semid;
} else {
const unsigned int RETRIES = 10, SLEEP_US = 10000;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we generally call "unsigned int" just "unsigned".

Also, so far we have avoid using "const" on anything that is not a pointer, since it gives useless guarantees here... maybe make this a macro?

return semid;
} else {
const unsigned int RETRIES = 10, SLEEP_US = 10000;
unsigned int i;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsigned int → unsigned, plz


semid = semget(key, N_SEMAPHORES, 0);
if (semid == -1)
return log_error_errno(-errno, "Failed to get semaphore set: %m");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

negation of errno is not necessary

ds.sem_otime != 0)
return semid;
usleep(SLEEP_US);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

busy looping? urks!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's not beautiful, but there's no event we can listen for. This loop is only executed if there's a race between two processes trying to initialize the semaphore set at the same time, that is, at most once during a system's uptime. AFAIK, this is how SysV semaphore initalization is supposed to be done. (FTR, as long as usleep() is called, I wouldn't call it busy-looping).

I guess what I could do instead is use some other locking mechanism, e.g. a file lock, just for the purpose of running semget(). That way we'd ensure that the semaphore set would either not exist or be fully initialized when semget() is installed. From what I saw in earlier tests though, even that file lock, which must be taken once by every udev worker, might cause a scaling problem.

But it seems that you'd reject any SysV IPC-based approach anyway, so does it make sense for me to bother?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, sysvipc is not the way to go, sorry. the usleep() thing is just one issue among many with sysvipc...

But again, I don't see why the renameat() loop proposal I gave wouldn't work. What's essential is that the determination which symlink to use is enclosed with the readlinkat() before and the renameat() after and repeated until the result of the latter matches what we expect based on the results of the former. can you provide me with a runthrough ordering where this wouldn't work correctly?

}

static unsigned short get_sema_index(const char *link) {
static const unsigned char seed[16] = { 0x6b, 0xb0, 0xb1, 0x28,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why "unsignec char"? siphash24_init expects an uint8_t[] array there... which is of course the same, but let's not create needless confusion

@poettering
Copy link
Copy Markdown
Member

You made a similar suggestion on #8667. I replied in #8667 (comment) and #8667 (comment). The racy part is not the creation of the symlink itself. It's the determination of the best priority. Also, as long as the best priority is determined by calling udev_device_get_devlink_priority() for each contender, there's a race between link_update and updating the udev db file, too.

Well, i proposed ding any determination of which links wins between the initial readlink and the final update of the symlink. This means when multiple clients run through this concurrently, they will automatically retry if during the determination of the winning symlink the preconditions change or the result doesn't match with what the other contenders come up with.

@poettering
Copy link
Copy Markdown
Member

Would you mind explaining why you're against SysV IPC? File locks as currently available in systemd don't work because the kernel implementation scales very badly.

sysv ipc are sticky objects outside of any namespace binding in a global world-writable namespace. This means, they are easily DoS-able, and their clean-up doesn't really work, they don't get cleaned up implicitly if the object they belong to goes away... And let's not even start with the busy looping around usleep()! In 2019?

@mwilck
Copy link
Copy Markdown
Contributor Author

mwilck commented Jan 16, 2019

sysv ipc are sticky objects outside of any namespace binding in a global world-writable namespace.

AFAIK there's CLONE_NEWIPC, so IPC objects support namespaces just like anything. It's true that someone else could access my semaphore set in the same container and lock the semaphores, causing udev to get stalled. But the permissions are set such that only root could do that. If someone has root priviliges, I'd guess he'd have better means to DoS the system then blocking these semaphores. In general I fail to see what kind DoS attacks would be possible here that couldn't be done just as easily with file locks or mutexes.

This means, they are easily DoS-able, and their clean-up doesn't really work, they don't get cleaned up implicitly if the object they belong to goes away...

You're right about the cleanup. But the same would apply to any locking mechanism we'd use for udev. The locks must be shared between all udev processes, so at what point in time could we reasonably assume that the resources can be freed? IMHO, no sooner than when we'd be sure that no more udev workers could be spawned, which would be almost immediately before the final system shutdown.

And let's not even start with the busy looping around usleep()! In 2019?

If it really hurts you that much... I'd implement a different initialization sequence if I could conceive of one, and if I'd see a chance to get it merged.

@mwilck
Copy link
Copy Markdown
Contributor Author

mwilck commented Jan 16, 2019

Suggestion: Maybe you can merge the udev testsuite part of the patch set. After that, we could then give @poettering's idea another try. If we can get it to work, fine. If not, we'll have to reiterate the locking discussion.

@poettering
Copy link
Copy Markdown
Member

poettering commented Jan 16, 2019

AFAIK there's CLONE_NEWIPC, so IPC objects support namespaces just like anything. I

Nah, that's not what I meant, I wasn't referring to Linux IPC namespaces, but simply to the space of names that a sysv IPC object can have: it's a single, system-wide one, and if I am unprivileged I can still take any object I like, and then it's mine, and everybody else has a problem.

This is different from let's say AF_UNIX sockets in the file system: they have an hierarchial namespace where through UNIX access modes on some dir I can control exactly who can create sockets below it and who cannot. Thus, there's no DoS by unpriv processes possible.

@poettering
Copy link
Copy Markdown
Member

Suggestion: Maybe you can merge the udev testsuite part of the patch set. After that, we could then give @poettering's idea another try. If we can get it to work, fine. If not, we'll have to reiterate the locking discussion.

Sure, can you submit that as a separate PR, please? of course, any test that checks for this race should not be enabled at first, so that the CI doesn't suddenly fail all the time, as long as the race isn#t fixed.

@mwilck
Copy link
Copy Markdown
Contributor Author

mwilck commented Jan 16, 2019

Suggestion: Maybe you can merge the udev testsuite part of the patch set.

Sure, can you submit that as a separate PR, please?

There was #8819, which was just that and which had been superseded by this one. It needs to be updated of course.

@mwilck
Copy link
Copy Markdown
Contributor Author

mwilck commented Jan 16, 2019

Nah, that's not what I meant, I wasn't referring to Linux IPC namespaces, but simply to the space of names that a sysv IPC object can have: it's a single, system-wide one, and if I am unprivileged I can still take any object I like, and then it's mine, and everybody else has a problem.

OK, got it. But you have to admit that in practice, that's not a problem for a semaphore set that udev would grab while still running in the initrd.

@mwilck
Copy link
Copy Markdown
Contributor Author

mwilck commented Jan 16, 2019

Well, i proposed ding any determination of which links wins between the initial readlink and the final update of the symlink.

A necessary condition for this to work is that we change the udev code to create the db file first, then the symlink. Not sure if it's sufficient though.

@systemd systemd deleted a comment Feb 12, 2020
@systemd systemd deleted a comment Feb 20, 2020
@msekletar
Copy link
Copy Markdown
Contributor

@mwilck do you still plan to work on this? I think reservations that @poettering had regarding use of sysV APIs (semaphores) could be fixed.

My idea is that main daemon could allocate memory using memfd_create() + ftruncate(), next it would map memfd using mmap() and initialze unnamed POSIX semaphore in that memory region. Then, dup of memfd could be passed to worker (e.g. via AF_UNIX socket pair) and worker could then map memfd memory segment and would hence get access to unnamed semaphore. With this model we wouldn't have any externally visible artifacts and hence nothing to clean-up.

@poettering
Copy link
Copy Markdown
Member

if you really need to use locking, then please use BSD file locks (i.e. flock()). they are the only ones that have somewhat sensible semantics. take on some suitable file or dir that isn't accessible by unpriv users.

@poettering
Copy link
Copy Markdown
Member

but again, i still don't see why locking is required here. i.e. update the db, then update the symlinks to match the db. do so with RENAME_EXCHANGE so that you can then validate if you succeeded without anyone interfering by looking at the symlink you replace after you replaced it. If you noticed that things changed you just start the algorithm anew until you complete it without anyone interfering.

i.e. the symlink updating logic should be idempotent: it should always synthesize the highest priority symlink based on the current state, and if the state changes during this determination this can be detected through the RENAME_EXCHANGE trick because we know what we replaced. the RENAME_EXCHANGE thing is our "cas" if you so will.

to illustrate here the supposed algorithm in pseudo code:

  1. update_the_db(device, …)
  2. stat() on the symblink db dir
  3. foreach device symlink:
    a. open(O_PATH|O_NOFOLLOW) the device symlink we are looking at, then readlinkat() it, (with "" as filename arg), and fstat() it (to get inode nr), close it
    b. determine_highest_prio_symlink_from_db()
    c. if resulting highest prio symlink matches the symlink we found in a, goto 4
    d. create resulting symlink with renameat2() with atomic exchange
    fe look at the old symlink that you just renamed away, with lstat(), does it match the fstat() data from step a? if so, good, goto 4. if not, goto a, try again.
  4. stat() on the symlink db dir again. if mtime matches, great we are done, exit. if not goto 2

The outer loop makes sure that what we do operates on a congruent snapshot of the symlinks dir: we keep repeating things until the state of the db before and after we did things didn't change. The inner loop adjusts the device node symlinks to match the db.

@mwilck
Copy link
Copy Markdown
Contributor Author

mwilck commented Oct 26, 2020

Thanks for looking into this again; good to know that there's still interest. I wanted to look into Lennart's previous suggestions, but unfortunately I've had no time to work on it. I'd need to rebase and try to reproduce the issue before doing anything else.

The scaling issues with fcntl-based locking that I had encountered should be strongly mitigated by Neil Brown's patch series in kernel v5.0 (7bbd1fc0e9f1 and parents).

@mwilck
Copy link
Copy Markdown
Contributor Author

mwilck commented Oct 26, 2020

A loosely related question that keeps occurring to me: I wonder if the udev "data base" couldn't be implemented more efficiently in some other way, like using an actual data base code such as SQLite. Sorry if that's a totally dumb idea. It's obviously out of scope for this issue.

@poettering
Copy link
Copy Markdown
Member

I guess we can close this now that #17431 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks tests udev

Development

Successfully merging this pull request may close these issues.

5 participants