udev: lock symlink directories in link_update() by mwilck · Pull Request #8667 · systemd/systemd

mwilck · 2018-04-05T20:51:42Z

When uevents are processed in parallel, in particular during coldplug,
it can happen that several events for the same symlink arrive at the
same time (e.g. SCSI disk and for a multipath DM device for that
disk). This opens up a race window where a higher-priority link target
may not be visible under /run/udev/links/ yet when a lower-priority
one is processed, leading to the creation of the wrong symlink.

Avoid this by locking the directory under /run/udev/links/.

poettering · 2018-04-06T09:04:29Z

not sure i follow. it shouldn't matter in which order the stuff is applied, no? i mean, if they have different link prios then the one with the higher one should always win...

mwilck · 2018-04-06T10:04:30Z

@poettering: I thought so too. But (in particular during coldplug) the events can occur simulateneously, and then the following may happen:

udev worker for DM               udev worker for path dev
----------------------------------------------------------------
scans symlink dir         
                                 scans symlink dir (doesn't see dm entry)
creates symlink to dm device
creates dm entry in symlink dir
                                 creates symlink to path device (removes previous entry)
                                 creates path entry in symlink dir

The end result is that the symlink points (wrongly) to the device with the lower priority (path device).
When a new "change" event arrives, the situation is fixed, but otherwise it stays "for good".

Here "symlink dir" refers to the symlink db directory under /run/udev/links,
e.g. /run/udev/links/'\x2fdisk\x2fby-uuid\x2f298df3ea-3cb7-4613-acf2-39f9940c95cc'.

This is particularly racy because link_update() scans the directory for other entries before it adds a new one. It might work to revert this ordering (first create the new entry in the symlink dir, then read other entries), but I'm uncertain about the atomicity of these file system operations. Using a lock makes the check and creation of the entries truly atomic.

poettering · 2018-04-06T10:10:41Z

hmm, but wouldn't it be better to simple rescan the dir if symlink() suggests that there's a symlink after all?

mwilck · 2018-04-06T10:52:10Z

That won't work. node_symlink() "atomically replaces" an existing link by using rename(2). It assumes that the caller has already determined the link with the best priority. It's commonplace that the symlink already exists, that's no error condition, and shouldn't normally cause the directory to be rescanned. It's only an error if the existing symlink has a higher priority if the replacing one, but that information isn't available in node_symlink().

The problem is that while the symlink handling under /dev/ is atomic, the determination of the "best target" is not.

mwilck · 2018-04-09T08:03:53Z

@poettering, meanwhile partner testing has confirmed that this patch fixes a real-world bug. Please hint how we can make progress. As indicated, I don't think that your suggestion above works, but of course I'm open for other ideas. So far nothing short of a lock has occured to me that would avoid the race.

Are you worried about deadlocks or slow-down? I've been pondering that too, but as this race is relatively rare, I'd expect that contention on these locks would be rare, too. There are some pathological symlinks that are claimed by many devices ("/dev/disk/by-partlabel/primary" is the prime example), but those aside, even in big multipath setups, there usually won't be more than 10 contenders for a given symlink.

fbuihuu · 2018-04-10T16:13:15Z

The changes proposed by @mwilck seems the simplest way to fix this race to me.

@poettering any chance you can clarify what's wrong with this approach ?

mwilck · 2018-04-16T15:10:59Z

@poettering, any updates?

I can perceive a problem myself that should be discussed. In certain cases, there can be a lot of contenders for a certain symlink. On older SUSE distributions at least, due to a historic shortcoming of parted, all GPT partitions receive the partlabel "primary" by default. This can lead to thousands of contenders for the label /dev/disk/by-partlabel/primary. Handling such cases might not work well with locking. However I think that 's a pathological case, and we should rather not create these meaningless symlinks in the first place.

Real, meaningful device identifiers normally don't have so many contenders. I can imagine a multipath setup with 16 paths per device and a mulltipath device, adding up to 17. And anyway, a large number of contenders might also increase the probability of a race.

mwilck · 2018-04-25T20:26:13Z

I added extensions to the udev test suite in #8819. With the new tests I added, which allow testing simultaneous addition and removal of devices claiming the same symlink, the problem described here can be reproduced even on my laptop.

When uevents are processed in parallel, in particular during coldplug, it can happen that several events for the same symlink arrive at the same time (e.g. SCSI disk and for a multipath DM device for that disk). This opens up a race window where a higher-priority link target may not be visible under /run/udev/links/ yet when a lower-priority one is processed, leading to the creation of the wrong symlink. Avoid this by locking the directory under /run/udev/links/.

The patch "udev: lock symlink directories in link_update()" adds locking in the link_update() code path which figures out the highest-prio target for any symlink claimed by a current uevent. While this fixes the incorrectness of the current code and avoids races, it may also cause a slowdown of udev operation due to lock contention; in particular, for symlinks which are claimed by a lot of different devices. Typical real-world examples would be multipath setups with 16 paths per device and a dm-multipath device with higher link priority. According to my research up to this point, the code with locking handles this amount of devices quite well. But there are also pathological cases, where a symlink is claimed by hundreds ot thousands of devices in a system. Although this indicates a system misconfiguration, udev should be able to handle it without grinding to a halt due to lock contention. An example for such symlinks is "/dev/disk/by-partlabel/primary", which is claimed by every partition in a GPT partition table on certain distributions, due to some weirdness of past versions of the parted program. Currently udev handles symlink targets using files that look like this: /run/udev/links/${encoded_symlink_name}/${dev_id} where dev_id is b$major:$minor or c$major:$minor. When an event is processed, udev adds or removes these files according to event type and scans the whole directory /run/udev/links/${encoded_symlink_name} to find the entry with the highest priority. In the process, it needs to get an udev device ref for every entry and derive the priority from libudev. If there are many contenders for the symlink at hand, this procedure is inefficient. This patch changes the data structure as follows: /run/udev/links/${encoded_symlink_name}/L:$prio/${dev_id} Thus entries are grouped by priority, and finding the highest-prio entry is much quicker because we hardly ever have more than 2 or 3 prio levels for any given symlink. In combination with the locking patch, the time during which the lock is held can be drastically reduced. I verified that this patch doesn't cause any test suite regressions.

If udevd is restarted after applying the changes for /run/udev/links from "udev: change structure in /run/udev/links to improve efficiency", it may find the old data structures left over from a previous version. Deal with this situation by detecting it and moving the target files to the new places.

mwilck · 2018-04-25T21:27:44Z

Here is a major update to this pull request.

With the help of the udev-test code I submitted in #8819, I verified the following:

With the current upstream code, race conditions are easily reproducable. The "multiple device" test (149) fails about 3/10 times on my laptop, and basically always on a 72-core server.
No more failures occur with my patches.
The locking patch scales all-right for "realistic" cases with no more than ~20 contenders.
But, in the pathological case with hundreds of contenders, the locking patch scales badly.

I don't see an alternative to locking to achieve correctness. Any re-scanning as suggested by @poettering, without locking, would be susceptible to the same race, even if it might be triggered under slightly different timing conditions. Therefore I came up with a different approach that drastically reduces the time spent holding the lock, and the results are promising.

Results for a "worst case" scenario (simultaneous events for lots of devices that all claim the same link and nothing else) are summarized in symlink-lock.pdf, where "old" refers to the current upstream, "lock" to the "lock symlink directories" patch, and "new" to the code with the patch "change structure in /run/udev/links" on top of it. (The "migration code" patch has no influence on the test results).

My new approach changes data structures used by udev under /run/udev/links; thus this is a non-trivial change. Reviews and comments are welcome.

mwilck · 2018-04-30T20:34:31Z

At the risk of confusing everybody, I have tried another locking approach to avoid the large CPU consumption of the "lock symlink directories" patch for lots of contenders. The code is in mwilck@da9a945; I'm not pushing it here right now.

The background is that locking using fcntl() is very CPU-intensive - instead of sleeping, waiters end up busy-waiting in spin_lock() in the kernel. I tried this on 4.12 and 4.15 kernels with very similar results. I'm now using SysV semaphores for locking instead. Locks are not per-symlink-file but per-hash-of-symlink-file. Details are in the patch.

I've tested this approach both A) with only the "lock symlink directories" patch (5ac8373), and B) with the whole patch set including "change structure in /run/udev/links" (39ff14e). In both cases it improves performance and reduces CPU time. Details are in the updated symlink-lock.pdf.

So, locking with semaphores is actually a big improvement, scaling-wise, over locking with fcntl(). But 39ff14e is necessary; without it, the scaling is improved by using semaphores, but it's still not good enough for the pathological case where 100 or more targets contend for the same symlink.

mwilck · 2018-05-04T22:36:02Z

@poettering, I delved a bit into your idea to rescan the directory, trying to use changed directory mtime as indicator for "I need to rescan". I didn't succeed.

But I gained some more insight. The current code compares priorities with udev_device_get_devlink_priority(). That requires that the device's db entry has already been written and contains an up-to-date priority value. But if many add events occur simultaneously, there's no guarantee for that.

Thus, the actual race is not only between processes for different devices running link_update() for the same symlink, but also between link_update() and udev_device_update_db() calls. In udev_event_execute_rules(), udev_device_update_db() is called after udev_node_add->link_update(), thus the race window is actually larger than I originally thought.

Therefore my "lock symlink directories in link_update()" patch is not sufficient to fully avoid the race, although it seemed to be good enough in my testing.

This is another argument for my "change structure in /run/udev/links to improve efficiency" patch, which
doesn't depend on udev_device_get_devlink_priority() (and thus on up-to-date db entries), because the prio is directly encoded in the paths under /run/udev/links.

Please, have another look.

mwilck · 2018-07-09T16:29:27Z

Superseded by #9551.

mwilck mentioned this pull request Apr 5, 2018

Suggested fixes for bsc#1086785 and bsc#1084748 openSUSE/systemd#11

Open

yuwata added the udev label Apr 6, 2018

mwilck mentioned this pull request Apr 25, 2018

Bug fixes and improvements for the udev test suite #8819

Closed

mwilck added 3 commits April 25, 2018 22:28

mwilck force-pushed the lock-symlink-dir branch from f8d8898 to 6614f8f Compare April 25, 2018 21:03

mwilck mentioned this pull request Jul 9, 2018

udev: fix race condition in symlink generation code #9551

Closed

mwilck closed this Jul 9, 2018

mwilck mentioned this pull request Apr 29, 2022

pid1: several fixlets for device handling #23218

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

udev: lock symlink directories in link_update()#8667

udev: lock symlink directories in link_update()#8667
mwilck wants to merge 3 commits intosystemd:masterfrom
mwilck:lock-symlink-dir

mwilck commented Apr 5, 2018

Uh oh!

poettering commented Apr 6, 2018

Uh oh!

mwilck commented Apr 6, 2018

Uh oh!

poettering commented Apr 6, 2018

Uh oh!

mwilck commented Apr 6, 2018

Uh oh!

mwilck commented Apr 9, 2018

Uh oh!

fbuihuu commented Apr 10, 2018

Uh oh!

mwilck commented Apr 16, 2018

Uh oh!

mwilck commented Apr 25, 2018

Uh oh!

mwilck commented Apr 25, 2018

Uh oh!

mwilck commented Apr 30, 2018

Uh oh!

mwilck commented May 4, 2018

Uh oh!

mwilck commented Jul 9, 2018

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

mwilck commented Apr 5, 2018

Uh oh!

poettering commented Apr 6, 2018

Uh oh!

mwilck commented Apr 6, 2018

Uh oh!

poettering commented Apr 6, 2018

Uh oh!

mwilck commented Apr 6, 2018

Uh oh!

mwilck commented Apr 9, 2018

Uh oh!

fbuihuu commented Apr 10, 2018

Uh oh!

mwilck commented Apr 16, 2018

Uh oh!

mwilck commented Apr 25, 2018

Uh oh!

mwilck commented Apr 25, 2018

Uh oh!

mwilck commented Apr 30, 2018

Uh oh!

mwilck commented May 4, 2018

Uh oh!

mwilck commented Jul 9, 2018

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants