udev: lock symlink directories in link_update()#8667
udev: lock symlink directories in link_update()#8667mwilck wants to merge 3 commits intosystemd:masterfrom
Conversation
|
not sure i follow. it shouldn't matter in which order the stuff is applied, no? i mean, if they have different link prios then the one with the higher one should always win... |
|
@poettering: I thought so too. But (in particular during coldplug) the events can occur simulateneously, and then the following may happen: The end result is that the symlink points (wrongly) to the device with the lower priority (path device). Here "symlink dir" refers to the symlink db directory under This is particularly racy because |
|
hmm, but wouldn't it be better to simple rescan the dir if symlink() suggests that there's a symlink after all? |
|
That won't work. The problem is that while the symlink handling under /dev/ is atomic, the determination of the "best target" is not. |
|
@poettering, meanwhile partner testing has confirmed that this patch fixes a real-world bug. Please hint how we can make progress. As indicated, I don't think that your suggestion above works, but of course I'm open for other ideas. So far nothing short of a lock has occured to me that would avoid the race. Are you worried about deadlocks or slow-down? I've been pondering that too, but as this race is relatively rare, I'd expect that contention on these locks would be rare, too. There are some pathological symlinks that are claimed by many devices ("/dev/disk/by-partlabel/primary" is the prime example), but those aside, even in big multipath setups, there usually won't be more than 10 contenders for a given symlink. |
|
The changes proposed by @mwilck seems the simplest way to fix this race to me. @poettering any chance you can clarify what's wrong with this approach ? |
|
@poettering, any updates? I can perceive a problem myself that should be discussed. In certain cases, there can be a lot of contenders for a certain symlink. On older SUSE distributions at least, due to a historic shortcoming of parted, all GPT partitions receive the partlabel "primary" by default. This can lead to thousands of contenders for the label Real, meaningful device identifiers normally don't have so many contenders. I can imagine a multipath setup with 16 paths per device and a mulltipath device, adding up to 17. And anyway, a large number of contenders might also increase the probability of a race. |
|
I added extensions to the udev test suite in #8819. With the new tests I added, which allow testing simultaneous addition and removal of devices claiming the same symlink, the problem described here can be reproduced even on my laptop. |
When uevents are processed in parallel, in particular during coldplug, it can happen that several events for the same symlink arrive at the same time (e.g. SCSI disk and for a multipath DM device for that disk). This opens up a race window where a higher-priority link target may not be visible under /run/udev/links/ yet when a lower-priority one is processed, leading to the creation of the wrong symlink. Avoid this by locking the directory under /run/udev/links/.
The patch "udev: lock symlink directories in link_update()" adds locking
in the link_update() code path which figures out the highest-prio target for
any symlink claimed by a current uevent. While this fixes the incorrectness
of the current code and avoids races, it may also cause a slowdown of udev
operation due to lock contention; in particular, for symlinks which are claimed
by a lot of different devices.
Typical real-world examples would be multipath setups with 16 paths per device
and a dm-multipath device with higher link priority. According to my research
up to this point, the code with locking handles this amount of devices quite
well. But there are also pathological cases, where a symlink is claimed by
hundreds ot thousands of devices in a system. Although this indicates a system
misconfiguration, udev should be able to handle it without grinding to a halt
due to lock contention. An example for such symlinks is
"/dev/disk/by-partlabel/primary", which is claimed by every partition in a GPT
partition table on certain distributions, due to some weirdness of past
versions of the parted program.
Currently udev handles symlink targets using files that look like this:
/run/udev/links/${encoded_symlink_name}/${dev_id}
where dev_id is b$major:$minor or c$major:$minor.
When an event is processed, udev adds or removes these files according to event
type and scans the whole directory /run/udev/links/${encoded_symlink_name} to
find the entry with the highest priority. In the process, it needs to get an
udev device ref for every entry and derive the priority from libudev.
If there are many contenders for the symlink at hand, this procedure is
inefficient.
This patch changes the data structure as follows:
/run/udev/links/${encoded_symlink_name}/L:$prio/${dev_id}
Thus entries are grouped by priority, and finding the highest-prio entry is
much quicker because we hardly ever have more than 2 or 3 prio levels for any
given symlink. In combination with the locking patch, the time during which the
lock is held can be drastically reduced.
I verified that this patch doesn't cause any test suite regressions.
If udevd is restarted after applying the changes for /run/udev/links from "udev: change structure in /run/udev/links to improve efficiency", it may find the old data structures left over from a previous version. Deal with this situation by detecting it and moving the target files to the new places.
f8d8898 to
6614f8f
Compare
|
Here is a major update to this pull request. With the help of the udev-test code I submitted in #8819, I verified the following:
I don't see an alternative to locking to achieve correctness. Any re-scanning as suggested by @poettering, without locking, would be susceptible to the same race, even if it might be triggered under slightly different timing conditions. Therefore I came up with a different approach that drastically reduces the time spent holding the lock, and the results are promising. Results for a "worst case" scenario (simultaneous events for lots of devices that all claim the same link and nothing else) are summarized in symlink-lock.pdf, where "old" refers to the current upstream, "lock" to the "lock symlink directories" patch, and "new" to the code with the patch "change structure in /run/udev/links" on top of it. (The "migration code" patch has no influence on the test results). My new approach changes data structures used by udev under |
|
At the risk of confusing everybody, I have tried another locking approach to avoid the large CPU consumption of the "lock symlink directories" patch for lots of contenders. The code is in mwilck@da9a945; I'm not pushing it here right now. The background is that locking using fcntl() is very CPU-intensive - instead of sleeping, waiters end up busy-waiting in spin_lock() in the kernel. I tried this on 4.12 and 4.15 kernels with very similar results. I'm now using SysV semaphores for locking instead. Locks are not per-symlink-file but per-hash-of-symlink-file. Details are in the patch. I've tested this approach both A) with only the "lock symlink directories" patch (5ac8373), and B) with the whole patch set including "change structure in /run/udev/links" (39ff14e). In both cases it improves performance and reduces CPU time. Details are in the updated symlink-lock.pdf. So, locking with semaphores is actually a big improvement, scaling-wise, over locking with fcntl(). But 39ff14e is necessary; without it, the scaling is improved by using semaphores, but it's still not good enough for the pathological case where 100 or more targets contend for the same symlink. |
|
@poettering, I delved a bit into your idea to rescan the directory, trying to use changed directory mtime as indicator for "I need to rescan". I didn't succeed. But I gained some more insight. The current code compares priorities with udev_device_get_devlink_priority(). That requires that the device's db entry has already been written and contains an up-to-date priority value. But if many add events occur simultaneously, there's no guarantee for that. Thus, the actual race is not only between processes for different devices running link_update() for the same symlink, but also between link_update() and udev_device_update_db() calls. In udev_event_execute_rules(), udev_device_update_db() is called after udev_node_add->link_update(), thus the race window is actually larger than I originally thought. Therefore my "lock symlink directories in link_update()" patch is not sufficient to fully avoid the race, although it seemed to be good enough in my testing. This is another argument for my "change structure in /run/udev/links to improve efficiency" patch, which Please, have another look. |
|
Superseded by #9551. |
When uevents are processed in parallel, in particular during coldplug,
it can happen that several events for the same symlink arrive at the
same time (e.g. SCSI disk and for a multipath DM device for that
disk). This opens up a race window where a higher-priority link target
may not be visible under /run/udev/links/ yet when a lower-priority
one is processed, leading to the creation of the wrong symlink.
Avoid this by locking the directory under /run/udev/links/.