udev: fix race condition in symlink generation code#9551
udev: fix race condition in symlink generation code#9551mwilck wants to merge 42 commits intosystemd:masterfrom
Conversation
keszybz
left a comment
There was a problem hiding this comment.
This is an impressive patch series. For the delay in review. The perl part looks great.
I think that they python part has some issues — maybe it can be split out to a separate PR, to get the rest merged.
I made some comments below about the C part.
|
Needs to be rebased. Could you update this? |
4cb19d4 to
46cbe12
Compare
|
thanks a lot for the review. And big apologies that it took me so long to get back to this issue. Again, sorry for the long delay, and please give this another shot. |
Allow testing cases where multiple devices are added and removed. This implies a change of the data structure: every test allows for multiple devices to be added, and "exp_name" etc. are now properties of the device, not of the test.
It's not necessary to write the rules for every udev run, as we now may have many (rather than just 2) per test.
Allow testing cases where multiple devices are added and removed simultaneously. Tests are started as synchronously as possible using a semaphore, in order to test possible race conditions. If this isn't desired, the test parameter "sleep_us" can be set to the number of microseconds to wait between udev invocations.
More often than not, the created devnode is the basename of the sysfs entry. The "devnode" device may be used to override the auto-detected node name. Permissions and major/minor number are now verified on the devnode itself, not on symlinks. For those tests where exp_name is set to the computed devnode name, the explicit "exp_name" can be removed. "exp_name" is only required for symlinks. This allows separate testing for devnodes and symlinks an a follow-up patch.
Test if symlinks are created correctly by comparing the symlink targets to the devnode path. This implies (for the symlink) that major/minor numbers and permissions are correct, as we have tested that on the devnode already.
Instead of testing the existence or non-exisitence of just a single symlink, allow testing of several links per device. Change the test definitions accordingly.
udev hasn't supported renaming device nodes for some time.
the "last_rule" option hasn't been supported for some time. Therefore this test fails if a "not_exp_links" attribute is added, as it should be. Mark it appropriately.
Add some rules that make it a bit harder to pass, mainly the non-existence checks.
These rules have survived from an ancient version of the code and save no purpose any more.
As we can check multiple links in a single test now, these 3 tests can be merged into one.
As we can test multiple devices and multiple links per device in one test now, these two tests can be merged into one.
This is helpful to catch possible regressions in the test. Also, don't count wait() errors, they are likely not udev errors.
Add 4 new tests using multiple devices. Number 2-4 use many devices claiming the same symlink, where only one device has a higher priority thatn the others. They fail sporadically with the current code, if a race condition causes the symlink to point to the wrong device. Test 4 is like test 2 with sleeps in between, it's much less likely to fail.
for easier reproduction of sporadic test failures.
Manually listing all devices in the test definition becomes cumbersome with lots of devices. Add a function that scans on all block devices in the test sysfs and generates a list of devices to test.
Script for generating LOTS of SCSI disk and partition devices in the fake sysfs we use for udev testing. This script is supposed to be run after sys-script.py. It uses code from sys-script.py as template to generate additional SCSI disk data structures. Together with the "generator" code in udev-test.pl added in the previous patch, it allows to run udev tests with almost arbitrarily many devices, and thus to do performance scaling tests.
node_symlink() itself creates the directory structure, no need to do it twice.
Increase the timeout of the udev test, which has become more time-consuming because of the multi-device tests added in previous patches.
|
@yuwata, thanks again for the thorough review. This latest push should fix the issues you raised. I also increased the timeout value to 90s, let's see if it's enough. FTR: systemd is currently using "timeout-mutiplier=3" for the ASAN + UBSAN test, but looking at the actual times required (also in "good" case), a higher multiplier seems to be in order. The ASAN+UBSAN test make take ~40s where the "normal" test completes in less than a second. I'd vote for a timeout multiplier of 10 or even 100. At first glance this seems to apply to gcc + ASAN only, not clang. But I can't claim that I've looket at this in depth. Probably up to other people to analyze. |
|
The Travis CI test has completed successfully now after the timeout had been increased. Actual runtime in the "ASAN + UBSAN" test was 107.5s. Test duration without ASAN+UBSAN was 24.6s. |
|
I am not convinced we really need locking here. What about this algorithm:
Wouldn't that work, too? this approach basically use RENAME_EXCHANGE and RENAME_NOREPLACE as a method to determine whether the symlink previously in place still matters our expectaions from the beginning of the algorithm. And if it doesn't we just restart things. All without SysV IPC and everything, with only file system primitives. |
|
(And to say this clearly, I don't think SysV IPC should ever be an option. There are so many better options, BSD file locks or even robust pthread mutexes in a shared mmap buffer, but SysV IPC is really not a good idea. Best of course is not to serialize here at all though, as I proposed above.) |
You made a similar suggestion on #8667. I replied in #8667 (comment) and #8667 (comment). The racy part is not the creation of the symlink itself. It's the determination of the best priority. Also, as long as the best priority is determined by calling This is not only a theoretical argument. As pointed out in #8667 (comment), I experimented with various approaches after you suggestion in #8667 (comment), and while the algorithm I came up with may not have been exactly the same as what you suggested here, I can clearly say that nothing except locking avoided the race - which totally makes sense from my point of view. Unfortunately you never found the time to reply to my arguments there. We could have saved time if you did. But never mind. If you insist, I can try what you just suggested instead of my locking approach, but I would be very surprised if it worked.
Would you mind explaining why you're against SysV IPC? File locks as currently available in systemd don't work because the kernel implementation scales very badly. I've provided details about that in #8667 (comment) (kernel patches are under way to improve that but it'll take time until they're in widespread use). I haven't tried pthread mutexes in shared mmap, but I wonder why you consider them "better" than SysV semaphores. The SysV implementation in the kernel is rock solid and very efficient. Also, the concept of semaphore sets is very useful to avoid using a single lock for every symlink without resorting to inefficient file locking. |
poettering
left a comment
There was a problem hiding this comment.
here's a beginning of a review, but quite frankly, sysv ipc is a no-go, this needs to be solved differently
src/udev/udev-node.c
Outdated
| return log_error_errno(-errno, "Failed to initialize semaphores: %m"); | ||
|
|
||
| /* Dummy semop to set sem_otime */ | ||
| if (semop(semid, dummy_op, (sizeof(dummy_op)/sizeof(*dummy_op))) == -1) |
There was a problem hiding this comment.
No, sysv IPC is just plain awful, let's not even start with this. They have no sane clean-up story, and just weird all over the place. I am sure in 2019 sysv IP is a total nogo for new code.
(Oh, and please don't reinvent ELEMENTSOF()) here...
| #include "udev-node.h" | ||
| #include "siphash24.h" | ||
| #include "hash-funcs.h" | ||
| #include "udev.h" |
src/udev/udev-node.c
Outdated
| * The default maximum semaphore set size under Linux (SEMMSL) is 32000. | ||
| */ | ||
| #define N_SEMAPHORES 1024 | ||
| static const char links_dirname[] = "/run/udev/links/"; |
There was a problem hiding this comment.
why define a static array for this? sounds unnecessary, just use the string directly each time? it's more readable that way I am sure
src/udev/udev-node.c
Outdated
| for (i = 0; i < N_SEMAPHORES; i++) | ||
| val[i] = 1; | ||
| if (semctl(semid, 0, SETALL, val) == -1) | ||
| return log_error_errno(-errno, "Failed to initialize semaphores: %m"); |
There was a problem hiding this comment.
no need to negate errno here, log_error_errno() dies that implicitly
src/udev/udev-node.c
Outdated
| semid = semget(key, N_SEMAPHORES, 0600|IPC_CREAT|IPC_EXCL); | ||
| if (semid != -1) { | ||
| unsigned short val[N_SEMAPHORES]; | ||
| int i; |
There was a problem hiding this comment.
you use this to iterate through an array, should be size_t hence
| return log_error_errno(-errno, "Failed to set sem_otime: %m"); | ||
| return semid; | ||
| } else { | ||
| const unsigned int RETRIES = 10, SLEEP_US = 10000; |
There was a problem hiding this comment.
we generally call "unsigned int" just "unsigned".
Also, so far we have avoid using "const" on anything that is not a pointer, since it gives useless guarantees here... maybe make this a macro?
| return semid; | ||
| } else { | ||
| const unsigned int RETRIES = 10, SLEEP_US = 10000; | ||
| unsigned int i; |
There was a problem hiding this comment.
unsigned int → unsigned, plz
src/udev/udev-node.c
Outdated
|
|
||
| semid = semget(key, N_SEMAPHORES, 0); | ||
| if (semid == -1) | ||
| return log_error_errno(-errno, "Failed to get semaphore set: %m"); |
There was a problem hiding this comment.
negation of errno is not necessary
| ds.sem_otime != 0) | ||
| return semid; | ||
| usleep(SLEEP_US); | ||
| } |
There was a problem hiding this comment.
I agree that it's not beautiful, but there's no event we can listen for. This loop is only executed if there's a race between two processes trying to initialize the semaphore set at the same time, that is, at most once during a system's uptime. AFAIK, this is how SysV semaphore initalization is supposed to be done. (FTR, as long as usleep() is called, I wouldn't call it busy-looping).
I guess what I could do instead is use some other locking mechanism, e.g. a file lock, just for the purpose of running semget(). That way we'd ensure that the semaphore set would either not exist or be fully initialized when semget() is installed. From what I saw in earlier tests though, even that file lock, which must be taken once by every udev worker, might cause a scaling problem.
But it seems that you'd reject any SysV IPC-based approach anyway, so does it make sense for me to bother?
There was a problem hiding this comment.
yeah, sysvipc is not the way to go, sorry. the usleep() thing is just one issue among many with sysvipc...
But again, I don't see why the renameat() loop proposal I gave wouldn't work. What's essential is that the determination which symlink to use is enclosed with the readlinkat() before and the renameat() after and repeated until the result of the latter matches what we expect based on the results of the former. can you provide me with a runthrough ordering where this wouldn't work correctly?
| } | ||
|
|
||
| static unsigned short get_sema_index(const char *link) { | ||
| static const unsigned char seed[16] = { 0x6b, 0xb0, 0xb1, 0x28, |
There was a problem hiding this comment.
why "unsignec char"? siphash24_init expects an uint8_t[] array there... which is of course the same, but let's not create needless confusion
Well, i proposed ding any determination of which links wins between the initial readlink and the final update of the symlink. This means when multiple clients run through this concurrently, they will automatically retry if during the determination of the winning symlink the preconditions change or the result doesn't match with what the other contenders come up with. |
sysv ipc are sticky objects outside of any namespace binding in a global world-writable namespace. This means, they are easily DoS-able, and their clean-up doesn't really work, they don't get cleaned up implicitly if the object they belong to goes away... And let's not even start with the busy looping around usleep()! In 2019? |
AFAIK there's CLONE_NEWIPC, so IPC objects support namespaces just like anything. It's true that someone else could access my semaphore set in the same container and lock the semaphores, causing udev to get stalled. But the permissions are set such that only root could do that. If someone has root priviliges, I'd guess he'd have better means to DoS the system then blocking these semaphores. In general I fail to see what kind DoS attacks would be possible here that couldn't be done just as easily with file locks or mutexes.
You're right about the cleanup. But the same would apply to any locking mechanism we'd use for udev. The locks must be shared between all udev processes, so at what point in time could we reasonably assume that the resources can be freed? IMHO, no sooner than when we'd be sure that no more udev workers could be spawned, which would be almost immediately before the final system shutdown.
If it really hurts you that much... I'd implement a different initialization sequence if I could conceive of one, and if I'd see a chance to get it merged. |
|
Suggestion: Maybe you can merge the udev testsuite part of the patch set. After that, we could then give @poettering's idea another try. If we can get it to work, fine. If not, we'll have to reiterate the locking discussion. |
Nah, that's not what I meant, I wasn't referring to Linux IPC namespaces, but simply to the space of names that a sysv IPC object can have: it's a single, system-wide one, and if I am unprivileged I can still take any object I like, and then it's mine, and everybody else has a problem. This is different from let's say AF_UNIX sockets in the file system: they have an hierarchial namespace where through UNIX access modes on some dir I can control exactly who can create sockets below it and who cannot. Thus, there's no DoS by unpriv processes possible. |
Sure, can you submit that as a separate PR, please? of course, any test that checks for this race should not be enabled at first, so that the CI doesn't suddenly fail all the time, as long as the race isn#t fixed. |
There was #8819, which was just that and which had been superseded by this one. It needs to be updated of course. |
OK, got it. But you have to admit that in practice, that's not a problem for a semaphore set that udev would grab while still running in the initrd. |
A necessary condition for this to work is that we change the udev code to create the db file first, then the symlink. Not sure if it's sufficient though. |
|
@mwilck do you still plan to work on this? I think reservations that @poettering had regarding use of sysV APIs (semaphores) could be fixed. My idea is that main daemon could allocate memory using memfd_create() + ftruncate(), next it would map memfd using mmap() and initialze unnamed POSIX semaphore in that memory region. Then, dup of memfd could be passed to worker (e.g. via AF_UNIX socket pair) and worker could then map memfd memory segment and would hence get access to unnamed semaphore. With this model we wouldn't have any externally visible artifacts and hence nothing to clean-up. |
|
if you really need to use locking, then please use BSD file locks (i.e. flock()). they are the only ones that have somewhat sensible semantics. take on some suitable file or dir that isn't accessible by unpriv users. |
|
but again, i still don't see why locking is required here. i.e. update the db, then update the symlinks to match the db. do so with RENAME_EXCHANGE so that you can then validate if you succeeded without anyone interfering by looking at the symlink you replace after you replaced it. If you noticed that things changed you just start the algorithm anew until you complete it without anyone interfering. i.e. the symlink updating logic should be idempotent: it should always synthesize the highest priority symlink based on the current state, and if the state changes during this determination this can be detected through the RENAME_EXCHANGE trick because we know what we replaced. the RENAME_EXCHANGE thing is our "cas" if you so will. to illustrate here the supposed algorithm in pseudo code:
The outer loop makes sure that what we do operates on a congruent snapshot of the symlinks dir: we keep repeating things until the state of the db before and after we did things didn't change. The inner loop adjusts the device node symlinks to match the db. |
|
Thanks for looking into this again; good to know that there's still interest. I wanted to look into Lennart's previous suggestions, but unfortunately I've had no time to work on it. I'd need to rebase and try to reproduce the issue before doing anything else. The scaling issues with |
|
A loosely related question that keeps occurring to me: I wonder if the udev "data base" couldn't be implemented more efficiently in some other way, like using an actual data base code such as SQLite. Sorry if that's a totally dumb idea. It's obviously out of scope for this issue. |
|
I guess we can close this now that #17431 is merged. |
This request replaces the previous pull requests #8667 and #8819. These requests belong together, as #8819 introduced currently failing tests that were fixed by #8667. Especially in #8667, several updates from my side may confuse readers. Therefore I decided to create a new pull request.
With lots of contenders for a certain symlink, the current udev code suffers from two problems:
/run/udev/links/$LINK. To obtain the highest-prio link, udev must read all entries and retrieve their priority from the udev db, which is again inherently racy. This inefficiency is strongly aggravated if locking is used to solve the first problem.Another issue which I faced when looking for a solution for this problem was the bad scaling of
fcntl()based file locking in current kernels. Neil Brown has developed a fix for this, but it will take a lot of time until that's broadly available. Therefore I am using a different approach for locking based on a set of SysV semaphores, which doesn't have kernel-side scaling issues.In the discussion for #8667, @poettering had asked for a solution without locking ("simple rescan"). After having conducted lots of experiment and research in alternative solutions I can assure that no such solution exists. Like in other, similar cases, locking the shared data structure is the only safe way to avoid the race.
Solving both 1.) and 2.) above requires both locking and changing the internal data structures udev uses for managing symlink targets. With the method presented here, the race is avoided, and yet the code scales better than the original in my tests.
The patch set also fixes some minor issues I encountered while working on the udev test suite.