-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
udev: don't close FDs before dropping them from epoll #238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Make sure we never close fds before we drop their related event-source. This will cause horrible disruptions if the fd-num is re-used by someone else. Under normal conditions, this should not cause any problems as the close() will drop the fd from the epoll-set automatically. However, this changes if you have any child processes with a copy of that fd. This fixes issue systemd#163. Background: If you create an epoll-set via epoll_create() (lets call it 'EFD') you can add file-descriptors to it to watch for events. Whenever you call EPOLL_CTL_ADD on a file-descriptor you want to watch, the kernel looks up the attached "struct file" pointer, that this FD refers to. This combination of the FD-number and the "struct file" pointer is used as key to link it into the epoll-set (EFD). This means, if you duplicate your file-descriptor, you can watch this file-descriptor, too (because the duplicate will have a different FD-number, hence, the combination of FD-number and "struct file" is different as before). If you want to stop watching an FD, you use EPOLL_CTL_DEL and pass the FD to the kernel. The kernel again looks up your file-descriptor in your FD-table to find the linked "struct file". This FD-number and "struct file" combination is then dropped from the epoll-set (EFD). Last, but not least: If you close a file-descriptor that is linked to an epoll-set, the kernel does *NOTHING* regarding the epoll-set. This is a vital observation! Because this means, your epoll_wait() calls will still return the metadata you used to watch/subscribe your file-descriptor to events. There is one exception to this rule: If the file-descriptor that you just close()ed was the last FD that referred to the underlying "struct file", then _all_ epoll-set watches/subscriptions are destroyed. Hence, if you never dup()ed your FD, then a simple close() will also unsubscribe it from any epoll-set. With this in mind, lets look at fork(): Assume you have an epoll-set (EFD) and a bunch of FDs subscribed to events on that EFD. If you now call fork(), the new process gets a copy of your file-descriptor table. This means, the whole table is copied and the "struct file" reference of each FD is increased by 1. It is important to notice that the FD-numbers in the child are exactly the same as in the parent (eg., FD systemd#5 in the child refers to the same "struct file" as FD systemd#5 in the parent). This means, if the child calls EPOLL_CTL_DEL on an FD, the kernel will look up the linked "struct file" and drop the FD-number and "struct file" combination from the epoll-set (EFD). However, this will effectively drop the subscription that was installed by the parent. To sum up: even though the child gets a duplicate of the EFD and all FDs, the subscriptions in the EFD are *NOT* duplicated! Now, with this in mind, lets look at what udevd does: Udevd has a bunch of file-descriptors that it watches in its sd-event main-loop. Whenever a uevent is received, the event is dispatched on its workers. If no suitable worker is present, a new worker is fork()ed to handle the event. Inside of this worker, we try to free all resources we inherited. However, the fork() call is done from a call-stack that is never rewinded. Therefore, this call stack might own references that it drops once it is left. Those references we cannot deduce from the fork()'ed process; effectively causing us to leak objects in the worker (eg., the call to sd_event_dispatch() that dispatched our uevent owns a reference to the sd_event object it used; and drops it again once the function is left). (Another example is udev_monitor_ref() for each 'worker' that is also inherited by all children; thus keeping the udev-monitor and the uevent-fd alive in all children (which is the real cause for bug systemd#163)) (The extreme variant is sd_event_source_unref(), which explicitly keeps event-sources alive, if they're currently dispatched, knowing that the dispatcher will free the event once done. But if the dispatcher is in the parent, the child will never ever free that object, thus leaking it) This is usually not an issue. However, if such an object has a file-descriptor embedded, this FD is left open and never closed in the child. In manager_exit(), if we now destroy an object (i.e., close its embedded file-descriptor) before we destroy its related sd_event_source, then sd-event will not be able to drop the FD from the epoll-set (EFD). This is, because the FD is no longer valid at the time we call EPOLL_CTL_DEL. Hence, the kernel cannot figure out the linked "struct file" and thus cannot remove the FD-number plus "struct file" combination; effectively leaving the subscription in the epoll-set. Since we leak the uevent-fd in the children, they retain a copy of the FD pointing to the same "struct file". Thus, the EFD-subscription are not automatically removed by close() (as described above). Therefore, the main daemon will still get its metadata back on epoll_watch() whenever an event occurs (even though it already freed the metadata). This then causes the free-after-use bug described in systemd#163. This patch fixes the order in which we destruct objects and related sd-event-sources. Some open questions remain: * Why does source_io_unregister() not warn on EPOLL_CTL_DEL failures? This really needs to be turned into an assert_return(). * udevd really should not leak file-descriptors into its children. Fixing this would *not* have prevented this bug, though (since the child-setup is still async). It's non-trivial to fix this, though. The stack-context of the caller cannot be rewinded, so we cannot figure out temporary refs. Maybe it's time to exec() the udev-workers? * Why does the kernel not copy FD-subscriptions across fork()? Or at least drop subscriptions if you close() your FD (it uses the FD-number as key, so it better subscribe to it)? Or it better used FD+"struct file_table*"+"struct file*" as key to not allow the childen to share the subscription table.. *sigh* Seems like we have to live with that API forever.
kaysievers
added a commit
that referenced
this pull request
Jun 17, 2015
udev: don't close FDs before dropping them from epoll
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
(CC: @teg @kaysievers)
This fixes #163 and makes #227 obsolete. See the commit description for details.
This was fun debugging... :)