Skip to content

Inotify: fix watcher.watches and paths leaks#73

Closed
chamaken wants to merge 3 commits intofsnotify:masterfrom
chamaken:inotify
Closed

Inotify: fix watcher.watches and paths leaks#73
chamaken wants to merge 3 commits intofsnotify:masterfrom
chamaken:inotify

Conversation

@chamaken
Copy link
Copy Markdown
Contributor

Thanks to @PieterD Watcher.Close() can stop goroutine cleanly.
But I think watches and paths map still leaks, and feel a bit strange
Remove() invalid filename returns differently between at first time and
after. I tried to fix those, would you revew the patches?

Watcher.watches and paths are not deleted when entry was removed
automatically. Deleting those entry at ignoreLinux() to fix it.
This changes removing invalid watch name, it (should) always
returns ``can't remove non-existent...''.
@PieterD
Copy link
Copy Markdown
Contributor

PieterD commented Mar 12, 2015

Sure thing, I'll take a look at this tonight or tomorrow. Thanks.

@nathany nathany added the linux label Mar 12, 2015
@PieterD
Copy link
Copy Markdown
Contributor

PieterD commented Mar 13, 2015

Don't have much time (I'm on holiday!) but I've got a few minutes.

inotify.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big fan that all these comments were removed.

@PieterD
Copy link
Copy Markdown
Contributor

PieterD commented Mar 13, 2015

Sorry, I don't have a ton of time :(

I'm a bit confused about what the patch is meant to fix. Could you elaborate?

@chamaken
Copy link
Copy Markdown
Contributor Author

Thank you for taking your time.
I will restore commens. How about changing first comment:
inotify_rm_watch will return EINVAL if the file has been deleted;
the inotify will already have been removed.
watches and pathes are deleted in ignoreLinux() implicitly and asynchronously
by calling inotify_rm_watch() below. e.g. readEvents() goroutine receives IN_IGNORE
so that EINVAL means that the wd is being rm_watch()ed or its file removed
by another thread and we have not received IN_IGNORE event.

@PieterD
Copy link
Copy Markdown
Contributor

PieterD commented Mar 14, 2015

Thanks. There was some stuff in there I thought might save some time looking stuff up in the future :)

@PieterD
Copy link
Copy Markdown
Contributor

PieterD commented Mar 14, 2015

LGTM. I don't have a linux machine handy so I can't play around with it, but it looks good.

@nathany
Copy link
Copy Markdown
Contributor

nathany commented Mar 15, 2015

Thanks @chamaken. I can take a look sometime over the next few days.

@nathany
Copy link
Copy Markdown
Contributor

nathany commented Apr 28, 2015

Sorry, I still haven't checked this. :-(

@nathany nathany closed this in e253224 Nov 17, 2015
@nathany
Copy link
Copy Markdown
Contributor

nathany commented Nov 17, 2015

Merged. Thanks.

@tsuna
Copy link
Copy Markdown

tsuna commented Mar 1, 2016

This change e253224 is making some of my unit tests fail non-deterministically. After adding some tracing to the code, here's what I'm seeing (both on Linux 3.4.43 and 4.1.18):

When Remove() is called on a Watcher, it now Wait()s on a condition variable after calling syscall.InotifyRmWatch(). The condition variable is notified from ignoreLinux(), at the point that we see an Event with the IN_IGNORED bit set. This seems reasonable as the man page for inotify_rm_watch says:

Removing a watch causes an IN_IGNORED event to be generated for this watch descriptor.

However I'm seeing that in two of my unit tests sometimes (more than 50% of the time) there is no IN_IGNORED event that follows the call to syscall.InotifyRmWatch(). I verified that syscall.InotifyRmWatch() doesn't return any error. There is just no event on the watch descriptor that was passed to syscall.InotifyRmWatch().

Has anybody seen this before?

@tsuna
Copy link
Copy Markdown

tsuna commented Mar 1, 2016

OK I think I got it... With this change it's no longer safe to consume Events from the same goroutine that calls Remove(). In our case our code basically looks this:

func (shared *InotifyTracker) run() {
    // ...

    for {
        select {
        case winfo := <-shared.watch:
            shared.error <- shared.addWatch(winfo.fname)

        case fname := <-shared.remove:
            shared.removeWatch(fname) // <-------- no longer legit

        // ...
    }
}

Removing the watch leads to a deadlock: we're trying to call Remove() and Remove() is waiting to be notified that the IN_IGNORED event was received, but the IN_IGNORED event can't be sent because fsnotify is trying to write another event to the Watch.

@nathany
Copy link
Copy Markdown
Contributor

nathany commented Mar 1, 2016

Thanks for investigating this @tsuna. Would you mind opening a new issue so this doesn't get lost?

Arista-Jenkins pushed a commit to aristanetworks/tail that referenced this pull request Mar 1, 2016
Calling watcher.Remove() from the run() goroutine is now problematic,
because with the change made in fsnotify/fsnotify#73 Remove() can now
take an arbitrary amount of time, which means we can deadlock if run()
is waiting for fsnotify to acknowledge the removal and fsnotify is
trying to send an unrelated Event.

So instead we now do part of the cleanup, including calling Remove(),
synchronously, in the goroutine trying to unsubscribe.

This fixes hpcloud#75.

Change-Id: I346c9eecc34b2378312b07b3c3efc41616b95380
tsuna added a commit to aristanetworks/tail that referenced this pull request Mar 1, 2016
Calling watcher.Remove() from the run() goroutine is now problematic,
because with the change made in fsnotify/fsnotify#73 Remove() can now
take an arbitrary amount of time, which means we can deadlock if run()
is waiting for fsnotify to acknowledge the removal and fsnotify is
trying to send an unrelated Event.

So instead we now do part of the cleanup, including calling Remove(),
synchronously, in the goroutine trying to unsubscribe.

This fixes hpcloud#75.  Thanks to Aaron Beitch for the fix.

Change-Id: I346c9eecc34b2378312b07b3c3efc41616b95380
@tsuna
Copy link
Copy Markdown

tsuna commented Mar 1, 2016

@nathany I'm happy to open an issue if you deem that this new code creates an undesirable side effect that needs to be fixed. As you can see above, I've submitted a PR to workaround this quirk in the tail library. But I agree that the new behavior of fsnotify as a result of this change is unfortunate because it will be more easily prone to deadlocks, depending on how the client code is written.

@nathany
Copy link
Copy Markdown
Contributor

nathany commented Mar 3, 2016

Yes. I think we should take another look at this code change. Adding and removing watches from the same goroutine should not cause deadlocks.

@tsuna
Copy link
Copy Markdown

tsuna commented Mar 3, 2016

I filed bug #123. Thanks.

Nino-K pushed a commit to hpcloud/tail that referenced this pull request Apr 8, 2016
Calling watcher.Remove() from the run() goroutine is now problematic,
because with the change made in fsnotify/fsnotify#73 Remove() can now
take an arbitrary amount of time, which means we can deadlock if run()
is waiting for fsnotify to acknowledge the removal and fsnotify is
trying to send an unrelated Event.

So instead we now do part of the cleanup, including calling Remove(),
synchronously, in the goroutine trying to unsubscribe.

This fixes #75.  Thanks to Aaron Beitch for the fix.

Change-Id: I346c9eecc34b2378312b07b3c3efc41616b95380
tsuna added a commit to aristanetworks/fsnotify that referenced this pull request Oct 17, 2016
This partially reverts pull request fsnotify#73, but helps prevent bug fsnotify#123,
which is a lot worse than the problem that was fixed by fsnotify#73.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants