Skip to content

Semaphore: Ensure semaphore picks correct IOLoop for threadpool workers#4060

Merged
mrocklin merged 4 commits intodask:masterfrom
fjetter:gh4057/ensure_ioloop_threadpool
Aug 24, 2020
Merged

Semaphore: Ensure semaphore picks correct IOLoop for threadpool workers#4060
mrocklin merged 4 commits intodask:masterfrom
fjetter:gh4057/ensure_ioloop_threadpool

Conversation

@fjetter
Copy link
Member

@fjetter fjetter commented Aug 21, 2020

When running workers in a threadpool the Tornado PC doesn't seem to pick the correct IOLoop instance. The PC is effectively never scheduled causing all semaphore leases to timeout.

Closes #4057

with dask.config.set(
{
"distributed.scheduler.locks.lease-validation-interval": 0.01,
"distributed.scheduler.locks.lease-timeout": 0.05,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment here: during some quick testing, it seems like there are some delays with these timeouts. I think the semaphore released after 10 seconds when I set the timeout to 200ms, also when setting it to low refrehs didn't refresh quick enough (I think that was < 20ms though)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reproduce this? If so, another ticket would be appreciated.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. The settings might be a bit confusing. You can set the timeout much lower then the validation interval, which might have caused it. Although the default is 1 second?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing it locally: it looks like the test is actually fine, seems like a timeout can occur the latest at validation-interval + lease-timeout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the timeout is lower than the validation interval this guarantees every lease to timeout. We do not check for this, though.

interval: how often does the scheduler check the leases
timeout: what's the maximum time between refreshs before we consider the lease stale

I just realised that the description in the config is very misleading. I'll adjust this. Thanks

Comment on lines +376 to +378
# Ensure that the PC below takes the correct IOLoop
if not self.client.io_loop.is_current:
self.client.io_loop.make_current()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead I recommend that we start (and maybe create) the periodic callback in the __await__ method. Changing default event loops in different threads seems like it might cause trouble later on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting this in __await__ is not sufficient since the await is usually only called on client side. The PC must be registered on worker side, though, since the worker acquires the lease and needs to take care of the refresh. Even if it was awaited on the worker, we would still have the same problem, wouldn't we?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could reset the default once the PC is created and started. Not really nice but I'm not sure how else to deal with this as long as the PC doesn't allow to set the event loop (I think this was possible pre 5.0 but I do not know why it was removed)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting this in await is not sufficient since the await is usually only called on client side

Hrm, I see. My expectation was that we would create things in __init__ and then either await, or call sync(await) when creating these objects. I see that that isn't always the case though.

In that case I would use client.loop.add_callback to make sure that this gets run in the event loop.

I think that it should work to create the PeriodicCallback here and then start it in the event loop

self.refresh_callback = PeriodicCallback(...)
self.client.loop.add(self.refresh_callback.start)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that in a previous comment @pvanderlinden said that this might not work. I would be a little bit surprised by this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I do not know why it was removed

I think that the compromise that we worked out with the Tornado team was that the event loop would be set in pc.start, so we just needed to make sure that that gets run in the right place.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrocklin That's in the issue not in the PR, but I was using IOLoop.current() so that's why it didn't work.
Should the refresh_callback be started on acquire, which is already a coroutine?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the add_callback seems to work. I'm wondering how other PCs are affected and if we should replace all pc.start() with this.

Should the refresh_callback be started on acquire, which is already a coroutine?

Theoretically, that should work but for whatever reason a few of the tests break when trying this. Instead of debugging this, I'd prefer to use the add_callback

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the add_callback seems to work. I'm wondering how other PCs are affected and if we should replace all pc.start() with this.

I think that we're usually pretty careful with these, but I wouldn't be surprised if something else has slipped.

@mrocklin
Copy link
Member

This looks good to me. Merging. Thanks @fjetter for the work and @pvanderlinden for raising the issue and helping with review.

@mrocklin mrocklin merged commit e08aa22 into dask:master Aug 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Semaphores release automatically - refresh won't run

3 participants