Fix regressions in #4651 by crusaderky · Pull Request #4719 · dask/distributed

crusaderky · 2021-04-20T15:17:45Z

Fix regressions introduced in #4651:

failing test
if a worker sends a heartbeat before its SystemMonitor has a chance to run, metrics["memory"] will be None and that will cause a crash in the scheduler.

* code cleanup * code cleanup * Early exit pattern * refactor * cleanup * fix regression * annotations * revert * backend prototype * slightly faster both in CPython and Cython * slightly faster both in CPython and Cython * polish * polish * early exit * polish * polish * backend review * nonfunctional GUI prototype * GUI prototype (unpolished) * tooltip * refactor * GUI * GUI * GUI * refactor * polish * simpler tooltip * Reduce spilled size on delitem * tweak cluster-wide nbytes gauge * workers tab * Self-review * bokeh unit tests * test SpillBuffer * Code review * cython optimizations * test MemoryState * test backend * Remove unnecessary casts uint->sint * Self-review * Test edge cases * fix test failure * redesign test * relax maximums * fix test * lint * fix test * fix test * fix bar on small screens * height in em * larger * fix flaky test

pentschev · 2021-04-20T18:11:29Z

I confirmed locally this fixes the problem for me too, thanks so much for the quick fix @crusaderky !

crusaderky · 2021-04-21T08:38:45Z

Stress test outcome:

77 out of 77 times, test_memory was successful
once, an unrelated test failed
22 times, CI fell over with no log whatsoever, very fast and definitely before reaching the test suite

crusaderky · 2021-04-21T08:42:54Z

distributed/scheduler.py

+        # ws._nbytes is updated at a different time and sizeof() may not be accurate,
+        # so size may be (temporarily) negative; floor it to zero.
+        size = max(0, (metrics["memory"] or 0) - ws._nbytes + metrics["spilled_nbytes"])
+


I can't see a way to write a unit test for this short of monkey-patching SystemMonitor?

It seems like it's going to be the only way, another test does exactly that:

distributed/distributed/tests/test_worker.py

Lines 1697 to 1717 in e4b534a

@pytest.mark.asyncio

@pytest.mark.parametrize("reconnect", [True, False])

async def test_heartbeat_comm_closed(cleanup, monkeypatch, reconnect):

with captured_logger("distributed.worker", level=logging.WARNING) as logger:

async with await Scheduler() as s:

def bad_heartbeat_worker(*args, **kwargs):

raise CommClosedError()

async with await Worker(s.address, reconnect=reconnect) as w:

# Trigger CommClosedError during worker heartbeat

monkeypatch.setattr(

w.scheduler, "heartbeat_worker", bad_heartbeat_worker

)

await w.heartbeat()

if reconnect:

assert w.status == Status.running

else:

assert w.status == Status.closed

assert "Heartbeat to scheduler failed" in logger.getvalue()

There is distributed.admin.system-monitor.interval which controls how often the monitor runs. You could set it to incredibly high values such that it is never executed during the test runtime

Another patch version w/ using monkeypatch, you could remove the PC before it is even started. Something like

@pyters.mark.asyncio async def test_foo(): s = Scheduler() s.periodic_callbacks["monitor"] = None w = Worker(s) w.periodic_callbacks["monitor"] = None await s await w async with Client(s): ....

Setting distributed.admin.system-monitor.interval to a very high value before I create the Scheduler has no effect ( I can see data arriving in the heartbeat from the SystemMonitor.update).

Setting

s.periodic_callbacks["monitor"] = None w.periodic_callbacks["monitor"] = None

fails with

for pc in self.periodic_callbacks.values(): > pc.stop() E AttributeError: 'NoneType' object has no attribute 'stop'

this has no effect:

del s.periodic_callbacks["monitor"] del w.periodic_callbacks["monitor"]

this has no effect:

pc = PeriodicCallback(lamba: None, 999999999) s.periodic_callbacks["monitor"] = pc w.periodic_callbacks["monitor"] = pc

pentschev · 2021-04-21T13:27:16Z

Would people be ok with merging this even without a test and leaving a test for a different PR? This would unblock CI in Dask-CUDA and UCX-Py (potentially other RAPIDS projects too).

fjetter · 2021-04-21T13:52:02Z

I'm fine with postponing tests since our CI is also failing hard. @crusaderky your call

crusaderky · 2021-04-21T14:51:50Z

I need an extra 2-3 hours to cook a unit test. I'm ok merging without

jrbourbeau

Thanks for your work here @crusaderky. I'm going to merge this as is to unblock CI, but left a comment below

jrbourbeau · 2021-04-21T14:59:09Z

distributed/tests/test_scheduler.py

        assert_memory(s, "managed_spilled", 1, 999)
+        # Wait for the spilling to finish. Note that this does not make the test take
+        # longer as we're waiting for recent_to_old_time anyway.
+        sleep(10)


Is there something more direct we can probe here instead of sleeping for 10 seconds?

Not really because the unmanaged memory is very volatile, so we don't know how many keys are going to be spilled out exactly. Also, as noted it doesn't slow the test down.

crusaderky closed this Apr 20, 2021

crusaderky reopened this Apr 20, 2021

fix test_memory and scheduler crash

28852b4

crusaderky force-pushed the test_memory branch from 66bc958 to 28852b4 Compare April 20, 2021 17:05

crusaderky mentioned this pull request Apr 20, 2021

Split RAM measure into dask keys/other old/other new #4651

Merged

jrbourbeau mentioned this pull request Apr 20, 2021

Add descriptions for UCX config options #4683

Merged

2 tasks

crusaderky added 3 commits April 21, 2021 07:11

fix test

b5416dc

Merge remote-tracking branch 'upstream/main' into test_memory

036954c

fix test

89b7a67

revert temp workflows

2e52554

crusaderky commented Apr 21, 2021

View reviewed changes

crusaderky marked this pull request as ready for review April 21, 2021 10:55

crusaderky changed the title ~~WIP fix test_memory~~ Fix regressions in #4651 Apr 21, 2021

fjetter mentioned this pull request Apr 21, 2021

Ensure adaptive scaling is properly awaited and closed #4720

Merged

jacobtomlinson mentioned this pull request Apr 21, 2021

Enable configuration of prometheus metrics namespace #4722

Merged

jrbourbeau approved these changes Apr 21, 2021

View reviewed changes

jrbourbeau merged commit f492aa7 into dask:main Apr 21, 2021

crusaderky deleted the test_memory branch April 21, 2021 16:02

jrbourbeau mentioned this pull request Apr 22, 2021

Unit test for metrics["memory"]=None #4727

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix regressions in #4651#4719

Fix regressions in #4651#4719
jrbourbeau merged 5 commits intodask:mainfrom
crusaderky:test_memory

crusaderky commented Apr 20, 2021 •

edited

Loading

Uh oh!

pentschev commented Apr 20, 2021

Uh oh!

crusaderky commented Apr 21, 2021 •

edited

Loading

Uh oh!

crusaderky Apr 21, 2021 •

edited

Loading

Uh oh!

pentschev Apr 21, 2021

Uh oh!

fjetter Apr 21, 2021

Uh oh!

crusaderky Apr 22, 2021 •

edited

Loading

Uh oh!

pentschev commented Apr 21, 2021

Uh oh!

fjetter commented Apr 21, 2021

Uh oh!

crusaderky commented Apr 21, 2021

Uh oh!

jrbourbeau left a comment

Uh oh!

jrbourbeau Apr 21, 2021

Uh oh!

crusaderky Apr 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	@pytest.mark.asyncio
	@pytest.mark.parametrize("reconnect", [True, False])
	async def test_heartbeat_comm_closed(cleanup, monkeypatch, reconnect):
	with captured_logger("distributed.worker", level=logging.WARNING) as logger:
	async with await Scheduler() as s:

	def bad_heartbeat_worker(args, *kwargs):
	raise CommClosedError()

	async with await Worker(s.address, reconnect=reconnect) as w:
	# Trigger CommClosedError during worker heartbeat
	monkeypatch.setattr(
	w.scheduler, "heartbeat_worker", bad_heartbeat_worker
	)

	await w.heartbeat()
	if reconnect:
	assert w.status == Status.running
	else:
	assert w.status == Status.closed
	assert "Heartbeat to scheduler failed" in logger.getvalue()

Uh oh!

Conversation

crusaderky commented Apr 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pentschev commented Apr 20, 2021

Uh oh!

crusaderky commented Apr 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crusaderky Apr 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pentschev Apr 21, 2021

Choose a reason for hiding this comment

Uh oh!

fjetter Apr 21, 2021

Choose a reason for hiding this comment

Uh oh!

crusaderky Apr 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pentschev commented Apr 21, 2021

Uh oh!

fjetter commented Apr 21, 2021

Uh oh!

crusaderky commented Apr 21, 2021

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Apr 21, 2021

Choose a reason for hiding this comment

Uh oh!

crusaderky Apr 21, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

crusaderky commented Apr 20, 2021 •

edited

Loading

crusaderky commented Apr 21, 2021 •

edited

Loading

crusaderky Apr 21, 2021 •

edited

Loading

crusaderky Apr 22, 2021 •

edited

Loading