Fixed flaky test-rearrange by TomAugspurger · Pull Request #5977 · dask/dask

TomAugspurger · 2020-03-04T22:13:40Z

Puts file cleanup in the task graph, rather than relying on Partd to do it for us. I'm not really happy with my approach, but wanted to try this on CI a few times.

Closes #5867

Puts file cleanup in the task graph, rather than relying on Partd.

…earrange

TomAugspurger · 2020-03-24T18:46:12Z

@mrocklin you mentioned that you have concerns about this approach. Can you share them when you get a chance, before I put more time into finishing it off / finding an alternative?

mrocklin

OK, so now I see that we're still using the existing cleanup mechanism in partd, so presumably this is unlikely to be less robust. I guess I'm coming around. I do have a few comments below though.

mrocklin · 2020-03-24T20:01:42Z

dask/dataframe/shuffle.py

+        p.append(d, fsync=True)
+    except Exception:
+        try:
+            p.drop()


Why do we want to drop this?

I think this matches the behavior on master of a case where something in shuffle_group_3 raises an exception. Previously, p would go out of scope and be garbage collected, which includes a call to partd.File.drop.

Because we're explicitly providing a path to create the partd.File now, we're responsible for cleaning up.

dask/dataframe/tests/test_shuffle.py

mrocklin · 2020-03-24T20:03:20Z

dask/dataframe/shuffle.py

+        path = None
+
+    if path:
+        shutil.rmtree(path, ignore_errors=True)


It's low priority, but some folks might create their own PartD objects here. In the future we might want a more robust solution to finding a File object.

dask/dataframe/shuffle.py

mrocklin · 2020-03-24T20:06:42Z

dask/dataframe/shuffle.py

+        (name1, i): (collect, p, i, df._meta, barrier_token) for i in range(npartitions)
    }
+    cleanup_token = "cleanup-" + always_new_token
+    dsk5 = {cleanup_token: (cleanup_partd_files, p, list(dsk4))}


I think that you're bringing back all of the results into a single task with list(dsk4).

If we do this approach then we might have to have an intermediary set of tasks that take in the results of dsk4 and return None for each. In this way we maintain the dependencies, but don't move the data around.

Mmm OK. Do you know, does dsk3 at https://github.com/dask/dask/pull/5977/files#diff-83ae5352ddc87bd80c831102addd9b1eL407 have a similar problem? Perhaps that result isn't as large.

I don't think so, my guess is that the output tasks in dsk2 write things to disk and then return None.

The output tasks of dsk4 read from disk and return dataframes, so this is more of a concern.

Instead, I think that the solution is to have a dsk4b that maps lambda x: None across the outputs of dsk4.

…earrange

TomAugspurger

Just for reference, here's the dask graph for the set_index test.

TomAugspurger · 2020-03-26T13:37:45Z

dask/dataframe/shuffle.py

+        (name1, i): (collect, p, i, df._meta, barrier_token) for i in range(npartitions)
    }
+    cleanup_token = "cleanup-" + always_new_token
+    dsk5 = {cleanup_token: (cleanup_partd_files, p, list(dsk4))}


Mmm OK. Do you know, does dsk3 at https://github.com/dask/dask/pull/5977/files#diff-83ae5352ddc87bd80c831102addd9b1eL407 have a similar problem? Perhaps that result isn't as large.

TomAugspurger · 2020-03-26T13:40:37Z

dask/dataframe/shuffle.py

+        p.append(d, fsync=True)
+    except Exception:
+        try:
+            p.drop()


I think this matches the behavior on master of a case where something in shuffle_group_3 raises an exception. Previously, p would go out of scope and be garbage collected, which includes a call to partd.File.drop.

Because we're explicitly providing a path to create the partd.File now, we're responsible for cleaning up.

dask/dataframe/tests/test_shuffle.py

TomAugspurger · 2020-03-26T13:58:33Z

dask/dataframe/tests/test_shuffle.py

        compute_divisions(c)


-# TODO: Fix sporadic failure on Python 3.8 and remove this xfail mark


I've removed this test, but I can restore it if needed.

It's asserting that there are some files left around, but I'm not sure that we want that / can reliably assert that. My understanding was that we wanted them to be cleaned up automatically.

mrocklin · 2020-03-26T16:30:17Z

Because we're explicitly providing a path to create the partd.File now, we're responsible for cleaning up.

This concerns me. We will often fail at this. What is some exception happens during computation and we never get to the cleanup step?

TomAugspurger · 2020-03-26T20:29:56Z

This concerns me. We will often fail at this. What is some exception happens during computation and we never get to the cleanup step?

1b704d7 attempts to make this clearer. The idea is to wrap calls using the partd.File object in try / except blocks. If there is an exception, then we drop the partd file and re-raise the original exception. 1b704d7 has a test that should illustrate this.

…earrange

TomAugspurger

Gave this another look and I'm sufficiently happy with where it's at.

jrbourbeau · 2020-03-31T20:08:21Z

Thanks for your work on this @TomAugspurger

Fixed flaky test-rearrange

d96c7d6

Puts file cleanup in the task graph, rather than relying on Partd.

TomAugspurger changed the title ~~Fixed flaky test-rearrange~~ [WIP]: Fixed flaky test-rearrange Mar 4, 2020

TomAugspurger added 8 commits March 5, 2020 09:46

fixups

ee0dd8c

fixup

9289ed1

fixup

836964d

fixup

26004f5

wip

c5f5d6a

Merge remote-tracking branch 'upstream/master' into 5867-debub-test-r…

99aa1fc

…earrange

wip

f73357b

hrmm

317720c

TomAugspurger mentioned this pull request Mar 24, 2020

Flaky test test_rearrange #5867

Closed

mrocklin reviewed Mar 24, 2020

View reviewed changes

TomAugspurger added 2 commits March 26, 2020 08:30

Merge remote-tracking branch 'upstream/master' into 5867-debub-test-r…

6e6a7ad

…earrange

fixups

7c11deb

TomAugspurger commented Mar 26, 2020

View reviewed changes

clarify exception handling

1b704d7

TomAugspurger changed the title ~~[WIP]: Fixed flaky test-rearrange~~ Fixed flaky test-rearrange Mar 26, 2020

TomAugspurger added 5 commits March 27, 2020 10:46

Merge remote-tracking branch 'upstream/master' into 5867-debub-test-r…

bffdb53

…earrange

docs, comments

86a621c

Merge remote-tracking branch 'upstream/master' into 5867-debub-test-r…

e568d79

…earrange

remove test change

618ac99

private noop

a2d211a

TomAugspurger commented Mar 31, 2020

View reviewed changes

TomAugspurger merged commit 3c14766 into dask:master Mar 31, 2020

TomAugspurger deleted the 5867-debub-test-rearrange branch March 31, 2020 19:34

This was referenced Feb 22, 2021

Memory performance regresssion in Dask Dataframe when shuffling on disk #7259

Closed

Remove "cleanup" task in dataframe on-disk shuffle #7260

Merged

		compute_divisions(c)


		# TODO: Fix sporadic failure on Python 3.8 and remove this xfail mark

Uh oh!

Conversation

TomAugspurger commented Mar 4, 2020

Uh oh!

TomAugspurger commented Mar 24, 2020

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Mar 26, 2020

Uh oh!

TomAugspurger commented Mar 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

jrbourbeau commented Mar 31, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TomAugspurger commented Mar 26, 2020 •

edited

Loading