Skip to content

Specify scheduler in DataFrame assert_eq#8559

Merged
jrbourbeau merged 4 commits intodask:mainfrom
gjoseph92:assert-eq-set-scheduler
Jan 18, 2022
Merged

Specify scheduler in DataFrame assert_eq#8559
jrbourbeau merged 4 commits intodask:mainfrom
gjoseph92:assert-eq-set-scheduler

Conversation

@gjoseph92
Copy link
Collaborator

Support passing scheduler= to dask.dataframe.utils.assert_eq. By default, assert_eq continues to use the sync scheduler, but this allows specific tests to override that as necessary.

Split out from #8392.

  • Tests added / passed
  • Passes pre-commit run --all-files

@gjoseph92 gjoseph92 requested a review from jrbourbeau January 11, 2022 21:57
@gjoseph92 gjoseph92 self-assigned this Jan 11, 2022
@jrbourbeau
Copy link
Member

Thanks @gjoseph92! cc @rjzamora who I recall wanted something like this previously

gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request Jan 12, 2022
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gjoseph92 -- I left a couple comments about the test that was added, but am looking forward to seeing this added

with pytest.raises(AssertionError):
assert_eq(ddf2, ddf2)

assert_eq(ddf2, ddf2, scheduler=custom_scheduler)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is close to what we want, but it's not clear to me that this is actually using custom_scheduler. For example, if we just added a scheduler= keyword to assert_eq and then did nothing with it, this line would still pass. If custom_scheduler incremented some counter when it was called and we then asserted that the counter has been incremented the expected number of times after calling assert_eq it would be more straightforward to see that custom_scheduler is actually being used (there certainly could be other approaches too -- I'm just throwing one possibility out there)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point here was that if you weren't using custom_scheduler, the computation would fail (as just tested above). I could also have the custom_scheduler function increment a counter every time it's called, but that seems kind of redundant to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Thanks for the clarification.

Somewhat related, what about updating the assert_eq calls to be the original pandas DataFrame (i.e. something along the of assert_eq(df + 1, ddf2))? That should result in us exercising all the same logic currently in the test, while also giving more confidence since we know we're comparing the computed ddf2 result to a known object which doesn't require a dask computation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked it against itself just because it seemed like an easier way to verify that both sides of the assert_eq were respecting the scheduler= logic. Otherwise, besides parametrizing the test to flip them around, it would be possible to have a future regression where one of the calls inside assert_eq forgot to pass the argument through.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gjoseph92

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants