Skip to content

Fixes a metadata issue with store's return_stored#3064

Merged
jcrist merged 6 commits intodask:masterfrom
jakirkham:fix_store_ret_arrs_meta_bug
Jan 16, 2018
Merged

Fixes a metadata issue with store's return_stored#3064
jcrist merged 6 commits intodask:masterfrom
jakirkham:fix_store_ret_arrs_meta_bug

Conversation

@jakirkham
Copy link
Copy Markdown
Member

Tweaks one of the tests to work with two different arrays (both in shape and type) to demonstrate the bug that was occurring. Then provides a fix to store that makes this test pass.

Make sure that each array's metadata is handled correctly and uniquely.
@jakirkham
Copy link
Copy Markdown
Member Author

Passing (with exception for the pandas dev build, which is failing everywhere). Not sure if we want a changelog entry for this. Otherwise this is good to go on my end.

sources[i].chunks,
sources[i].dtype
))
result = tuple(result)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer comprehensions when they're not too complicated (this one certainly isn't), as it's clear that the loop is just for building a result. They're also more efficient (not that this matters here).

return tuple(Array(load_dsks_mrg, name, src.chunks, src.dtype)
             for name, src in zip(load_names, sources))

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'd prefer to keep this as a for-loop if it's all the same. I already messed up this code once and would much rather keep it in a form that is clear to me what it does. Hope that makes sense.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you replace the indexing with zip at least?

@jakirkham
Copy link
Copy Markdown
Member Author

Would you like a changelog entry?

There was an issue of applying the same metadata from the last array to
all preceding arrays when `return_stored=True`. This fixes that issue by
adding a `for`-loop and selecting out each array from `sources`
appropriately to match the data loaded from disk.
@jakirkham jakirkham force-pushed the fix_store_ret_arrs_meta_bug branch from 776182e to b72e346 Compare January 13, 2018 19:30
Array(load_dsks_mrg, n, src.chunks, src.dtype) for n in load_names
)
result = []
for i, sources_i in enumerate(sources):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still is an antipattern in python. Semantically you're iterating over two equal length iterators, but this is masked here by using enumerate and indexing for one of them. It'd be better to do:

for name, source in zip(load_names, sources):
    result.append((Array(load_dsks_mrg, name, source.chunks, source.dtype))

as it's immediately clear what the loop is for. Example from python antipatterns. Using common python idioms makes understanding the codebase easier for maintainers.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside of using zip for things like this is it quietly (without erroring) shortens the lists when combined if they are both not the same length. This leaves users to figure out this problem downstream of where it was introduced. While one could suggest using zip_longest, it essentially has the same problem, but fills in dummy values instead of erroring when it would be preferred.

Indexing on the other hand would actually error and would do so in the right location (in the for-loop) clearly pointing out that one of the lists was too short even with some indication about how short. Personally I was happy with range, but switched to enumerate as a compromise. If you find enumerate to be worse than range, I would be happy to revert back. I suppose one could add asserts after, but that feels more like an antipattern than just indexing.

I don't buy that indexing is an antipattern. Iterating by index is a common pattern in just about every imperative language and is well understood. IMHO this is perfectly clear in what it is doing.

I'm sorry to be so adamant about this, but I'm unhappy that this bug was introduced and feel it was in part because I did not stick up for my viewpoints. So am doing so now to avoid the same issue again.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two sets of thoughts coming from two different roles of "dask code reader" and "dask dev community member"

Dask code reader

The downside of using zip for things like this is it quietly (without erroring) shortens the lists when combined if they are both not the same length

If we are concerned about list length equality in this case then maybe this would be a good use of assert?

FWIW I agree with @jcrist about the preference of for a, b in zip(A, B). This isn't based on any first-principles argument (these are hard to argue about) but rather based on the style that I see in related projects in the ecosystem. While I agree that the for(int i=0; i < n; i++) is super-common in other languages I rarely see it in PyData projects, which is from where most of our readers come, I suspect. In C-like languages that have a foreach concept (Java, C#, ...) the foreach construct tends to be preferred over iterating on indexes (or at least this seemed to be the case several years ago when I had more exposure here).

I also have a preference for list comprehensions when possible. This is subjective though. My reasoning is that when I see a list comprehension I know immediately that it's some sort of map+filter operation. My brain can quickly get the gist of what is happening and move on. When I see a for loop my brain actually has to stop and think a bit to identify what's going on. I find list comprehensions enable my laziness, which I like, but I'm obviously not the only person reading this code. Others may have alternate views. List comprehensions can definitely be overdone. I've been chronically guilty of this in the past.

Dask community member

This brings up a more general question of how we determine style within the project. I personally see two viewpoints:

  1. Those who maintain the project should determine project style. As the pool of maintainers shifts we should expect the style to shift as well. Although to be clear, I'm using maintain separately from develop. Development is often about writing new feature while maintenance is on who tends to solve problems that arise from other peoples' issues. As @jakirkham spends more time doing maintenance we should naturally expect the style of the project to evolve. This certainly happened when @jcrist started doing more maintenance and I started doing less. I think that the project has been greatly improved by the extra perspective.
  2. We should optimize far more to community standards and to our reader-base rather than our own personal thoughts on how code should be written. The people involved in this discussion are far more capable of writing code to meet community standards than the community is capable of understanding our code.

We obviously need to balance these. I think that Dask developers are somewhat forward thinking and can push the needle a little bit on community standards. However if we went entirely with our own objectives then Dask would have been written in Clojure three years ago and, I suspect, no one would be working on it today :)

thanks for listening all

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to drag out this discussion. However I have to admit that I'm a bit bothered by the level of resistance I'm experiencing in a bug fix over a style choice. In recent memory, it has felt like there has been an uptick of similar discussions in PRs I have made. Hence I propose the following changes.

  1. We should explore using other settings from flake8, pylint, etc.
  2. We should write up a style guide for the Dask project.

I propose these changes so that we can have these style discussions exactly once hopefully saving time on both contributor and reviewer sides in the future. All further discussions will be something along the lines of "the linter failed because of..." or "please see this line in the style guide". If other style points come up down the road, they should make their way into new linter settings and/or the style guide.

Thoughts?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry about that. At the same time consistent style does have value for long-term maintenance. There is a trade-off here.

I'm totally in favor of automating these decisions when we can. My experience has been that no tool is able to automate all of these kinds of questions. However I do think that developer communities naturally gravitate to a common average over time, so I hope that this improves. If having a style guide would help then I'm fine with that, though I'm slightly concerned that it will favor the style of older contributors over newer ones and so cause some stagnation.

Perhaps this is a topic of conversation for a future community meeting?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok. I'd just like to improve the communication about style.

Sure, talking about it in a meeting is fine.

Agree that no tool is perfect. Though as we use various tools more, we can certainly encourage useful changes upstream. In the future tools may be able to learn style of a codebase and enforce consistency.

If having a style guide would help then I'm fine with that, though I'm slightly concerned that it will favor the style of older contributors over newer ones and so cause some stagnation.

Just as an observation, that bias likely already exists. A style guide would make it clearer what that bias means in terms of expected style.

Actually I would think the style guide would make it easier to shift style over time as we would be able to do one style-focused PR (e.g. to the style guide) instead of multiple unrelated PRs (e.g. to the codebase that use different styles). Though I could be wrong about this.

@jcrist
Copy link
Copy Markdown
Member

jcrist commented Jan 14, 2018

Would you like a changelog entry?

Up to you. It couldn't hurt.

To alleviate some style concerns raised, switch to using a generator
expression in a `tuple` using `zip`. Add an `assert` afterwards to
ensure we got the number of results we expected (same as number of
sources).
We forgot to include a changelog entry for the original PR. This
corrects that error.
- Compatability for changes to structured arrays in the upcoming NumPy 1.14 release (:pr:`2964`) `Tom Augspurger`_
- Add ``block`` (:pr:`2650`) `John A Kirkham`_
- Add ``frompyfunc`` (:pr:`3030`) `Jim Crist`_
- Add the ``return_stored`` option to ``store`` for chaining stored results (:pr:`2980`) `John A Kirkham`_
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, appears I had forgotten this before. So added it here. Hope that is ok.

@jakirkham
Copy link
Copy Markdown
Member Author

Think this addresses the concerns raised. Please let me know if there is anything else.

@jakirkham
Copy link
Copy Markdown
Member Author

Planning on merging tomorrow if no other comments.

@jcrist
Copy link
Copy Markdown
Member

jcrist commented Jan 16, 2018

Looks good, thanks @jakirkham.

@jcrist jcrist merged commit f29cf97 into dask:master Jan 16, 2018
@jakirkham jakirkham deleted the fix_store_ret_arrs_meta_bug branch January 16, 2018 18:35
@jakirkham
Copy link
Copy Markdown
Member Author

Thanks for the reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants