Fixes a metadata issue with store's return_stored by jakirkham · Pull Request #3064 · dask/dask

jakirkham · 2018-01-12T23:55:22Z

Tweaks one of the tests to work with two different arrays (both in shape and type) to demonstrate the bug that was occurring. Then provides a fix to store that makes this test pass.

Make sure that each array's metadata is handled correctly and uniquely.

jakirkham · 2018-01-13T01:00:28Z

Passing (with exception for the pandas dev build, which is failing everywhere). Not sure if we want a changelog entry for this. Otherwise this is good to go on my end.

jcrist · 2018-01-13T02:50:41Z

dask/array/core.py

+                sources[i].chunks,
+                sources[i].dtype
+            ))
+        result = tuple(result)


I prefer comprehensions when they're not too complicated (this one certainly isn't), as it's clear that the loop is just for building a result. They're also more efficient (not that this matters here).

return tuple(Array(load_dsks_mrg, name, src.chunks, src.dtype) for name, src in zip(load_names, sources))

Actually I'd prefer to keep this as a for-loop if it's all the same. I already messed up this code once and would much rather keep it in a form that is clear to me what it does. Hope that makes sense.

Could you replace the indexing with zip at least?

jakirkham · 2018-01-13T19:18:30Z

Would you like a changelog entry?

There was an issue of applying the same metadata from the last array to all preceding arrays when `return_stored=True`. This fixes that issue by adding a `for`-loop and selecting out each array from `sources` appropriately to match the data loaded from disk.

jcrist · 2018-01-14T16:22:09Z

dask/array/core.py

-            Array(load_dsks_mrg, n, src.chunks, src.dtype) for n in load_names
-        )
+        result = []
+        for i, sources_i in enumerate(sources):


This still is an antipattern in python. Semantically you're iterating over two equal length iterators, but this is masked here by using enumerate and indexing for one of them. It'd be better to do:

for name, source in zip(load_names, sources): result.append((Array(load_dsks_mrg, name, source.chunks, source.dtype))

as it's immediately clear what the loop is for. Example from python antipatterns. Using common python idioms makes understanding the codebase easier for maintainers.

The downside of using zip for things like this is it quietly (without erroring) shortens the lists when combined if they are both not the same length. This leaves users to figure out this problem downstream of where it was introduced. While one could suggest using zip_longest, it essentially has the same problem, but fills in dummy values instead of erroring when it would be preferred.

Indexing on the other hand would actually error and would do so in the right location (in the for-loop) clearly pointing out that one of the lists was too short even with some indication about how short. Personally I was happy with range, but switched to enumerate as a compromise. If you find enumerate to be worse than range, I would be happy to revert back. I suppose one could add asserts after, but that feels more like an antipattern than just indexing.

I don't buy that indexing is an antipattern. Iterating by index is a common pattern in just about every imperative language and is well understood. IMHO this is perfectly clear in what it is doing.

I'm sorry to be so adamant about this, but I'm unhappy that this bug was introduced and feel it was in part because I did not stick up for my viewpoints. So am doing so now to avoid the same issue again.

Two sets of thoughts coming from two different roles of "dask code reader" and "dask dev community member"

Dask code reader

The downside of using zip for things like this is it quietly (without erroring) shortens the lists when combined if they are both not the same length

If we are concerned about list length equality in this case then maybe this would be a good use of assert?

FWIW I agree with @jcrist about the preference of for a, b in zip(A, B). This isn't based on any first-principles argument (these are hard to argue about) but rather based on the style that I see in related projects in the ecosystem. While I agree that the for(int i=0; i < n; i++) is super-common in other languages I rarely see it in PyData projects, which is from where most of our readers come, I suspect. In C-like languages that have a foreach concept (Java, C#, ...) the foreach construct tends to be preferred over iterating on indexes (or at least this seemed to be the case several years ago when I had more exposure here).

I also have a preference for list comprehensions when possible. This is subjective though. My reasoning is that when I see a list comprehension I know immediately that it's some sort of map+filter operation. My brain can quickly get the gist of what is happening and move on. When I see a for loop my brain actually has to stop and think a bit to identify what's going on. I find list comprehensions enable my laziness, which I like, but I'm obviously not the only person reading this code. Others may have alternate views. List comprehensions can definitely be overdone. I've been chronically guilty of this in the past.

Dask community member

This brings up a more general question of how we determine style within the project. I personally see two viewpoints:

Those who maintain the project should determine project style. As the pool of maintainers shifts we should expect the style to shift as well. Although to be clear, I'm using maintain separately from develop. Development is often about writing new feature while maintenance is on who tends to solve problems that arise from other peoples' issues. As @jakirkham spends more time doing maintenance we should naturally expect the style of the project to evolve. This certainly happened when @jcrist started doing more maintenance and I started doing less. I think that the project has been greatly improved by the extra perspective.

We should optimize far more to community standards and to our reader-base rather than our own personal thoughts on how code should be written. The people involved in this discussion are far more capable of writing code to meet community standards than the community is capable of understanding our code.

We obviously need to balance these. I think that Dask developers are somewhat forward thinking and can push the needle a little bit on community standards. However if we went entirely with our own objectives then Dask would have been written in Clojure three years ago and, I suspect, no one would be working on it today :)

thanks for listening all

I don't want to drag out this discussion. However I have to admit that I'm a bit bothered by the level of resistance I'm experiencing in a bug fix over a style choice. In recent memory, it has felt like there has been an uptick of similar discussions in PRs I have made. Hence I propose the following changes.

We should explore using other settings from flake8, pylint, etc.

We should write up a style guide for the Dask project.

I propose these changes so that we can have these style discussions exactly once hopefully saving time on both contributor and reviewer sides in the future. All further discussions will be something along the lines of "the linter failed because of..." or "please see this line in the style guide". If other style points come up down the road, they should make their way into new linter settings and/or the style guide.

Thoughts?

Yeah, sorry about that. At the same time consistent style does have value for long-term maintenance. There is a trade-off here.

I'm totally in favor of automating these decisions when we can. My experience has been that no tool is able to automate all of these kinds of questions. However I do think that developer communities naturally gravitate to a common average over time, so I hope that this improves. If having a style guide would help then I'm fine with that, though I'm slightly concerned that it will favor the style of older contributors over newer ones and so cause some stagnation.

Perhaps this is a topic of conversation for a future community meeting?

It's ok. I'd just like to improve the communication about style.

Sure, talking about it in a meeting is fine.

Agree that no tool is perfect. Though as we use various tools more, we can certainly encourage useful changes upstream. In the future tools may be able to learn style of a codebase and enforce consistency.

If having a style guide would help then I'm fine with that, though I'm slightly concerned that it will favor the style of older contributors over newer ones and so cause some stagnation.

Just as an observation, that bias likely already exists. A style guide would make it clearer what that bias means in terms of expected style.

Actually I would think the style guide would make it easier to shift style over time as we would be able to do one style-focused PR (e.g. to the style guide) instead of multiple unrelated PRs (e.g. to the codebase that use different styles). Though I could be wrong about this.

jcrist · 2018-01-14T16:22:35Z

Would you like a changelog entry?

Up to you. It couldn't hurt.

To alleviate some style concerns raised, switch to using a generator expression in a `tuple` using `zip`. Add an `assert` afterwards to ensure we got the number of results we expected (same as number of sources).

We forgot to include a changelog entry for the original PR. This corrects that error.

jakirkham · 2018-01-16T14:29:31Z

docs/source/changelog.rst

 - Compatability for changes to structured arrays in the upcoming NumPy 1.14 release (:pr:`2964`) `Tom Augspurger`_
 - Add ``block`` (:pr:`2650`) `John A Kirkham`_
 - Add ``frompyfunc`` (:pr:`3030`) `Jim Crist`_
+- Add the ``return_stored`` option to ``store`` for chaining stored results (:pr:`2980`) `John A Kirkham`_


Sorry, appears I had forgotten this before. So added it here. Hope that is ok.

jakirkham · 2018-01-16T15:41:46Z

Think this addresses the concerns raised. Please let me know if there is anything else.

jakirkham · 2018-01-16T18:24:04Z

Planning on merging tomorrow if no other comments.

jcrist · 2018-01-16T18:28:35Z

Looks good, thanks @jakirkham.

jakirkham · 2018-01-16T18:37:30Z

Thanks for the reviews.

Test two arrays of different sizes and types

b31595c

Make sure that each array's metadata is handled correctly and uniquely.

jakirkham mentioned this pull request Jan 13, 2018

Optionally return stored data after storing #2980

Merged

jcrist reviewed Jan 13, 2018

View reviewed changes

jakirkham force-pushed the fix_store_ret_arrs_meta_bug branch from 776182e to b72e346 Compare January 13, 2018 19:30

jcrist reviewed Jan 14, 2018

View reviewed changes

jakirkham added 4 commits January 16, 2018 08:06

Use generator expression to create arrays

360efc8

To alleviate some style concerns raised, switch to using a generator expression in a `tuple` using `zip`. Add an `assert` afterwards to ensure we got the number of results we expected (same as number of sources).

Note store's return_stored in the changelog

fc6369b

We forgot to include a changelog entry for the original PR. This corrects that error.

Add store's metadata bug fix in the changelog

9a6e2cd

Merge branch 'master' into fix_store_ret_arrs_meta_bug

5af830c

jakirkham commented Jan 16, 2018

View reviewed changes

jcrist merged commit f29cf97 into dask:master Jan 16, 2018

jakirkham deleted the fix_store_ret_arrs_meta_bug branch January 16, 2018 18:35

Uh oh!

Conversation

jakirkham commented Jan 12, 2018

Uh oh!

jakirkham commented Jan 13, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakirkham commented Jan 13, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Dask code reader

Dask community member

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcrist commented Jan 14, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakirkham commented Jan 16, 2018

Uh oh!

jakirkham commented Jan 16, 2018

Uh oh!

jcrist commented Jan 16, 2018

Uh oh!

jakirkham commented Jan 16, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants