Skip to content

Dask dataframe - divisions don't seem to be consistent #1057

@thequackdaddy

Description

@thequackdaddy

Hello,

I wanted to point out that the definition of a "division" is not consistent between the different input formats.

divisions appear to be a tuple which is either

  • Style A
    (first index in first division, last index is first division, last index in second division, ... , last index in n-1 division, last index in n division)
  • Style B
    (first index in first division, first index in second division, ... , first index in n division, last index in n division)
  • Style C
    All divisions are labeled as None.

I think this is how this is currently set-up...

  • read_csv uses Style C
  • from_array uses Style B (Except that it skips the last division when the index is already the last record. Based on the logic, I'm not sure this can every happen, actually. See here.
  • from_pandas uses Style B
  • from_bcolz uses Style B
  • from_dask_array uses Style B
  • from_castra uses Style A
  • read_hdf uses Style C

So which one is the most "correct"?

Essentially, I'm trying to use a series.where statement where if a condition is met, you use one column, otherwise, you use another (the others is actually a sequence in a from_pandas array.) I think the fact the divisions are slightly different is breaking this.

Happy to program a fix. It looks straightforward enough for all but the from_castra and read_hdf.

Thanks.


Update: I started watching march madness too early and messed up some of these

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions