Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Jul 23, 2019

Also fix a bug in Take / Filter for large binary types.

@kszucs
Copy link
Member

kszucs commented Jul 24, 2019

@ursabot build

@pitrou pitrou force-pushed the ARROW-6000-py-large-binary branch 2 times, most recently from 3bdd5c9 to 63b4dfa Compare July 29, 2019 15:44
@codecov-io
Copy link

Codecov Report

Merging #4927 into master will increase coverage by 0.43%.
The diff coverage is 93.43%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4927      +/-   ##
==========================================
+ Coverage    87.5%   87.93%   +0.43%     
==========================================
  Files         998      908      -90     
  Lines      141869   133088    -8781     
  Branches     1418     1418              
==========================================
- Hits       124139   117036    -7103     
+ Misses      17368    16042    -1326     
+ Partials      362       10     -352
Impacted Files Coverage Δ
cpp/src/arrow/testing/random.h 100% <ø> (ø) ⬆️
cpp/src/arrow/pretty_print.cc 83.33% <ø> (ø) ⬆️
cpp/src/arrow/python/helpers.h 88.88% <ø> (ø) ⬆️
cpp/src/arrow/visitor.h 50% <ø> (ø) ⬆️
cpp/src/arrow/visitor.cc 0% <0%> (ø) ⬆️
cpp/src/arrow/python/helpers.cc 77.05% <0%> (-0.92%) ⬇️
cpp/src/arrow/json/converter-test.cc 100% <100%> (ø) ⬆️
cpp/src/arrow/visitor_inline.h 88.78% <100%> (ø) ⬆️
cpp/src/arrow/array-binary-test.cc 100% <100%> (ø) ⬆️
cpp/src/arrow/pretty_print-test.cc 100% <100%> (ø) ⬆️
... and 124 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 171c3f7...63b4dfa. Read the comment docs.

@pitrou pitrou force-pushed the ARROW-6000-py-large-binary branch from 63b4dfa to 022173e Compare July 30, 2019 08:12
@pitrou pitrou changed the title [WIP] ARROW-6000: [Python] Add support for LargeString and LargeBinary types ARROW-6000: [Python] Add support for LargeString and LargeBinary types Jul 30, 2019
@pitrou pitrou force-pushed the ARROW-6000-py-large-binary branch from 022173e to aaf009e Compare July 30, 2019 08:14
@pitrou pitrou force-pushed the ARROW-6000-py-large-binary branch from aaf009e to 9672ca4 Compare July 30, 2019 08:44
@pitrou
Copy link
Member Author

pitrou commented Jul 30, 2019

@pitrou
Copy link
Member Author

pitrou commented Jul 31, 2019

@jorisvandenbossche Want to take a look?

@jorisvandenbossche
Copy link
Member

Yes, was planning to do that, thanks for the ping!

I played a bit with the branch (mainly with large_string), some things I noticed (to be clear: I suppose most of them are expected to not yet working, so not wanting to say they should be fixed in this PR, but some items might be worth JIRA issues):

  • With dictionary_encode() is not working on the array (a = pa.array(['a', 'b', 'c'], pa.large_string()); a.dictionary_encode() gives NotImplementedError), but constructing a dictionary array manually with large string dictionary works OK. unique is also not implemented.

  • to_pandas is not yet working (both on Table or Array). to_pylist is working fine. Also not yet for parquet, but I don't know for sure that parquet has a type that could be used for this (so not sure it should be working).

  • Do we want to be able to cast string to large_string (or the other way around) ?

  • take is not working correctly (but also not raising that it is not supported), while it works correctly for normal string:

    In [29]: a = pa.array(['a', 'b', 'c'], pa.large_string()) 
    
    In [30]: a.take(pa.array([0, 2])) 
    Out[30]: 
    <pyarrow.lib.LargeStringArray object at 0x7f19719d40f8>
    [
      "",
      ""
    ]
    
    In [31]: b = pa.array(['a', 'b', 'c'], pa.string())
    
    In [32]: b.take(pa.array([0, 2]))
    Out[32]: 
    <pyarrow.lib.StringArray object at 0x7f19719cd2b0>
    [
      "a",
      "c"
    ]
    

Will take a look at the actual code now.

@pitrou
Copy link
Member Author

pitrou commented Jul 31, 2019

Ok, some answers:

  • dictionary encoding isn't implemented, that's expected.
  • to_pandas: similar. Perhaps that's something you would like to tackle in a separate issue?
  • casting from string to large_string and vice-versa: yes, I opened https://issues.apache.org/jira/browse/ARROW-6071 for that and will leave it to interested contributors
  • take not working correctly: that's unexpected; either it should work or raise NotImplementedError...

@jorisvandenbossche
Copy link
Member

Ah, I see you already opened https://issues.apache.org/jira/browse/ARROW-6071 for the casting yesterday

to_pandas: similar. Perhaps that's something you would like to tackle in a separate issue?

Yes, that's something I could look into.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already went through the python part (need to run now), which looks good.

Create large variable-length binary type
This data type may not be supported by all Arrow implementations. Unless
you need to represent data larger than 2GB, you should prefer binary().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we clarify this is about the total size of a single array?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, forget, a single value bigger than 2GB of course implies an array that is bigger than 2GB .. ;), so the distinction does not matter much.

@pitrou
Copy link
Member Author

pitrou commented Jul 31, 2019

Ok, I've pushed a fix for the take() issue. Thanks for noticing!

@pitrou
Copy link
Member Author

pitrou commented Aug 1, 2019

Will merge soon if no more comments.

public:
static constexpr Type::type type_id = Type::STRING;
static constexpr bool is_utf8 = true;
using EquivalentBinaryType = BinaryType;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fsaintjacques Any issue with the naming?

@pitrou pitrou closed this in eb73b96 Aug 1, 2019
@pitrou pitrou deleted the ARROW-6000-py-large-binary branch August 1, 2019 10:50
@jorisvandenbossche
Copy link
Member

to_pandas: similar. Perhaps that's something you would like to tackle in a separate issue?

Have to see a bit with my priorities, but at least created an issue for it: https://issues.apache.org/jira/browse/ARROW-6115

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants