-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-16756: [C++] Introduce non-owning ArraySpan, ExecSpan data structures and refactor ScalarKernels to use them #13364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @save-buffer |
save-buffer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall seems fine, a couple of nits here and there. Biggest concerns have to do with unnecessary heap allocations
cpp/src/arrow/compute/exec.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little sus about this std::vector. May be ok for now but eventually I'd like to avoid small heap allocations. When we add a bump allocator for this kind of stuff within Acero, I'd like to switch to that.
If we want to support that now, we can just make this ExecSpan take a pointer to ExecValue and the number of ExecValues, so that ExecSpan doesn't have to touch the heap at all either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- potentially we could use SmallVector when there are a small number of fields and then switch to std::vector only for "wide" spans, either way it's something we can optimize and at minimum limit the exposure of this std::vector to users of ExecSpan
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see major issues so take these comments more as potential follow-ups
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments to help with reviewing and also notes to myself. I'll respond to the other comments and work on getting the CI to pass
cpp/src/arrow/array/array_nested.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes a shared_ptr copy that is often not needed
cpp/src/arrow/array/data.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open Jira for follow up work
cpp/src/arrow/chunked_array.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another place where shared_ptr being copied needlessly
cpp/src/arrow/compute/exec.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside: I noticed a couple places where very expensive std::vector<Datum> copies were happening
cpp/src/arrow/type.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this because getting the byte width from a fixed width type is quite tedious at the moment (there's many places where we do the cast-- I think having the bit_width()/byte_width() methods available on non-fixed-width types (returning -1) is pretty harmless, but if others don't like it we can remove these and add a helper function or something)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method doesn't need to be virtual, does it?
cpp/src/arrow/util/bitmap.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, we should watch out for classes that take copies of shared_ptr<T> without a good reason to; this was a good example
cpp/src/arrow/util/int_util.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's plenty of other places where we should nix the use of Datum because of its bloat. We might consider promoting ExecValue to the Datum level (like ValueSpan or something)
8d11ee7 to
20dc538
Compare
|
I'm working on some incidental cleanups that I ran into fixing the Python UDF code paths. I'll push the closer-to-CI-green branch with the changes I cited in review comments (e.g. removing |
20dc538 to
3cde3e7
Compare
|
I went ahead and tackled ARROW-16819 which involved cleaning up the various places where nullary / zero-arguments functions were being called hackishly to get a batch length in. This also added a |
|
I hope that we can merge this soon after getting a green build and tackle further improvements in follow up patches because this PR is getting a little unwieldy (every time I touch something it results in a game of whackamole) |
|
I found it curious that FTR, here are the largest symbols in libarrow.so right now (from this branch): https://gist.github.com/wesm/51c97a47336da682dedd918bb7b90015 If we want to reduce binary sizes, the comparison kernels and the temporal kernels is the place to start |
method of scalar kernel evaluation Checkpoint, getting closer to compilation libarrow builds again Get everything compiling again, compute internals tests passing Get Bitmap test cases passing again Don't try filling validity bitmap if none was allocated Fix bit block visitors and a few arithmetic kernels Refactor int_util.h to use ArraySpan Some more tests passing Fix some more unit tests, compilation Another fix Fix some more tests Fix some more bugs Fixed some more tests All scalar_if_else.cc tests passing again Work partway through scalar_nested.cc Down to only a few failing scalar tests scalar kernels tests passing again fix scalar executor, tests all passing now
|
For the record, did the continuous benchmark results give any improvement (re: code size increase)? |
|
Thanks for the reviews -- I'll make sure that these things get taken care of. Several items:
|
|
I think the best solution for performance is to remove shared_ptr to data types and have a canonical owner. The current shared_ptrs have a reference count of at least 1 at all times anyway since they're stored and initialized in that vector. |
I think it's
Yes, that sounds like a reasonable solution. |
Who would be the canonical owner of a datatype? |
I'm thinking hard about how to actually do this. With the way the library is designed, if DataType instances were stored in some central place, absent some form of reference counting mechanism, it would be basically impossible to know if it is ever safe to garbage collect a data type instance. So one proposal would be something like:
I can visualize how to implement this but it would be a truly chaotic refactor So the new structure of DataType would be something like class DataType {
private:
Type::type id_;
TypeMetadata* metadata_ = NULLPTR;
};
struct TypeMetadata {
std::atomic<int> ref_count;
};
struct NestedTypeMetadata : public TypeMetadata {
std::vector<Field> children_;
};etc. |
OK, I'll see if I can do just a local run on my desktop using |
|
An atomic reference count is just how shared_ptr works under the hood, so I don't really understand how your proposal would make a difference (except perhaps for trivial datatypes?). |
|
@pitrou indeed — it would cause the reference-counting cost to only be paid for non-trivial data types: see the example structure I added to my PR comment |
|
I think we should first run benchmarks after the ExecSpan changes have been propagated to the compute engine, to see if it's still worth doing something on the DataType front. |
|
Yes I agree, no need to go overboard right now, but in principle what I described is a straightforward design that would enable us to go from |
|
If we mostly want to optimize the trivial data types, then why not just resort to |
|
A couple of thoughts:
|
Imagine a user reading a Parquet file. Right now non-trivial types get automagically created and destroyed through the creation and deallocation of Schema objects (note that APIs like |
That's possible -- do you know what is the cost of a null |
|
Garbage collecting types is probably essential for bindings such as Python and R. |
Yep I think it's just two pointers (16 bytes) that would be set to 0.
Ok then that does throw a wrench in the data type registry idea. |
Or simply use a |
We can do better than variant since the type id is the discriminant, which I don't think you can accomplish with std::variant |
|
True, but that's more code to redevelop as well :-) And it might not be more compact either, due to internal structure padding. |
|
Here is the raw benchmark results (contender is this patch merged to master versus the commit before it) on my Intel i9 (has avx512) with cpu frequency scaling disabled, seems to have yielded some speedups but also some regressions: Some of the regressions looks spurious, but some others we ought to profile to better understand, for example |
|
Since the only change to vector_hash.cc is changing https://issues.apache.org/jira/browse/ARROW-16837 to investigate |
Not a detailed thought but could the canonical owner be
+1 |
|
I spent some time playing around with the We still fail to be able to run as efficiently with batches of 10k rows.
I'm not as convinced as before that this has anything to do with thread contention. It may not have been an issue in the past so I may have been too eager to jump on that particular bandwagon.
There are however, just plain more instructions. This is obviously expected, to some degree, when going from 100k batches to 10k batches. I don't really have a solid way of knowing how much additional overhead is right.
The overhead, in terms of instructions, is very widely dispersed. If I had to pick any particular theme I would say allocation/deallocation. Here are the top-15 non-critical section calls along with my crude attempt at categorization.
|
|
Yes, that was my experience as well. When looking with Apple's TimeProfiler I saw a ton of 0.1% to 1% stack traces. The execution really is spread out such that it's hard to pinpoint an exact bottleneck. This was before this change, but I think it mostly still applies. |
|
I wouldn’t expect this PR to make much difference for the moment. We need to write a new implementation of ExecuteScalarExpression and also look at options for migrating the ExecNodes to use ExecSpan (which will require some rethinking of memory management) |
Parent issue: ARROW-16755. Also resolves ARROW-16819
ArraySpan has no shared pointers at all and is much cheaper to pass around, copy, and basically eliminates the current significant overhead associated with ExecBatch ExecBatchIterator.
This PR isn't going to show meaningful performance gains in function or expression evaluation -- that will require implementing a more streamlined expression evaluator that is based on ArraySpan.
This is only an intermediate patch to try to limit the scope of work as much as possible and facilitate follow up PRs. I have a long list of things I would like to do pretty much right away in follow up patches
Some notes:
shared_ptr<ArrayData>const DataType*orconst Scalar*would disallow the use of APIs that require eithershared_ptr<DataType>orshared_ptr<Scalar>, so I addedstd::enable_shared_from_thison these classes. I don't know whether this increases the initialization cost ofmake_shared<T>if anyone knows, but I hope that in the future we can removestd::enable_shared_from_this. It would be better to haveScalar::CopyandDataType::Copymethods so this isn't necessary, but rather than try to hack this in this PR, I left this for follow on work(IsIn, IndexIn, IsNull, IsValid)
ArrayBuilder::AppendArraySlice, stuff in arrow/util/int_util.h
In the interest of getting this merged sooner rather than later, rather than trying to make everything perfect here let's try to fix any glaring / serious issues that you see otherwise leave many improvements for follow up patches, otherwise any work in the scalar kernels codebase will be blocked.