Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Jun 10, 2022

Parent issue: ARROW-16755. Also resolves ARROW-16819

ArraySpan has no shared pointers at all and is much cheaper to pass around, copy, and basically eliminates the current significant overhead associated with ExecBatch ExecBatchIterator.

This PR isn't going to show meaningful performance gains in function or expression evaluation -- that will require implementing a more streamlined expression evaluator that is based on ArraySpan.

This is only an intermediate patch to try to limit the scope of work as much as possible and facilitate follow up PRs. I have a long list of things I would like to do pretty much right away in follow up patches

Some notes:

  • The ArraySpan retains pointers to the buffers that were used to populate it because in many places in the existing scalar kernels, we have to "go back" to a shared_ptr<ArrayData>
  • There are multiple places where having only const DataType* or const Scalar* would disallow the use of APIs that require either shared_ptr<DataType> or shared_ptr<Scalar>, so I added std::enable_shared_from_this on these classes. I don't know whether this increases the initialization cost of make_shared<T> if anyone knows, but I hope that in the future we can remove std::enable_shared_from_this. It would be better to have Scalar::Copy and DataType::Copy methods so this isn't necessary, but rather than try to hack this in this PR, I left this for follow on work
  • A few kernels have been refactored to always write into preallocated memory
    (IsIn, IndexIn, IsNull, IsValid)
  • Some internal APIs were best refactored to use ArraySpan, such as
    ArrayBuilder::AppendArraySlice, stuff in arrow/util/int_util.h

In the interest of getting this merged sooner rather than later, rather than trying to make everything perfect here let's try to fix any glaring / serious issues that you see otherwise leave many improvements for follow up patches, otherwise any work in the scalar kernels codebase will be blocked.

@wesm wesm requested review from lidavidm, pitrou and westonpace June 10, 2022 16:54
@github-actions
Copy link

@wesm
Copy link
Member Author

wesm commented Jun 10, 2022

cc @save-buffer

Copy link
Contributor

@save-buffer save-buffer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall seems fine, a couple of nits here and there. Biggest concerns have to do with unnecessary heap allocations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little sus about this std::vector. May be ok for now but eventually I'd like to avoid small heap allocations. When we add a bump allocator for this kind of stuff within Acero, I'd like to switch to that.

If we want to support that now, we can just make this ExecSpan take a pointer to ExecValue and the number of ExecValues, so that ExecSpan doesn't have to touch the heap at all either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- potentially we could use SmallVector when there are a small number of fields and then switch to std::vector only for "wide" spans, either way it's something we can optimize and at minimum limit the exposure of this std::vector to users of ExecSpan

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see major issues so take these comments more as potential follow-ups

Copy link
Member Author

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments to help with reviewing and also notes to myself. I'll respond to the other comments and work on getting the CI to pass

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a shared_ptr copy that is often not needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open Jira for follow up work

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another place where shared_ptr being copied needlessly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: I noticed a couple places where very expensive std::vector<Datum> copies were happening

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this because getting the byte width from a fixed width type is quite tedious at the moment (there's many places where we do the cast-- I think having the bit_width()/byte_width() methods available on non-fixed-width types (returning -1) is pretty harmless, but if others don't like it we can remove these and add a helper function or something)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method doesn't need to be virtual, does it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we should watch out for classes that take copies of shared_ptr<T> without a good reason to; this was a good example

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's plenty of other places where we should nix the use of Datum because of its bloat. We might consider promoting ExecValue to the Datum level (like ValueSpan or something)

@wesm wesm force-pushed the ARROW-16756-lightweight-exec-batch branch from 8d11ee7 to 20dc538 Compare June 10, 2022 22:16
@wesm
Copy link
Member Author

wesm commented Jun 11, 2022

I'm working on some incidental cleanups that I ran into fixing the Python UDF code paths. I'll push the closer-to-CI-green branch with the changes I cited in review comments (e.g. removing VisitArrayDataInline) sometime tomorrow so we can hopefully merge this in the next couple days

@wesm
Copy link
Member Author

wesm commented Jun 12, 2022

I went ahead and tackled ARROW-16819 which involved cleaning up the various places where nullary / zero-arguments functions were being called hackishly to get a batch length in. This also added a length argument to pyarrow.compute.call_function which allows e.g. pc.call_function('random', [], length=100) rather than passing the length in through the RandomOptions which was a hack

@wesm
Copy link
Member Author

wesm commented Jun 12, 2022

I hope that we can merge this soon after getting a green build and tackle further improvements in follow up patches because this PR is getting a little unwieldy (every time I touch something it results in a game of whackamole)

@wesm
Copy link
Member Author

wesm commented Jun 12, 2022

I found it curious that libarrow.so in release mode with gcc10 got about 4-5% bigger. I did some investigation of the binaries (and added a tool we can refine in cpp/tools/binary_symbol_explore.py) and my conclusion is that the absence of shared_ptr interactions is leading gcc to do more aggressive inlining of various functions, which is resulting in larger code size (but probably also better performance). There's probably a good amount we can do to further reduce the amount of generated code, but it will have to come in follow up work.

FTR, here are the largest symbols in libarrow.so right now (from this branch):

https://gist.github.com/wesm/51c97a47336da682dedd918bb7b90015

If we want to reduce binary sizes, the comparison kernels and the temporal kernels is the place to start

wesm added 12 commits June 12, 2022 21:04
method of scalar kernel evaluation

Checkpoint, getting closer to compilation

libarrow builds again

Get everything compiling again, compute internals tests passing

Get Bitmap test cases passing again

Don't try filling validity bitmap if none was allocated

Fix bit block visitors and a few arithmetic kernels

Refactor int_util.h to use ArraySpan

Some more tests passing

Fix some more unit tests, compilation

Another fix

Fix some more tests

Fix some more bugs

Fixed some more tests

All scalar_if_else.cc tests passing again

Work partway through scalar_nested.cc

Down to only a few failing scalar tests

scalar kernels tests passing again

fix scalar executor, tests all passing now
@pitrou
Copy link
Member

pitrou commented Jun 14, 2022

For the record, did the continuous benchmark results give any improvement (re: code size increase)?

@wesm
Copy link
Member Author

wesm commented Jun 14, 2022

Thanks for the reviews -- I'll make sure that these things get taken care of.

Several items:

  • What's the latest and greatest way to see a full comparison table of microbenchmarks? I wouldn't expect this PR to have too much immediate impact
  • Regarding the Copy methods -- I have read through some of Scott Meyers https://www.oreilly.com/library/view/effective-modern-c/9781491908419/ch04.html and I can't find any evidence that adding std::enable_shared_from_this necessarily makes things any slower or adds overhead to instantiations. What do you think about just calling this GetSharedPtr and letting the std::enable_shared_from_this slide? The alternative is that we add Copy implementations that actually copy, and add a virtual ExtensionType::CopyImpl that must be implemented.

@save-buffer
Copy link
Contributor

I think the best solution for performance is to remove shared_ptr to data types and have a canonical owner. The current shared_ptrs have a reference count of at least 1 at all times anyway since they're stored and initialized in that vector.

@pitrou
Copy link
Member

pitrou commented Jun 15, 2022

  • What's the latest and greatest way to see a full comparison table of microbenchmarks?

I think it's @ursabot please benchmark. It might not work on a merged-then-deleted PR branch though.

What do you think about just calling this GetSharedPtr and letting the std::enable_shared_from_this slide?

Yes, that sounds like a reasonable solution.

@pitrou
Copy link
Member

pitrou commented Jun 15, 2022

I think the best solution for performance is to remove shared_ptr to data types and have a canonical owner

Who would be the canonical owner of a datatype?

@wesm
Copy link
Member Author

wesm commented Jun 15, 2022

I think the best solution for performance is to remove shared_ptr to data types and have a canonical owner.

I'm thinking hard about how to actually do this. With the way the library is designed, if DataType instances were stored in some central place, absent some form of reference counting mechanism, it would be basically impossible to know if it is ever safe to garbage collect a data type instance.

So one proposal would be something like:

  • Central type metadata registry where every data type with non-trivial state (something more than a type id, like child fields) has an associated atomic "reference count"
  • Any non-trivial DataType (e.g. if its definition is more complex than just a Type::type) has a back-reference to the central metadata registry, where deallocation decrements the refcount in the metadata registry. Trivial types (e.g. Int64Type) do nothing
  • The type metadata registry performs a garbage collection pass every so often

I can visualize how to implement this but it would be a truly chaotic refactor

So the new structure of DataType would be something like

class DataType {

 private:
  Type::type id_;
  TypeMetadata* metadata_ = NULLPTR;
};

struct TypeMetadata {
  std::atomic<int> ref_count;
};

struct NestedTypeMetadata : public TypeMetadata {
  std::vector<Field> children_;
};

etc.

@wesm
Copy link
Member Author

wesm commented Jun 15, 2022

I think it's @ursabot please benchmark. It might not work on a merged-then-deleted PR branch though.

OK, I'll see if I can do just a local run on my desktop using archery and link the results here.

@pitrou
Copy link
Member

pitrou commented Jun 15, 2022

An atomic reference count is just how shared_ptr works under the hood, so I don't really understand how your proposal would make a difference (except perhaps for trivial datatypes?).

@wesm
Copy link
Member Author

wesm commented Jun 15, 2022

@pitrou indeed — it would cause the reference-counting cost to only be paid for non-trivial data types: see the example structure I added to my PR comment

@pitrou
Copy link
Member

pitrou commented Jun 15, 2022

I think we should first run benchmarks after the ExecSpan changes have been propagated to the compute engine, to see if it's still worth doing something on the DataType front.

@wesm
Copy link
Member Author

wesm commented Jun 15, 2022

Yes I agree, no need to go overboard right now, but in principle what I described is a straightforward design that would enable us to go from shared_ptr<DataType> -> DataType quite easily (albeit with a lot of mechanical refactoring), so we have the option to do it if it can be demonstrated that it's worth it performance-wise. I think that once a new expression evaluator based on ArraySpan has been implemented that it won't be as serious of an issue

@lidavidm
Copy link
Member

If we mostly want to optimize the trivial data types, then why not just resort to shared_ptr (or even, unique_ptr) for the nested type metadata instead of a global registry/garbage collection? (And perhaps some non-trivial fields are inlined, like say time unit or fixed-size binary length, the same way it's being done in arrow-c, to avoid making certain common nested types require the shared ptr at the cost of structure size (though you could play games with reusing a single field for storage of different types' metadata))

@save-buffer
Copy link
Contributor

save-buffer commented Jun 15, 2022

A couple of thoughts:

  • I haven't thought about garbage collecting types before - is it a big problem? I guess I'm not sure how often a new data type instance gets created, we may not even have to worry about it. If we were worried about having lots of duplicates (e.g. someone uses decimal(12, 2) a lot instead of reusing the pointer), we could add some sort of interning system. Maybe this gets difficult with struct types though.
  • If we do want to garbage collect, then presumably if we went with a type registry for the basic types, we'd replace every instance of shared_ptr<DataType> to DataType *. If so, then a user of a complicated type could allocate it themselves and pass it around and destroy it when they're done, something like:
std::unique_ptr<DataType> type(new MyCustomType);
arrow::DoTheThing(type.get());

@wesm
Copy link
Member Author

wesm commented Jun 15, 2022

I haven't thought about garbage collecting types before - is it a big problem? I guess I'm not sure how often a new data type instance gets created.

Imagine a user reading a Parquet file. Right now non-trivial types get automagically created and destroyed through the creation and deallocation of Schema objects (note that APIs like arrow::int64() return static instances so no new types are created). End-users of the Arrow C++ libraries don't have to think about this at all right now, so to have explicit owners of types would mean a significant amount of API changes and new things to think about when using the project.

@wesm
Copy link
Member Author

wesm commented Jun 15, 2022

then why not just resort to shared_ptr (or even, unique_ptr) for the nested type metadata instead of a global registry/garbage collection

That's possible -- do you know what is the cost of a null shared_ptr? I would hope that having a default-initialized shared_ptr on a class is nearly costless

@pitrou
Copy link
Member

pitrou commented Jun 15, 2022

Garbage collecting types is probably essential for bindings such as Python and R.

@save-buffer
Copy link
Contributor

save-buffer commented Jun 15, 2022

I would hope that having a default-initialized shared_ptr on a class is nearly costless

Yep I think it's just two pointers (16 bytes) that would be set to 0.

Garbage collecting types is probably essential for bindings such as Python and R.

Ok then that does throw a wrench in the data type registry idea.

@pitrou
Copy link
Member

pitrou commented Jun 15, 2022

though you could play games with reusing a single field for storage of different types' metadata

Or simply use a variant for that. Though, of course, it might come with its own overhead due to compilers being imperfect :-)

@lidavidm
Copy link
Member

though you could play games with reusing a single field for storage of different types' metadata

Or simply use a variant for that. Though, of course, it might come with its own overhead due to compilers being imperfect :-)

We can do better than variant since the type id is the discriminant, which I don't think you can accomplish with std::variant

@pitrou
Copy link
Member

pitrou commented Jun 15, 2022

True, but that's more code to redevelop as well :-) And it might not be more compact either, due to internal structure padding.

@wesm
Copy link
Member Author

wesm commented Jun 15, 2022

Here is the raw benchmark results (contender is this patch merged to master versus the commit before it) on my Intel i9 (has avx512) with cpu frequency scaling disabled, seems to have yielded some speedups but also some regressions:

https://gist.githubusercontent.com/wesm/e103586caa0b4144350b2a754a7c96a4/raw/374383d9e1af5186e072a327277d40d427bd5430/benchmark_diff_execspan.txt

Some of the regressions looks spurious, but some others we ought to profile to better understand, for example Unique seems to have been affected but I don't know why (some here clearly spurious, e.g. BitmapWriter was not touched in this PR)

                                                                               UniqueUInt8/1      1.732 GiB/sec      1.255 GiB/sec   -27.544                                                                                                       
                                                         ArrayArrayKernel<KleeneAnd>/32768/0     11.853 GiB/sec      8.581 GiB/sec   -27.600                                                                                                       
                                                                 BitmapEqualsWithOffset/8192      6.547 GiB/sec      4.735 GiB/sec   -27.671                                                                                                       
ExecuteScalarExpressionOverhead/simple_expression/rows_per_batch:100000/real_time/threads:36      227450.971042      164478.721836   -27.686
                                                                               UniqueInt64/5     11.135 GiB/sec      7.692 GiB/sec   -30.919                                                                                                       
                                                   TakeInt64RandomIndicesWithNulls/1048576/1   2.787G items/sec   1.917G items/sec   -31.231
                                     ArrayScalarKernel<SubtractChecked, Int32Type>/1048576/0      8.185 GiB/sec      5.553 GiB/sec   -32.153
                                    ArrayScalarKernel<SubtractChecked, UInt32Type>/1048576/0      8.199 GiB/sec      5.550 GiB/sec   -32.312
                                                       ArrayArrayKernel<KleeneAnd>/1048576/0      6.401 GiB/sec      4.306 GiB/sec   -32.731
                                                                               UniqueUInt8/0      2.121 GiB/sec      1.421 GiB/sec   -33.013
                                         ArrayScalarKernel<AddChecked, UInt16Type>/1048576/0      4.203 GiB/sec      2.813 GiB/sec   -33.064
                                    ArrayScalarKernel<SubtractChecked, UInt16Type>/1048576/0      4.204 GiB/sec      2.811 GiB/sec   -33.131
                                                           ArrayRangeEqualsFloat32/32768/100   1.449G items/sec 958.666M items/sec   -33.837
                                                                           BitmapWriter/8192    164.977 MiB/sec    100.866 MiB/sec   -38.861
                                                         ArrayRangeEqualsFloat32/32768/10000   1.905G items/sec   1.088G items/sec   -42.904
                                                     BM_PlainDecodingSpacedFloat/32768/10000     82.335 GiB/sec     46.504 GiB/sec   -43.518
                       BenchmarkTemporalBinary<MonthDayNanoBetween, time32_type>/1048576/100    118.598 MiB/sec     44.037 MiB/sec   -62.868
                                                                               UniqueUInt8/6     24.516 GiB/sec      1.994 GiB/sec   -91.865                                                                                                                                                                                     UniqueInt64/13    195.807 GiB/sec     15.848 GiB/sec   -91.906
                                                                               UniqueInt64/6    196.137 GiB/sec     15.835 GiB/sec   -91.927

@wesm
Copy link
Member Author

wesm commented Jun 15, 2022

Since the only change to vector_hash.cc is changing VisitArrayDataInline to VisitArraySpanInline it makes me think that something got messed up in the implementation chain of ArraySpanInlineVisitor<T, enable_if_has_c_type<T>>. I opened

https://issues.apache.org/jira/browse/ARROW-16837

to investigate

@westonpace
Copy link
Member

Who would be the canonical owner of a datatype?

Not a detailed thought but could the canonical owner be arrow::Schema? Outside of Acero, every record batch has a schema (might have to do some work so that collections of record batches, like arrow::Table, share a single schema instance). Within Acero the canonical owner could be the ExecPlan (each node has an output schema and any intermediate types needed could be managed by the nodes). That being said...

I think we should first run benchmarks after the ExecSpan changes have been propagated to the compute engine, to see if it's still worth doing something on the DataType front.

+1

@westonpace
Copy link
Member

I spent some time playing around with the ExecuteScalarExpressionOverhead/complex_expression benchmark today to see if this PR made much change there. I didn't see too much change (which might not be surprising, I don't really know if any was expected).

We still fail to be able to run as efficiently with batches of 10k rows.

  • 10k row batches => 6G rows/s
  • 100k row batches => 7G rows/s

I'm not as convinced as before that this has anything to do with thread contention. It may not have been an issue in the past so I may have been too eager to jump on that particular bandwagon.

  • When going from 1 thread to 16 threads there is a significant amount of "front-end latency" that gets introduced. vTune can find no attributable cause to this (e.g. not branch resteers, icache misses, or anything else there is a category for).
  • However, the above happens at both 10k rows per batch and 100k rows per batch and the extra latency appears to be in the execution of the critical section itself, and not just the overhead parts.

There are however, just plain more instructions. This is obviously expected, to some degree, when going from 100k batches to 10k batches. I don't really have a solid way of knowing how much additional overhead is right.

  • With 100k batches we spend 96% of the time in the critical section
  • With 10k batches we spend 75% of the time in the critical section

The overhead, in terms of instructions, is very widely dispersed. If I had to pick any particular theme I would say allocation/deallocation. Here are the top-15 non-critical section calls along with my crude attempt at categorization.

Call % instructions Category guess
__GI___libc_malloc 1.1% Allocation
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release 1.0% Deallocation
arrow::compute::ExecuteScalarExpression 1.0% self-time / branchy?
_GI 0.9% Unknown
arrow::compute::detail::(anonymous namespace)::KernelExecutorImplarrow::compute::ScalarKernel::PrepareOutput 0.6% self-time / allocation?
std::vector<arrow::ValueDescr, std::allocatorarrow::ValueDescr>::~vector 0.6% Deallocation
tcache_get 0.5% Allocation
arrow::CPUDevice::Instance 0.5% Allocation
arrow::util::detail::VariantImpl<arrow::util::Variant<arrow::Datum::Empty, std::shared_ptrarrow::Scalar, std::shared_ptrarrow::ArrayData, std::shared_ptrarrow::ChunkedArray, std::shared_ptrarrow::RecordBatch, std::shared_ptrarrow::Table>, std::shared_ptrarrow::Scalar, std::shared_ptrarrow::ArrayData, std::shared_ptrarrow::ChunkedArray, std::shared_ptrarrow::RecordBatch, std::shared_ptrarrow::Table>::copy_to<arrow::util::Variant<arrow::Datum::Empty, std::shared_ptrarrow::Scalar, std::shared_ptrarrow::ArrayData, std::shared_ptrarrow::ChunkedArray, std::shared_ptrarrow::RecordBatch, std::shared_ptrarrow::Table>> 0.5% Allocation?
arrow::BaseMemoryPoolImplarrow::memory_pool::internal::JemallocAllocator::Allocate 0.5% Allocation
arrow::Datum::descr 0.4% shared_ptr copy
arrow::compute::detail::(anonymous namespace)::ScalarExecutor::~ScalarExecutor 0.4% Deallocation? (there are some vectors in here that will get deallocated)
arrow::compute::detail::(anonymous namespace)::ScalarExecutor::Execute 0.4% Unknown
arrow::CPUMemoryManager::Make 0.4% Allocation
arrow::compute::detail::(anonymous namespace)::ScalarExecutor::ExecuteSpans 0.4% Unknown

@save-buffer
Copy link
Contributor

Yes, that was my experience as well. When looking with Apple's TimeProfiler I saw a ton of 0.1% to 1% stack traces. The execution really is spread out such that it's hard to pinpoint an exact bottleneck.
There is a handy tool in TimeProfiler that's called "invert call tree", maybe VTune has something similar. You can then search the inverted call tree for "free" and "Alloc" and such to see what percent of time is spent there. Here's what I saw:

57.00 ms    7.0%	57.00 ms	 	tiny_free_no_lock
51.00 ms    6.3%	51.00 ms	 	tiny_malloc_from_free_list
45.00 ms    5.5%	45.00 ms	 	tiny_malloc_should_clear
40.00 ms    4.9%	40.00 ms	 	tiny_free_list_add_ptr
31.00 ms    3.8%	31.00 ms	 	tiny_size
23.00 ms    2.8%	23.00 ms	 	free_tiny
20.00 ms    2.4%	20.00 ms	 	free
16.00 ms    1.9%	16.00 ms	 	tiny_free_list_remove_ptr
16.00 ms    1.9%	16.00 ms	 	_malloc_zone_malloc
14.00 ms    1.7%	14.00 ms	 	szone_free_definite_size

This was before this change, but I think it mostly still applies.

@wesm
Copy link
Member Author

wesm commented Jun 18, 2022

I wouldn’t expect this PR to make much difference for the moment. We need to write a new implementation of ExecuteScalarExpression and also look at options for migrating the ExecNodes to use ExecSpan (which will require some rethinking of memory management)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants