ARROW-15583: [C++] The Substrait consumer could potentially use a massive amount of RAM if the producer uses large anchors #12852

sanjibansg · 2022-04-11T04:11:22Z

This PR modifies the ExtensionSet in C++ Consumer to use an unordered_map instead of a vector to store the uri anchors as the lookup table. This also modifies the usage of the impl struct as now the included functions are defined directly with the ExtensionSet implementation.

github-actions · 2022-04-11T04:12:48Z

https://issues.apache.org/jira/browse/ARROW-15583

lidavidm · 2022-04-19T12:26:21Z

cpp/src/arrow/engine/substrait/plan_internal.cc

Why the hardcoded limit? If it's important, can it be made constexpr?

Refactored that code, and not currently using the SetMapElement() function, is this approach better?

cpp/src/arrow/engine/substrait/plan_internal.cc

lidavidm · 2022-04-19T12:29:10Z

cpp/src/arrow/engine/substrait/extension_set.h

Does this need to be exposed in the public header? The pImpl at least kept these details out of the header.

Do we have a guideline for hiding. I find it hard to know when details are "too messy" and deserve a pimpl (other than mutex which we have lint checks for).

Given that users of this header are going to have to dealing with Id (ExtensionIdRegistry exposes the idea) the IdHashEq could be considered useful.

I don't think we have a real guideline. IdHashEq could indeed be useful, but unordered_map is a "heavy" include to have (that said we lost that battle a long time ago). I don't think we have to hide this, though, just wondering if it's useful.

cpp/src/arrow/engine/substrait/extension_set.h

cpp/src/arrow/engine/substrait/extension_set.cc

cpp/src/arrow/engine/substrait/extension_set.h

jvanstraten · 2022-04-19T13:13:14Z

cpp/src/arrow/engine/substrait/plan_internal.cc

This is no longer applicable when a map is used. The reserve() call is actually highly counterproductive, since a map with the single key 100000 will reserve room for 100001 key-value pairs (this is, in fact, the core problem that needs solving).

Once these lines are removed, I'd argue that the whole function reduces to nothing and should just be inlined.

Not using the function anymore, and doing it directly in the GetExtensionSetFromPlan(). Is this approach better?

cpp/src/arrow/engine/substrait/plan_internal.cc

jvanstraten · 2022-04-19T13:55:08Z

cpp/src/arrow/engine/substrait/plan_internal.cc

GitHub won't let me add comments to code this far away from a change, but these vectors a few lines further down (as well as the equivalent fields of ExtensionSet) should also be replaced with maps:

std::vector<Id> type_ids, function_ids; std::vector<bool> type_is_variation;

When you change this and everything affected by it, you'll find that ExtensionSet::EncodeType and ExtensionSet::EncodeFunction won't work anymore as written. Functionally, they include code to generate a key that wasn't used before, which is easy with a vector, but for a map would require searching for an unused key by iterating over all possible keys until you find one. It might therefore be better to use an ordered map, in which case you can use something like map::rbegin()->first + 1 for the typical case (though you'd still need to implement the empty map and overflow edge cases; in the former case you should return 1, and in the latter case you'd need to exhaustively search for a free key anyway). Don't use anchor 0 (the current implementation probably does); that anchor value is reserved for undefined references in some contexts. If you really want to use an unordered_map instead of a map, you can also keep track of the latest key that was added to the map and use it as a start position for looking for free anchors; that'd still be pretty efficient in the typical case, but requires a bit more bookkeeping.

Furthermore, as I pointed out in my comment to the JIRA, using a single anchor map for both types and type variations, and then using type_is_variation to distinguish between types and type variation, was a misinterpretation of the Substrait specification as far as I'm concerned: it's perfectly valid (or at least, unambiguous) for a plan to define a type with anchor 1 as well as a completely unrelated type variation with anchor 1. To be clear, a correct, complete implementation should look something like this:

std::[unordered_]map<uint32_t, Id> type_ids, type_variation_ids, function_ids;

That's a lot more involved though, as the assumption that a user-defined type or type variation is uniquely identified by a single number is pretty deeply ingrained in the round-tripping code. However, since type variations are not (correctly) implemented anyway, it's probably better for now to just remove the broken and incomplete code related to type variations for now, so it'd just become

std::[unordered_]map<uint32_t, Id> type_ids, function_ids;

On the consumer end, a nonzero type variation reference already yields an error (this is why anchor 0 should not be used, by the way). The type variation anchor definitions in the kExtensionTypeVariation case in the loop below could just be silently ignored (which is what I would personally do), but it'd be more in line with the rest of the implementation to emit an error when encountering those too, as the implementation aims to error out on anything that wouldn't round-trip perfectly.

I don't think an ordered_map is needed because we can come up with unique ids by just looking at the size of the map (if 0 is not valid then we can add 1.)

I agree that the vectors won't work and I also agree that removing type variations entirely is maybe the best idea until it's clear what they actually represent.

I don't think an ordered_map is needed because we can come up with unique ids by just looking at the size of the map (if 0 is not valid then we can add 1.)

That doesn't work, because there's no requirement for the indices to be sequential for incoming plans. For example, if the indices 1 and 3 are defined, and a new type is added on the production side, it would erroneously try to reuse map_size + 1 = 3 by this logic. The whole reason why we want to use maps at all is because they don't have to be sequential, otherwise it'd be much cheaper to just use vectors. You could use it as a cheap starting point when looking for unused anchors though; if they are sequential it would immediately yield a valid index.

until it's clear what they actually represent.

They're like a poor-man's version of polymorphism, and a way to attach logical type information that doesn't impact the physical representation. For example, one might define a custom type polygon, with type variations triangle and quad, written as polygon[triangle] and polygon[quad] respectively. They'd all have the same type index and the same physical storage (probably some list of coordinate structs), but a function can then be written to only accept polygon[triangle], and a function that accepts polygons may or may not also accept polygon[quad]s and polygon[triangle]s (this depends on how the variation is defined in the YAML files). Note that variations can also be defined for Substrait's built-in types. Ultimately, a Substrait type consists of its physical base type (a simple type, compound type, or user-defined type), its nullability, a variation index (using 0 for "no variation"), and (for compound types and hopefully at some point also user-defined types) a set of template parameters.

Unless we really want to round-trip Substrait plans or validate them completely (having spent months on just a validator that still can't do everything, I don't recommend it), we could just ignore variations entirely. Even if someone else defined a variation in an extension we don't know about, it would have no bearing on how the plan is executed, since we only need to know what the physical representation of a type is. It's more of a producer or optimizer thing, to restrict which functions are available.

That doesn't work, because there's no requirement for the indices to be sequential for incoming plans

If it's an incoming plan we don't need to invent anchors. This "come up with a new anchor to use" problem only occurs when going from Arrow -> Substrait. I don't think it makes sense to share an ExtensionSet between production and consumption scenarios.

Thanks for the explanation of variations. What you are describing makes sense. It would allow functions to restrict themselves to a certain "class" of a type. I'd be curious to see a real world use case for such a thing but otherwise I agree that consumers can safely ignore them (provided they aren't expected to be validating consumers)

I'm inclined to agree, because personally I still don't understand why Arrow should have producer code at all. The best argument I've heard yet is to communicate plans across language boundaries within Arrow, but it'd be so much easier to just make protobuf messages that actually correspond to the execution engine data structures and type system for that. Heck, throw out protobuf in favor of something that links sanely. The second best is roundtrip testing, but is all the code needed to not only produce but also bit-by-bit reproduce the same plan really worth a factor 2 to 3 in code size?

But anyway, what you're describing as a non-goal for those functions is exactly how they are currently used. In practice the anchors will all already exist, since the producer code is AFAIK incomplete and can only reproduce an incoming plan, so the new-anchor-generation code would never trigger under normal circumstances. IMO that's no excuse for that function to just silently override the same anchor over and over again in that case, though. It should at least detect it and throw an error.

westonpace

Looks pretty close. Just a few minor suggestions.

cpp/src/arrow/engine/substrait/extension_set.cc

cpp/src/arrow/engine/substrait/extension_set.h

cpp/src/arrow/engine/substrait/extension_set.cc

westonpace · 2022-05-11T21:37:42Z

cpp/src/arrow/engine/substrait/extension_set.cc

Suggested change

return types_.at(anchor);

return types_[anchor];

We don't handle exceptions so there is generally no benefit to using at over [].

I haven't looked at the PR again as of yet, but this struck me as an odd take in my e-mail feed. Assuming types_ is now a map of some kind, keep in mind that if the key does not exist, an uninitialized value is silently added to the map and then returned by mutable reference, to make x[y] = z work at the expense of sanity in all other use cases. Both [] and at() are bad if there's a real possibility of this occurring (I didn't check), but I'd much rather get an unhandled exception than pants-on-head crazy behavior. For vectors there'd be a minor performance benefit to [] over at(), but that also doesn't apply to maps AFAIK.

Ah, fair enough. I think you're probably right on the behavior for maps. Thanks for the catch. Let's keep this as at

cpp/src/arrow/engine/substrait/extension_set.cc

westonpace

The code looks good at this point. I looked through and nothing caught my eye. Unfortunately, I think this test error is legitimate:

In file included from /usr/local/include/arrow/util/hashing.h:49,
                 from /usr/local/include/arrow/engine/substrait/extension_set.h:31,
                 from /usr/local/include/arrow/engine/substrait/api.h:22,
                 from compute-exec.cpp:337:
/usr/local/include/arrow/vendored/xxhash.h:18:10: fatal error: arrow/vendored/xxhash/xxhash.h: No such file or directory
   18 | #include "arrow/vendored/xxhash/xxhash.h"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It turns out that arrow/util/hashing.h is a private header (we should probably rename it hashing_internal.h but not as part of this PR).

Can you either move to a pimpl pattern to hide IdHashEq entirely or move the definition of IdHashEq::operator() to the .cc file? That should allow you to move the import back into the .cc file.

lidavidm · 2022-05-19T14:23:17Z

It turns out that arrow/util/hashing.h is a private header (we should probably rename it hashing_internal.h but not as part of this PR).

FWIW, see https://issues.apache.org/jira/browse/ARROW-16609 (we may have to make it public?)

fix: removed SetMapElement feat: used [] operator instead of at() method to fetch from ext_set.uris()

…nings

docs: wording correction feat: using util::string_view instead of auto feat: using DCHECK instead of only returning Invalid status

…header

westonpace

Thanks for your help in getting these changes through.

ursabot · 2022-05-25T11:31:24Z

Benchmark runs are scheduled for baseline = 5994fd8 and contender = 8bbc695. 8bbc695 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.19% ⬆️1.87%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.08% ⬆️1.89%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 8bbc6955 ec2-t3-xlarge-us-east-2
[Failed] 8bbc6955 test-mac-arm
[Failed] 8bbc6955 ursa-i9-9960x
[Finished] 8bbc6955 ursa-thinkcentre-m75q
[Finished] 5994fd88 ec2-t3-xlarge-us-east-2
[Failed] 5994fd88 test-mac-arm
[Failed] 5994fd88 ursa-i9-9960x
[Finished] 5994fd88 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the Component: C++ label Apr 11, 2022

sanjibansg marked this pull request as ready for review April 13, 2022 02:19

westonpace self-requested a review April 14, 2022 07:00

lidavidm reviewed Apr 19, 2022

View reviewed changes

jvanstraten suggested changes Apr 19, 2022

View reviewed changes

sanjibansg requested a review from lidavidm May 4, 2022 10:10

westonpace requested changes May 11, 2022

View reviewed changes

sanjibansg requested a review from westonpace May 18, 2022 15:39

westonpace requested changes May 19, 2022

View reviewed changes

sanjibansg requested a review from westonpace May 21, 2022 22:27

sanjibansg added 13 commits May 22, 2022 09:59

feat: uris as map of extension_uri_anchor to uri string

53a8196

refactor: to align with changes of uris map

1c9f1d2

fix: map for functions_ and types_

9a071f9

feat: find from functions_map and types_map

533cc9b

feat: CheckHasUri and AddUri are made private and non-static

ef5e1ef

fix: removed SetMapElement feat: used [] operator instead of at() method to fetch from ext_set.uris()

feat: remove type_variation and review changes

9878c9d

refactor: remove SetElement()

8ce363c

fix: cpp lint

3991fe2

fix: static cast size_t variables to unsigned int to prevent MSVC war…

dd497a0

…nings

docs: add docstring for maps in ExtensionSet

8bd3c98

docs: wording correction feat: using util::string_view instead of auto feat: using DCHECK instead of only returning Invalid status

fix: DCHECK instead of returning Invalid status inside if case

3cf03cb

fix: moved definition of IdHashEq::operator() to avoid using private …

d2e1ed6

…header

fix: remove including private header hashing

d0d5ccb

sanjibansg force-pushed the substrait/uri_map branch from 5fb8068 to d0d5ccb Compare May 22, 2022 04:29

westonpace approved these changes May 25, 2022

View reviewed changes

westonpace closed this in 8bbc695 May 25, 2022

sanjibansg deleted the substrait/uri_map branch May 25, 2022 09:33

ARROW-15583: [C++] The Substrait consumer could potentially use a massive amount of RAM if the producer uses large anchors #12852

ARROW-15583: [C++] The Substrait consumer could potentially use a massive amount of RAM if the producer uses large anchors #12852

Conversation

sanjibansg commented Apr 11, 2022

Uh oh!

github-actions bot commented Apr 11, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

lidavidm commented May 19, 2022

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

ursabot commented May 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants