-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-15583: [C++] The Substrait consumer could potentially use a massive amount of RAM if the producer uses large anchors #12852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the hardcoded limit? If it's important, can it be made constexpr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored that code, and not currently using the SetMapElement() function, is this approach better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be exposed in the public header? The pImpl at least kept these details out of the header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a guideline for hiding. I find it hard to know when details are "too messy" and deserve a pimpl (other than mutex which we have lint checks for).
Given that users of this header are going to have to dealing with Id (ExtensionIdRegistry exposes the idea) the IdHashEq could be considered useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have a real guideline. IdHashEq could indeed be useful, but unordered_map is a "heavy" include to have (that said we lost that battle a long time ago). I don't think we have to hide this, though, just wondering if it's useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer applicable when a map is used. The reserve() call is actually highly counterproductive, since a map with the single key 100000 will reserve room for 100001 key-value pairs (this is, in fact, the core problem that needs solving).
Once these lines are removed, I'd argue that the whole function reduces to nothing and should just be inlined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not using the function anymore, and doing it directly in the GetExtensionSetFromPlan(). Is this approach better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GitHub won't let me add comments to code this far away from a change, but these vectors a few lines further down (as well as the equivalent fields of ExtensionSet) should also be replaced with maps:
std::vector<Id> type_ids, function_ids;
std::vector<bool> type_is_variation;When you change this and everything affected by it, you'll find that ExtensionSet::EncodeType and ExtensionSet::EncodeFunction won't work anymore as written. Functionally, they include code to generate a key that wasn't used before, which is easy with a vector, but for a map would require searching for an unused key by iterating over all possible keys until you find one. It might therefore be better to use an ordered map, in which case you can use something like map::rbegin()->first + 1 for the typical case (though you'd still need to implement the empty map and overflow edge cases; in the former case you should return 1, and in the latter case you'd need to exhaustively search for a free key anyway). Don't use anchor 0 (the current implementation probably does); that anchor value is reserved for undefined references in some contexts. If you really want to use an unordered_map instead of a map, you can also keep track of the latest key that was added to the map and use it as a start position for looking for free anchors; that'd still be pretty efficient in the typical case, but requires a bit more bookkeeping.
Furthermore, as I pointed out in my comment to the JIRA, using a single anchor map for both types and type variations, and then using type_is_variation to distinguish between types and type variation, was a misinterpretation of the Substrait specification as far as I'm concerned: it's perfectly valid (or at least, unambiguous) for a plan to define a type with anchor 1 as well as a completely unrelated type variation with anchor 1. To be clear, a correct, complete implementation should look something like this:
std::[unordered_]map<uint32_t, Id> type_ids, type_variation_ids, function_ids;That's a lot more involved though, as the assumption that a user-defined type or type variation is uniquely identified by a single number is pretty deeply ingrained in the round-tripping code. However, since type variations are not (correctly) implemented anyway, it's probably better for now to just remove the broken and incomplete code related to type variations for now, so it'd just become
std::[unordered_]map<uint32_t, Id> type_ids, function_ids;On the consumer end, a nonzero type variation reference already yields an error (this is why anchor 0 should not be used, by the way). The type variation anchor definitions in the kExtensionTypeVariation case in the loop below could just be silently ignored (which is what I would personally do), but it'd be more in line with the rest of the implementation to emit an error when encountering those too, as the implementation aims to error out on anything that wouldn't round-trip perfectly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think an ordered_map is needed because we can come up with unique ids by just looking at the size of the map (if 0 is not valid then we can add 1.)
I agree that the vectors won't work and I also agree that removing type variations entirely is maybe the best idea until it's clear what they actually represent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think an
ordered_mapis needed because we can come up with unique ids by just looking at the size of the map (if 0 is not valid then we can add1.)
That doesn't work, because there's no requirement for the indices to be sequential for incoming plans. For example, if the indices 1 and 3 are defined, and a new type is added on the production side, it would erroneously try to reuse map_size + 1 = 3 by this logic. The whole reason why we want to use maps at all is because they don't have to be sequential, otherwise it'd be much cheaper to just use vectors. You could use it as a cheap starting point when looking for unused anchors though; if they are sequential it would immediately yield a valid index.
until it's clear what they actually represent.
They're like a poor-man's version of polymorphism, and a way to attach logical type information that doesn't impact the physical representation. For example, one might define a custom type polygon, with type variations triangle and quad, written as polygon[triangle] and polygon[quad] respectively. They'd all have the same type index and the same physical storage (probably some list of coordinate structs), but a function can then be written to only accept polygon[triangle], and a function that accepts polygons may or may not also accept polygon[quad]s and polygon[triangle]s (this depends on how the variation is defined in the YAML files). Note that variations can also be defined for Substrait's built-in types. Ultimately, a Substrait type consists of its physical base type (a simple type, compound type, or user-defined type), its nullability, a variation index (using 0 for "no variation"), and (for compound types and hopefully at some point also user-defined types) a set of template parameters.
Unless we really want to round-trip Substrait plans or validate them completely (having spent months on just a validator that still can't do everything, I don't recommend it), we could just ignore variations entirely. Even if someone else defined a variation in an extension we don't know about, it would have no bearing on how the plan is executed, since we only need to know what the physical representation of a type is. It's more of a producer or optimizer thing, to restrict which functions are available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't work, because there's no requirement for the indices to be sequential for incoming plans
If it's an incoming plan we don't need to invent anchors. This "come up with a new anchor to use" problem only occurs when going from Arrow -> Substrait. I don't think it makes sense to share an ExtensionSet between production and consumption scenarios.
Thanks for the explanation of variations. What you are describing makes sense. It would allow functions to restrict themselves to a certain "class" of a type. I'd be curious to see a real world use case for such a thing but otherwise I agree that consumers can safely ignore them (provided they aren't expected to be validating consumers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm inclined to agree, because personally I still don't understand why Arrow should have producer code at all. The best argument I've heard yet is to communicate plans across language boundaries within Arrow, but it'd be so much easier to just make protobuf messages that actually correspond to the execution engine data structures and type system for that. Heck, throw out protobuf in favor of something that links sanely. The second best is roundtrip testing, but is all the code needed to not only produce but also bit-by-bit reproduce the same plan really worth a factor 2 to 3 in code size?
But anyway, what you're describing as a non-goal for those functions is exactly how they are currently used. In practice the anchors will all already exist, since the producer code is AFAIK incomplete and can only reproduce an incoming plan, so the new-anchor-generation code would never trigger under normal circumstances. IMO that's no excuse for that function to just silently override the same anchor over and over again in that case, though. It should at least detect it and throw an error.
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty close. Just a few minor suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return types_.at(anchor); | |
| return types_[anchor]; |
We don't handle exceptions so there is generally no benefit to using at over [].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at the PR again as of yet, but this struck me as an odd take in my e-mail feed. Assuming types_ is now a map of some kind, keep in mind that if the key does not exist, an uninitialized value is silently added to the map and then returned by mutable reference, to make x[y] = z work at the expense of sanity in all other use cases. Both [] and at() are bad if there's a real possibility of this occurring (I didn't check), but I'd much rather get an unhandled exception than pants-on-head crazy behavior. For vectors there'd be a minor performance benefit to [] over at(), but that also doesn't apply to maps AFAIK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, fair enough. I think you're probably right on the behavior for maps. Thanks for the catch. Let's keep this as at
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good at this point. I looked through and nothing caught my eye. Unfortunately, I think this test error is legitimate:
In file included from /usr/local/include/arrow/util/hashing.h:49,
from /usr/local/include/arrow/engine/substrait/extension_set.h:31,
from /usr/local/include/arrow/engine/substrait/api.h:22,
from compute-exec.cpp:337:
/usr/local/include/arrow/vendored/xxhash.h:18:10: fatal error: arrow/vendored/xxhash/xxhash.h: No such file or directory
18 | #include "arrow/vendored/xxhash/xxhash.h"
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It turns out that arrow/util/hashing.h is a private header (we should probably rename it hashing_internal.h but not as part of this PR).
Can you either move to a pimpl pattern to hide IdHashEq entirely or move the definition of IdHashEq::operator() to the .cc file? That should allow you to move the import back into the .cc file.
FWIW, see https://issues.apache.org/jira/browse/ARROW-16609 (we may have to make it public?) |
fix: removed SetMapElement feat: used [] operator instead of at() method to fetch from ext_set.uris()
docs: wording correction feat: using util::string_view instead of auto feat: using DCHECK instead of only returning Invalid status
5fb8068 to
d0d5ccb
Compare
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your help in getting these changes through.
|
Benchmark runs are scheduled for baseline = 5994fd8 and contender = 8bbc695. 8bbc695 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
This PR modifies the ExtensionSet in C++ Consumer to use an
unordered_mapinstead of avectorto store theurianchors as the lookup table. This also modifies the usage of theimplstruct as now the included functions are defined directly with the ExtensionSet implementation.