Requiring keys provided to map to be unique#3760
Conversation
|
Maybe I should change the GetHistogramFunction so it doesn't always throw an exception if the type isn't supported so I can handle it at the call site and throw a more descriptive exception? |
|
Some of the CI tests are failing because of the Histogram aggregate issue I mentioned, might be worth it to temporarily create an inefficient brute check for uniqueness for those types? Or just allow non-unique keys for those types for now |
|
@pdet What do you think is the right course of action to take for now? Should I allow non-unique keys for types that aren't supported by the Histogram aggregate yet? |
|
What types are missing? Only nested types? |
|
I'm not sure, when I comment out the template <bool IS_ORDERED = true>
AggregateFunction GetHistogramFunction(const LogicalType &type) {
switch (type.id()) {
case LogicalType::BOOLEAN:
return GetMapType<HistogramFunctor, bool, IS_ORDERED>(type);
case LogicalType::UTINYINT:
return GetMapType<HistogramFunctor, uint8_t, IS_ORDERED>(type);
case LogicalType::USMALLINT:
return GetMapType<HistogramFunctor, uint16_t, IS_ORDERED>(type);
case LogicalType::UINTEGER:
return GetMapType<HistogramFunctor, uint32_t, IS_ORDERED>(type);
case LogicalType::UBIGINT:
return GetMapType<HistogramFunctor, uint64_t, IS_ORDERED>(type);
case LogicalType::TINYINT:
return GetMapType<HistogramFunctor, int8_t, IS_ORDERED>(type);
case LogicalType::SMALLINT:
return GetMapType<HistogramFunctor, int16_t, IS_ORDERED>(type);
case LogicalType::INTEGER:
return GetMapType<HistogramFunctor, int32_t, IS_ORDERED>(type);
case LogicalType::BIGINT:
return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
case LogicalType::FLOAT:
return GetMapType<HistogramFunctor, float, IS_ORDERED>(type);
case LogicalType::DOUBLE:
return GetMapType<HistogramFunctor, double, IS_ORDERED>(type);
case LogicalType::VARCHAR:
return GetMapType<HistogramStringFunctor, string, IS_ORDERED>(type);
case LogicalType::TIMESTAMP:
return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
case LogicalType::TIMESTAMP_TZ:
return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
case LogicalType::TIMESTAMP_S:
return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
case LogicalType::TIMESTAMP_MS:
return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
case LogicalType::TIMESTAMP_NS:
return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
case LogicalType::TIME:
return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
case LogicalType::TIME_TZ:
return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
case LogicalType::DATE:
return GetMapType<HistogramFunctor, int32_t, IS_ORDERED>(type);
default:
throw InternalException("Unimplemented histogram aggregate");
}
}@pdet TL;DR yes probably only nested types aren't supported |
|
I think it is fine if you just throw an error if it's not one of these types. |
|
That is also an option haha hadn't considered that one if I'm honest. |
|
@pdet Could you take a look at this, see if there's anything I still need to fix/change? |
src/function/scalar/map/map.cpp
Outdated
| bool KeyListIsEmpty(list_entry_t *data, idx_t rows) { | ||
| for (idx_t i = 0; i < rows; i++) { | ||
| auto size = data[i].length; | ||
| if (size != 0) { | ||
| return false; | ||
| } | ||
| } | ||
| return true; | ||
| } | ||
|
|
||
| static void CheckKeyValidity(DataChunk &args) { | ||
| auto types = args.GetTypes(); | ||
| if (types.empty()) { | ||
| return; | ||
| } | ||
| auto key_type = ListType::GetChildType(types[0]).id(); | ||
| auto &array = args.data[0]; | ||
| auto arg_data = FlatVector::GetData<list_entry_t>(array); | ||
| auto &entries = ListVector::GetEntry(array); | ||
| if (key_type == LogicalTypeId::SQLNULL) { | ||
| if (KeyListIsEmpty(arg_data, args.size())) { | ||
| return; | ||
| } | ||
| // The entire key list is NULL for one (or more) of the rows: (ARRAY[NULL, NULL, NULL]) | ||
| throw InvalidInputException("Map keys can not be NULL"); | ||
| } | ||
|
|
||
| VectorData list_data; | ||
| auto count = ListVector::GetListSize(array); | ||
| args.data[0].Orrify(args.size(), list_data); | ||
| auto validity = FlatVector::Validity(entries); | ||
| if (!validity.CheckAllValid(count)) { | ||
| throw InvalidInputException("Map keys can not be NULL"); | ||
| } | ||
| } | ||
|
|
||
| static void CheckForKeyUniqueness(DataChunk &args, ExpressionState &state) { | ||
| // Create a copy of the arguments | ||
| auto types = args.GetTypes(); | ||
| if (types.empty() || ListType::GetChildType(types[0]).id() == LogicalType::SQLNULL) { | ||
| return; | ||
| } | ||
|
|
||
| auto arg_data = FlatVector::GetData<list_entry_t>(args.data[0]); | ||
|
|
||
| DataChunk keys; | ||
| keys.Initialize(args.GetTypes()); | ||
| args.Copy(keys); | ||
|
|
||
| // Split the copy to separate the keys | ||
| DataChunk remaining_columns; | ||
| keys.Split(remaining_columns, 1); | ||
|
|
||
| Vector unique_result(LogicalType::UBIGINT, args.size()); | ||
| ListUniqueFunction(keys, state, unique_result); | ||
| for (idx_t i = 0; i < args.size(); i++) { | ||
| auto keys_length = arg_data[i].length; | ||
| auto unique_keys = FlatVector::GetValue<uint64_t>(unique_result, i); | ||
| if (unique_keys != keys_length) { | ||
| throw InvalidInputException("Map keys have to be unique!"); | ||
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Nitpicking, wouldn't it be better if these are
AreKeysUnique and AreKeysValid functions that return a bool with the exception being thrown in the MapFunction?
Also, how are these functions called when you have a map type not created by SQL?
e.g., if you have a malformed pyarrow table with a map type and then consume it through duckdb, I think these functions won't be called, am I right?
There was a problem hiding this comment.
Hmm that could definitely be the case, I will test those cases
There was a problem hiding this comment.
map_type = pa.map_(pa.int32(), pa.int32())
values = [
[
(3, 12),
(3, 21)
],
[
(5, 42)
]
]
arrow_table = pa.table(
{'detail': pa.array(values, map_type)}
)
rel = duckdb.from_arrow(arrow_table).fetchall()[({'key': [3, 3], 'value': [12, 21]},), ({'key': [5], 'value': [42]},)]
Your hunch was correct, this error isn't caught
There was a problem hiding this comment.
Now the error is caught and tested for 👍
Mytherin
left a comment
There was a problem hiding this comment.
Thanks for the updates! Some more comments:
Mytherin
left a comment
There was a problem hiding this comment.
Thanks for the updates! Looking great. Some more comments:
Mytherin
left a comment
There was a problem hiding this comment.
Thanks for the updates! Almost there. Some more pointers:
src/function/scalar/map/map.cpp
Outdated
| namespace duckdb { | ||
|
|
||
| // TODO: this doesn't recursively verify maps if maps are nested | ||
| void VerifyMap(Vector &map, idx_t count, const SelectionVector &sel) { |
There was a problem hiding this comment.
Can we split this function into a function that returns an enum class MapInvalidReason : uint8_t { MAP_VALID, MAP_NULL_KEY_LIST, MAP_NULL_KEY, MAP_NOT_UNIQUE }; and a function that throws the InvalidInputExceptions?
in Vector::Verify we want to throw an InternalException if the map is not valid, and not an InvalidInputException. We want to do something like:
D_ASSERT(VerifyMapInternal(map, count, sel) == MapInvalidReason::MAP_VALID);There was a problem hiding this comment.
Done 👍
I turned VerifyMap into a static method of Vector, that just asserts that it's valid
CheckMapValidity now returns the enum
and I've added two functions that handle the errors
One in arrow_conversion.cpp and one in map.cpp
There was a problem hiding this comment.
Most of the ones in Arrow conversion should never be hit, because the only thing we don't share is the unique requirement, but I added them for completeness
Mytherin
left a comment
There was a problem hiding this comment.
Thanks for the updates - last comment, then it is good to go.
|
Thanks, now all that's left is to pray to the CI gods to be merciful |
|
Looks like they were. Great, thanks! |
This PR adds a requirement to the
mapfunction as discussed in #3640It uses the
list_uniquefunction to verify that all keys are unique, that's why it is also subject to the same limitations as list_unique, being that the type provided has to be supported by the histogram aggregrate function.That's why I've changed that exception type in
GetHistogramFunctionto InvalidInputException and made it slightly more descriptiveIf the provided keys are not unique, an InvalidInputException will be thrown and the operation is aborted.