Requiring keys provided to `map` to be unique by Tishj · Pull Request #3760 · duckdb/duckdb

Tishj · 2022-06-03T08:49:45Z

This PR adds a requirement to the map function as discussed in #3640

It uses the list_unique function to verify that all keys are unique, that's why it is also subject to the same limitations as list_unique, being that the type provided has to be supported by the histogram aggregrate function.

# Type is not implemented in the histogram aggregrate function
statement error
select MAP(LIST_VALUE([1],[2],[3],[4]),LIST_VALUE(10,9,8,7))

That's why I've changed that exception type in GetHistogramFunction to InvalidInputException and made it slightly more descriptive

If the provided keys are not unique, an InvalidInputException will be thrown and the operation is aborted.

Tishj · 2022-06-03T09:05:30Z

Maybe I should change the GetHistogramFunction so it doesn't always throw an exception if the type isn't supported so I can handle it at the call site and throw a more descriptive exception?

Tishj · 2022-06-03T10:49:25Z

Some of the CI tests are failing because of the Histogram aggregate issue I mentioned, might be worth it to temporarily create an inefficient brute check for uniqueness for those types? Or just allow non-unique keys for those types for now

Tishj · 2022-06-07T13:12:34Z

@pdet What do you think is the right course of action to take for now?

Should I allow non-unique keys for types that aren't supported by the Histogram aggregate yet?
Or should I write a (probably inefficient) check for those cases?

pdet · 2022-06-07T13:14:52Z

What types are missing? Only nested types?

Tishj · 2022-06-07T13:20:05Z

I'm not sure, when I comment out the default it tells me 21 types aren't implemented, but definitely not all of those need to be implemented.

template <bool IS_ORDERED = true>
AggregateFunction GetHistogramFunction(const LogicalType &type) {

	switch (type.id()) {
	case LogicalType::BOOLEAN:
		return GetMapType<HistogramFunctor, bool, IS_ORDERED>(type);
	case LogicalType::UTINYINT:
		return GetMapType<HistogramFunctor, uint8_t, IS_ORDERED>(type);
	case LogicalType::USMALLINT:
		return GetMapType<HistogramFunctor, uint16_t, IS_ORDERED>(type);
	case LogicalType::UINTEGER:
		return GetMapType<HistogramFunctor, uint32_t, IS_ORDERED>(type);
	case LogicalType::UBIGINT:
		return GetMapType<HistogramFunctor, uint64_t, IS_ORDERED>(type);
	case LogicalType::TINYINT:
		return GetMapType<HistogramFunctor, int8_t, IS_ORDERED>(type);
	case LogicalType::SMALLINT:
		return GetMapType<HistogramFunctor, int16_t, IS_ORDERED>(type);
	case LogicalType::INTEGER:
		return GetMapType<HistogramFunctor, int32_t, IS_ORDERED>(type);
	case LogicalType::BIGINT:
		return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
	case LogicalType::FLOAT:
		return GetMapType<HistogramFunctor, float, IS_ORDERED>(type);
	case LogicalType::DOUBLE:
		return GetMapType<HistogramFunctor, double, IS_ORDERED>(type);
	case LogicalType::VARCHAR:
		return GetMapType<HistogramStringFunctor, string, IS_ORDERED>(type);
	case LogicalType::TIMESTAMP:
		return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
	case LogicalType::TIMESTAMP_TZ:
		return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
	case LogicalType::TIMESTAMP_S:
		return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
	case LogicalType::TIMESTAMP_MS:
		return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
	case LogicalType::TIMESTAMP_NS:
		return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
	case LogicalType::TIME:
		return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
	case LogicalType::TIME_TZ:
		return GetMapType<HistogramFunctor, int64_t, IS_ORDERED>(type);
	case LogicalType::DATE:
		return GetMapType<HistogramFunctor, int32_t, IS_ORDERED>(type);
	default:
		throw InternalException("Unimplemented histogram aggregate");
	}
}

@pdet TL;DR yes probably only nested types aren't supported

pdet · 2022-06-07T13:44:10Z

I think it is fine if you just throw an error if it's not one of these types.

Tishj · 2022-06-07T13:49:47Z

That is also an option haha hadn't considered that one if I'm honest.
I'll need to change some tests that use MAP with keys that aren't of this type then 👍

Tishj · 2022-06-09T13:52:22Z

@pdet Could you take a look at this, see if there's anything I still need to fix/change?

tools/pythonpkg/tests/fast/pandas/test_fetch_nested.py

tools/pythonpkg/tests/fast/arrow/test_nested_arrow.py

test/sql/copy/parquet/writer/write_map.test

pdet · 2022-06-13T09:19:24Z

src/function/scalar/map/map.cpp

+bool KeyListIsEmpty(list_entry_t *data, idx_t rows) {
+	for (idx_t i = 0; i < rows; i++) {
+		auto size = data[i].length;
+		if (size != 0) {
+			return false;
+		}
+	}
+	return true;
+}
+
+static void CheckKeyValidity(DataChunk &args) {
+	auto types = args.GetTypes();
+	if (types.empty()) {
+		return;
+	}
+	auto key_type = ListType::GetChildType(types[0]).id();
+	auto &array = args.data[0];
+	auto arg_data = FlatVector::GetData<list_entry_t>(array);
+	auto &entries = ListVector::GetEntry(array);
+	if (key_type == LogicalTypeId::SQLNULL) {
+		if (KeyListIsEmpty(arg_data, args.size())) {
+			return;
+		}
+		// The entire key list is NULL for one (or more) of the rows: (ARRAY[NULL, NULL, NULL])
+		throw InvalidInputException("Map keys can not be NULL");
+	}
+
+	VectorData list_data;
+	auto count = ListVector::GetListSize(array);
+	args.data[0].Orrify(args.size(), list_data);
+	auto validity = FlatVector::Validity(entries);
+	if (!validity.CheckAllValid(count)) {
+		throw InvalidInputException("Map keys can not be NULL");
+	}
+}
+
+static void CheckForKeyUniqueness(DataChunk &args, ExpressionState &state) {
+	// Create a copy of the arguments
+	auto types = args.GetTypes();
+	if (types.empty() || ListType::GetChildType(types[0]).id() == LogicalType::SQLNULL) {
+		return;
+	}
+
+	auto arg_data = FlatVector::GetData<list_entry_t>(args.data[0]);
+
+	DataChunk keys;
+	keys.Initialize(args.GetTypes());
+	args.Copy(keys);
+
+	// Split the copy to separate the keys
+	DataChunk remaining_columns;
+	keys.Split(remaining_columns, 1);
+
+	Vector unique_result(LogicalType::UBIGINT, args.size());
+	ListUniqueFunction(keys, state, unique_result);
+	for (idx_t i = 0; i < args.size(); i++) {
+		auto keys_length = arg_data[i].length;
+		auto unique_keys = FlatVector::GetValue<uint64_t>(unique_result, i);
+		if (unique_keys != keys_length) {
+			throw InvalidInputException("Map keys have to be unique!");
+		}
+	}
+}
+


Nitpicking, wouldn't it be better if these are
AreKeysUnique and AreKeysValid functions that return a bool with the exception being thrown in the MapFunction?
Also, how are these functions called when you have a map type not created by SQL?
e.g., if you have a malformed pyarrow table with a map type and then consume it through duckdb, I think these functions won't be called, am I right?

Hmm that could definitely be the case, I will test those cases

map_type = pa.map_(pa.int32(), pa.int32()) values = [ [ (3, 12), (3, 21) ], [ (5, 42) ] ] arrow_table = pa.table( {'detail': pa.array(values, map_type)} ) rel = duckdb.from_arrow(arrow_table).fetchall()

[({'key': [3, 3], 'value': [12, 21]},), ({'key': [5], 'value': [42]},)]

Your hunch was correct, this error isn't caught

Now the error is caught and tested for 👍

…Function a bit

Mytherin

Thanks for the updates! Some more comments:

src/function/table/arrow_conversion.cpp

test/sql/types/nested/map/test_map.test

src/function/scalar/map/map.cpp

Mytherin

Thanks for the updates! Looking great. Some more comments:

src/function/scalar/map/map.cpp

src/function/table/arrow_conversion.cpp

test/sql/copy/parquet/writer/write_map.test

src/include/duckdb/function/list_aggregate_function.hpp

src/include/duckdb/common/types/value_map.hpp

Mytherin

Thanks for the updates! Almost there. Some more pointers:

src/function/scalar/map/map.cpp

src/common/types/vector.cpp

test/sql/types/nested/map/test_map.test

Mytherin · 2022-06-17T12:49:24Z

src/function/scalar/map/map.cpp

 namespace duckdb {

+// TODO: this doesn't recursively verify maps if maps are nested
+void VerifyMap(Vector &map, idx_t count, const SelectionVector &sel) {


Can we split this function into a function that returns an enum class MapInvalidReason : uint8_t { MAP_VALID, MAP_NULL_KEY_LIST, MAP_NULL_KEY, MAP_NOT_UNIQUE }; and a function that throws the InvalidInputExceptions?

in Vector::Verify we want to throw an InternalException if the map is not valid, and not an InvalidInputException. We want to do something like:

D_ASSERT(VerifyMapInternal(map, count, sel) == MapInvalidReason::MAP_VALID);

Done 👍
I turned VerifyMap into a static method of Vector, that just asserts that it's valid
CheckMapValidity now returns the enum
and I've added two functions that handle the errors
One in arrow_conversion.cpp and one in map.cpp

Most of the ones in Arrow conversion should never be hit, because the only thing we don't share is the unique requirement, but I added them for completeness

Mytherin

Thanks for the updates - last comment, then it is good to go.

Mytherin

LGTM

Tishj · 2022-06-17T14:50:08Z

Thanks, now all that's left is to pray to the CI gods to be merciful

Mytherin · 2022-06-18T09:33:54Z

Looks like they were. Great, thanks!

Tishj added 4 commits June 2, 2022 14:12

added test for 'map' with columns

1add9a8

added first implementation of unique check in map

7718c81

also detecting unique errors for multiple rows

43f9b1c

changed exception type in GetHistogramFunction

cceb23c

Tishj added 10 commits June 7, 2022 15:54

adjusted 'write_map.test'

b2d3416

adjusted 'nested_nested_types.test'

5f3497e

adjusted 'test_map_subscript.test', and added/fixed NULL check on MAP

cb543f0

adjusted 'test_null_type.test'

23c772e

adjusted json tests and 'test_map_cardinality'

edddbaf

adjusted 'write_map.test'

d149076

Merge branch 'master' into map_unique_fix

d8054e8

fixed 'test_nested_arrow.py' test

c3eeaaa

Merge branch 'master' into map_unique_fix

19020dc

fixed 'test_fetch_nested.py' test

870e391

pdet reviewed Jun 13, 2022

View reviewed changes

tools/pythonpkg/tests/fast/pandas/test_fetch_nested.py Outdated Show resolved Hide resolved

pdet reviewed Jun 13, 2022

View reviewed changes

tools/pythonpkg/tests/fast/arrow/test_nested_arrow.py Show resolved Hide resolved

pdet reviewed Jun 13, 2022

View reviewed changes

test/sql/copy/parquet/writer/write_map.test Outdated Show resolved Hide resolved

pdet reviewed Jun 13, 2022

View reviewed changes

Tishj added 2 commits June 14, 2022 14:01

added match to exceptions caught in 'test_fetch_nested.py'

5e55b0b

CheckKeyValidity -> AreKeysNull and reworked the structure of the Map…

62ec77a

…Function a bit

Tishj force-pushed the map_unique_fix branch from 67d31f2 to 37c4e17 Compare June 16, 2022 07:59

Mytherin reviewed Jun 16, 2022

View reviewed changes

src/function/table/arrow_conversion.cpp Outdated Show resolved Hide resolved

test/sql/types/nested/map/test_map.test Outdated Show resolved Hide resolved

src/function/scalar/map/map.cpp Outdated Show resolved Hide resolved

Tishj added 3 commits June 16, 2022 11:23

reverted back to a previous commit, using an unordered_set now

1eed151

reverted changes to ListUnique

61d0306

passing all tests

91876e2

Tishj force-pushed the map_unique_fix branch from 9fbcbcf to 62ec77a Compare June 16, 2022 09:49

Tishj added 3 commits June 16, 2022 11:54

accidentally commented out enable_verification

22e8cbe

added AreKeysUnique to arrow map conversion

8dca85c

added test that makes sure this exception gets thrown when appropriate

7ab1b01

Mytherin reviewed Jun 16, 2022

View reviewed changes

Tishj added 5 commits June 16, 2022 14:00

AreKeysUnique now uses Orrify

715cbaf

replaced VerifyKeysUnique and AreKeysNull among others with VerifyMap

8ff57b8

changed to NotDistinctFrom, removed unused file

8b9750c

merged with master

0fa7177

restored and added additional tests for nested keys

e1214aa

Mytherin reviewed Jun 16, 2022

View reviewed changes

Tishj added 5 commits June 17, 2022 12:37

VerifyMap now uses a selection vector

a17837c

test_map uncommented enable_verification

a2dda29

auto -> auto& for ValidityMask variables

8261c88

make format-master

96825ef

added 'can_run' check to 'test_map_arrow_to_duckdb'

31c7617

Mytherin reviewed Jun 17, 2022

View reviewed changes

separated error throwing logic from error detection

dd9e64a

Mytherin approved these changes Jun 17, 2022

View reviewed changes

Mytherin merged commit ee02a53 into duckdb:master Jun 18, 2022

Mytherin mentioned this pull request Oct 3, 2024

Implement map_extract_first #14175

Merged

Tishj deleted the map_unique_fix branch November 7, 2025 16:18

Conversation

Tishj commented Jun 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tishj commented Jun 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tishj commented Jun 3, 2022

Uh oh!

Tishj commented Jun 7, 2022

Uh oh!

pdet commented Jun 7, 2022

Uh oh!

Tishj commented Jun 7, 2022

Uh oh!

pdet commented Jun 7, 2022

Uh oh!

Tishj commented Jun 7, 2022

Uh oh!

Tishj commented Jun 9, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pdet Jun 13, 2022

Choose a reason for hiding this comment

Uh oh!

Tishj Jun 13, 2022

Choose a reason for hiding this comment

Uh oh!

Tishj Jun 14, 2022

Choose a reason for hiding this comment

Uh oh!

Tishj Jun 16, 2022

Choose a reason for hiding this comment

Uh oh!

Mytherin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Mytherin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Mytherin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Mytherin Jun 17, 2022

Choose a reason for hiding this comment

Uh oh!

Tishj Jun 17, 2022

Choose a reason for hiding this comment

Uh oh!

Tishj Jun 17, 2022

Choose a reason for hiding this comment

Uh oh!

Mytherin left a comment

Choose a reason for hiding this comment

Uh oh!

Mytherin left a comment

Choose a reason for hiding this comment

Uh oh!

Tishj commented Jun 17, 2022

Uh oh!

Mytherin commented Jun 18, 2022

Uh oh!

Reviewers

Assignees

Tishj commented Jun 3, 2022 •

edited

Loading

Tishj commented Jun 3, 2022 •

edited

Loading