Merged
Conversation
f811d92 to
a6d458c
Compare
1fa9364 to
1ffe9d3
Compare
1ffe9d3 to
9b8b6d9
Compare
locker
reviewed
Dec 17, 2024
Gumix
approved these changes
Dec 19, 2024
lenkis
approved these changes
Dec 20, 2024
a7c74f5 to
f79f8de
Compare
df29993 to
50e3905
Compare
50e3905 to
a82ba62
Compare
39b7402 to
cbea45d
Compare
locker
approved these changes
Feb 6, 2025
Gumix
approved these changes
Feb 6, 2025
d2ff7d2 to
8999b80
Compare
This update pulls the following commits: * Fix mp_compare_uint() * Make tests C++ * Bring test.h up to date with Tarantool's unit.h * Fix mp_read_int/double on suboptimal msgpack * Add NOINLINE to a couple of test functions Part of #9965 NO_DOC=bugfix NO_CHANGELOG=later
The memtx index hints were assuming that MP_INT is always negative. It is true that Tarantool always encodes non-negative values as MP_UINT, but the msgpack format doesn't require that. All the code must be ready to any valid msgpack. Part of #9965 NO_DOC=bugfix NO_CHANGELOG=later
Upsert operation normalizes its operations - subtracts the index base from the fieldnos. If the fieldno is negative, it is simply encoded back. If positive - the subtraction happens and then encoding back. The problem was that the code assumed MP_INT is always negative, and would blindly encode it without changes. The assumption wasn't right - MP_INT can be positive. Need to handle that. Part of #9965 NO_DOC=bugfix NO_CHANGELOG=later
Datetime interval object, when being decoded from msgpack, didn't consider MP_INT allowing positive values. In one places, where a number of any sign in int32_t was allowed, the MP_INT decoded value was only checked for INT32_MIN, assuming that max is 0. But in fact it had to be checked for INT32_MAX too, as it can be > 0. Part of #9965 NO_DOC=bugfix NO_CHANGELOG=later
rand() function isn't good. It returns just an 'int'. Not enough to properly randomize int64_t/uint64_t. It is going to be needed for the future commits to better cover the effects of big numbers on tuple hashing. Lets use mt19937 random generator from C++ standard library with uniform_int_distribution to cover numbers of any byte size evenly. Part of #9965 NO_DOC=internal NO_CHANGELOG=internal NO_TEST=it is a test
It is a step towards removing the global singletons completely. They make it super hard to customize any bench, because they all have to use the same globals right now. It would make problematic to introduce a bench for tuple hashing in the next commits. In scope of #9965 NO_DOC=internal NO_CHANGELOG=internal NO_TEST=it is a test
In the tuple bench there were a couple of places which were going to cause stack overflow, when their code was executed in a regular context, not at the singleton construction phase. Lets make them use the heap. In scope of #9965 NO_DOC=internal NO_CHANGELOG=internal NO_TEST=it is a test
They made it super hard to customize any bench, because all benches had to use the same globals. It wasn't easy, for example, specify a new kind of key def with a different set of fields. All the code done by those globals would have to be duplicated. But now they are more generic and reusable. In scope of #9965 NO_DOC=internal NO_CHANGELOG=internal NO_TEST=it is a test
key_part_def was initialized with zeros, which is wrong. It makes it have ON_CONFLICT_ACTION_NONE set (because it is zero), which isn't default. This led to some fields being nullable unintentionally. Better use the proper key_part_def_default to initialize those defs. In scope of #9965 NO_DOC=internal NO_CHANGELOG=internal NO_TEST=it is a test
The new bench should help to see how the following commits are changing the tuple hashing perf, when the hashing functions get fixed (they contain some bugs). In scope of #9965 NO_DOC=internal NO_CHANGELOG=internal NO_TEST=it is a test
Hash index over a number field was showing a confusing behaviour: same non-integer number stored as MP_FLOAT and MP_DOUBLE had different hashes. For example, a user could insert float 14.5 and would be unable to get the tuple by double 14.5 key. At the same time if the index was 'double', the scenario above worked fine. Also it works fine in a tree index. Which means the 'number' index behaviour was not only confusing but also inconsistent. Lets fix it and hash all non-integer values as doubles. Note that the hashing function uses mp_store_double() instead of mp_encode_double(). This is because all the doubles have the same header MP_DOUBLE. It is pointless to include it into the hash function. Besides, as the next commit will show, hash from an msgpack number without a header is quite faster to calculate. Part of #9965 NO_DOC=bugfix NO_CHANGELOG=later
tuple_bloom filter had a flag is_legacy. It existed because of Vinyl where 2 different types of bloom filters exist for backward compatibility. Unfortunately, soon there will be 3. One flag isn't enough. The commit turns it into an enum. In scope of #9965 NO_DOC=internal NO_CHANGELOG=internal NO_TEST=it is a test
If a hash index had <= 3 fields, all of types 'unsigned' and 'string', then its hashing functions were C++ templates, which in case of a single unsigned field would use a faster hashing function than MurMur. But the function would return a different hash. This led to a problem that the same MP_UINT value would have different hash values depending on its space field type. And this in turn would break indexes altering their single unsigned field's type to any other type, like number, double, scalar. Those hash templates don't really make a big difference. At least most of them. Only this single field case for 'unsigned' type was quite faster, but the code complexity is too high and leads to the bugs like the one described above. Lets just drop those templates. Hash index anyway is considered deprecated and isn't being evolved or recommended. Note that the tuple_hash function is still templated. But only to save on 'if's which it has quite a lot, checking the presence of optional parts and JSON paths. The hashing function is not affected. Part of #9965 NO_DOC=bugfix NO_CHANGELOG=later
Tuple and key hashing in tuple_hash.cc could return different hashes for the same integer encoded in different ways. In MessagePack it is possible to encode one integer in up to 9 ways, all being valid. Tarantool must either prohibit suboptimal integer encoding, or properly support it, but not ignore like it was before. This commit chose the support way. Tuple hashing code now repacks the integers in the optimal way before taking their hash. This is enough for memtx. For Vinyl the situation was a bit different - it stores hashes on disk, for the bloom filters. In order for the filters generated by the older versions of Tarantool to be usable, there is still support of incorrect hashing functions. Over time, after some compactions, the old filters are going to be washed away and replaced by the new ones automatically. The new hash functions not only aren't slower, but are even slightly faster. The reason is that the hashing algorithm seems to have a heavy penalty for all buffers whose size isn't a power of 2. The optimally encoded MessagePack integers usually take an uneven number of bytes: 1, 2, 3, 5, 9. Only numbers in the ranges [-128, -32) and (127, 255) take a buffer of an even size. The new hashing functions use the same hash algorithm, but re-encode the MesssagePack integers without their headers. Hence operating always on the buffers of sizes 1, 2, 4, 8 - smaller and only one of them is uneven. The speed up isn't large though. It can make visible effect on the indexes consisting of integer key parts. The observable speed increase is reproducible in tuple.perftest with `--benchmark_filter="tuple_tuple_hash"` argument, and was seen to be up to 11% on just tuple_hash() function alone. In a realistic case the increase might be much less visible when full overhead of a complete request is considered. Part of #9965 NO_DOC=bugfix NO_CHANGELOG=later
When 'double' field type was introduced, it would only allow MP_DOUBLE MessagePack values. This would work fine. The numbers were compared correctly, could be promoted to 'number' and 'scalar' types. Then the users were complaining that it is too hard to insert integer values into a 'double' column because, at least when come from the Lua land, the numbers with 0 fractional part are saved as MP_INT/MP_UINT. The commit 51af059 ("box: compare and hash msgpack value of double key field as double") allowed the users to insert MP_INT and MP_UINT into the 'double' field, but it kept the C-double comparison logic. Meaning that all the numbers would be cast to C-language doubles and compared like that, regardless of their MessagePack type. It would behave like if the numbers are all actually stored as MP_DOUBLE. This eventually led to problems, because the assumption of 'double' being included into 'scalar' and 'number' broke. The same values stored in a 'double' index and in a 'number' index would be sorted differently, because large integers, when cast to C-double, loose precision. In double-world UINT64_MAX == UINT64_MAX + 1, while it is obviously not so in number- and scalar-world. The index simply broke when it was altered to 'number' or 'scalar', because it doesn't get rebuilt, and with the new sorting rules it became sorted incorrectly. This led to all sorts of data inconsistencies, assertions and crashes. With vinyl it was particularly bad, because it can't even be fixed automatically. If a user ever had a vinyl 'double' index with large integers and then converted it to 'number' or 'scalar', then the index is broken. And it is undetectable from just looking at the field types. The unlucky users of vinyl with 'double'-type indexes are going to have to pay attention to the next release and rebuild their indexes manually. Closes #9965 Follow-up for #7483 @TarantoolBot document Title: `double` field type new behaviour `Double` field type now is identical to `number`. I.e. it can store integers and compares them as integers. * [Reference -> SQL reference -> SQL user guide](https://www.tarantool.io/en/doc/latest/reference/reference_sql/sql_user_guide/#sql-data-type-double): *DOUBLE values are numerics that do contain decimal points (for example 0.5) or are expressed with exponential notation (for example 5E-1)* -> now DOUBLE can have any numbers, including integers. It becomes identical to NUMBER. * [Platform -> Defining and manipulating data -> Data storage](https://www.tarantool.io/en/doc/latest/platform/ddl_dml/value_store/): *the storage type is MP_DOUBLE and the size of the encoded value is always 9 bytes. In Lua, fields of the double type can only contain non-integer numeric values and cdata values with double floating-point numbers* -> not so anymore. `double` field is now identical to `number` (it can store integers too, and compares them like integers).
8999b80 to
bcac6c1
Compare
Collaborator
Author
|
During rebase I did a change in the vinyl 9965 test. I turned off the vinyl cache, and noticed that in the old run files we won't be able to find a double key, if the search-key is float. And vice versa. This is matching the old behaviour I guess, which we intentionally keep for the old run-files. I caught this flaky test in static-build job, which somewhy had vinyl cache set to zero and managed to catch the problem. Now I made the cache zero in all 9965 tests. |
Collaborator
Author
|
I will backport it manually. |
This was referenced Feb 10, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Not including some problems of
MP_INTvsMP_UINTin SQL.Closes #9965.
Related to #10132, #10084.
Here is a related discussion: https://github.com/orgs/tarantool/discussions/11017.