CASSANDRA-18504: Added support for type VECTOR<type, dimension> #2310

dcapwell · 2023-05-05T23:11:16Z

No description provided.

src/java/org/apache/cassandra/cql3/Lists.java

src/java/org/apache/cassandra/utils/vint/VIntCoding.java

test/unit/org/apache/cassandra/cql3/CQLTester.java

dcapwell · 2023-05-05T23:25:28Z

test/unit/org/apache/cassandra/db/marshal/AbstractTypeTest.java

failing atm, few types remain to be added to generators...

java.lang.AssertionError: Uncovered types: org.apache.cassandra.db.marshal.LexicalUUIDType org.apache.cassandra.db.marshal.DynamicCompositeType.FixedValueComparator org.apache.cassandra.db.marshal.PartitionerDefinedOrder org.apache.cassandra.db.marshal.DateType org.apache.cassandra.db.marshal.CounterColumnType org.apache.cassandra.db.marshal.IntegerType org.apache.cassandra.db.marshal.DecimalType org.apache.cassandra.db.marshal.SimpleDateType org.apache.cassandra.db.marshal.TimeType org.apache.cassandra.db.marshal.DynamicCompositeType org.apache.cassandra.db.marshal.CompositeType org.apache.cassandra.db.marshal.LegacyTimeUUIDType org.apache.cassandra.db.marshal.TimeUUIDType

I assume that this comment is now invalid because it looks like you have added type support for these types. Can you confirm?

there are a small handful left to add, but those are not too hard to do. This comment isn't 100% accurate as some types were added but this test still fails waiting for 100% coverage

dcapwell · 2023-05-05T23:26:58Z

test/unit/org/apache/cassandra/db/marshal/AbstractTypeTest.java

I know there are only 2 versions and I filter 2 out... but this allows for new versions to get tested by default

dcapwell · 2023-05-05T23:27:37Z

test/unit/org/apache/cassandra/db/marshal/AbstractTypeTest.java

we are working with non-null data, so this can't be null

test/unit/org/apache/cassandra/db/marshal/AbstractTypeTest.java

src/java/org/apache/cassandra/db/marshal/VectorType.java

mike-tr-adamson · 2023-05-09T10:32:29Z

@dcapwell Is this ready for review or are you still actively working on it?

dcapwell · 2023-05-11T17:01:06Z

@mike-tr-adamson

Is this ready for review or are you still actively working on it?

yes, its ready for review... I make tweaks over time as I think about it, but that's mostly to add more tests; the core patch is stable

mike-tr-adamson · 2023-05-12T11:59:03Z

src/java/org/apache/cassandra/cql3/Vectors.java

We need to add a Marker extends AbstractMarker implementation to this class to handle bindings correctly. We then need to add to AbstractMarker.Raw.prepare to use this class:

else if (receiver.type instanceof VectorType) { return new Vectors.Marker(bindIndex, receiver); }

This allows us to correctly bind variables as vectors.

can you provide a CQL where this matters? In testing I see we become return new Constants.Marker(bindIndex, receiver);, which is fine to me as that validates with receiver.type.validate(value);

For list example, that produces a Value that split the list, so you have List<ByteBuffer>... since we are frozen, do we need this?

I added tests for select and IN clause... works with and without binds...

How are you binding values? If I try the following:

createTable("CREATE TABLE %s (pk int, val vector<float, 3>, PRIMARY KEY(pk))"); execute("INSERT INTO %s (pk, val) VALUES (0, ?)", new float[]{ 1.0f, 2.0f, 3.0f });

I get the following error:

java.lang.IllegalArgumentException: Unsupported value type (value is [F@389a5c0b) at org.apache.cassandra.cql3.CQLTester.typeFor(CQLTester.java:2433) at org.apache.cassandra.cql3.CQLTester.transformValues(CQLTester.java:2077) at org.apache.cassandra.cql3.CQLTester.executeFormattedQuery(CQLTester.java:1434) at org.apache.cassandra.cql3.CQLTester.execute(CQLTester.java:1413)

and if I try:

createTable("CREATE TABLE %s (pk int, val vector<float, 3>, PRIMARY KEY(pk))"); execute("INSERT INTO %s (pk, val) VALUES (0, ?)", Lists.newArrayList(1.0f, 2.0f, 3.0f));

I get:

org.apache.cassandra.exceptions.InvalidRequestException: Unexpected extraneous bytes after list value at org.apache.cassandra.cql3.Constants$Marker.bindAndGet(Constants.java:461) at org.apache.cassandra.cql3.Constants$Setter.execute(Constants.java:485) at org.apache.cassandra.cql3.statements.UpdateStatement.addUpdateForKey(UpdateStatement.java:92) at org.apache.cassandra.cql3.statements.ModificationStatement.addUpdates(ModificationStatement.java:799)

Ok, looking at this again, I can see that a vector method has been added to allow the usage of vectors in CQLTester but I'm still slightly confused about what the drivers will eventually accept for binding values. Do you have any thoughts about what this will be? As in what is the wire representation going to be?

But, in answer to your response, I don't think we need a Vector.Marker because we are dealing with an effectively frozen type.

Ok, looking at this again, I can see that a vector method has been added to allow the usage of vectors in CQLTester

CQLTester does not use the schema to figure out how to serialize bind values, it uses java types to figure that out; for this reason List can be confusing, so added a Vector type in CQLTester only for this work.

but I'm still slightly confused about what the drivers will eventually accept for binding values

I believe this is up to drivers... the CQL syntax matches list, so to me accepting the same types makes sense. Java driver could accept array, but that adds a lot of complexity to the driver I would think as you would need a serializer for every primitive as well.

I guess this all boils down to how driver works; if you know the schema then it doesn't matter too much, but if you do not know the schema (such as CQLTester) then you would need type support to figure this out... I feel like we have to know the schema as stuff like python would need this? Not familiar with that layer so can not say...

I feel the worst case scenario is to add a Vector type to each driver, similar to how tuples/udts work... I would personally find that more annoying than working with a List in java, but it does solve the type mapping to serializer if we don't leverage the schema.

As in what is the wire representation going to be?

There are 2 different protocols: fixed length, and variable length.

For fixed length types, the format is N sequence of bytes where those bytes are the serialized format of the type; below is a int example

int int int ... int

For variable length types, the format is the same as ListType, but without the list size prefix at the top, so its

( element_size: int element_bytes: byte[] )+

Ok, I'm happy with this. I have a minor concern about having different representations for fixed and variable types but I'm pretty certain that the drivers can handle this based on the schema.

I think the one driver edge case to worry about is ShortType... it "should" be fixed length, but for w/e reason it isn't, which means we have a int header that always has the value 2...

src/java/org/apache/cassandra/cql3/Vectors.java

src/java/org/apache/cassandra/db/marshal/VectorType.java

mike-tr-adamson · 2023-05-12T15:08:09Z

I'd like to put some questions here that are outside of the review and are mainly related to the scope of the ticket and what we are expecting to support at the end.

The tickets states that we are producing a fixed-length, frozen array with non-null values. Are we expecting these values to be primitive types? What I mean by this is could we expect something like INSERT INTO t (pk, vec) VALUES (0, [1.0, sum(1.0, 2.0), 3.0])? Meaning do we want to support function calls in the creation of the array?

If we disallow this (which would be my preference) we could simplify our type support greatly by removing the need for non-terminal support.

@dcapwell WDYT?

dcapwell · 2023-05-12T16:15:29Z

Are we expecting these values to be primitive types

Nope, this limitation was removed in the dev mailing list. The requirement is support for any type, including stuff like map and text. My stance is that the complexity for the type isn't more complex to have such a case, as a ShortType is variable length (don't know history on why... its fixed length but the type says its not...), so is IntegerType... so by allowing fixed and variable length types, we make sure that all primitive types work, and users can choose to use this type in other, non ML, use cases.

INSERT INTO t (pk, vec) VALUES (0, [1.0, sum(1.0, 2.0), 3.0])

I never remember how functions work in CQL... in most DBs I expect sum(1, 2) to be converted to 3 at binding time, but I think we defer these until execution time... ignoring this detail, conceptually yes, we should allow such behavior as long as the grammar allows (grammar doesn't allow functions in all places). If the grammar supports, I can work on a test to make sure this is handled.

I actually created a CQL fuzz tester for Accord, but been debating if I should add to tree or not (there is overlap with what Harry does... which is why I question)... That fuzz tester builds a AST of what is allowed in CQL for each type, so would be a perfect way to make sure this behavior works... I tend to prefer generic/property tests rather than single type tests... which is why this patch is so large <_<...

If we disallow this (which would be my preference) we could simplify our type support greatly by removing the need for non-terminal support.

My personal stance is that CQL is way too much special casing, so as a user I have a hard time knowing what to expect (I hate frozen vs multi-cell... they define their own semantics, so you have to know implementation details of the db...)... if the grammar allows functions in the value, then VectorType must allow as well...

Thanks @mike-tr-adamson for the review, ill make a few changes and let you know when I push

dcapwell · 2023-05-12T20:12:30Z

What I mean by this is could we expect something like INSERT INTO t (pk, vec) VALUES (0, [1.0, sum(1.0, 2.0), 3.0])? Meaning do we want to support function calls in the creation of the array?

I created tests for this, this works just fine. I can't use sum as that is an aggregate function, so switched to 1 + 2

dcapwell · 2023-05-12T20:12:44Z

@mike-tr-adamson believe I fixed or addressed all comments

src/java/org/apache/cassandra/cql3/Vectors.java

src/java/org/apache/cassandra/db/marshal/VectorType.java

Maxwell-Guo · 2023-05-15T06:15:12Z

src/java/org/apache/cassandra/cql3/CQL3Type.java

+     */
+    default ByteBuffer fromCQLLiteral(String literal)
+    {
+        return fromCQLLiteral("--dummy--", literal);


so why it is "--dummy--"?

because @mike-tr-adamson gave that example =).

I see, here have
already use this key word.
Besides ,what about make the "--dummy--" a final String in SchemaConstants.java with some descriptions, like
// some descriptions
public final String DUMMY = "--dummy--"; // SchemaConstants.java

because I think if we do not explain for the reason , then for every subsequent developer who may see this word, they will ask why.
@mike-tr-adamson WDYT?

I think my main concern is that this may not be valid for for all types. As it happens this method is only used in AbstractTypeTest. I think it would make sense to remove this method from here to the test so it isn't accidentally used in live code without clear understanding of the effects.

Maxwell-Guo · 2023-05-15T06:22:13Z

test/unit/org/apache/cassandra/cql3/CQLTester.java

                    Object actualValueDecoded = actualValue == null ? null : column.type.getSerializer().deserialize(actualValue);
                    if (!Objects.equal(expected != null ? expected[j] : null, actualValueDecoded))
+                    {
+                        //TODO confirm this isn't a bug...


//TODO confirm this isn't a bug... -----> // TODO confirm this isn't a bug...

why is this? At least in IntelliJ //TODO causes the comments under this to be flagged as part of the TODO, which is true for this case... If I switch to // TODO that goes away and only this line is flagged... Not actually sure that this is a feature in IntelliJ... I just noticed when I made this change...

I don't think we have a style guide comment for this, so prefer to keep

//TODO
// TODO
it seems both are all exists, so I think we can just keep it.

Maxwell-Guo · 2023-05-15T06:27:27Z

test/unit/org/apache/cassandra/cql3/validation/operations/CQLVectorTest.java

+        execute("INSERT INTO %s (pk, value) VALUES (0, [1, 1 + (int) ?])", 1);
+        test.run();
+    }
+


do we still need test cases for tables that have verctor type in both pk and non-pks ?

yes, they are different code paths; testing one doesn't mean you fully work, the other code path may do something unexpected.

Maxwell-Guo · 2023-05-15T06:36:52Z

test/unit/org/apache/cassandra/cql3/validation/operations/CQLVectorTest.java

+        assertRows(execute("SELECT * FROM %s WHERE pk IN ([1, 1 + (int) ?])", 1), row(list(1, 2)));
+
+        assertRows(execute("SELECT * FROM %s WHERE pk > [0, 0] AND pk < [1, 3] ALLOW FILTERING"), row(list(1, 2)));
+        assertRows(execute("SELECT * FROM %s WHERE token(pk) = token([1, 2])"), row(list(1, 2)));


do we need some more test cases when element type is not just int ,decimal,collection type or udt (if not suporrt now an exception is need )?

and for compare order , pk > [0, 0] and <[1, 3] which means (the first index element in array is big than 0 and less than 1 )& (the secondary index element is big than 0 and less than 3) , or not & but | ? I think we should declaring these relationships clearly

do we need some more test cases when element type is not just int ,decimal,collection type or udt (if not suporrt now an exception is need )?

See org.apache.cassandra.cql3.RandomSchemaTest

and for compare order , pk > [0, 0] and <[1, 3] which means (the first index element in array is big than 0 and less than 1 )& (the secondary index element is big than 0 and less than 3)

I wouldn't say that... from CQL point of view I am saying to compare specific values, so is similar to the case where pk is a int and do pk > 0 AND pk < 3. For the type, the ordering is similar to list and based off the elements... so the following would also match: [0, 1], and [1, 2], and [0, 10991038080]

I think we should declaring these relationships clearly

I feel this is very clear, what would you recommend to improve?

ok, after see the compareCustom of vertorType ,I got the really meaning of this test case.

mike-tr-adamson · 2023-05-15T10:05:46Z

src/antlr/Parser.g

-    @after {$value = new Lists.Literal(l);}
-    : '[' ( t1=term { l.add(t1); } ( ',' tn=term { l.add(tn); } )* )? ']' { $value = new Lists.Literal(l); }
+    @after {$value = new ArrayLiteral(l);}
+    : '[' ( t1=term { l.add(t1); } ( ',' tn=term { l.add(tn); } )* )? ']' { $value = new ArrayLiteral(l); }


I noticed this previously but isn't this a duplication? We either need @after {$value = new ArrayLiteral(l);} or we need { $value = new ArrayLiteral(l); } but I don't think we need both of them.

I agree, but parser scares me as we don't have a lot of edge case testing in-tree atm... so I strongly prefer not changing this in this patch

Fair enough, although, I have changed it locally and done a lot of testing on it without breaking anything. I'm happy to leave as is for the the time being and fix it later.

mike-tr-adamson · 2023-05-15T12:52:36Z

src/java/org/apache/cassandra/serializers/TypeSerializer.java

    public abstract Class<T> getType();

-    public String toCQLLiteral(ByteBuffer buffer)
+    protected String toCQLLiteralNonNull(ByteBuffer buffer)


nit: Maybe add @Nonnull to parameter?

made this to param and return for each new type

mike-tr-adamson · 2023-05-15T12:52:46Z

src/java/org/apache/cassandra/serializers/TypeSerializer.java

+
+    public boolean shouldQuoteCQL()
+    {
+        return true;


Why do we default to true here? If it's strings that need quoting then wouldn't it be better to default to false and override this in AbstractTextSerializer?

mostly because it looked like there were more types needing than those that exclude.

If it's strings that need quoting then wouldn't it be better to default to false and override this in AbstractTextSerializer?

Most types looked to be strings, but do not use AbstractTypeSerializer... one negative of this is that collections became strings (which does still work in CQL), so switching to false may help not quote when we don't need to... ill test out swapping

made the change, less types need to override if the default is false; just making sure to run serde many times to see if any case pops up

dcapwell · 2023-05-15T16:43:37Z

@mike-tr-adamson @Maxwell-Guo replied to or fixed all feedback

dcapwell · 2023-05-15T16:53:02Z

src/java/org/apache/cassandra/db/marshal/PartitionerDefinedOrder.java

    }

+    @Override
+    public <V> boolean isNull(V buffer, ValueAccessor<V> accessor)


NOTE: text and byte[] treat the empty set as non-null, so this class will convert those to null... I didn't want to really go down the rabbit hole to figure out the impact to this... but this isn't "correct" from the types point of view...

dcapwell · 2023-05-15T17:36:56Z

test/unit/org/apache/cassandra/cql3/CQL3TypeLiteralTest.java

- * Test functionality to re-create a CQL literal from its serialized representation.
- * This test uses some randomness to generate the values and nested structures (collections,tuples,UDTs).
- */
-public class CQL3TypeLiteralTest


deleted this test as its a duplicate of AbstractTypeTest.serde. This test actually showed we were doing the wrong thing, it built its own model of what CQL would accept and made sure types matched its model; but if you ask CQL to parse it will fail as the model wasn't correct... Rather than fixing, was easier to remove as the new serde test covers this completely.

dcapwell · 2023-05-15T17:39:20Z

test/unit/org/apache/cassandra/cql3/RandomSchemaTest.java

+            TypeGenBuilder nonEmptyNoDuration = AbstractTypeGenerators.builder()
+                                                                      .withoutEmpty()
+                                                                      .withUserTypeKeyspace(KEYSPACE)
+                                                                      .withoutPrimitive(DurationType.instance);


DurationType violates several properties and is banned in several places do to these bugs... so exclude from this test as this test makes sure types respect their given properties

dcapwell · 2023-05-15T17:40:04Z

test/unit/org/apache/cassandra/cql3/RandomSchemaTest.java

+        Gen<Boolean> nulls = SourceDSL.integers().between(1, 100).map(i -> i < 5);
+        qt().checkAssert(random -> {
+            TypeGenBuilder nonEmptyNoDuration = AbstractTypeGenerators.builder()
+                                                                      .withoutEmpty()


legacy 3.x -> 4.x type... yay COMPACT STORAGE. This type is allowed in some places and not all, so easier to block in the test than try to work around it

dcapwell · 2023-05-15T17:41:19Z

test/unit/org/apache/cassandra/cql3/RandomSchemaTest.java

+                                     .withPartitionColumnsCount(1)
+                                     .withPrimaryColumnTypeGen(new TypeGenBuilder(nonEmptyNoDuration).withMaxDepth(2).build())
+                                     .withClusteringColumnsBetween(1, 2)


the reason for 1 partition and 1-2 clustering is to keep the size small.. we have a limit of 64k and with larger sizes its easier to hit that....

…supported

…so conflicts may happen

…sts, this way they can add their own filters without impacting the others

…fused

dcapwell commented May 5, 2023

View reviewed changes

src/java/org/apache/cassandra/cql3/Lists.java Outdated Show resolved Hide resolved

dcapwell commented May 5, 2023

View reviewed changes

src/java/org/apache/cassandra/utils/vint/VIntCoding.java Outdated Show resolved Hide resolved

dcapwell commented May 5, 2023

View reviewed changes

test/unit/org/apache/cassandra/cql3/CQLTester.java Outdated Show resolved Hide resolved

dcapwell commented May 5, 2023

View reviewed changes

test/unit/org/apache/cassandra/db/marshal/AbstractTypeTest.java Outdated Show resolved Hide resolved

dcapwell commented May 6, 2023

View reviewed changes

src/java/org/apache/cassandra/db/marshal/VectorType.java Outdated Show resolved Hide resolved

dcapwell force-pushed the CASSANDRA-18504 branch from 1193667 to d2479ec Compare May 10, 2023 17:03

mike-tr-adamson reviewed May 12, 2023

View reviewed changes

src/java/org/apache/cassandra/cql3/Vectors.java Outdated Show resolved Hide resolved

mike-tr-adamson reviewed May 12, 2023

View reviewed changes

src/java/org/apache/cassandra/db/marshal/VectorType.java Outdated Show resolved Hide resolved

mike-tr-adamson reviewed May 12, 2023

View reviewed changes

src/java/org/apache/cassandra/db/marshal/VectorType.java Outdated Show resolved Hide resolved

dcapwell force-pushed the CASSANDRA-18504 branch from d39bbfc to fec98d7 Compare May 12, 2023 19:50

Maxwell-Guo reviewed May 15, 2023

View reviewed changes

mike-tr-adamson reviewed May 15, 2023

View reviewed changes

dcapwell commented May 15, 2023

View reviewed changes

dcapwell added 27 commits June 21, 2023 09:12

remove SSTableHeaderFix as its no longer needed in 5.0

e1fd3af

random schema now uses random memtable impl

75592af

random schema test now overrides sstable and randomly selects w/e is …

3ae020d

…supported

fixed ScrubToolTest

9276e06

fixed StandaloneScrubberTest to handle the fact fix header is gone

1b04991

fixed flakey test where double type converts nan/infiiity to null

57c4024

imports

824ac98

switch to isVector

56a916b

switch to isVector

0aaaf10

remove todo

7aca77c

cleanup

d88966a

remove todo that is no longer needed

aed1da9

float has the same issue as double for json

2db5aa1

when using java driver use by index apis as by name uses toLowerCase …

8eef7f8

…so conflicts may happen

moved to/from string and to/from literal from serde and into 2 new te…

3e86472

…sts, this way they can add their own filters without impacting the others

ordering now limits compose type to try to limit the size of the values

b49e740

comparable bytes also limits composiite

b2ad11e

news

bad1c88

feedback

5d2a3bd

feedback

ea8a886

feedback

fbc6d15

removed a lot of compiler warnings

4215faf

fix compiler error

213e904

added java JIRA link

7b8835e

higher

3ba2352

doc from Andres

d4dab5c

revert ci

5e5a921

dcapwell force-pushed the CASSANDRA-18504 branch from e137349 to 5e5a921 Compare June 21, 2023 16:22

move udt type name dedup to the test as the multiple builders get con…

0613328

…fused

smiklosovic closed this Jun 27, 2023

CASSANDRA-18504: Added support for type VECTOR<type, dimension> #2310

CASSANDRA-18504: Added support for type VECTOR<type, dimension> #2310

Uh oh!

Conversation

dcapwell commented May 5, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dcapwell May 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mike-tr-adamson commented May 9, 2023

Uh oh!

dcapwell commented May 11, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mike-tr-adamson commented May 12, 2023

Uh oh!

dcapwell commented May 12, 2023

Uh oh!

dcapwell commented May 12, 2023

Uh oh!

dcapwell commented May 12, 2023

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Maxwell-Guo May 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dcapwell May 5, 2023 •

edited

Loading

Maxwell-Guo May 16, 2023 •

edited

Loading

mike-tr-adamson May 15, 2023 •

edited

Loading