-
Notifications
You must be signed in to change notification settings - Fork 3.8k
CASSANDRA-18504: Added support for type VECTOR<type, dimension> #2310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
failing atm, few types remain to be added to generators...
java.lang.AssertionError: Uncovered types:
org.apache.cassandra.db.marshal.LexicalUUIDType
org.apache.cassandra.db.marshal.DynamicCompositeType.FixedValueComparator
org.apache.cassandra.db.marshal.PartitionerDefinedOrder
org.apache.cassandra.db.marshal.DateType
org.apache.cassandra.db.marshal.CounterColumnType
org.apache.cassandra.db.marshal.IntegerType
org.apache.cassandra.db.marshal.DecimalType
org.apache.cassandra.db.marshal.SimpleDateType
org.apache.cassandra.db.marshal.TimeType
org.apache.cassandra.db.marshal.DynamicCompositeType
org.apache.cassandra.db.marshal.CompositeType
org.apache.cassandra.db.marshal.LegacyTimeUUIDType
org.apache.cassandra.db.marshal.TimeUUIDType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that this comment is now invalid because it looks like you have added type support for these types. Can you confirm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are a small handful left to add, but those are not too hard to do. This comment isn't 100% accurate as some types were added but this test still fails waiting for 100% coverage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know there are only 2 versions and I filter 2 out... but this allows for new versions to get tested by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are working with non-null data, so this can't be null
test/unit/org/apache/cassandra/db/marshal/AbstractTypeTest.java
Outdated
Show resolved
Hide resolved
|
@dcapwell Is this ready for review or are you still actively working on it? |
yes, its ready for review... I make tweaks over time as I think about it, but that's mostly to add more tests; the core patch is stable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to add a Marker extends AbstractMarker implementation to this class to handle bindings correctly. We then need to add to AbstractMarker.Raw.prepare to use this class:
else if (receiver.type instanceof VectorType)
{
return new Vectors.Marker(bindIndex, receiver);
}
This allows us to correctly bind variables as vectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you provide a CQL where this matters? In testing I see we become return new Constants.Marker(bindIndex, receiver);, which is fine to me as that validates with receiver.type.validate(value);
For list example, that produces a Value that split the list, so you have List<ByteBuffer>... since we are frozen, do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added tests for select and IN clause... works with and without binds...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are you binding values? If I try the following:
createTable("CREATE TABLE %s (pk int, val vector<float, 3>, PRIMARY KEY(pk))");
execute("INSERT INTO %s (pk, val) VALUES (0, ?)", new float[]{ 1.0f, 2.0f, 3.0f });
I get the following error:
java.lang.IllegalArgumentException: Unsupported value type (value is [F@389a5c0b)
at org.apache.cassandra.cql3.CQLTester.typeFor(CQLTester.java:2433)
at org.apache.cassandra.cql3.CQLTester.transformValues(CQLTester.java:2077)
at org.apache.cassandra.cql3.CQLTester.executeFormattedQuery(CQLTester.java:1434)
at org.apache.cassandra.cql3.CQLTester.execute(CQLTester.java:1413)
and if I try:
createTable("CREATE TABLE %s (pk int, val vector<float, 3>, PRIMARY KEY(pk))");
execute("INSERT INTO %s (pk, val) VALUES (0, ?)", Lists.newArrayList(1.0f, 2.0f, 3.0f));
I get:
org.apache.cassandra.exceptions.InvalidRequestException: Unexpected extraneous bytes after list value
at org.apache.cassandra.cql3.Constants$Marker.bindAndGet(Constants.java:461)
at org.apache.cassandra.cql3.Constants$Setter.execute(Constants.java:485)
at org.apache.cassandra.cql3.statements.UpdateStatement.addUpdateForKey(UpdateStatement.java:92)
at org.apache.cassandra.cql3.statements.ModificationStatement.addUpdates(ModificationStatement.java:799)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, looking at this again, I can see that a vector method has been added to allow the usage of vectors in CQLTester but I'm still slightly confused about what the drivers will eventually accept for binding values. Do you have any thoughts about what this will be? As in what is the wire representation going to be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, in answer to your response, I don't think we need a Vector.Marker because we are dealing with an effectively frozen type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, looking at this again, I can see that a vector method has been added to allow the usage of vectors in CQLTester
CQLTester does not use the schema to figure out how to serialize bind values, it uses java types to figure that out; for this reason List can be confusing, so added a Vector type in CQLTester only for this work.
but I'm still slightly confused about what the drivers will eventually accept for binding values
I believe this is up to drivers... the CQL syntax matches list, so to me accepting the same types makes sense. Java driver could accept array, but that adds a lot of complexity to the driver I would think as you would need a serializer for every primitive as well.
I guess this all boils down to how driver works; if you know the schema then it doesn't matter too much, but if you do not know the schema (such as CQLTester) then you would need type support to figure this out... I feel like we have to know the schema as stuff like python would need this? Not familiar with that layer so can not say...
I feel the worst case scenario is to add a Vector type to each driver, similar to how tuples/udts work... I would personally find that more annoying than working with a List in java, but it does solve the type mapping to serializer if we don't leverage the schema.
As in what is the wire representation going to be?
There are 2 different protocols: fixed length, and variable length.
For fixed length types, the format is N sequence of bytes where those bytes are the serialized format of the type; below is a int example
int
int
int
...
int
For variable length types, the format is the same as ListType, but without the list size prefix at the top, so its
(
element_size: int
element_bytes: byte[]
)+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'm happy with this. I have a minor concern about having different representations for fixed and variable types but I'm pretty certain that the drivers can handle this based on the schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the one driver edge case to worry about is ShortType... it "should" be fixed length, but for w/e reason it isn't, which means we have a int header that always has the value 2...
|
I'd like to put some questions here that are outside of the review and are mainly related to the scope of the ticket and what we are expecting to support at the end. The tickets states that we are producing a fixed-length, frozen array with non-null values. Are we expecting these values to be primitive types? What I mean by this is could we expect something like If we disallow this (which would be my preference) we could simplify our type support greatly by removing the need for non-terminal support. @dcapwell WDYT? |
Nope, this limitation was removed in the dev mailing list. The requirement is support for any type, including stuff like
I never remember how functions work in CQL... in most DBs I expect I actually created a CQL fuzz tester for Accord, but been debating if I should add to tree or not (there is overlap with what Harry does... which is why I question)... That fuzz tester builds a AST of what is allowed in CQL for each type, so would be a perfect way to make sure this behavior works... I tend to prefer generic/property tests rather than single type tests... which is why this patch is so large <_<...
My personal stance is that CQL is way too much special casing, so as a user I have a hard time knowing what to expect (I hate frozen vs multi-cell... they define their own semantics, so you have to know implementation details of the db...)... if the grammar allows functions in the value, then Thanks @mike-tr-adamson for the review, ill make a few changes and let you know when I push |
I created tests for this, this works just fine. I can't use |
|
@mike-tr-adamson believe I fixed or addressed all comments |
| */ | ||
| default ByteBuffer fromCQLLiteral(String literal) | ||
| { | ||
| return fromCQLLiteral("--dummy--", literal); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so why it is "--dummy--"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because @mike-tr-adamson gave that example =).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, here have
already use this key word.
Besides ,what about make the "--dummy--" a final String in SchemaConstants.java with some descriptions, like
// some descriptions
public final String DUMMY = "--dummy--"; // SchemaConstants.java
because I think if we do not explain for the reason , then for every subsequent developer who may see this word, they will ask why.
@mike-tr-adamson WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my main concern is that this may not be valid for for all types. As it happens this method is only used in AbstractTypeTest. I think it would make sense to remove this method from here to the test so it isn't accidentally used in live code without clear understanding of the effects.
| Object actualValueDecoded = actualValue == null ? null : column.type.getSerializer().deserialize(actualValue); | ||
| if (!Objects.equal(expected != null ? expected[j] : null, actualValueDecoded)) | ||
| { | ||
| //TODO confirm this isn't a bug... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//TODO confirm this isn't a bug... -----> // TODO confirm this isn't a bug...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this? At least in IntelliJ //TODO causes the comments under this to be flagged as part of the TODO, which is true for this case... If I switch to // TODO that goes away and only this line is flagged... Not actually sure that this is a feature in IntelliJ... I just noticed when I made this change...
I don't think we have a style guide comment for this, so prefer to keep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| execute("INSERT INTO %s (pk, value) VALUES (0, [1, 1 + (int) ?])", 1); | ||
| test.run(); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need test cases for tables that have verctor type in both pk and non-pks ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, they are different code paths; testing one doesn't mean you fully work, the other code path may do something unexpected.
| assertRows(execute("SELECT * FROM %s WHERE pk IN ([1, 1 + (int) ?])", 1), row(list(1, 2))); | ||
|
|
||
| assertRows(execute("SELECT * FROM %s WHERE pk > [0, 0] AND pk < [1, 3] ALLOW FILTERING"), row(list(1, 2))); | ||
| assertRows(execute("SELECT * FROM %s WHERE token(pk) = token([1, 2])"), row(list(1, 2))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need some more test cases when element type is not just int ,decimal,collection type or udt (if not suporrt now an exception is need )?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and for compare order , pk > [0, 0] and <[1, 3] which means (the first index element in array is big than 0 and less than 1 )& (the secondary index element is big than 0 and less than 3) , or not & but | ? I think we should declaring these relationships clearly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need some more test cases when element type is not just int ,decimal,collection type or udt (if not suporrt now an exception is need )?
See org.apache.cassandra.cql3.RandomSchemaTest
and for compare order , pk > [0, 0] and <[1, 3] which means (the first index element in array is big than 0 and less than 1 )& (the secondary index element is big than 0 and less than 3)
I wouldn't say that... from CQL point of view I am saying to compare specific values, so is similar to the case where pk is a int and do pk > 0 AND pk < 3. For the type, the ordering is similar to list and based off the elements... so the following would also match: [0, 1], and [1, 2], and [0, 10991038080]
I think we should declaring these relationships clearly
I feel this is very clear, what would you recommend to improve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, after see the compareCustom of vertorType ,I got the really meaning of this test case.
| @after {$value = new Lists.Literal(l);} | ||
| : '[' ( t1=term { l.add(t1); } ( ',' tn=term { l.add(tn); } )* )? ']' { $value = new Lists.Literal(l); } | ||
| @after {$value = new ArrayLiteral(l);} | ||
| : '[' ( t1=term { l.add(t1); } ( ',' tn=term { l.add(tn); } )* )? ']' { $value = new ArrayLiteral(l); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this previously but isn't this a duplication? We either need @after {$value = new ArrayLiteral(l);} or we need { $value = new ArrayLiteral(l); } but I don't think we need both of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but parser scares me as we don't have a lot of edge case testing in-tree atm... so I strongly prefer not changing this in this patch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, although, I have changed it locally and done a lot of testing on it without breaking anything. I'm happy to leave as is for the the time being and fix it later.
| public abstract Class<T> getType(); | ||
|
|
||
| public String toCQLLiteral(ByteBuffer buffer) | ||
| protected String toCQLLiteralNonNull(ByteBuffer buffer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe add @Nonnull to parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made this to param and return for each new type
|
|
||
| public boolean shouldQuoteCQL() | ||
| { | ||
| return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we default to true here? If it's strings that need quoting then wouldn't it be better to default to false and override this in AbstractTextSerializer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly because it looked like there were more types needing than those that exclude.
If it's strings that need quoting then wouldn't it be better to default to false and override this in AbstractTextSerializer?
Most types looked to be strings, but do not use AbstractTypeSerializer... one negative of this is that collections became strings (which does still work in CQL), so switching to false may help not quote when we don't need to... ill test out swapping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made the change, less types need to override if the default is false; just making sure to run serde many times to see if any case pops up
|
@mike-tr-adamson @Maxwell-Guo replied to or fixed all feedback |
| } | ||
|
|
||
| @Override | ||
| public <V> boolean isNull(V buffer, ValueAccessor<V> accessor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: text and byte[] treat the empty set as non-null, so this class will convert those to null... I didn't want to really go down the rabbit hole to figure out the impact to this... but this isn't "correct" from the types point of view...
| * Test functionality to re-create a CQL literal from its serialized representation. | ||
| * This test uses some randomness to generate the values and nested structures (collections,tuples,UDTs). | ||
| */ | ||
| public class CQL3TypeLiteralTest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted this test as its a duplicate of AbstractTypeTest.serde. This test actually showed we were doing the wrong thing, it built its own model of what CQL would accept and made sure types matched its model; but if you ask CQL to parse it will fail as the model wasn't correct... Rather than fixing, was easier to remove as the new serde test covers this completely.
| TypeGenBuilder nonEmptyNoDuration = AbstractTypeGenerators.builder() | ||
| .withoutEmpty() | ||
| .withUserTypeKeyspace(KEYSPACE) | ||
| .withoutPrimitive(DurationType.instance); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DurationType violates several properties and is banned in several places do to these bugs... so exclude from this test as this test makes sure types respect their given properties
| Gen<Boolean> nulls = SourceDSL.integers().between(1, 100).map(i -> i < 5); | ||
| qt().checkAssert(random -> { | ||
| TypeGenBuilder nonEmptyNoDuration = AbstractTypeGenerators.builder() | ||
| .withoutEmpty() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
legacy 3.x -> 4.x type... yay COMPACT STORAGE. This type is allowed in some places and not all, so easier to block in the test than try to work around it
| .withPartitionColumnsCount(1) | ||
| .withPrimaryColumnTypeGen(new TypeGenBuilder(nonEmptyNoDuration).withMaxDepth(2).build()) | ||
| .withClusteringColumnsBetween(1, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason for 1 partition and 1-2 clustering is to keep the size small.. we have a limit of 64k and with larger sizes its easier to hit that....
…so conflicts may happen
…sts, this way they can add their own filters without impacting the others
e137349 to
5e5a921
Compare
No description provided.