Skip to content

Conversation

@dcapwell
Copy link
Contributor

@dcapwell dcapwell commented May 5, 2023

No description provided.

Copy link
Contributor Author

@dcapwell dcapwell May 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failing atm, few types remain to be added to generators...

java.lang.AssertionError: Uncovered types:
org.apache.cassandra.db.marshal.LexicalUUIDType
org.apache.cassandra.db.marshal.DynamicCompositeType.FixedValueComparator
org.apache.cassandra.db.marshal.PartitionerDefinedOrder
org.apache.cassandra.db.marshal.DateType
org.apache.cassandra.db.marshal.CounterColumnType
org.apache.cassandra.db.marshal.IntegerType
org.apache.cassandra.db.marshal.DecimalType
org.apache.cassandra.db.marshal.SimpleDateType
org.apache.cassandra.db.marshal.TimeType
org.apache.cassandra.db.marshal.DynamicCompositeType
org.apache.cassandra.db.marshal.CompositeType
org.apache.cassandra.db.marshal.LegacyTimeUUIDType
org.apache.cassandra.db.marshal.TimeUUIDType

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that this comment is now invalid because it looks like you have added type support for these types. Can you confirm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are a small handful left to add, but those are not too hard to do. This comment isn't 100% accurate as some types were added but this test still fails waiting for 100% coverage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know there are only 2 versions and I filter 2 out... but this allows for new versions to get tested by default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are working with non-null data, so this can't be null

@mike-tr-adamson
Copy link
Contributor

@dcapwell Is this ready for review or are you still actively working on it?

@dcapwell
Copy link
Contributor Author

@mike-tr-adamson

Is this ready for review or are you still actively working on it?

yes, its ready for review... I make tweaks over time as I think about it, but that's mostly to add more tests; the core patch is stable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add a Marker extends AbstractMarker implementation to this class to handle bindings correctly. We then need to add to AbstractMarker.Raw.prepare to use this class:

            else if (receiver.type instanceof VectorType)
            {
                return new Vectors.Marker(bindIndex, receiver);
            }

This allows us to correctly bind variables as vectors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you provide a CQL where this matters? In testing I see we become return new Constants.Marker(bindIndex, receiver);, which is fine to me as that validates with receiver.type.validate(value);

For list example, that produces a Value that split the list, so you have List<ByteBuffer>... since we are frozen, do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added tests for select and IN clause... works with and without binds...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you binding values? If I try the following:

createTable("CREATE TABLE %s (pk int, val vector<float, 3>, PRIMARY KEY(pk))");
execute("INSERT INTO %s (pk, val) VALUES (0, ?)", new float[]{ 1.0f, 2.0f, 3.0f });

I get the following error:

java.lang.IllegalArgumentException: Unsupported value type (value is [F@389a5c0b)

	at org.apache.cassandra.cql3.CQLTester.typeFor(CQLTester.java:2433)
	at org.apache.cassandra.cql3.CQLTester.transformValues(CQLTester.java:2077)
	at org.apache.cassandra.cql3.CQLTester.executeFormattedQuery(CQLTester.java:1434)
	at org.apache.cassandra.cql3.CQLTester.execute(CQLTester.java:1413)

and if I try:

createTable("CREATE TABLE %s (pk int, val vector<float, 3>, PRIMARY KEY(pk))");
execute("INSERT INTO %s (pk, val) VALUES (0, ?)", Lists.newArrayList(1.0f, 2.0f, 3.0f));

I get:

org.apache.cassandra.exceptions.InvalidRequestException: Unexpected extraneous bytes after list value

	at org.apache.cassandra.cql3.Constants$Marker.bindAndGet(Constants.java:461)
	at org.apache.cassandra.cql3.Constants$Setter.execute(Constants.java:485)
	at org.apache.cassandra.cql3.statements.UpdateStatement.addUpdateForKey(UpdateStatement.java:92)
	at org.apache.cassandra.cql3.statements.ModificationStatement.addUpdates(ModificationStatement.java:799)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, looking at this again, I can see that a vector method has been added to allow the usage of vectors in CQLTester but I'm still slightly confused about what the drivers will eventually accept for binding values. Do you have any thoughts about what this will be? As in what is the wire representation going to be?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, in answer to your response, I don't think we need a Vector.Marker because we are dealing with an effectively frozen type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, looking at this again, I can see that a vector method has been added to allow the usage of vectors in CQLTester

CQLTester does not use the schema to figure out how to serialize bind values, it uses java types to figure that out; for this reason List can be confusing, so added a Vector type in CQLTester only for this work.

but I'm still slightly confused about what the drivers will eventually accept for binding values

I believe this is up to drivers... the CQL syntax matches list, so to me accepting the same types makes sense. Java driver could accept array, but that adds a lot of complexity to the driver I would think as you would need a serializer for every primitive as well.

I guess this all boils down to how driver works; if you know the schema then it doesn't matter too much, but if you do not know the schema (such as CQLTester) then you would need type support to figure this out... I feel like we have to know the schema as stuff like python would need this? Not familiar with that layer so can not say...

I feel the worst case scenario is to add a Vector type to each driver, similar to how tuples/udts work... I would personally find that more annoying than working with a List in java, but it does solve the type mapping to serializer if we don't leverage the schema.

As in what is the wire representation going to be?

There are 2 different protocols: fixed length, and variable length.

For fixed length types, the format is N sequence of bytes where those bytes are the serialized format of the type; below is a int example

int
int
int
...
int

For variable length types, the format is the same as ListType, but without the list size prefix at the top, so its

(
element_size: int
element_bytes: byte[]
)+

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'm happy with this. I have a minor concern about having different representations for fixed and variable types but I'm pretty certain that the drivers can handle this based on the schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the one driver edge case to worry about is ShortType... it "should" be fixed length, but for w/e reason it isn't, which means we have a int header that always has the value 2...

@mike-tr-adamson
Copy link
Contributor

I'd like to put some questions here that are outside of the review and are mainly related to the scope of the ticket and what we are expecting to support at the end.

The tickets states that we are producing a fixed-length, frozen array with non-null values. Are we expecting these values to be primitive types? What I mean by this is could we expect something like INSERT INTO t (pk, vec) VALUES (0, [1.0, sum(1.0, 2.0), 3.0])? Meaning do we want to support function calls in the creation of the array?

If we disallow this (which would be my preference) we could simplify our type support greatly by removing the need for non-terminal support.

@dcapwell WDYT?

@dcapwell
Copy link
Contributor Author

Are we expecting these values to be primitive types

Nope, this limitation was removed in the dev mailing list. The requirement is support for any type, including stuff like map and text. My stance is that the complexity for the type isn't more complex to have such a case, as a ShortType is variable length (don't know history on why... its fixed length but the type says its not...), so is IntegerType... so by allowing fixed and variable length types, we make sure that all primitive types work, and users can choose to use this type in other, non ML, use cases.

INSERT INTO t (pk, vec) VALUES (0, [1.0, sum(1.0, 2.0), 3.0])

I never remember how functions work in CQL... in most DBs I expect sum(1, 2) to be converted to 3 at binding time, but I think we defer these until execution time... ignoring this detail, conceptually yes, we should allow such behavior as long as the grammar allows (grammar doesn't allow functions in all places). If the grammar supports, I can work on a test to make sure this is handled.

I actually created a CQL fuzz tester for Accord, but been debating if I should add to tree or not (there is overlap with what Harry does... which is why I question)... That fuzz tester builds a AST of what is allowed in CQL for each type, so would be a perfect way to make sure this behavior works... I tend to prefer generic/property tests rather than single type tests... which is why this patch is so large <_<...

If we disallow this (which would be my preference) we could simplify our type support greatly by removing the need for non-terminal support.

My personal stance is that CQL is way too much special casing, so as a user I have a hard time knowing what to expect (I hate frozen vs multi-cell... they define their own semantics, so you have to know implementation details of the db...)... if the grammar allows functions in the value, then VectorType must allow as well...

Thanks @mike-tr-adamson for the review, ill make a few changes and let you know when I push

@dcapwell
Copy link
Contributor Author

What I mean by this is could we expect something like INSERT INTO t (pk, vec) VALUES (0, [1.0, sum(1.0, 2.0), 3.0])? Meaning do we want to support function calls in the creation of the array?

I created tests for this, this works just fine. I can't use sum as that is an aggregate function, so switched to 1 + 2

@dcapwell
Copy link
Contributor Author

@mike-tr-adamson believe I fixed or addressed all comments

*/
default ByteBuffer fromCQLLiteral(String literal)
{
return fromCQLLiteral("--dummy--", literal);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so why it is "--dummy--"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because @mike-tr-adamson gave that example =).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, here have
already use this key word.
Besides ,what about make the "--dummy--" a final String in SchemaConstants.java with some descriptions, like
// some descriptions
public final String DUMMY = "--dummy--"; // SchemaConstants.java

because I think if we do not explain for the reason , then for every subsequent developer who may see this word, they will ask why.
@mike-tr-adamson WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my main concern is that this may not be valid for for all types. As it happens this method is only used in AbstractTypeTest. I think it would make sense to remove this method from here to the test so it isn't accidentally used in live code without clear understanding of the effects.

Object actualValueDecoded = actualValue == null ? null : column.type.getSerializer().deserialize(actualValue);
if (!Objects.equal(expected != null ? expected[j] : null, actualValueDecoded))
{
//TODO confirm this isn't a bug...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//TODO confirm this isn't a bug... -----> // TODO confirm this isn't a bug...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this? At least in IntelliJ //TODO causes the comments under this to be flagged as part of the TODO, which is true for this case... If I switch to // TODO that goes away and only this line is flagged... Not actually sure that this is a feature in IntelliJ... I just noticed when I made this change...

I don't think we have a style guide comment for this, so prefer to keep

Copy link
Contributor

@Maxwell-Guo Maxwell-Guo May 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//TODO
// TODO
it seems both are all exists, so I think we can just keep it.

execute("INSERT INTO %s (pk, value) VALUES (0, [1, 1 + (int) ?])", 1);
test.run();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need test cases for tables that have verctor type in both pk and non-pks ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, they are different code paths; testing one doesn't mean you fully work, the other code path may do something unexpected.

assertRows(execute("SELECT * FROM %s WHERE pk IN ([1, 1 + (int) ?])", 1), row(list(1, 2)));

assertRows(execute("SELECT * FROM %s WHERE pk > [0, 0] AND pk < [1, 3] ALLOW FILTERING"), row(list(1, 2)));
assertRows(execute("SELECT * FROM %s WHERE token(pk) = token([1, 2])"), row(list(1, 2)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need some more test cases when element type is not just int ,decimal,collection type or udt (if not suporrt now an exception is need )?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and for compare order , pk > [0, 0] and <[1, 3] which means (the first index element in array is big than 0 and less than 1 )& (the secondary index element is big than 0 and less than 3) , or not & but | ? I think we should declaring these relationships clearly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need some more test cases when element type is not just int ,decimal,collection type or udt (if not suporrt now an exception is need )?

See org.apache.cassandra.cql3.RandomSchemaTest

and for compare order , pk > [0, 0] and <[1, 3] which means (the first index element in array is big than 0 and less than 1 )& (the secondary index element is big than 0 and less than 3)

I wouldn't say that... from CQL point of view I am saying to compare specific values, so is similar to the case where pk is a int and do pk > 0 AND pk < 3. For the type, the ordering is similar to list and based off the elements... so the following would also match: [0, 1], and [1, 2], and [0, 10991038080]

I think we should declaring these relationships clearly

I feel this is very clear, what would you recommend to improve?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, after see the compareCustom of vertorType ,I got the really meaning of this test case.

@after {$value = new Lists.Literal(l);}
: '[' ( t1=term { l.add(t1); } ( ',' tn=term { l.add(tn); } )* )? ']' { $value = new Lists.Literal(l); }
@after {$value = new ArrayLiteral(l);}
: '[' ( t1=term { l.add(t1); } ( ',' tn=term { l.add(tn); } )* )? ']' { $value = new ArrayLiteral(l); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this previously but isn't this a duplication? We either need @after {$value = new ArrayLiteral(l);} or we need { $value = new ArrayLiteral(l); } but I don't think we need both of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but parser scares me as we don't have a lot of edge case testing in-tree atm... so I strongly prefer not changing this in this patch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, although, I have changed it locally and done a lot of testing on it without breaking anything. I'm happy to leave as is for the the time being and fix it later.

public abstract Class<T> getType();

public String toCQLLiteral(ByteBuffer buffer)
protected String toCQLLiteralNonNull(ByteBuffer buffer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe add @Nonnull to parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made this to param and return for each new type


public boolean shouldQuoteCQL()
{
return true;
Copy link
Contributor

@mike-tr-adamson mike-tr-adamson May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we default to true here? If it's strings that need quoting then wouldn't it be better to default to false and override this in AbstractTextSerializer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly because it looked like there were more types needing than those that exclude.

If it's strings that need quoting then wouldn't it be better to default to false and override this in AbstractTextSerializer?

Most types looked to be strings, but do not use AbstractTypeSerializer... one negative of this is that collections became strings (which does still work in CQL), so switching to false may help not quote when we don't need to... ill test out swapping

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made the change, less types need to override if the default is false; just making sure to run serde many times to see if any case pops up

@dcapwell
Copy link
Contributor Author

@mike-tr-adamson @Maxwell-Guo replied to or fixed all feedback

}

@Override
public <V> boolean isNull(V buffer, ValueAccessor<V> accessor)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: text and byte[] treat the empty set as non-null, so this class will convert those to null... I didn't want to really go down the rabbit hole to figure out the impact to this... but this isn't "correct" from the types point of view...

* Test functionality to re-create a CQL literal from its serialized representation.
* This test uses some randomness to generate the values and nested structures (collections,tuples,UDTs).
*/
public class CQL3TypeLiteralTest
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted this test as its a duplicate of AbstractTypeTest.serde. This test actually showed we were doing the wrong thing, it built its own model of what CQL would accept and made sure types matched its model; but if you ask CQL to parse it will fail as the model wasn't correct... Rather than fixing, was easier to remove as the new serde test covers this completely.

TypeGenBuilder nonEmptyNoDuration = AbstractTypeGenerators.builder()
.withoutEmpty()
.withUserTypeKeyspace(KEYSPACE)
.withoutPrimitive(DurationType.instance);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DurationType violates several properties and is banned in several places do to these bugs... so exclude from this test as this test makes sure types respect their given properties

Gen<Boolean> nulls = SourceDSL.integers().between(1, 100).map(i -> i < 5);
qt().checkAssert(random -> {
TypeGenBuilder nonEmptyNoDuration = AbstractTypeGenerators.builder()
.withoutEmpty()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

legacy 3.x -> 4.x type... yay COMPACT STORAGE. This type is allowed in some places and not all, so easier to block in the test than try to work around it

Comment on lines 78 to 99
.withPartitionColumnsCount(1)
.withPrimaryColumnTypeGen(new TypeGenBuilder(nonEmptyNoDuration).withMaxDepth(2).build())
.withClusteringColumnsBetween(1, 2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason for 1 partition and 1-2 clustering is to keep the size small.. we have a limit of 64k and with larger sizes its easier to hit that....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants