Update vector support to work done in JAVA-3061, add JSON codecs #480

absurdfarce · 2023-07-10T17:43:17Z

Update dsbulk to work with latest changes in JAVA-3061

absurdfarce · 2023-07-10T18:01:28Z

Local testing looks good:

mvn pacakge -Prelease
bin/dsbulk load -url "./../vector_test_data_string.csv" -k test -t foo
bin/dsbulk unload -k test -t foo
bin/dsbulk unload -query "select j from test.foo where j ann of [3.4, 7.8, 9.1] limit 1"
truncate test.foo;
bin/dsbulk load -url "./../vector_test_data_json" -k test -t foo -c json
bin/dsbulk unload -k test -t foo -c json

Jenkins build currently failing since the relevant Java driver hasn't been published yet.

absurdfarce · 2023-07-10T18:05:33Z

pom.xml

    <reactor.version>2020.0.19</reactor.version>
    <config.version>1.4.2</config.version>
-    <netty.version>4.1.77.Final</netty.version>
+    <netty.version>4.1.94.Final</netty.version>


Netty upgrade required to match the new version in the Java driver (upgraded as of JAVA-3050). Without this we get weird mismatch errors on the tcnative usage.

what we're doing in the Java driver's VectorCodec

absurdfarce · 2023-07-13T18:05:55Z

Following the example of what was outlined in the earlier PR:

Given the following table:

cqlsh> describe test.foo;

CREATE TABLE test.foo (
    i int PRIMARY KEY,
    j vector<float, 3>
);

CREATE CUSTOM INDEX ann_index ON test.foo (j) USING 'StorageAttachedIndex';
cqlsh> select * from test.foo;

 i | j
---+---

(0 rows)

With the changes in this PR we can now load and unload JSON data into this table:

$ cat ../vector_test_data_json/one.json 
{
    "i":1,
    "j":[8, 2.3, 58]
}
$ cat ../vector_test_data_json/two.json 
{
    "i":2,
    "j":[1.2, 3.4, 5.6]
}
$ cat ../vector_test_data_json/five.json 
{
    "i":5,
    "j":[23, 18, 3.9]
}
$ bin/dsbulk load -url "./../vector_test_data_json" -k test -t foo -c json
...
total | failed | rows/s | p50ms | p99ms | p999ms | batches
    3 |      0 |     16 | 37.18 | 39.58 |  39.58 |    1.00
...
$ bin/dsbulk unload -k test -t foo -c json
...
{"i":5,"j":[23.0,18.0,3.9]}
{"i":1,"j":[8.0,2.3,58.0]}
{"i":2,"j":[1.2,3.4,5.6]}
total | failed | rows/s | p50ms | p99ms | p999ms
    3 |      0 |     14 |  2.58 |  2.87 |   2.87
...

Data on the server side matches up to what we'd expect:

cqlsh> select * from test.foo;

 i | j
---+-----------------
 5 |   [23, 18, 3.9]
 1 |    [8, 2.3, 58]
 2 | [1.2, 3.4, 5.6]

(3 rows)
cqlsh> select j from test.foo where j ann of [3.4, 7.8, 9.1] limit 1;

 j
-----------------
 [1.2, 3.4, 5.6]

(1 rows)

msmygit · 2023-07-13T21:52:38Z

bin/dsbulk unload -query "select j from test.foo where j ann of [3.4, 7.8, 9.1] limit 1"

Shouldn't this be as below?

bin/dsbulk unload -query "select j from test.foo where j ORDER BY ann of [3.4, 7.8, 9.1] limit 1"

msmygit

I left a trivial question, but everything else LGTM

weideng1

LGTM

absurdfarce · 2023-07-14T06:34:02Z

Good find @msmygit ! I originally put this code together using what is now a quite old build of datastax/cassandra. I think the "order by" syntax was added after I did my original work so you are quite right, what I had above is now out-of-date. Based on some experiments with cqlsh on a version built from source as of about an hour ago it looks like the correct syntax now is actually the following:

select j from test.foo ORDER BY j ann of [3.4, 7.8, 9.1] limit 1;

The good news is that the CQL parser contained by dsbulk appears to be fine with that syntax as well:

$ bin/dsbulk unload -query "select j from test.foo order by j ann of [3.4, 7.8, 9.1] limit 1"
Operation directory: /work/git/dsbulk/dist_test/dsbulk-1.11.0/logs/UNLOAD_20230714-061839-089434
j
"[1.2, 3.4, 5.6]"
total | failed | rows/s | p50ms | p99ms | p999ms
    1 |      0 |      5 |  9.40 |  9.44 |   9.44

Does all that hang together for you?

Bret McGuire added 2 commits July 10, 2023 12:40

Functioning changes... ?

67ff050

Fix for issue discovered in testing

6291183

absurdfarce requested review from msmygit and weideng1 July 10, 2023 18:02

absurdfarce commented Jul 10, 2023

View reviewed changes

Bret McGuire added 2 commits July 12, 2023 23:37

Update to newly-released 4.17.0 Java driver

3a2d18c

Test for too many/too few elements for a given vector type should follow

448a5ee

what we're doing in the Java driver's VectorCodec

absurdfarce linked an issue Jul 13, 2023 that may be closed by this pull request

Add support for loading/unloading vector type data #481

Closed

absurdfarce changed the title ~~JAVA-3061 Vector support, round2~~ Update vector support to work done in JAVA-3061, add JSON codecs Jul 13, 2023

absurdfarce added this to the 1.11.0 milestone Jul 13, 2023

msmygit approved these changes Jul 13, 2023

View reviewed changes

weideng1 approved these changes Jul 13, 2023

View reviewed changes

absurdfarce merged commit 03c3c13 into 1.x Jul 13, 2023

absurdfarce deleted the vector_support_round2 branch July 13, 2023 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update vector support to work done in JAVA-3061, add JSON codecs #480

Update vector support to work done in JAVA-3061, add JSON codecs #480

Uh oh!

absurdfarce commented Jul 10, 2023

Uh oh!

absurdfarce commented Jul 10, 2023

Uh oh!

absurdfarce Jul 10, 2023

Uh oh!

absurdfarce commented Jul 13, 2023

Uh oh!

msmygit commented Jul 13, 2023

Uh oh!

msmygit left a comment

Uh oh!

weideng1 left a comment

Uh oh!

absurdfarce commented Jul 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Update vector support to work done in JAVA-3061, add JSON codecs #480

Update vector support to work done in JAVA-3061, add JSON codecs #480

Uh oh!

Conversation

absurdfarce commented Jul 10, 2023

Uh oh!

absurdfarce commented Jul 10, 2023

Uh oh!

absurdfarce Jul 10, 2023

Choose a reason for hiding this comment

Uh oh!

absurdfarce commented Jul 13, 2023

Uh oh!

msmygit commented Jul 13, 2023

Uh oh!

msmygit left a comment

Choose a reason for hiding this comment

Uh oh!

weideng1 left a comment

Choose a reason for hiding this comment

Uh oh!

absurdfarce commented Jul 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants