Skip to content

Conversation

@absurdfarce
Copy link
Collaborator

Update dsbulk to work with latest changes in JAVA-3061

@absurdfarce
Copy link
Collaborator Author

Local testing looks good:

mvn pacakge -Prelease
bin/dsbulk load -url "./../vector_test_data_string.csv" -k test -t foo
bin/dsbulk unload -k test -t foo
bin/dsbulk unload -query "select j from test.foo where j ann of [3.4, 7.8, 9.1] limit 1"
truncate test.foo;
bin/dsbulk load -url "./../vector_test_data_json" -k test -t foo -c json
bin/dsbulk unload -k test -t foo -c json

Jenkins build currently failing since the relevant Java driver hasn't been published yet.

@absurdfarce absurdfarce requested review from msmygit and weideng1 July 10, 2023 18:02
<reactor.version>2020.0.19</reactor.version>
<config.version>1.4.2</config.version>
<netty.version>4.1.77.Final</netty.version>
<netty.version>4.1.94.Final</netty.version>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Netty upgrade required to match the new version in the Java driver (upgraded as of JAVA-3050). Without this we get weird mismatch errors on the tcnative usage.

@absurdfarce absurdfarce linked an issue Jul 13, 2023 that may be closed by this pull request
@absurdfarce absurdfarce changed the title JAVA-3061 Vector support, round2 Update vector support to work done in JAVA-3061, add JSON codecs Jul 13, 2023
@absurdfarce
Copy link
Collaborator Author

Following the example of what was outlined in the earlier PR:

Given the following table:

cqlsh> describe test.foo;

CREATE TABLE test.foo (
    i int PRIMARY KEY,
    j vector<float, 3>
);

CREATE CUSTOM INDEX ann_index ON test.foo (j) USING 'StorageAttachedIndex';
cqlsh> select * from test.foo;

 i | j
---+---

(0 rows)

With the changes in this PR we can now load and unload JSON data into this table:

$ cat ../vector_test_data_json/one.json 
{
    "i":1,
    "j":[8, 2.3, 58]
}
$ cat ../vector_test_data_json/two.json 
{
    "i":2,
    "j":[1.2, 3.4, 5.6]
}
$ cat ../vector_test_data_json/five.json 
{
    "i":5,
    "j":[23, 18, 3.9]
}
$ bin/dsbulk load -url "./../vector_test_data_json" -k test -t foo -c json
...
total | failed | rows/s | p50ms | p99ms | p999ms | batches
    3 |      0 |     16 | 37.18 | 39.58 |  39.58 |    1.00
...
$ bin/dsbulk unload -k test -t foo -c json
...
{"i":5,"j":[23.0,18.0,3.9]}
{"i":1,"j":[8.0,2.3,58.0]}
{"i":2,"j":[1.2,3.4,5.6]}
total | failed | rows/s | p50ms | p99ms | p999ms
    3 |      0 |     14 |  2.58 |  2.87 |   2.87
...

Data on the server side matches up to what we'd expect:

cqlsh> select * from test.foo;

 i | j
---+-----------------
 5 |   [23, 18, 3.9]
 1 |    [8, 2.3, 58]
 2 | [1.2, 3.4, 5.6]

(3 rows)
cqlsh> select j from test.foo where j ann of [3.4, 7.8, 9.1] limit 1;

 j
-----------------
 [1.2, 3.4, 5.6]

(1 rows)

@absurdfarce absurdfarce added this to the 1.11.0 milestone Jul 13, 2023
@msmygit
Copy link
Member

msmygit commented Jul 13, 2023

bin/dsbulk unload -query "select j from test.foo where j ann of [3.4, 7.8, 9.1] limit 1"

Shouldn't this be as below?

bin/dsbulk unload -query "select j from test.foo where j ORDER BY ann of [3.4, 7.8, 9.1] limit 1"

Copy link
Member

@msmygit msmygit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a trivial question, but everything else LGTM

Copy link
Collaborator

@weideng1 weideng1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@absurdfarce absurdfarce merged commit 03c3c13 into 1.x Jul 13, 2023
@absurdfarce absurdfarce deleted the vector_support_round2 branch July 13, 2023 22:04
@absurdfarce
Copy link
Collaborator Author

Good find @msmygit ! I originally put this code together using what is now a quite old build of datastax/cassandra. I think the "order by" syntax was added after I did my original work so you are quite right, what I had above is now out-of-date. Based on some experiments with cqlsh on a version built from source as of about an hour ago it looks like the correct syntax now is actually the following:

select j from test.foo ORDER BY j ann of [3.4, 7.8, 9.1] limit 1;

The good news is that the CQL parser contained by dsbulk appears to be fine with that syntax as well:

$ bin/dsbulk unload -query "select j from test.foo order by j ann of [3.4, 7.8, 9.1] limit 1"
Operation directory: /work/git/dsbulk/dist_test/dsbulk-1.11.0/logs/UNLOAD_20230714-061839-089434
j
"[1.2, 3.4, 5.6]"
total | failed | rows/s | p50ms | p99ms | p999ms
    1 |      0 |      5 |  9.40 |  9.44 |   9.44

Does all that hang together for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for loading/unloading vector type data

4 participants