Skip to content

Support uploading batches of ndjson vectors from a source file#4029

Closed
ndisidore wants to merge 1 commit intocloudflare:silverlock/wrangler-vectorizefrom
ndisidore:nathan/vectorize-insert-from-file
Closed

Support uploading batches of ndjson vectors from a source file#4029
ndisidore wants to merge 1 commit intocloudflare:silverlock/wrangler-vectorizefrom
ndisidore:nathan/vectorize-insert-from-file

Conversation

@ndisidore
Copy link
Contributor

@ndisidore ndisidore commented Sep 25, 2023

Extends the Vectorize wrangler support work to allow uploading vectors from a file source.

What this PR solves / how to test:

We want to improve the Vector Store onboarding for our customers by allowing them an easy way to source vectors. This vector files may be quite big (100s of MB - GB). This supports a streaming approach to reading and loading these files in an efficient manner.

@ndisidore ndisidore requested a review from a team as a code owner September 25, 2023 18:57
@changeset-bot
Copy link

changeset-bot bot commented Sep 25, 2023

⚠️ No Changeset found

Latest commit: bdf3f2e

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@ndisidore ndisidore force-pushed the nathan/vectorize-insert-from-file branch from 5215e8c to 95e8ce8 Compare September 25, 2023 18:58
@github-actions
Copy link
Contributor

github-actions bot commented Sep 25, 2023

A wrangler prerelease is available for testing. You can install this latest build in your project with:

npm install --save-dev https://prerelease-registry.devprod.cloudflare.dev/workers-sdk/runs/6305869698/npm-package-wrangler-4029

You can reference the automatically updated head of this PR with:

npm install --save-dev https://prerelease-registry.devprod.cloudflare.dev/workers-sdk/prs/6305869698/npm-package-wrangler-4029

Or you can use npx with this latest build directly:

npx https://prerelease-registry.devprod.cloudflare.dev/workers-sdk/runs/6305869698/npm-package-wrangler-4029 dev path/to/script.js
Additional artifacts:
npm install https://prerelease-registry.devprod.cloudflare.dev/workers-sdk/runs/6305869698/npm-package-cloudflare-pages-shared-4029

Note that these links will no longer work once the GitHub Actions artifact expires.


wrangler@3.9.0 includes the following runtime dependencies:

Package Constraint Resolved
miniflare 3.20230922.0 3.20230922.0
workerd 1.20230922.0 1.20230922.0
workerd --version 1.20230922.0 2023-09-22

|

Please ensure constraints are pinned, and miniflare/workerd minor versions match.

vectors: {
type: "array",
file: {
describe: "A file containing line separated json (ndjson) vector objects",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
describe: "A file containing line separated json (ndjson) vector objects",
describe: "A file containing line separated Newline Delimited JSON (NDJSON) vector objects",

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to specify a maximum size in the help text too?

e.g.

A file containing line separated Newline Delimited JSON (NDJSON) vector objects (max lines: ${VECTORIZE_UPSERT_BATCH_SIZE})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be cool with this - it probably does make sense to make it our batch size (since that defeats the purpose of batching) but feel free to suggest something reasonable... maybe like 500k initially?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably 100k? I'd want to be sure that 5k * 20 loops works reliably?

@ndisidore ndisidore force-pushed the nathan/vectorize-insert-from-file branch 2 times, most recently from 719095f to 151e1be Compare September 25, 2023 20:21
async function* getBatchFromFile(file: FileHandle, batchSize = 3) {
let batch: string[] = [];
for await (const line of file.readLines()) {
if (batch.push(line) >= batchSize) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this does no validation here: it just takes the line as it is. We'll need to do verification on the backend anyway, so this should be okay. Also it appears there's no zod anywhere in wrangler evidently most folks don't schema validate client side here

config: Config,
indexName: string,
vectors: Array<VectorizeVector>
body: FormData
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to always assume multipart and file source. I can't really see a reality where customers specify vector definitions via a flag

}

const index = await insertIntoIndex(config, args.name, vectors);
// remove the ids - skip tracking these for bulk uploads since this could be in the 100s of thousands.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably the right move for this type of operation, but feel free to shout if you feel differently

"batch-size": {
describe:
"An array of one or more vectors in JSON format to insert into the index",
"Number of vector records to include when sending to the Cloudflare API.",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a max?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the backend should be the one worrying about this (since clients could edit these file values to whatever they want anyway) but I went ahead and put a soft limit in of 100k. We should definitely increase this in the future

@ndisidore ndisidore force-pushed the nathan/vectorize-insert-from-file branch from 151e1be to a2a33a4 Compare September 25, 2023 23:05
@ndisidore ndisidore force-pushed the nathan/vectorize-insert-from-file branch from a2a33a4 to bdf3f2e Compare September 25, 2023 23:08
@lrapoport-cf lrapoport-cf deleted the branch cloudflare:silverlock/wrangler-vectorize September 26, 2023 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants