Support uploading batches of ndjson vectors from a source file#4029
Support uploading batches of ndjson vectors from a source file#4029ndisidore wants to merge 1 commit intocloudflare:silverlock/wrangler-vectorizefrom
Conversation
|
5215e8c to
95e8ce8
Compare
|
A wrangler prerelease is available for testing. You can install this latest build in your project with: npm install --save-dev https://prerelease-registry.devprod.cloudflare.dev/workers-sdk/runs/6305869698/npm-package-wrangler-4029You can reference the automatically updated head of this PR with: npm install --save-dev https://prerelease-registry.devprod.cloudflare.dev/workers-sdk/prs/6305869698/npm-package-wrangler-4029Or you can use npx https://prerelease-registry.devprod.cloudflare.dev/workers-sdk/runs/6305869698/npm-package-wrangler-4029 dev path/to/script.jsAdditional artifacts:npm install https://prerelease-registry.devprod.cloudflare.dev/workers-sdk/runs/6305869698/npm-package-cloudflare-pages-shared-4029Note that these links will no longer work once the GitHub Actions artifact expires.
| Please ensure constraints are pinned, and |
| vectors: { | ||
| type: "array", | ||
| file: { | ||
| describe: "A file containing line separated json (ndjson) vector objects", |
There was a problem hiding this comment.
| describe: "A file containing line separated json (ndjson) vector objects", | |
| describe: "A file containing line separated Newline Delimited JSON (NDJSON) vector objects", |
There was a problem hiding this comment.
Do we want to specify a maximum size in the help text too?
e.g.
A file containing line separated Newline Delimited JSON (NDJSON) vector objects (max lines: ${VECTORIZE_UPSERT_BATCH_SIZE})
There was a problem hiding this comment.
I'd be cool with this - it probably does make sense to make it our batch size (since that defeats the purpose of batching) but feel free to suggest something reasonable... maybe like 500k initially?
There was a problem hiding this comment.
Probably 100k? I'd want to be sure that 5k * 20 loops works reliably?
719095f to
151e1be
Compare
| async function* getBatchFromFile(file: FileHandle, batchSize = 3) { | ||
| let batch: string[] = []; | ||
| for await (const line of file.readLines()) { | ||
| if (batch.push(line) >= batchSize) { |
There was a problem hiding this comment.
Note this does no validation here: it just takes the line as it is. We'll need to do verification on the backend anyway, so this should be okay. Also it appears there's no zod anywhere in wrangler evidently most folks don't schema validate client side here
| config: Config, | ||
| indexName: string, | ||
| vectors: Array<VectorizeVector> | ||
| body: FormData |
There was a problem hiding this comment.
Changed this to always assume multipart and file source. I can't really see a reality where customers specify vector definitions via a flag
| } | ||
|
|
||
| const index = await insertIntoIndex(config, args.name, vectors); | ||
| // remove the ids - skip tracking these for bulk uploads since this could be in the 100s of thousands. |
There was a problem hiding this comment.
I think this is probably the right move for this type of operation, but feel free to shout if you feel differently
| "batch-size": { | ||
| describe: | ||
| "An array of one or more vectors in JSON format to insert into the index", | ||
| "Number of vector records to include when sending to the Cloudflare API.", |
There was a problem hiding this comment.
I think the backend should be the one worrying about this (since clients could edit these file values to whatever they want anyway) but I went ahead and put a soft limit in of 100k. We should definitely increase this in the future
151e1be to
a2a33a4
Compare
a2a33a4 to
bdf3f2e
Compare
Extends the Vectorize wrangler support work to allow uploading vectors from a file source.
What this PR solves / how to test:
We want to improve the Vector Store onboarding for our customers by allowing them an easy way to source vectors. This vector files may be quite big (100s of MB - GB). This supports a streaming approach to reading and loading these files in an efficient manner.