Support uploading batches of ndjson vectors from a source file#4034
Conversation
🦋 Changeset detectedLatest commit: a06767d The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
A wrangler prerelease is available for testing. You can install this latest build in your project with: npm install --save-dev https://prerelease-registry.devprod.cloudflare.dev/workers-sdk/runs/6315244316/npm-package-wrangler-4034You can reference the automatically updated head of this PR with: npm install --save-dev https://prerelease-registry.devprod.cloudflare.dev/workers-sdk/prs/6315244316/npm-package-wrangler-4034Or you can use npx https://prerelease-registry.devprod.cloudflare.dev/workers-sdk/runs/6315244316/npm-package-wrangler-4034 dev path/to/script.jsAdditional artifacts:npm install https://prerelease-registry.devprod.cloudflare.dev/workers-sdk/runs/6315244316/npm-package-cloudflare-pages-shared-4034Note that these links will no longer work once the GitHub Actions artifact expires.
| Please ensure constraints are pinned, and |
.changeset/eleven-monkeys-tan.md
Outdated
| "wrangler": patch | ||
| --- | ||
|
|
||
| [Vectorize] Support uploading batches of ndjson vectors from a source file |
There was a problem hiding this comment.
Could this include an example?
| FormData.prototype.toString = mockFormDataToString; | ||
| export interface RestRequestWithFormData extends MockedRequest, RestRequest { | ||
| formData(): Promise<FormData>; | ||
| } | ||
| (MockedRequest.prototype as RestRequestWithFormData).formData = | ||
| mockFormDataFromString; | ||
|
|
||
| function mockFormDataToString(this: FormData) { | ||
| const entries = []; | ||
| for (const [key, value] of this.entries()) { | ||
| if (value instanceof Blob) { | ||
| const reader = new FileReaderSync(); | ||
| reader.readAsText(value); | ||
| const result = reader.result; | ||
| entries.push([key, result]); | ||
| } else { | ||
| entries.push([key, value]); | ||
| } | ||
| } | ||
| return JSON.stringify({ | ||
| __formdata: entries, | ||
| }); | ||
| } | ||
|
|
||
| async function mockFormDataFromString(this: MockedRequest): Promise<FormData> { | ||
| const { __formdata } = await this.json(); | ||
| expect(__formdata).toBeInstanceOf(Array); | ||
|
|
||
| const form = new FormData(); | ||
| for (const [key, value] of __formdata) { | ||
| form.set(key, value); | ||
| } | ||
| return form; | ||
| } |
There was a problem hiding this comment.
non-blocking: it would be nice to not have to duplicate this mocking here and in https://github.com/cloudflare/workers-sdk/blob/4692d92c531fde29aaefbf7c0e2c375ede3e3b1d/packages/wrangler/src/__tests__/deploy.test.ts#L9207-L9247
| // actually do any parsing - that will be handled on the backend | ||
| async function* getBatchFromFile(file: FileHandle, batchSize = 3) { | ||
| let batch: string[] = []; | ||
| for await (const line of file.readLines()) { |
There was a problem hiding this comment.
filehandle.readLines() was added in Node 18—Wrangler supports Node 16, and so this won't be usable, I don't think. https://nodejs.org/api/readline.html is probably the best alternative?
| let index: Optional<VectorizeVectorMutation, "ids"> | undefined; | ||
| for await (const batch of getBatchFromFile(file, args.batchSize)) { | ||
| const formData = new FormData(); | ||
| formData.append("vectors", new File([batch.join(`\n`)], "vectors.ndjson")); |
There was a problem hiding this comment.
This should probably include a mime type for the File?
| logger.warn( | ||
| `🚧 While Vectorize is in beta, we've limited uploads to 100k vectors` | ||
| ); | ||
| logger.warn( | ||
| `🚧 You may run this again with another batch to upload further` | ||
| ); |
There was a problem hiding this comment.
| logger.warn( | |
| `🚧 While Vectorize is in beta, we've limited uploads to 100k vectors` | |
| ); | |
| logger.warn( | |
| `🚧 You may run this again with another batch to upload further` | |
| ); | |
| logger.warn( | |
| `🚧 While Vectorize is in beta, we've limited uploads to 100k vectors. You may run this again with another batch to upload further` | |
| ); |
| const index = await insertIntoIndex(config, args.name, vectors); | ||
| // remove the ids - skip tracking these for bulk uploads since this could be in the 100s of thousands. | ||
| if (index) delete index.ids; | ||
| logger.log(JSON.stringify(index, null, 2)); |
There was a problem hiding this comment.
This command should have additional --json flag for outputting JSON, otherwise, it should output in a human readable format
| if (args.vectors) { | ||
| // Parse each vector into a Vector type | ||
| // Think about potential limits on args.vectors.length? | ||
| if (index.count > VECTORIZE_MAX_UPSERT_VECTOR_RECORDS) { |
There was a problem hiding this comment.
This limiting behaviour should be covered in tests
| const config = readConfig(args.config, args); | ||
| const file = await open(args.file); | ||
|
|
||
| let index: Optional<VectorizeVectorMutation, "ids"> | undefined; |
There was a problem hiding this comment.
Since the ids field isn't being used here, could this just be an incrementing integer count?
| method: "POST", | ||
| body: JSON.stringify(vectors), | ||
| headers: { | ||
| "Content-Type": "multipart/form-data", |
There was a problem hiding this comment.
This should be omitted, so the boundary can be generated and included in the content-type. cc @Skye-31 for spotting this!
There was a problem hiding this comment.
Interesting: we're sure about this? I check the incoming content type in my test in here and omitting this results in
Expected: "multipart/form-data"
Received: "text/plain;charset=UTF-8"
Which is not the correct type for a multipart request, but this may be an artifact of the test server
There was a problem hiding this comment.
That's an artifact of the MSW test setup, I think, which doesn't properly support FormData
There was a problem hiding this comment.
Does this PR work when manually testing against the API?
3376d2e to
bbc0e0d
Compare
bbc0e0d to
eade8a5
Compare
| type: "string", | ||
| choices: [ | ||
| "workers-ai/bge-small-en", | ||
| "@cf/baai/bge-small-en-v1.5", |
| }); | ||
| }); | ||
|
|
||
| FormData.prototype.toString = mockFormDataToString; |
There was a problem hiding this comment.
This will pollute the FormData prototype for all tests
This needs to be done in a beforeEach, after storing the original value, and then resetting to the original value in an afterEach
There was a problem hiding this comment.
Apparently this pattern already exists within wrangler. So no need to block this PR on this comment
| (MockedRequest.prototype as RestRequestWithFormData).formData = | ||
| mockFormDataFromString; |
There was a problem hiding this comment.
Same comment as above for prototype pollution across all tests
There was a problem hiding this comment.
Apparently this pattern already exists within wrangler. So no need to block this PR on this comment
| function mockFormDataToString(this: FormData) { | ||
| const entries = []; | ||
| for (const [key, value] of this.entries()) { | ||
| if (value instanceof Blob) { |
There was a problem hiding this comment.
The tests are failing because Blob is not a global prior to Node v18 nodejs/node#42220 (comment)
It needs to be imported from node:buffer
it is not a global prior to node v18
Codecov Report
@@ Coverage Diff @@
## main #4034 +/- ##
==========================================
+ Coverage 74.94% 75.08% +0.13%
==========================================
Files 207 208 +1
Lines 11826 11871 +45
Branches 3079 3085 +6
==========================================
+ Hits 8863 8913 +50
+ Misses 2963 2958 -5
|
Extends the Vectorize wrangler support work to allow uploading vectors from a file source.
Comments on the original PR here: #4029
What this PR solves / how to test:
We want to improve the Vector Store onboarding for our customers by allowing them an easy way to source vectors. This vector files may be quite big (100s of MB - GB). This supports a streaming approach to reading and loading these files in an efficient manner.