Skip to content

Adds new bit element_type for dense_vectors#110059

Merged
elasticsearchmachine merged 28 commits intoelastic:mainfrom
benwtrent:feature/binary-vector-support
Jun 26, 2024
Merged

Adds new bit element_type for dense_vectors#110059
elasticsearchmachine merged 28 commits intoelastic:mainfrom
benwtrent:feature/binary-vector-support

Conversation

@benwtrent
Copy link
Copy Markdown
Member

@benwtrent benwtrent commented Jun 21, 2024

This commit adds bit vector support by adding element_type: bit for vectors. This new element type works for indexed and non-indexed vectors. Additionally, it works with hnsw and flat index types. No quantization based codec works with this element type, this is consistent with byte vectors.

bit vectors accept up to 32768 dimensions in size and expect vectors that are being indexed to be encoded either as a hexidecimal string or a byte[] array where each element of the byte array represents 8 bits of the vector.

bit vectors support script usage and regular query usage. When indexed, all comparisons done are xor and popcount summations (aka, hamming distance), and the scores are transformed and normalized given the vector dimensions. Note, indexed bit vectors require l2_norm to be the similarity.

For scripts, l1norm is the same as hamming distance and l2norm is sqrt(l1norm). dotProduct and cosineSimilarity are not supported.

Note, the dimensions expected by this element_type are always to be divisible by 8, and the byte[] vectors provided for index must be have size dim/8 size, where each byte element represents 8 bits of the vectors.

closes: #48322

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Jun 21, 2024
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @benwtrent, I've created a changelog YAML for you. Note that since this PR is labelled release highlight, you need to update the changelog YAML to fill out the extended information sections.

@benwtrent
Copy link
Copy Markdown
Member Author

I am planning on adding docs soonish

@benwtrent
Copy link
Copy Markdown
Member Author

@mayya-sharipova just pushed:

  • I dropped support for cosine & dot product in the script
  • Adding docs about this and what l1, l2, and hamming are for bit vectors
  • Also, I noticed I didn't have tests & such for magnitude & bit vectors, added that and optimized the magnitude calculation.

@benwtrent benwtrent added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label Jun 26, 2024
@benwtrent
Copy link
Copy Markdown
Member Author

@elasticmachine update branch

@mayya-sharipova
Copy link
Copy Markdown
Contributor

mayya-sharipova commented Jun 26, 2024

@benwtrent Thanks, the code LGTM.

One thing that concerns me how we use l2_norm similarity parameter for defining bit vectors:

"bit_vector": {
    "type": "dense_vector",
    "dims" : 40,
    "element_type" : "bit",
    "similarity": "l2_norm"
}

because we don't actually use l2_norm, we use l1_norm or hamming for scoring calculations, but we don't have those similarity metrics defined in mapping.
Could it be confusing for users that we define "l2_norm", but never use in score calculations? What do you think?

One suggestion I have is NOT allow to provide any "similarity" metric for bit vectors in mappings for users, and say in documentation that "similarity" is not defined for bit vectors. What do you think of this suggestion?

@benwtrent
Copy link
Copy Markdown
Member Author

@mayya-sharipova

One thing that concerns me how we use l2_norm similarity parameter for defining bit vectors:
...
because we don't actually use l2_norm, we use l1_norm or hamming for scoring calculations, but we don't have those similarity metrics defined in mapping.

So, our l2_norm calculation for all our other data types is just the squareDifference, we don't actually take the square root.

Lets take some bits for example: 1011 and 1001.

l2_norm would be (1-1)^2 + (0-0)^2 + (1-0)^2 + (1-1)^2 = 0 + 0 + 1 + 0 = 1
hamming is then 1^1 + 0^0 + 1^0 + 1^1 = 0 + 0 + 1 + 0 = 1

For bit vectors & how we already score things, l1_norm l2_norm and hamming are all the same.

@benwtrent
Copy link
Copy Markdown
Member Author

@mayya-sharipova also, if users don't provide any similarity, we default to l2_norm for bit. So, its not strictly necessary at all.

@mayya-sharipova
Copy link
Copy Markdown
Contributor

@benwtrent Thanks for the detailed explanation. I guess it is ok to keep l2_norm as defafult.

@benwtrent benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jun 26, 2024
@benwtrent
Copy link
Copy Markdown
Member Author

@elasticmachine update branch

@elasticsearchmachine elasticsearchmachine merged commit 5add44d into elastic:main Jun 26, 2024
@benwtrent benwtrent deleted the feature/binary-vector-support branch June 26, 2024 18:49
benwtrent pushed a commit that referenced this pull request Oct 14, 2024
…114407)

**Description:**

This PR addresses the issue described in [#114402](#114402), where the `synthetic_source` feature does not correctly handle the `bit` type in `dense_vector` fields when `index` is set to `false`. The root cause of the issue was that the `bit` type was not properly accounted for, leading to an array that is 8 times the size of the actual `dims` value of docvalue. This mismatch will causes an array out-of-bounds exception when reconstructing the document.

**Changes:**

- Adjusted the `synthetic_source` logic to correctly handle the `bit` type by ensuring the array size accounts for the 8x difference in dimensions.
- Added yaml test to cover the `bit` type scenario in `dense_vector` fields with `index` set to `false`.

**Related Issues:**

- Closes [#114402](#114402)
- Introduced in [#110059](#110059)
benwtrent pushed a commit to benwtrent/elasticsearch that referenced this pull request Oct 14, 2024
…lastic#114407)

**Description:**

This PR addresses the issue described in [elastic#114402](elastic#114402), where the `synthetic_source` feature does not correctly handle the `bit` type in `dense_vector` fields when `index` is set to `false`. The root cause of the issue was that the `bit` type was not properly accounted for, leading to an array that is 8 times the size of the actual `dims` value of docvalue. This mismatch will causes an array out-of-bounds exception when reconstructing the document.

**Changes:**

- Adjusted the `synthetic_source` logic to correctly handle the `bit` type by ensuring the array size accounts for the 8x difference in dimensions.
- Added yaml test to cover the `bit` type scenario in `dense_vector` fields with `index` set to `false`.

**Related Issues:**

- Closes [elastic#114402](elastic#114402)
- Introduced in [elastic#110059](elastic#110059)
benwtrent pushed a commit to benwtrent/elasticsearch that referenced this pull request Oct 14, 2024
…lastic#114407)

**Description:**

This PR addresses the issue described in [elastic#114402](elastic#114402), where the `synthetic_source` feature does not correctly handle the `bit` type in `dense_vector` fields when `index` is set to `false`. The root cause of the issue was that the `bit` type was not properly accounted for, leading to an array that is 8 times the size of the actual `dims` value of docvalue. This mismatch will causes an array out-of-bounds exception when reconstructing the document.

**Changes:**

- Adjusted the `synthetic_source` logic to correctly handle the `bit` type by ensuring the array size accounts for the 8x difference in dimensions.
- Added yaml test to cover the `bit` type scenario in `dense_vector` fields with `index` set to `false`.

**Related Issues:**

- Closes [elastic#114402](elastic#114402)
- Introduced in [elastic#110059](elastic#110059)

(cherry picked from commit 465c65c)
elasticsearchmachine pushed a commit that referenced this pull request Oct 14, 2024
…114407) (#114756)

**Description:**

This PR addresses the issue described in [#114402](#114402), where the `synthetic_source` feature does not correctly handle the `bit` type in `dense_vector` fields when `index` is set to `false`. The root cause of the issue was that the `bit` type was not properly accounted for, leading to an array that is 8 times the size of the actual `dims` value of docvalue. This mismatch will causes an array out-of-bounds exception when reconstructing the document.

**Changes:**

- Adjusted the `synthetic_source` logic to correctly handle the `bit` type by ensuring the array size accounts for the 8x difference in dimensions.
- Added yaml test to cover the `bit` type scenario in `dense_vector` fields with `index` set to `false`.

**Related Issues:**

- Closes [#114402](#114402)
- Introduced in [#110059](#110059)

Co-authored-by: Rassyan <yjkhngds@gmail.com>
davidkyle pushed a commit to davidkyle/elasticsearch that referenced this pull request Oct 14, 2024
…lastic#114407)

**Description:**

This PR addresses the issue described in [elastic#114402](elastic#114402), where the `synthetic_source` feature does not correctly handle the `bit` type in `dense_vector` fields when `index` is set to `false`. The root cause of the issue was that the `bit` type was not properly accounted for, leading to an array that is 8 times the size of the actual `dims` value of docvalue. This mismatch will causes an array out-of-bounds exception when reconstructing the document.

**Changes:**

- Adjusted the `synthetic_source` logic to correctly handle the `bit` type by ensuring the array size accounts for the 8x difference in dimensions.
- Added yaml test to cover the `bit` type scenario in `dense_vector` fields with `index` set to `false`.

**Related Issues:**

- Closes [elastic#114402](elastic#114402)
- Introduced in [elastic#110059](elastic#110059)
davidkyle pushed a commit that referenced this pull request Oct 15, 2024
…114407)

**Description:**

This PR addresses the issue described in [#114402](#114402), where the `synthetic_source` feature does not correctly handle the `bit` type in `dense_vector` fields when `index` is set to `false`. The root cause of the issue was that the `bit` type was not properly accounted for, leading to an array that is 8 times the size of the actual `dims` value of docvalue. This mismatch will causes an array out-of-bounds exception when reconstructing the document.

**Changes:**

- Adjusted the `synthetic_source` logic to correctly handle the `bit` type by ensuring the array size accounts for the 8x difference in dimensions.
- Added yaml test to cover the `bit` type scenario in `dense_vector` fields with `index` set to `false`.

**Related Issues:**

- Closes [#114402](#114402)
- Introduced in [#110059](#110059)
elasticsearchmachine pushed a commit that referenced this pull request Oct 15, 2024
…ld (#114407) (#114759)

* Fix Synthetic Source Handling for `bit` Type in `dense_vector` Field (#114407)

**Description:**

This PR addresses the issue described in [#114402](#114402), where the `synthetic_source` feature does not correctly handle the `bit` type in `dense_vector` fields when `index` is set to `false`. The root cause of the issue was that the `bit` type was not properly accounted for, leading to an array that is 8 times the size of the actual `dims` value of docvalue. This mismatch will causes an array out-of-bounds exception when reconstructing the document.

**Changes:**

- Adjusted the `synthetic_source` logic to correctly handle the `bit` type by ensuring the array size accounts for the 8x difference in dimensions.
- Added yaml test to cover the `bit` type scenario in `dense_vector` fields with `index` set to `false`.

**Related Issues:**

- Closes [#114402](#114402)
- Introduced in [#110059](#110059)

(cherry picked from commit 465c65c)

* fixing backport of search capabilities

* fixing license header

* adding capabilities to RestSearchAction

* fixing backport

* spotless

* muting teset for ccs

* adding capabilities to the ccs test runner

---------

Co-authored-by: Rassyan <yjkhngds@gmail.com>
georgewallace pushed a commit to georgewallace/elasticsearch that referenced this pull request Oct 25, 2024
…lastic#114407)

**Description:**

This PR addresses the issue described in [elastic#114402](elastic#114402), where the `synthetic_source` feature does not correctly handle the `bit` type in `dense_vector` fields when `index` is set to `false`. The root cause of the issue was that the `bit` type was not properly accounted for, leading to an array that is 8 times the size of the actual `dims` value of docvalue. This mismatch will causes an array out-of-bounds exception when reconstructing the document.

**Changes:**

- Adjusted the `synthetic_source` logic to correctly handle the `bit` type by ensuring the array size accounts for the 8x difference in dimensions.
- Added yaml test to cover the `bit` type scenario in `dense_vector` fields with `index` set to `false`.

**Related Issues:**

- Closes [elastic#114402](elastic#114402)
- Introduced in [elastic#110059](elastic#110059)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) cloud-deploy Publish cloud docker image for Cloud-First-Testing >feature release highlight :Search Relevance/Vectors Vector search Team:Search Meta label for search team v8.15.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a field type for high-dimensional bit vectors.

6 participants