[fix] Treat numpy string/object arrays as batches in encode/predict#3720
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR addresses a regression where 1D+ numpy arrays of strings were treated as a single input in encode()/predict(), leading to incorrect modality routing (e.g., being inferred as audio) and producing one embedding/score for the entire array rather than per-element outputs.
Changes:
- Refines
is_singular_inputheuristics to treat numpy string/object arrays as batched inputs (and handles CrossEncoder’s 1D-pair vs 2D-batch distinction). - Ensures numpy inputs are materialized via
.tolist()where appropriate inencode()/predict(). - Adds regression tests covering numpy string-array inputs for
SentenceTransformer,SparseEncoder, andCrossEncoder.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
sentence_transformers/base/model.py |
Updates singular-vs-batch detection for numpy string/bytes/object arrays. |
sentence_transformers/sentence_transformer/model.py |
Uses .tolist() when materializing numpy inputs in encode(). |
sentence_transformers/sparse_encoder/model.py |
Uses .tolist() when materializing numpy inputs in encode(). |
sentence_transformers/cross_encoder/model.py |
Handles numpy singular-pair conversion in predict() and adds dtype-based singular detection. |
tests/base/test_model.py |
Adds regression tests for numpy string/bytes/object singular detection and encode behavior. |
tests/sparse_encoder/test_model.py |
Adds encode regression tests for numpy string arrays. |
tests/cross_encoder/test_model.py |
Adds numpy-array singular detection + predict regression tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
fix] Treat numpy string/bytes/object arrays as batches in encode/predictfix] Treat numpy string/object arrays as batches in encode/predict
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves #3718
Hello!
Pull Request overview
encode/predictinstead of single inputsDetails
Passing a 1D numpy string array (e.g. the output of
np.unique(...)) toSentenceTransformer.encode,SparseEncoder.encode, orCrossEncoder.predictwas being misclassified as a single sample, so the model produced one bogus embedding for the entire array rather than one embedding per element. The reason is thatis_singular_inputusednot isinstance(inputs, (list, tuple, Column))as its sole heuristic, which lumps everynp.ndarray(including string arrays) into the singular branch. We can't just blanket-excludenp.ndarrayeither, since a numeric ndarray is a legitimate single input for the multimodal/audio path.I narrowed the heuristic by inspecting
dtype.kind: arrays with kind"U"or"O"andndim >= 1are now treated as batches, while numeric arrays keep their existing singular interpretation.CrossEncoder.is_singular_inputuses the same dtype check but withndim < 2for the singular case, since a 1D string array there represents one(query, document)pair and a 2D array is a batch of pairs.