Skip to content

[fix] Treat numpy string/object arrays as batches in encode/predict#3720

Merged
tomaarsen merged 2 commits into
huggingface:mainfrom
tomaarsen:fix/numpy_arrays
Apr 13, 2026
Merged

[fix] Treat numpy string/object arrays as batches in encode/predict#3720
tomaarsen merged 2 commits into
huggingface:mainfrom
tomaarsen:fix/numpy_arrays

Conversation

@tomaarsen

@tomaarsen tomaarsen commented Apr 13, 2026

Copy link
Copy Markdown
Member

Resolves #3718

Hello!

Pull Request overview

  • Treat 1D+ numpy string/object arrays as batches in encode/predict instead of single inputs
  • Add tests to show the above works

Details

Passing a 1D numpy string array (e.g. the output of np.unique(...)) to SentenceTransformer.encode, SparseEncoder.encode, or CrossEncoder.predict was being misclassified as a single sample, so the model produced one bogus embedding for the entire array rather than one embedding per element. The reason is that is_singular_input used not isinstance(inputs, (list, tuple, Column)) as its sole heuristic, which lumps every np.ndarray (including string arrays) into the singular branch. We can't just blanket-exclude np.ndarray either, since a numeric ndarray is a legitimate single input for the multimodal/audio path.

I narrowed the heuristic by inspecting dtype.kind: arrays with kind "U" or "O" and ndim >= 1 are now treated as batches, while numeric arrays keep their existing singular interpretation. CrossEncoder.is_singular_input uses the same dtype check but with ndim < 2 for the singular case, since a 1D string array there represents one (query, document) pair and a 2D array is a batch of pairs.

  • Tom Aarsen

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a regression where 1D+ numpy arrays of strings were treated as a single input in encode()/predict(), leading to incorrect modality routing (e.g., being inferred as audio) and producing one embedding/score for the entire array rather than per-element outputs.

Changes:

  • Refines is_singular_input heuristics to treat numpy string/object arrays as batched inputs (and handles CrossEncoder’s 1D-pair vs 2D-batch distinction).
  • Ensures numpy inputs are materialized via .tolist() where appropriate in encode()/predict().
  • Adds regression tests covering numpy string-array inputs for SentenceTransformer, SparseEncoder, and CrossEncoder.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
sentence_transformers/base/model.py Updates singular-vs-batch detection for numpy string/bytes/object arrays.
sentence_transformers/sentence_transformer/model.py Uses .tolist() when materializing numpy inputs in encode().
sentence_transformers/sparse_encoder/model.py Uses .tolist() when materializing numpy inputs in encode().
sentence_transformers/cross_encoder/model.py Handles numpy singular-pair conversion in predict() and adds dtype-based singular detection.
tests/base/test_model.py Adds regression tests for numpy string/bytes/object singular detection and encode behavior.
tests/sparse_encoder/test_model.py Adds encode regression tests for numpy string arrays.
tests/cross_encoder/test_model.py Adds numpy-array singular detection + predict regression tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sentence_transformers/base/model.py Outdated
Comment thread sentence_transformers/cross_encoder/model.py Outdated
Comment thread sentence_transformers/cross_encoder/model.py Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sentence_transformers/base/model.py
Comment thread sentence_transformers/cross_encoder/model.py
Comment thread tests/base/test_model.py
@tomaarsen tomaarsen changed the title [fix] Treat numpy string/bytes/object arrays as batches in encode/predict [fix] Treat numpy string/object arrays as batches in encode/predict Apr 13, 2026
@tomaarsen tomaarsen merged commit c500af5 into huggingface:main Apr 13, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: infer_modality misclassifies 1D numpy string arrays as audio, raising ValueError: Modality 'audio' is not supported

2 participants