fix: Support scikit-learn 1.9 in ZeroShotClassification#4790
Conversation
scikit-learn 1.9 raises "ValueError: Mix of label input types" when classification metrics receive string y_true with numeric y_pred. Zeroshot predictions are always integer indices into the candidate labels, so string dataset labels are now mapped to their candidate index before scoring. Unmappable string labels raise a clear error instead of silently scoring 0.0, which is what scikit-learn < 1.9 did. Removes the <1.9.0 pin introduced as a stopgap in embeddings-benchmark#4783. Fixes embeddings-benchmark#4784
|
Can you run some tasks to verify that scores are matching? |
|
Ran
Identical to full float precision, as expected: The RenderedSST2 value also matches the published result for this model in embeddings-benchmark/results ( Repro: import mteb
model = mteb.get_model_meta("openai/clip-vit-base-patch32")
tasks = mteb.get_tasks(tasks=["RenderedSST2", "DTDZeroShot"])
results = mteb.evaluate(model, tasks, cache=None, co2_tracker=False)
print({r.task_name: r.scores["test"][0]["accuracy"] for r in results}) |
|
Note on the one red check: test-dockerfile also fails on main itself (pushes at 16:06 and 16:48 UTC today) and on the other open PRs, so it is the pre-existing #4292 flake, not this diff. Everything else is green, including the py3.11-3.14 jobs that run scikit-learn 1.9.0. |
…nto fix/zeroshot-sklearn-19
5e8f0e1
into
embeddings-benchmark:main
Fixes #4784
scikit-learn 1.9 raises on mixed-type inputs to classification metrics (scikit-learn#33086).
AbsTaskZeroShotClassificationhits exactly that case: predictions are always integer indices into the candidate labels, while the dataset label column can contain strings. Onmainwith scikit-learn 1.9.0,tests/test_abstasks/test_predictions.py::test_predictions[task2-expected2]fails with:The pre-1.9 behavior was arguably worse:
accuracy_score(["label1", "label2"], [0, 1])compared strings to ints, never matched, and silently reported 0.0 accuracy.Changes
AbsTaskZeroShotClassification._normalize_labels: integer labels pass through unchanged, string labels are mapped to their index inget_candidate_labels(), and strings that match no candidate raise aValueErrordescribing the label contract. For external task authors this is a behavior change: unmappable string labels now fail loudly on every scikit-learn version instead of silently scoring 0.0.scikit-learn<1.9.0stopgap pin from fix: Update lock and remove python limit fo pylate and colbert_engine #4783 and regenerated the lock withuv lock --upgrade-package scikit-learn. Since scikit-learn 1.9 dropped Python 3.10, the lock forks to 1.9.0 on Python >= 3.11 and stays on 1.7.2 for 3.10, so the CI matrix exercises both sides.tests/test_abstasks/test_zeroshot_classification.py: unit tests for the passthrough, mapping, and error contract, plus an end-to-end regression test that evaluatesmteb/baseline-random-encoderon the string-label mock and assertsaccuracy == 1.0(deterministic; the encoder seeds embeddings per input string).Leaderboard impact
None. Predictions are untouched and label normalization is an identity for integer labels. All registered zeroshot tasks store labels as integer indices: most derive candidates from
ClassLabelfeatures, SciMMIR maps strings to ints indataset_transform, and the remaining label columns were verified as int64 via the HF datasets server (mteb/esc50,mteb/SpeechCommandsZeroshotv0.01,mteb/urbansound8K,mteb/wds_imagenet1k, and others). The string path only ever fired for the test mocks.Verification
mainwith scikit-learn 1.9.0, then confirmed it passes with this branch (red/green on the same test).tests/test_abstasks(154 passed),tests/test_integrations+ prompt validation (314 passed),tests/test_evaluators,tests/test_evaluate.py,tests/test_result_cache.py(87 passed) all green locally on scikit-learn 1.9.0.ruff format --check,ruff check,typos, andmypy mteb(mypy 2.1.0) clean;uv lock --checkpasses.