fix: update revision for NightOwl-CodeEmbedding by Shun0212 · Pull Request #4799 · embeddings-benchmark/mteb

Shun0212 · 2026-06-11T23:37:15Z

This is a follow-up to #4791, updating the model revision for Shuu12121/NightOwl-CodeEmbedding.

After reviewing the CoIR data structure more carefully, I found that the hard-negative mining pool could have included documents from the CoIR test splits: while the qrels are partitioned via Hugging Face dataset splits (train/test), the corpus and queries are provided as a single split with the partition indicated in a partition column, and I had initially built the pool without filtering on that column. No test queries or positive pairs were included in training; the issue was limited to candidate documents in the hard-negative pool.

To remove this potential overlap with the submitted MTEB evaluation tasks, I rebuilt the training data with test-split documents filtered out of the pool, then retrained the model with the same script and hyperparameters. No results from the previous checkpoint have been submitted to the results repository or the leaderboard. This PR updates the revision to point to the retrained checkpoint, and the results I plan to submit will be produced with this revision.

Score comparison: previous vs. retrained checkpoint (NDCG@10)

Task	Split	Previous	Retrained	Δ
AppsRetrieval	test	0.36361	0.39177	+0.02816
COIRCodeSearchNetRetrieval	test	0.84063	0.84264	+0.00201
CodeEditSearchRetrieval	train	0.74720	0.74808	+0.00088
CodeFeedbackMT	test	0.76277	0.76690	+0.00413
CodeFeedbackST	test	0.85137	0.85207	+0.00070
CodeSearchNetCCRetrieval	test	0.91646	0.91805	+0.00159
CodeSearchNetRetrieval	test	0.89187	0.89239	+0.00052
CodeTransOceanContest	test	0.74091	0.75953	+0.01862
CodeTransOceanDL	test	0.35802	0.36057	+0.00255
CosQA	test	0.41207	0.42810	+0.01603
StackOverflowQA	test	0.86031	0.86608	+0.00577
SyntheticText2SQL	test	0.68354	0.68266	-0.00088
Macro average (12 tasks)		0.70240	0.70907	+0.00667

The retrained checkpoint scores slightly higher. One possible explanation is consistent with the contamination acting as false negatives (test-split documents being pushed away from training queries during contrastive training) rather than as leakage in the model's favor. Part of the delta may also be run-to-run variance, since the data change alters batch composition even under identical hyperparameters.

Apologies for the follow-up update, and thank you for your review.

fix: update revision for nightowl code embedding model

9cdba1d

Samoed approved these changes Jun 12, 2026

View reviewed changes

Samoed merged commit 8e00d7e into embeddings-benchmark:main Jun 12, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: update revision for NightOwl-CodeEmbedding#4799

fix: update revision for NightOwl-CodeEmbedding#4799
Samoed merged 1 commit into
embeddings-benchmark:mainfrom
Shun0212:nightowl-code-embedding-v2

Shun0212 commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Shun0212 commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants