Skip to content

fix: update revision for NightOwl-CodeEmbedding#4799

Merged
Samoed merged 1 commit into
embeddings-benchmark:mainfrom
Shun0212:nightowl-code-embedding-v2
Jun 12, 2026
Merged

fix: update revision for NightOwl-CodeEmbedding#4799
Samoed merged 1 commit into
embeddings-benchmark:mainfrom
Shun0212:nightowl-code-embedding-v2

Conversation

@Shun0212

Copy link
Copy Markdown
Contributor

This is a follow-up to #4791, updating the model revision for Shuu12121/NightOwl-CodeEmbedding.

After reviewing the CoIR data structure more carefully, I found that the hard-negative mining pool could have included documents from the CoIR test splits: while the qrels are partitioned via Hugging Face dataset splits (train/test), the corpus and queries are provided as a single split with the partition indicated in a partition column, and I had initially built the pool without filtering on that column. No test queries or positive pairs were included in training; the issue was limited to candidate documents in the hard-negative pool.

To remove this potential overlap with the submitted MTEB evaluation tasks, I rebuilt the training data with test-split documents filtered out of the pool, then retrained the model with the same script and hyperparameters. No results from the previous checkpoint have been submitted to the results repository or the leaderboard. This PR updates the revision to point to the retrained checkpoint, and the results I plan to submit will be produced with this revision.

Score comparison: previous vs. retrained checkpoint (NDCG@10)
Task Split Previous Retrained Δ
AppsRetrieval test 0.36361 0.39177 +0.02816
COIRCodeSearchNetRetrieval test 0.84063 0.84264 +0.00201
CodeEditSearchRetrieval train 0.74720 0.74808 +0.00088
CodeFeedbackMT test 0.76277 0.76690 +0.00413
CodeFeedbackST test 0.85137 0.85207 +0.00070
CodeSearchNetCCRetrieval test 0.91646 0.91805 +0.00159
CodeSearchNetRetrieval test 0.89187 0.89239 +0.00052
CodeTransOceanContest test 0.74091 0.75953 +0.01862
CodeTransOceanDL test 0.35802 0.36057 +0.00255
CosQA test 0.41207 0.42810 +0.01603
StackOverflowQA test 0.86031 0.86608 +0.00577
SyntheticText2SQL test 0.68354 0.68266 -0.00088
Macro average (12 tasks) 0.70240 0.70907 +0.00667

The retrained checkpoint scores slightly higher. One possible explanation is consistent with the contamination acting as false negatives (test-split documents being pushed away from training queries during contrastive training) rather than as leakage in the model's favor. Part of the delta may also be run-to-run variance, since the data change alters batch composition even under identical hyperparameters.

Apologies for the follow-up update, and thank you for your review.

@Samoed Samoed merged commit 8e00d7e into embeddings-benchmark:main Jun 12, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants