fix: update revision for NightOwl-CodeEmbedding#4799
Merged
Samoed merged 1 commit intoJun 12, 2026
Conversation
Samoed
approved these changes
Jun 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a follow-up to #4791, updating the model revision for
Shuu12121/NightOwl-CodeEmbedding.After reviewing the CoIR data structure more carefully, I found that the hard-negative mining pool could have included documents from the CoIR test splits: while the qrels are partitioned via Hugging Face dataset splits (train/test), the corpus and queries are provided as a single split with the partition indicated in a
partitioncolumn, and I had initially built the pool without filtering on that column. No test queries or positive pairs were included in training; the issue was limited to candidate documents in the hard-negative pool.To remove this potential overlap with the submitted MTEB evaluation tasks, I rebuilt the training data with test-split documents filtered out of the pool, then retrained the model with the same script and hyperparameters. No results from the previous checkpoint have been submitted to the results repository or the leaderboard. This PR updates the revision to point to the retrained checkpoint, and the results I plan to submit will be produced with this revision.
Score comparison: previous vs. retrained checkpoint (NDCG@10)
The retrained checkpoint scores slightly higher. One possible explanation is consistent with the contamination acting as false negatives (test-split documents being pushed away from training queries during contrastive training) rather than as leakage in the model's favor. Part of the delta may also be run-to-run variance, since the data change alters batch composition even under identical hyperparameters.
Apologies for the follow-up update, and thank you for your review.