Skip to content

Increase filename boost#785

Merged
jtibshirani merged 2 commits into
mainfrom
jtibs/filename-boost
May 28, 2024
Merged

Increase filename boost#785
jtibshirani merged 2 commits into
mainfrom
jtibs/filename-boost

Conversation

@jtibshirani

@jtibshirani jtibshirani commented May 28, 2024

Copy link
Copy Markdown
Contributor

When we introduced filename boosting in BM25, we set it to a very conservative
weight. This PR increases the weight from 2.0 -> 5.0, which improves results on
relevant evals.

Relates to SPLF-88

@cla-bot cla-bot Bot added the cla-signed label May 28, 2024
@jtibshirani

jtibshirani commented May 28, 2024

Copy link
Copy Markdown
Contributor Author

golden queries evals (note the change in "Explain file", which is a class of queries containing a filename)

Before

Breakdown by class:
Find symbol	7/10
Find string	2/2
Explain file	1/2
Explain concept	4/5
Check dependency	1/2
Find logic	31/43
Gather information	11/16
Changelog	0/2
Ownership	2/2
How-to	1/1
Foreign language	0/2
Long request	0/2

Combined recall	60/89

After

Breakdown by class:
Find symbol	7/10
Find string	2/2
Explain file	2/2
Explain concept	4/5
Check dependency	1/2
Find logic	31/43
Gather information	12/16
Changelog	0/2
Ownership	2/2
How-to	1/1
Foreign language	0/2
Long request	0/2

Combined recall	62/89

CodeSearchNet evals (results unchanged)
Before

Recall (files)	91/99
Recall (chunks)	74/99
Average chunk overlap	0.89

After

Recall (files)	91/99
Recall (chunks)	74/99
Average chunk overlap	0.89

@jtibshirani jtibshirani requested a review from a team May 28, 2024 18:59
@jtibshirani jtibshirani merged commit 640102a into main May 28, 2024
@jtibshirani jtibshirani deleted the jtibs/filename-boost branch May 28, 2024 21:51
@chenkc805

chenkc805 commented Jul 3, 2024

Copy link
Copy Markdown

Just checking, did the filename boost actually decrease recall here? Cause when I add up all the numerators I actually get 64 instead of 60

Breakdown by class:
Find symbol	9/10
Find string	2/2
Explain file	2/2
Explain concept	4/5
Check dependency	1/2
Find logic	31/43
Gather information	12/16
Changelog	0/2
Ownership	2/2
How-to	1/1
Foreign language	0/2
Long request	0/2

Combined recall	60/89

@jtibshirani

Copy link
Copy Markdown
Contributor Author

@chenkc805 good catch! I checked our eval snapshots for the change right before this and confirmed that 60/89 is in fact right. I just somehow had a copy-paste error for the explanation.

I updated the comment above. I even called out how "explain file" improved, so we can be sure I was looking at the right numbers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants