context: detect if README is required#63199
Conversation
jtibshirani
left a comment
There was a problem hiding this comment.
The Sourcegraph repo has many README files and our ranking ranks other README files higher, for example if they contain the string "README", which leads to a vague answer.
It feels like we should consistently match the top-level README. Do you plan to follow-up with a fix? Maybe boosting exact matches on filename higher in Zoekt? Brainstorming ...
--- a/score.go
+++ b/score.go
@@ -118,8 +118,13 @@ func calculateTermFrequency(cands []*candidateMatch, df termDocumentFrequency) m
termFreqs := map[string]int{}
for _, cand := range cands {
term := string(cand.substrLowered)
+ length := uint32(len(cand.substrLowered))
if cand.fileName {
- termFreqs[term] += 5
+ if cand.byteOffset == 0 && cand.byteMatchSz == length {
+ termFreqs[term] += 5
+ } else {
+ termFreqs[term] += 2
+ }
} else {
termFreqs[term]++
}There was a problem hiding this comment.
Not a Marvin Gaye fan? :)
There was a problem hiding this comment.
It'd be really nice to also match queries containing only the repo name, like
- "what does batch-change-buddy/jtest do?"
- "what does go-ctags do?"
Do you think it'd be possible to pull out the repo names and also match against these? They would count as "project signifiers".
There was a problem hiding this comment.
I added a regex to match project names on a best-effort basis. I added a couple of tests to check corner cases.
There was a problem hiding this comment.
Couldn't we pull out the repo names being searched and match against those? They are available in the query string.
This approach could be noisy and would also miss things like "what does go-ctags do?"
There was a problem hiding this comment.
I guess I don't understand exactly what you mean with "Couldn't we pull out the repo names being searched and match against those?". I don't think we can reliably identify "go-ctags" as repository unless the user has explicitly selected it as context. We could check the DB but that is quite expensive.
I will remove the repo logic for now and follow-up once we chatted a bit more about this.
Yes, definitely. Your suggestion looks promising. I will take a look. I wonder whether we could add something that prioritizes shorter paths in general, not just exact matches. |
This adds a thin layer to detect entities in a query which are not explicity mentioned. As a first application, we detect if a README should probably be returned, following the approach in the client, see /vscode/src/chat/chat-view/context.ts. Test plan: added unit test
fdf33c2 to
1fdbf07
Compare
jtibshirani
left a comment
There was a problem hiding this comment.
Looks good to me! Only comment: I feel like we should solidify the repo name matching, or just remove it entirely.
I'm leaning towards removing it, then in the client we can make sure to rewrite things like "What does @batch-change-buddy/jtest do?" to "What does batch-change-buddy/jtest repo do?" I suggest that here: https://linear.app/sourcegraph/issue/SPLF-98/rewrite-mentioned-entities-to-natural-language
Cody context searches are always guaranteed to include |
This adds a thin layer to expand a query with keywords which are not explicitly mentioned. As a first application, we detect if a README should be returned. See /vscode/src/chat/chat-view/context.ts for how a similar logic is implemented in the client.
Test plan:
sourcegraph/sourcegraphandsourcegraph/zoektas selected context. For Zoekt we return/README.mdin the top 3, which leads to a good answer. The Sourcegraph repo has many README files and our ranking ranks other README files higher, for example if they contain the string "README", which leads to a vague answer.Addresses SPLF-94