Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

context: detect if README is required#63199

Merged
stefanhengl merged 5 commits into
mainfrom
sh/context/readme
Jun 14, 2024
Merged

context: detect if README is required#63199
stefanhengl merged 5 commits into
mainfrom
sh/context/readme

Conversation

@stefanhengl

@stefanhengl stefanhengl commented Jun 11, 2024

Copy link
Copy Markdown
Member

This adds a thin layer to expand a query with keywords which are not explicitly mentioned. As a first application, we detect if a README should be returned. See /vscode/src/chat/chat-view/context.ts for how a similar logic is implemented in the client.

Test plan:

  • new and updated unit tests
  • Cody web client: I tested the "what does this codebase do?" for sourcegraph/sourcegraph and sourcegraph/zoekt as selected context. For Zoekt we return /README.md in the top 3, which leads to a good answer. The Sourcegraph repo has many README files and our ranking ranks other README files higher, for example if they contain the string "README", which leads to a vague answer.

Addresses SPLF-94

@stefanhengl stefanhengl requested a review from jtibshirani June 11, 2024 11:13
@cla-bot cla-bot Bot added the cla-signed label Jun 11, 2024
@github-actions github-actions Bot added team/product-platform team/search-platform Issues owned by the search platform team labels Jun 11, 2024

@jtibshirani jtibshirani left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Sourcegraph repo has many README files and our ranking ranks other README files higher, for example if they contain the string "README", which leads to a vague answer.

It feels like we should consistently match the top-level README. Do you plan to follow-up with a fix? Maybe boosting exact matches on filename higher in Zoekt? Brainstorming ...

--- a/score.go
+++ b/score.go
@@ -118,8 +118,13 @@ func calculateTermFrequency(cands []*candidateMatch, df termDocumentFrequency) m
        termFreqs := map[string]int{}
        for _, cand := range cands {
                term := string(cand.substrLowered)
+               length := uint32(len(cand.substrLowered))
                if cand.fileName {
-                       termFreqs[term] += 5
+                       if cand.byteOffset == 0 && cand.byteMatchSz == length {
+                               termFreqs[term] += 5
+                       } else {
+                               termFreqs[term] += 2
+                       }
                } else {
                        termFreqs[term]++
                }

Comment thread internal/search/job/jobutil/job_test.go Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a Marvin Gaye fan? :)

Comment thread internal/search/codycontext/query_parser.go Outdated
Comment thread internal/search/codycontext/query_parser.go Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be really nice to also match queries containing only the repo name, like

  • "what does batch-change-buddy/jtest do?"
  • "what does go-ctags do?"

Do you think it'd be possible to pull out the repo names and also match against these? They would count as "project signifiers".

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a regex to match project names on a best-effort basis. I added a couple of tests to check corner cases.

@jtibshirani jtibshirani Jun 13, 2024

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we pull out the repo names being searched and match against those? They are available in the query string.

This approach could be noisy and would also miss things like "what does go-ctags do?"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I don't understand exactly what you mean with "Couldn't we pull out the repo names being searched and match against those?". I don't think we can reliably identify "go-ctags" as repository unless the user has explicitly selected it as context. We could check the DB but that is quite expensive.

I will remove the repo logic for now and follow-up once we chatted a bit more about this.

@stefanhengl

stefanhengl commented Jun 12, 2024

Copy link
Copy Markdown
Member Author

It feels like we should consistently match the top-level README. Do you plan to follow-up with a fix?

Yes, definitely. Your suggestion looks promising. I will take a look. I wonder whether we could add something that prioritizes shorter paths in general, not just exact matches.

This adds a thin layer to detect entities in a query which are not
explicity mentioned. As a first application, we detect if a README
should probably be returned, following the approach in the client, see
/vscode/src/chat/chat-view/context.ts.

Test plan:
added unit test
@stefanhengl stefanhengl requested a review from jtibshirani June 13, 2024 08:31

@jtibshirani jtibshirani left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Only comment: I feel like we should solidify the repo name matching, or just remove it entirely.

I'm leaning towards removing it, then in the client we can make sure to rewrite things like "What does @batch-change-buddy/jtest do?" to "What does batch-change-buddy/jtest repo do?" I suggest that here: https://linear.app/sourcegraph/issue/SPLF-98/rewrite-mentioned-entities-to-natural-language

@stefanhengl stefanhengl merged commit 92373c3 into main Jun 14, 2024
@stefanhengl stefanhengl deleted the sh/context/readme branch June 14, 2024 08:30
@jtibshirani

Copy link
Copy Markdown
Contributor

I guess I don't understand exactly what you mean with "Couldn't we pull out the repo names being searched and match against those?"

Cody context searches are always guaranteed to include repo filters, so I was thinking we could check for those and pull out the repo names. But I am +1 that we removed it, I like the "rewrite the initial query" option better (https://linear.app/sourcegraph/issue/SPLF-98/rewrite-mentioned-entities-to-natural-language).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

cla-signed team/product-platform team/search-platform Issues owned by the search platform team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants