(feat) add repo match search results aggregation by repo metadata#51248
Conversation
Bundle size report 📦
Look at the Statoscope report for a full comparison between the commits 789d428 and 8af39fc or learn more. Open explanation
|
6f21764 to
5674f5d
Compare
|
I think we might need to aggregate on value, and let them choose a key, to help them with the more common job to be done. e.g. What are the repo licenses across the company for this search? (e.g. choosing key I've designed up an idea for how that could work in https://github.com/sourcegraph/sourcegraph/pull/51392 |
5674f5d to
8c25b35
Compare
Thanks @toolmantim for the great feedback and preparing storybook example. During the last sync call with Ryan, we discussed to finilize this PR and merge it without repo meta picking UI, as it will take more effort to implement right now. The current architecture of aggregation panel doesn't support extra options/config, which will require lots of code and architecture changes. However, we put this repo meta key picking on the list to address it if we have time after we finish rest of the issues/requirements. |
8c25b35 to
a5536d1
Compare
|
Codenotify: Notifying subscribers in CODENOTIFY files for diff bad2254...789d428.
|
sashaostrikov
left a comment
There was a problem hiding this comment.
Nice! Left 1 minor comment about naming
toolmantim
left a comment
There was a problem hiding this comment.
Overall it's looking good.
I just found one issue, which is I believe type:repo doesn't work but should be supported (same as we do for the other groups, i.e. type:file)
chwarwick
left a comment
There was a problem hiding this comment.
I understand why you did it, but I'd like to avoid leaking the aggregation mode into result counting logic, it opens the possibility to do that for many more reasons and I'd would like to see if we can compose solutions on top of the existing match types instead.
I'm also concerned about performance on instances with hundreds of thousands of repos and on dotcom. Is there a plan in place to ensure that performance is acceptable?
| repoIDs.Add(r.RepoName().ID) | ||
| } | ||
|
|
||
| res, err := r.db.Repos().List(r.ctx, database.ReposListOptions{IDs: repoIDs.Values()}) |
There was a problem hiding this comment.
I am concerned about the performance of doing this on dotcom. A batch of matches can contain thousands of results and I know querying the repo table is already an issue.
There was a problem hiding this comment.
But these thousand repos are read at once for each batch. Or is your concern memory usage related?
There was a problem hiding this comment.
I'm not worried so much about memory usage more so about long running queries on the repo table. This would issue a single query with a potentially thousands of repo ids. We've done similar things in insights for filtering repos out of an insight and had performance issues that came from passing a long list of parameters. I don't know this repo query well enough to know how it performs in that situation about you should make sure it handles it will.
There was a problem hiding this comment.
Is the need for this DB call to load the metadata keys / values?
There was a problem hiding this comment.
I thought yes, but I just looked and I see the metadata in the search result event stream, so it shouldn't be necessary.
I'm seeing matches comeback as
[
{
"type": "repo",
"repositoryID": 399,
"repository": "github.com/sourcegraph/sourcegraph",
"repoStars": 7857,
"repoLastFetched": "2023-05-16T15:41:12.441465Z",
"description": "Code Intelligence Platform",
"metadata": {
"code-intel": null,
"code-search": null,
"license": "multiple",
"open-source": "",
"team": "sourcegraph"
}
}
]There was a problem hiding this comment.
if that's the case, I don't fully understand why this lookup is happening. Could we just use the search results directly here @erzhtor?
There was a problem hiding this comment.
The metadata is populated lazily right at the search API boundary, so it's not on the search results that are seen by the aggregation code
There was a problem hiding this comment.
@chwarwick, can you please point out where did you check the stream? I'm getting following when checking here (where the aggregation is happening):
{
"Results": [
{
"Name": "github.com/sourcegraph/sourcegraph",
"ID": 7,
"Rev": "",
"DescriptionMatches": null,
"RepoNameMatches": [{ "start": [0, 0, 0], "end": [34, 0, 34] }]
}
],
"Stats": {
"IsLimitHit": false,
"Repos": null,
"Status": {},
"BackendsMissing": 0,
"ExcludedForks": 0,
"ExcludedArchived": 0
}
}There was a problem hiding this comment.
I see, you've mentioned search result stream. I think that part similarly does additional batch db query for repo metadata and appends to the eventMatch, https://sourcegraph.com/github.com/sourcegraph/sourcegraph@f3b65acc3e189c7b54586e7bd1d7eaa13dc77104/-/blob/cmd/frontend/internal/search/search.go?L697-713 (pointed out by @camdencheek). So I used the same approach here for aggregation event stream.
| } | ||
|
|
||
| func NewSearchResultsAggregatorWithContext(ctx context.Context, tabulator AggregationTabulator, countFunc AggregationCountFunc, db database.DB) SearchResultsAggregator { | ||
| func NewSearchResultsAggregatorWithContext(ctx context.Context, tabulator AggregationTabulator, countFunc AggregationCountFunc, db database.DB, mode types.SearchAggregationMode) SearchResultsAggregator { |
There was a problem hiding this comment.
I would like to avoid the AggregationMode from leaking into this aggregator. My initial goal here was that over time we would be able to reuse this for Code Insights that are persisted on the dashboard as it was previously a goal to be able to persist these aggregations to an Insights dashboard.
I think an alternative approach that I think is worth trying is to use the repoCount function and then swap out the Aggregator when you are in metadata mode. My understanding is that since it's select:repo the result is going to be a long list of repoID with a 1.
I think you could compose your own aggregator on top of the existing limitedAggregator that builds up the repo slice and then converts it to metadata, maybe flushing the list every N Add calls or when the user requests it to be sorted.
type LimitedAggregator interface {
Add(label string, count int32)
SortAggregate() []*Aggregate
OtherCounts() OtherCount
}There was a problem hiding this comment.
Thank you for the suggestion, will take a look!
chwarwick
left a comment
There was a problem hiding this comment.
After talking though the other options I'm good with these changes. If Insights revisits being able to convert an aggregation into a a persisted Insight on a dashboard this will require some additional work to support but that isn't currently planned.
add repo metadata default aggreation mode check
feat: handle repometa filter url for cases when value is present
Co-authored-by: Alex Ostrikov <alex.ostrikov@sourcegraph.com>
…po metadata to use "repository searches" instead of "repository match searches"
6023820 to
486c3d9
Compare
486c3d9 to
789d428
Compare
…1248) Co-authored-by: Alex Ostrikov <alex.ostrikov@sourcegraph.com>
Part of https://github.com/sourcegraph/pr-faqs/issues/96.
This PR introduces new aggregation mode over repository metadata. It works for
select:repoand shows number of repositories for a given key-value pair metadata.Test plan
sg startrepository-metadatafeature flag totrueselect:repoand check that aggregation panel had new "Repo metadata" optionScreenshots
Screen.Recording.2023-05-12.at.14.41.27.mov