Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

(feat) add repo match search results aggregation by repo metadata#51248

Merged
erzhtor merged 10 commits into
mainfrom
erzhtor/add-results-aggregation-by-repo-metadata
May 24, 2023
Merged

(feat) add repo match search results aggregation by repo metadata#51248
erzhtor merged 10 commits into
mainfrom
erzhtor/add-results-aggregation-by-repo-metadata

Conversation

@erzhtor

@erzhtor erzhtor commented Apr 28, 2023

Copy link
Copy Markdown
Contributor

Part of https://github.com/sourcegraph/pr-faqs/issues/96.

This PR introduces new aggregation mode over repository metadata. It works for select:repo and shows number of repositories for a given key-value pair metadata.

Test plan

  • sg start
  • Set repository-metadata feature flag to true
  • Use search with additional select:repo and check that aggregation panel had new "Repo metadata" option

Screenshots

Screen.Recording.2023-05-12.at.14.41.27.mov

@erzhtor erzhtor requested a review from camdencheek April 28, 2023 09:46
@erzhtor erzhtor self-assigned this Apr 28, 2023
@cla-bot cla-bot Bot added the cla-signed label Apr 28, 2023
@sourcegraph-buildkite

sourcegraph-buildkite commented Apr 28, 2023

Copy link
Copy Markdown
Collaborator

Bundle size report 📦

Initial size Total size Async size Modules
0.00% (+0.06 kb) 0.00% (+0.56 kb) 0.00% (+0.49 kb) 0.00% (0)

Look at the Statoscope report for a full comparison between the commits 789d428 and 8af39fc or learn more.

Open explanation
  • Initial size is the size of the initial bundle (the one that is loaded when you open the page)
  • Total size is the size of the initial bundle + all the async loaded chunks
  • Async size is the size of all the async loaded chunks
  • Modules is the number of modules in the initial bundle

@erzhtor erzhtor force-pushed the erzhtor/add-results-aggregation-by-repo-metadata branch from 6f21764 to 5674f5d Compare May 2, 2023 12:26
@toolmantim

toolmantim commented May 3, 2023

Copy link
Copy Markdown
Contributor

I think we might need to aggregate on value, and let them choose a key, to help them with the more common job to be done. e.g. What are the repo licenses across the company for this search? (e.g. choosing key license) Or, what is the department breakdown for these matching repositories (e.g. choosing key department)

I've designed up an idea for how that could work in https://github.com/sourcegraph/sourcegraph/pull/51392

@erzhtor erzhtor force-pushed the erzhtor/add-results-aggregation-by-repo-metadata branch from 5674f5d to 8c25b35 Compare May 11, 2023 09:16
@erzhtor

erzhtor commented May 11, 2023

Copy link
Copy Markdown
Contributor Author

I think we might need to aggregate on value, and let them choose a key, to help them with the more common job to be done. e.g. What are the repo licenses across the company for this search? (e.g. choosing key license) Or, what is the department breakdown for these matching repositories (e.g. choosing key department)

I've designed up an idea for how that could work in #51392

Thanks @toolmantim for the great feedback and preparing storybook example. During the last sync call with Ryan, we discussed to finilize this PR and merge it without repo meta picking UI, as it will take more effort to implement right now. The current architecture of aggregation panel doesn't support extra options/config, which will require lots of code and architecture changes.

However, we put this repo meta key picking on the list to address it if we have time after we finish rest of the issues/requirements.

@erzhtor erzhtor force-pushed the erzhtor/add-results-aggregation-by-repo-metadata branch from 8c25b35 to a5536d1 Compare May 12, 2023 08:24
@erzhtor erzhtor changed the title (draft) (feat) add repo match search results aggregation by repo metadata (feat) add repo match search results aggregation by repo metadata May 12, 2023
@erzhtor erzhtor requested review from a team May 12, 2023 08:46
@erzhtor erzhtor marked this pull request as ready for review May 12, 2023 08:47
@sourcegraph-bot

sourcegraph-bot commented May 12, 2023

Copy link
Copy Markdown
Contributor

Codenotify: Notifying subscribers in CODENOTIFY files for diff bad2254...789d428.

Notify File(s)
@fkling client/web/src/search/results/components/aggregation/components/aggregation-mode-controls/AggregationModeControls.tsx
client/web/src/search/results/components/aggregation/hooks.ts
@limitedmage client/web/src/search/results/components/aggregation/components/aggregation-mode-controls/AggregationModeControls.tsx
client/web/src/search/results/components/aggregation/hooks.ts
@sourcegraph/code-insights-backend enterprise/internal/insights/aggregation/BUILD.bazel
enterprise/internal/insights/aggregation/aggregation.go
enterprise/internal/insights/aggregation/aggregation_test.go
enterprise/internal/insights/query/querybuilder/BUILD.bazel
enterprise/internal/insights/query/querybuilder/builder.go
enterprise/internal/insights/query/querybuilder/builder_test.go
enterprise/internal/insights/types/types.go

@sourcegraph-bot

sourcegraph-bot commented May 12, 2023

Copy link
Copy Markdown
Contributor

📖 Storybook live preview

@sashaostrikov sashaostrikov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Left 1 minor comment about naming

Comment thread enterprise/internal/insights/aggregation/aggregation.go Outdated
@erzhtor erzhtor requested review from ryphil and toolmantim May 12, 2023 09:31

@toolmantim toolmantim left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it's looking good.

I just found one issue, which is I believe type:repo doesn't work but should be supported (same as we do for the other groups, i.e. type:file)

Comment thread enterprise/cmd/frontend/internal/insights/resolvers/aggregates_resolvers.go Outdated

@chwarwick chwarwick left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why you did it, but I'd like to avoid leaking the aggregation mode into result counting logic, it opens the possibility to do that for many more reasons and I'd would like to see if we can compose solutions on top of the existing match types instead.

I'm also concerned about performance on instances with hundreds of thousands of repos and on dotcom. Is there a plan in place to ensure that performance is acceptable?

repoIDs.Add(r.RepoName().ID)
}

res, err := r.db.Repos().List(r.ctx, database.ReposListOptions{IDs: repoIDs.Values()})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned about the performance of doing this on dotcom. A batch of matches can contain thousands of results and I know querying the repo table is already an issue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But these thousand repos are read at once for each batch. Or is your concern memory usage related?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not worried so much about memory usage more so about long running queries on the repo table. This would issue a single query with a potentially thousands of repo ids. We've done similar things in insights for filtering repos out of an insight and had performance issues that came from passing a long list of parameters. I don't know this repo query well enough to know how it performs in that situation about you should make sure it handles it will.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the need for this DB call to load the metadata keys / values?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought yes, but I just looked and I see the metadata in the search result event stream, so it shouldn't be necessary.

I'm seeing matches comeback as

[
    {
        "type": "repo",
        "repositoryID": 399,
        "repository": "github.com/sourcegraph/sourcegraph",
        "repoStars": 7857,
        "repoLastFetched": "2023-05-16T15:41:12.441465Z",
        "description": "Code Intelligence Platform",
        "metadata": {
            "code-intel": null,
            "code-search": null,
            "license": "multiple",
            "open-source": "",
            "team": "sourcegraph"
        }
    }
]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that's the case, I don't fully understand why this lookup is happening. Could we just use the search results directly here @erzhtor?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata is populated lazily right at the search API boundary, so it's not on the search results that are seen by the aggregation code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chwarwick, can you please point out where did you check the stream? I'm getting following when checking here (where the aggregation is happening):

{
  "Results": [
    {
      "Name": "github.com/sourcegraph/sourcegraph",
      "ID": 7,
      "Rev": "",
      "DescriptionMatches": null,
      "RepoNameMatches": [{ "start": [0, 0, 0], "end": [34, 0, 34] }]
    }
  ],
  "Stats": {
    "IsLimitHit": false,
    "Repos": null,
    "Status": {},
    "BackendsMissing": 0,
    "ExcludedForks": 0,
    "ExcludedArchived": 0
  }
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you've mentioned search result stream. I think that part similarly does additional batch db query for repo metadata and appends to the eventMatch, https://sourcegraph.com/github.com/sourcegraph/sourcegraph@f3b65acc3e189c7b54586e7bd1d7eaa13dc77104/-/blob/cmd/frontend/internal/search/search.go?L697-713 (pointed out by @camdencheek). So I used the same approach here for aggregation event stream.

}

func NewSearchResultsAggregatorWithContext(ctx context.Context, tabulator AggregationTabulator, countFunc AggregationCountFunc, db database.DB) SearchResultsAggregator {
func NewSearchResultsAggregatorWithContext(ctx context.Context, tabulator AggregationTabulator, countFunc AggregationCountFunc, db database.DB, mode types.SearchAggregationMode) SearchResultsAggregator {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to avoid the AggregationMode from leaking into this aggregator. My initial goal here was that over time we would be able to reuse this for Code Insights that are persisted on the dashboard as it was previously a goal to be able to persist these aggregations to an Insights dashboard.

I think an alternative approach that I think is worth trying is to use the repoCount function and then swap out the Aggregator when you are in metadata mode. My understanding is that since it's select:repo the result is going to be a long list of repoID with a 1.

I think you could compose your own aggregator on top of the existing limitedAggregator that builds up the repo slice and then converts it to metadata, maybe flushing the list every N Add calls or when the user requests it to be sorted.

type LimitedAggregator interface {
	Add(label string, count int32)
	SortAggregate() []*Aggregate
	OtherCounts() OtherCount
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion, will take a look!

@erzhtor erzhtor requested a review from chwarwick May 22, 2023 14:19

@chwarwick chwarwick left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After talking though the other options I'm good with these changes. If Insights revisits being able to convert an aggregation into a a persisted Insight on a dashboard this will require some additional work to support but that isn't currently planned.

erzhtor and others added 3 commits May 23, 2023 19:52
Co-authored-by: Alex Ostrikov <alex.ostrikov@sourcegraph.com>
…po metadata to use "repository searches" instead of "repository match searches"
@erzhtor erzhtor force-pushed the erzhtor/add-results-aggregation-by-repo-metadata branch from 6023820 to 486c3d9 Compare May 23, 2023 18:04
@erzhtor erzhtor force-pushed the erzhtor/add-results-aggregation-by-repo-metadata branch from 486c3d9 to 789d428 Compare May 24, 2023 06:11
@erzhtor erzhtor merged commit b76236c into main May 24, 2023
@erzhtor erzhtor deleted the erzhtor/add-results-aggregation-by-repo-metadata branch May 24, 2023 07:10
ErikaRS pushed a commit that referenced this pull request Jun 22, 2023
…1248)

Co-authored-by: Alex Ostrikov <alex.ostrikov@sourcegraph.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants