insights: deprecate custom commit index by coury-clark · Pull Request #45644 · sourcegraph/sourcegraph-public-snapshot

coury-clark · 2022-12-13T21:08:30Z

This does a few things:

Deprecates the custom built commit indexer used by Code Insights in favor of querying gitserver directly. We have made a lot of progress in the backend infrastructure over the last year that really changes the landscape and means this isn't as necessary as it was.
Deprecate the Frames type. We don't really use this anymore, and it only leads to a lot of confusion when trying to reason about which point (to or from?) is useful. Instead we just pass a set of sample times which is how the application thinks about this anyway.
Fixes an issue where the generated sample times were getting truncated to the previous midnight UTC instead of something much closer to the created time. This has some long history behind, but for similar reasons above we don't need it anymore. As a happy note, this also fixes a problem where insights that extend beyond 1 year would only render 11 points on the chart instead of 12 because of this truncate mismatch.

The compression behavior is roughly the same, but overall is much simpler now that we don't have to maintain state about the freshness of the index. Additionally, we are able to reduce the overhead of gitserver calls by reusing the revision which is something we disabled due to instability in the index previously.

As a really great bonus, this means the commit compression is no longer limited to the time window in which we store commits, and can be used for many years back. It also means we don't have to deal with data retention for this index.

Test plan

I've included some unit tests for the compression behavior, and poked at the product quite a bit. Here is an example of a backfill plan on a repo that is compressed (go-multierror)

{"plan": {
  "Executions": [
    {
      "Revision": "72917a1559e17f38638ade54020ab372ba848d67",
      "RecordingTime": "2020-03-13T00:00:00Z",
      "SharedRecordings": null
    },
    {
      "Revision": "2004d9dba6b07a5b8d133209244f376680f9d472",
      "RecordingTime": "2020-06-13T00:00:00Z",
      "SharedRecordings": null
    },
    {
      "Revision": "0d28cf682dbe774898e42a3db11b7ce24b36751a",
      "RecordingTime": "2020-09-13T00:00:00Z",
      "SharedRecordings": null
    },
    {
      "Revision": "d51a2315c1223966f20802b44e436a5bcb9b6db2",
      "RecordingTime": "2020-12-13T00:00:00Z",
      "SharedRecordings": null
    },
    {
      "Revision": "9974e9ec57696378079ecc3accd3d6f29401b3a0",
      "RecordingTime": "2021-03-13T00:00:00Z",
      "SharedRecordings": [
        "2021-06-13T00:00:00Z",
        "2021-09-13T00:00:00Z",
        "2021-12-13T00:00:00Z",
        "2022-03-13T00:00:00Z",
        "2022-06-13T00:00:00Z",
        "2022-09-13T00:00:00Z",
        "2022-12-13T00:00:00Z"
      ]
    }
  ],
  "RecordCount": 5
}
}

# Conflicts: # enterprise/internal/insights/background/queryrunner/worker_test.go # enterprise/internal/insights/compression/worker.go

sourcegraph-bot · 2022-12-13T21:47:31Z

Codenotify: Notifying subscribers in CODENOTIFY files for diff fbb2b8c...e6df7e7.

Notify	File(s)
@sourcegraph/code-insights-backend	enterprise/internal/insights/background/background.go enterprise/internal/insights/background/historical_enqueuer.go enterprise/internal/insights/background/queryrunner/worker.go enterprise/internal/insights/background/queryrunner/worker_test.go enterprise/internal/insights/compression/commits.go enterprise/internal/insights/compression/commits_test.go enterprise/internal/insights/compression/compression.go enterprise/internal/insights/compression/compression_test.go enterprise/internal/insights/compression/mocks_test.go enterprise/internal/insights/compression/worker.go enterprise/internal/insights/compression/worker_test.go enterprise/internal/insights/discovery/git_commits.go enterprise/internal/insights/pipeline/backfill.go enterprise/internal/insights/pipeline/backfill_test.go enterprise/internal/insights/query/capture_group_executor.go enterprise/internal/insights/query/streaming_query_executor.go enterprise/internal/insights/resolvers/insight_series_resolver.go enterprise/internal/insights/scheduler/backfill_state_inprogress_handler.go enterprise/internal/insights/scheduler/backfill_state_new_handler.go enterprise/internal/insights/timeseries/frames.go enterprise/internal/insights/timeseries/frames_test.go enterprise/internal/insights/types/types.go

leonore

just had a quick first pass through, is this still a wip or is it ready?

leonore · 2022-12-14T12:25:44Z

@@ -7,36 +7,27 @@ import (
 	"github.com/sourcegraph/sourcegraph/enterprise/internal/insights/types"


opened https://github.com/sourcegraph/sourcegraph/issues/45656 as a grab issue to get rid of the "frame" legacy wording not to pollute this PR too much

leonore · 2022-12-14T12:39:19Z

-	OldestIndexedAt *time.Time
-}
-
-const getCommitsInRangeStr = `


should we add an issue to remember to remove this table in the future

https://github.com/sourcegraph/sourcegraph/issues/45684 thanks!

coury-clark · 2022-12-14T14:07:45Z

@leonore this should be ready!

chwarwick · 2022-12-14T15:37:59Z

+func (g *gitserverFilter) Filter(ctx context.Context, sampleTimes []time.Time, name api.RepoName) BackfillPlan {
+	var nodes []QueryExecution
+	getCommit := func(to time.Time) (*gitdomain.Commit, bool, error) {
+		commits, err := g.commitFetcher.RecentCommits(ctx, name, to)


Won't this be most felt on large monorepos? My understanding was that git log always starts from the current point, so it would need to traverse though the history 12 times. We could probably get this down to a single call. I see there is a LogReverseEach. It doesn't take options so it might not have everything we'd need but in theory do a single call and in the call back mark the points where searches need to happen.

Yeah, it's a good callout. Unfortunately I don't have any data one way or the other, so let me poke at this for a few hours and see what I can find.

So the big question for me was this: is it worse to send all of the bytes for every commit through stdout and read them to do this in one pass, or let git iterate through the commit graph on it's own, but potentially have multiple calls?

I experimented on gigarepo (25GB) and here is what I found:

Running log over all commits and counting them takes about 40 seconds

bash-5.1# time git --no-pager log --pretty="%h %cd" --no-abbrev | wc -l 2174609 real 0m39.306s user 0m31.750s sys 0m21.508s

Grabbing arbitrary commits at a time is fast, but it turns out the commits are not in order because of the way the gigarepo is constructed, so this doesn't actually tell the entire story.

bash-5.1# time git --no-pager log --pretty="%h %cd" --before="2011-01-01 00:00:00" --no-abbrev --date-order -n1 3e1a8ada0e6ae6b61724c83313aeeba487a360e3 Fri Dec 31 20:16:11 2010 +0000 real 0m0.700s user 0m0.362s sys 0m0.336s

Next I figured if I put the date old enough git will have to iterate all of the commits internally anyway

bash-5.1# time git --no-pager log --pretty="%h %cd" --before="2000-01-01 00:00:00" --no-abbrev --date-order -n50 cc97d63d1c33b94883f859addd407afc32dbe9ac Fri Apr 1 02:03:04 1988 -0500 435c3c1ff9e422159950e8efcaedfb6bc8d18cba Fri Apr 1 02:02:04 1988 -0500 fa26916c3329c8d4e8d49cde59e5996affa8d7c4 Sun Jan 20 01:02:03 1974 -0400 7629aa8aed33b793b6e365ee09cea16e61e476d7 Tue Jul 18 19:05:45 1972 -0500 4f76c6804600e9aff8ebb9c90ca53b4db76079d3 Thu Jan 1 00:00:00 1970 +0000 4f161d5e893b5ad924c8873729f6ddf144d268d7 Thu Jan 1 00:00:00 1970 +0000 real 0m3.126s user 0m2.557s sys 0m0.567s

With this result, it would be probably worse to query 12 times than walk the input with this command. However, I remembered that commits store a reference to their parent, so it should be possible to walk old commits very fast if we have a reference commit:

bash-5.1# time git --no-pager log --pretty="%h %cd" --before="2000-01-01 00:00:00" --no-abbrev --date-order -n50 cc97d63d1c33b94883f859addd407afc32dbe9ac cc97d63d1c33b94883f859addd407afc32dbe9ac Fri Apr 1 02:03:04 1988 -0500 435c3c1ff9e422159950e8efcaedfb6bc8d18cba Fri Apr 1 02:02:04 1988 -0500 fa26916c3329c8d4e8d49cde59e5996affa8d7c4 Sun Jan 20 01:02:03 1974 -0400 7629aa8aed33b793b6e365ee09cea16e61e476d7 Tue Jul 18 19:05:45 1972 -0500 4f76c6804600e9aff8ebb9c90ca53b4db76079d3 Thu Jan 1 00:00:00 1970 +0000 4f161d5e893b5ad924c8873729f6ddf144d268d7 Thu Jan 1 00:00:00 1970 +0000 real 0m0.010s user 0m0.003s sys 0m0.007s

This is a very compelling result, and makes sense. We still walk the log only once as far back as we need, but we iteratively pass in previous commits as the starting point. In other words, walking backwards from the most recent time to the oldest, without any duplicate iterations.

Just to be sure my intuition was correct, I modified the code to do both and ran it locally:

[ { "headDuration": "19.614ms", "fromPrevDuration": "18.460916ms", "headRev": "46603095267dd73539b5e43ca6a01e77b7a268d2", "fromPrevRev": "46603095267dd73539b5e43ca6a01e77b7a268d2", "sampleTime": "2022-12-14 22:21:00 +0000 UTC" }, { "headDuration": "20.504625ms", "fromPrevDuration": "26.970166ms", "headRev": "2fe76e22736bfc4ea8e511163e195fe3bb476bf4", "fromPrevRev": "2fe76e22736bfc4ea8e511163e195fe3bb476bf4", "sampleTime": "2022-10-14 22:21:00 +0000 UTC" }, { "headDuration": "19.896083ms", "fromPrevDuration": "35.768ms", "headRev": "c8a60301e18c5c233e8480c9cc8856ec0a6febd2", "fromPrevRev": "c8a60301e18c5c233e8480c9cc8856ec0a6febd2", "sampleTime": "2022-08-14 22:21:00 +0000 UTC" }, { "headDuration": "21.6195ms", "fromPrevDuration": "37.160041ms", "headRev": "0c7922306cedfeefe24d1988474f11e594f1629b", "fromPrevRev": "0c7922306cedfeefe24d1988474f11e594f1629b", "sampleTime": "2022-06-14 22:21:00 +0000 UTC" }, { "headDuration": "22.621083ms", "fromPrevDuration": "39.965542ms", "headRev": "19c8381086d168bf47397deeca24c05a30688906", "fromPrevRev": "19c8381086d168bf47397deeca24c05a30688906", "sampleTime": "2022-04-14 22:21:00 +0000 UTC" }, { "headDuration": "22.966875ms", "fromPrevDuration": "40.502666ms", "headRev": "039fafff4ac078b83450b2409a34463fbe88a194", "fromPrevRev": "039fafff4ac078b83450b2409a34463fbe88a194", "sampleTime": "2022-02-14 22:21:00 +0000 UTC" }, { "headDuration": "22.924667ms", "fromPrevDuration": "37.380041ms", "headRev": "eef10580008e0cb5f8fae538d13069ae8986a1c2", "fromPrevRev": "eef10580008e0cb5f8fae538d13069ae8986a1c2", "sampleTime": "2021-12-14 22:21:00 +0000 UTC" }, { "headDuration": "23.187917ms", "fromPrevDuration": "35.018292ms", "headRev": "0cf298d3a6953cc0f56b19b0e7d0ea0eae70af5e", "fromPrevRev": "0cf298d3a6953cc0f56b19b0e7d0ea0eae70af5e", "sampleTime": "2021-10-14 22:21:00 +0000 UTC" }, { "headDuration": "27.402417ms", "fromPrevDuration": "38.130417ms", "headRev": "ebd6b4836245019a7ba5e4c06604bc5aa2b86e28", "fromPrevRev": "ebd6b4836245019a7ba5e4c06604bc5aa2b86e28", "sampleTime": "2021-08-14 22:21:00 +0000 UTC" }, { "headDuration": "24.457458ms", "fromPrevDuration": "36.767334ms", "headRev": "e999549a6dea0330c8ca7ca3e48e2cbd18e96273", "fromPrevRev": "e999549a6dea0330c8ca7ca3e48e2cbd18e96273", "sampleTime": "2021-06-14 22:21:00 +0000 UTC" }, { "headDuration": "24.706916ms", "fromPrevDuration": "41.805583ms", "headRev": "d527260f3e873446c4e9dfac14409c7136344041", "fromPrevRev": "d527260f3e873446c4e9dfac14409c7136344041", "sampleTime": "2021-04-14 22:21:00 +0000 UTC" }, { "headDuration": "28.353334ms", "fromPrevDuration": "38.7715ms", "headRev": "7a2878701dbc1db1e1f0fd4e1975b7bfb9c60812", "fromPrevRev": "7a2878701dbc1db1e1f0fd4e1975b7bfb9c60812", "sampleTime": "2021-02-14 22:21:00 +0000 UTC" } ]

Running against src/src locally it takes longer when passing in the revision, but I suspect there is some overhead in fetching the older commit object when there are few commits (26.6k on src/src).

I think this approach should solve this problem, and I'll add a timer to see exactly how long its taking.

PR with the above here https://github.com/sourcegraph/sourcegraph/pull/45683

leonore

makes sense but let's see how this fares in practice too

Co-authored-by: leo <leo.p@sourcegraph.com>

#45683) * pass in previous revision when fetching recent commits to reduce impact for large repos * fix test * lint * Update enterprise/internal/insights/compression/compression.go Co-authored-by: leo <leo.p@sourcegraph.com> Co-authored-by: leo <leo.p@sourcegraph.com>

coury-clark added 7 commits December 7, 2022 13:34

wip

5b27772

wip

0ef441f

wip

8b2ed47

removing frames type

9be2ec5

wip

6d31506

wip

f955773

wip

4619d90

cla-bot Bot added the cla-signed label Dec 13, 2022

coury-clark added 3 commits December 13, 2022 14:09

wip

10f8b47

wip

8d92698

wip

4c689c5

coury-clark requested a review from a team December 13, 2022 21:30

coury-clark marked this pull request as ready for review December 13, 2022 21:30

coury-clark self-assigned this Dec 13, 2022

coury-clark changed the title ~~Cclark/gitserver compression~~ insights: deprecate custom commit index Dec 13, 2022

coury-clark added 2 commits December 13, 2022 14:44

wip

f8d22a3

Merge branch 'main' into cclark/gitserver-compression

162c53a

# Conflicts: # enterprise/internal/insights/background/queryrunner/worker_test.go # enterprise/internal/insights/compression/worker.go

coury-clark added 6 commits December 13, 2022 14:53

wip

4a41171

wip

8ce7d2e

wip

1a23da4

lint

ab20dcb

changelog

147cfed

changelog

5f86043

leonore mentioned this pull request Dec 14, 2022

insights: replace frames legacy wording #45656

Closed

leonore reviewed Dec 14, 2022

View reviewed changes

chwarwick reviewed Dec 14, 2022

View reviewed changes

leonore approved these changes Dec 14, 2022

View reviewed changes

Comment thread enterprise/internal/insights/compression/compression.go

Comment thread enterprise/internal/insights/compression/compression.go

Comment thread enterprise/internal/insights/compression/compression.go

coury-clark mentioned this pull request Dec 15, 2022

pass in previous revision when fetching recent commits to reduce impa… #45683

Merged

coury-clark and others added 3 commits December 14, 2022 17:05

Update CHANGELOG.md

3a0d4e2

Co-authored-by: leo <leo.p@sourcegraph.com>

Merge branch 'main' into cclark/gitserver-compression

e6df7e7

coury-clark merged commit 7588981 into main Dec 15, 2022

coury-clark deleted the cclark/gitserver-compression branch December 15, 2022 15:54

		@@ -7,36 +7,27 @@ import (
		"github.com/sourcegraph/sourcegraph/enterprise/internal/insights/types"

Conversation

coury-clark commented Dec 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

sourcegraph-bot commented Dec 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leonore left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

leonore Dec 14, 2022

Choose a reason for hiding this comment

Uh oh!

coury-clark Dec 15, 2022

Choose a reason for hiding this comment

Uh oh!

leonore Dec 14, 2022

Choose a reason for hiding this comment

Uh oh!

coury-clark Dec 15, 2022

Choose a reason for hiding this comment

Uh oh!

coury-clark commented Dec 14, 2022

Uh oh!

chwarwick Dec 14, 2022

Choose a reason for hiding this comment

Uh oh!

coury-clark Dec 14, 2022

Choose a reason for hiding this comment

Uh oh!

coury-clark Dec 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coury-clark Dec 15, 2022

Choose a reason for hiding this comment

Uh oh!

leonore left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coury-clark commented Dec 13, 2022 •

edited

Loading

sourcegraph-bot commented Dec 13, 2022 •

edited

Loading

coury-clark Dec 14, 2022 •

edited

Loading