insights: deprecate custom commit index#45644
Conversation
# Conflicts: # enterprise/internal/insights/background/queryrunner/worker_test.go # enterprise/internal/insights/compression/worker.go
|
Codenotify: Notifying subscribers in CODENOTIFY files for diff fbb2b8c...e6df7e7.
|
leonore
left a comment
There was a problem hiding this comment.
just had a quick first pass through, is this still a wip or is it ready?
| @@ -7,36 +7,27 @@ import ( | |||
| "github.com/sourcegraph/sourcegraph/enterprise/internal/insights/types" | |||
There was a problem hiding this comment.
opened https://github.com/sourcegraph/sourcegraph/issues/45656 as a grab issue to get rid of the "frame" legacy wording not to pollute this PR too much
| OldestIndexedAt *time.Time | ||
| } | ||
|
|
||
| const getCommitsInRangeStr = ` |
There was a problem hiding this comment.
should we add an issue to remember to remove this table in the future
There was a problem hiding this comment.
|
@leonore this should be ready! |
| func (g *gitserverFilter) Filter(ctx context.Context, sampleTimes []time.Time, name api.RepoName) BackfillPlan { | ||
| var nodes []QueryExecution | ||
| getCommit := func(to time.Time) (*gitdomain.Commit, bool, error) { | ||
| commits, err := g.commitFetcher.RecentCommits(ctx, name, to) |
There was a problem hiding this comment.
Won't this be most felt on large monorepos? My understanding was that git log always starts from the current point, so it would need to traverse though the history 12 times. We could probably get this down to a single call. I see there is a LogReverseEach. It doesn't take options so it might not have everything we'd need but in theory do a single call and in the call back mark the points where searches need to happen.
There was a problem hiding this comment.
Yeah, it's a good callout. Unfortunately I don't have any data one way or the other, so let me poke at this for a few hours and see what I can find.
There was a problem hiding this comment.
So the big question for me was this: is it worse to send all of the bytes for every commit through stdout and read them to do this in one pass, or let git iterate through the commit graph on it's own, but potentially have multiple calls?
I experimented on gigarepo (25GB) and here is what I found:
Running log over all commits and counting them takes about 40 seconds
bash-5.1# time git --no-pager log --pretty="%h %cd" --no-abbrev | wc -l
2174609
real 0m39.306s
user 0m31.750s
sys 0m21.508s
Grabbing arbitrary commits at a time is fast, but it turns out the commits are not in order because of the way the gigarepo is constructed, so this doesn't actually tell the entire story.
bash-5.1# time git --no-pager log --pretty="%h %cd" --before="2011-01-01 00:00:00" --no-abbrev --date-order -n1
3e1a8ada0e6ae6b61724c83313aeeba487a360e3 Fri Dec 31 20:16:11 2010 +0000
real 0m0.700s
user 0m0.362s
sys 0m0.336s
Next I figured if I put the date old enough git will have to iterate all of the commits internally anyway
bash-5.1# time git --no-pager log --pretty="%h %cd" --before="2000-01-01 00:00:00" --no-abbrev --date-order -n50
cc97d63d1c33b94883f859addd407afc32dbe9ac Fri Apr 1 02:03:04 1988 -0500
435c3c1ff9e422159950e8efcaedfb6bc8d18cba Fri Apr 1 02:02:04 1988 -0500
fa26916c3329c8d4e8d49cde59e5996affa8d7c4 Sun Jan 20 01:02:03 1974 -0400
7629aa8aed33b793b6e365ee09cea16e61e476d7 Tue Jul 18 19:05:45 1972 -0500
4f76c6804600e9aff8ebb9c90ca53b4db76079d3 Thu Jan 1 00:00:00 1970 +0000
4f161d5e893b5ad924c8873729f6ddf144d268d7 Thu Jan 1 00:00:00 1970 +0000
real 0m3.126s
user 0m2.557s
sys 0m0.567s
With this result, it would be probably worse to query 12 times than walk the input with this command. However, I remembered that commits store a reference to their parent, so it should be possible to walk old commits very fast if we have a reference commit:
bash-5.1# time git --no-pager log --pretty="%h %cd" --before="2000-01-01 00:00:00" --no-abbrev --date-order -n50 cc97d63d1c33b94883f859addd407afc32dbe9ac
cc97d63d1c33b94883f859addd407afc32dbe9ac Fri Apr 1 02:03:04 1988 -0500
435c3c1ff9e422159950e8efcaedfb6bc8d18cba Fri Apr 1 02:02:04 1988 -0500
fa26916c3329c8d4e8d49cde59e5996affa8d7c4 Sun Jan 20 01:02:03 1974 -0400
7629aa8aed33b793b6e365ee09cea16e61e476d7 Tue Jul 18 19:05:45 1972 -0500
4f76c6804600e9aff8ebb9c90ca53b4db76079d3 Thu Jan 1 00:00:00 1970 +0000
4f161d5e893b5ad924c8873729f6ddf144d268d7 Thu Jan 1 00:00:00 1970 +0000
real 0m0.010s
user 0m0.003s
sys 0m0.007s
This is a very compelling result, and makes sense. We still walk the log only once as far back as we need, but we iteratively pass in previous commits as the starting point. In other words, walking backwards from the most recent time to the oldest, without any duplicate iterations.
Just to be sure my intuition was correct, I modified the code to do both and ran it locally:
[
{
"headDuration": "19.614ms",
"fromPrevDuration": "18.460916ms",
"headRev": "46603095267dd73539b5e43ca6a01e77b7a268d2",
"fromPrevRev": "46603095267dd73539b5e43ca6a01e77b7a268d2",
"sampleTime": "2022-12-14 22:21:00 +0000 UTC"
},
{
"headDuration": "20.504625ms",
"fromPrevDuration": "26.970166ms",
"headRev": "2fe76e22736bfc4ea8e511163e195fe3bb476bf4",
"fromPrevRev": "2fe76e22736bfc4ea8e511163e195fe3bb476bf4",
"sampleTime": "2022-10-14 22:21:00 +0000 UTC"
},
{
"headDuration": "19.896083ms",
"fromPrevDuration": "35.768ms",
"headRev": "c8a60301e18c5c233e8480c9cc8856ec0a6febd2",
"fromPrevRev": "c8a60301e18c5c233e8480c9cc8856ec0a6febd2",
"sampleTime": "2022-08-14 22:21:00 +0000 UTC"
},
{
"headDuration": "21.6195ms",
"fromPrevDuration": "37.160041ms",
"headRev": "0c7922306cedfeefe24d1988474f11e594f1629b",
"fromPrevRev": "0c7922306cedfeefe24d1988474f11e594f1629b",
"sampleTime": "2022-06-14 22:21:00 +0000 UTC"
},
{
"headDuration": "22.621083ms",
"fromPrevDuration": "39.965542ms",
"headRev": "19c8381086d168bf47397deeca24c05a30688906",
"fromPrevRev": "19c8381086d168bf47397deeca24c05a30688906",
"sampleTime": "2022-04-14 22:21:00 +0000 UTC"
},
{
"headDuration": "22.966875ms",
"fromPrevDuration": "40.502666ms",
"headRev": "039fafff4ac078b83450b2409a34463fbe88a194",
"fromPrevRev": "039fafff4ac078b83450b2409a34463fbe88a194",
"sampleTime": "2022-02-14 22:21:00 +0000 UTC"
},
{
"headDuration": "22.924667ms",
"fromPrevDuration": "37.380041ms",
"headRev": "eef10580008e0cb5f8fae538d13069ae8986a1c2",
"fromPrevRev": "eef10580008e0cb5f8fae538d13069ae8986a1c2",
"sampleTime": "2021-12-14 22:21:00 +0000 UTC"
},
{
"headDuration": "23.187917ms",
"fromPrevDuration": "35.018292ms",
"headRev": "0cf298d3a6953cc0f56b19b0e7d0ea0eae70af5e",
"fromPrevRev": "0cf298d3a6953cc0f56b19b0e7d0ea0eae70af5e",
"sampleTime": "2021-10-14 22:21:00 +0000 UTC"
},
{
"headDuration": "27.402417ms",
"fromPrevDuration": "38.130417ms",
"headRev": "ebd6b4836245019a7ba5e4c06604bc5aa2b86e28",
"fromPrevRev": "ebd6b4836245019a7ba5e4c06604bc5aa2b86e28",
"sampleTime": "2021-08-14 22:21:00 +0000 UTC"
},
{
"headDuration": "24.457458ms",
"fromPrevDuration": "36.767334ms",
"headRev": "e999549a6dea0330c8ca7ca3e48e2cbd18e96273",
"fromPrevRev": "e999549a6dea0330c8ca7ca3e48e2cbd18e96273",
"sampleTime": "2021-06-14 22:21:00 +0000 UTC"
},
{
"headDuration": "24.706916ms",
"fromPrevDuration": "41.805583ms",
"headRev": "d527260f3e873446c4e9dfac14409c7136344041",
"fromPrevRev": "d527260f3e873446c4e9dfac14409c7136344041",
"sampleTime": "2021-04-14 22:21:00 +0000 UTC"
},
{
"headDuration": "28.353334ms",
"fromPrevDuration": "38.7715ms",
"headRev": "7a2878701dbc1db1e1f0fd4e1975b7bfb9c60812",
"fromPrevRev": "7a2878701dbc1db1e1f0fd4e1975b7bfb9c60812",
"sampleTime": "2021-02-14 22:21:00 +0000 UTC"
}
]
Running against src/src locally it takes longer when passing in the revision, but I suspect there is some overhead in fetching the older commit object when there are few commits (26.6k on src/src).
I think this approach should solve this problem, and I'll add a timer to see exactly how long its taking.
There was a problem hiding this comment.
PR with the above here https://github.com/sourcegraph/sourcegraph/pull/45683
leonore
left a comment
There was a problem hiding this comment.
makes sense but let's see how this fares in practice too
Co-authored-by: leo <leo.p@sourcegraph.com>
#45683) * pass in previous revision when fetching recent commits to reduce impact for large repos * fix test * lint * Update enterprise/internal/insights/compression/compression.go Co-authored-by: leo <leo.p@sourcegraph.com> Co-authored-by: leo <leo.p@sourcegraph.com>
Closes #45184
This does a few things:
The compression behavior is roughly the same, but overall is much simpler now that we don't have to maintain state about the freshness of the index. Additionally, we are able to reduce the overhead of gitserver calls by reusing the revision which is something we disabled due to instability in the index previously.
As a really great bonus, this means the commit compression is no longer limited to the time window in which we store commits, and can be used for many years back. It also means we don't have to deal with data retention for this index.
Test plan
I've included some unit tests for the compression behavior, and poked at the product quite a bit. Here is an example of a backfill plan on a repo that is compressed (
go-multierror){"plan": { "Executions": [ { "Revision": "72917a1559e17f38638ade54020ab372ba848d67", "RecordingTime": "2020-03-13T00:00:00Z", "SharedRecordings": null }, { "Revision": "2004d9dba6b07a5b8d133209244f376680f9d472", "RecordingTime": "2020-06-13T00:00:00Z", "SharedRecordings": null }, { "Revision": "0d28cf682dbe774898e42a3db11b7ce24b36751a", "RecordingTime": "2020-09-13T00:00:00Z", "SharedRecordings": null }, { "Revision": "d51a2315c1223966f20802b44e436a5bcb9b6db2", "RecordingTime": "2020-12-13T00:00:00Z", "SharedRecordings": null }, { "Revision": "9974e9ec57696378079ecc3accd3d6f29401b3a0", "RecordingTime": "2021-03-13T00:00:00Z", "SharedRecordings": [ "2021-06-13T00:00:00Z", "2021-09-13T00:00:00Z", "2021-12-13T00:00:00Z", "2022-03-13T00:00:00Z", "2022-06-13T00:00:00Z", "2022-09-13T00:00:00Z", "2022-12-13T00:00:00Z" ] } ], "RecordCount": 5 } }