Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

insights: deprecate custom commit index#45644

Merged
coury-clark merged 21 commits into
mainfrom
cclark/gitserver-compression
Dec 15, 2022
Merged

insights: deprecate custom commit index#45644
coury-clark merged 21 commits into
mainfrom
cclark/gitserver-compression

Conversation

@coury-clark

@coury-clark coury-clark commented Dec 13, 2022

Copy link
Copy Markdown
Contributor

Closes #45184

This does a few things:

  1. Deprecates the custom built commit indexer used by Code Insights in favor of querying gitserver directly. We have made a lot of progress in the backend infrastructure over the last year that really changes the landscape and means this isn't as necessary as it was.
  2. Deprecate the Frames type. We don't really use this anymore, and it only leads to a lot of confusion when trying to reason about which point (to or from?) is useful. Instead we just pass a set of sample times which is how the application thinks about this anyway.
  3. Fixes an issue where the generated sample times were getting truncated to the previous midnight UTC instead of something much closer to the created time. This has some long history behind, but for similar reasons above we don't need it anymore. As a happy note, this also fixes a problem where insights that extend beyond 1 year would only render 11 points on the chart instead of 12 because of this truncate mismatch.

The compression behavior is roughly the same, but overall is much simpler now that we don't have to maintain state about the freshness of the index. Additionally, we are able to reduce the overhead of gitserver calls by reusing the revision which is something we disabled due to instability in the index previously.

As a really great bonus, this means the commit compression is no longer limited to the time window in which we store commits, and can be used for many years back. It also means we don't have to deal with data retention for this index.

Test plan

I've included some unit tests for the compression behavior, and poked at the product quite a bit. Here is an example of a backfill plan on a repo that is compressed (go-multierror)

{"plan": {
  "Executions": [
    {
      "Revision": "72917a1559e17f38638ade54020ab372ba848d67",
      "RecordingTime": "2020-03-13T00:00:00Z",
      "SharedRecordings": null
    },
    {
      "Revision": "2004d9dba6b07a5b8d133209244f376680f9d472",
      "RecordingTime": "2020-06-13T00:00:00Z",
      "SharedRecordings": null
    },
    {
      "Revision": "0d28cf682dbe774898e42a3db11b7ce24b36751a",
      "RecordingTime": "2020-09-13T00:00:00Z",
      "SharedRecordings": null
    },
    {
      "Revision": "d51a2315c1223966f20802b44e436a5bcb9b6db2",
      "RecordingTime": "2020-12-13T00:00:00Z",
      "SharedRecordings": null
    },
    {
      "Revision": "9974e9ec57696378079ecc3accd3d6f29401b3a0",
      "RecordingTime": "2021-03-13T00:00:00Z",
      "SharedRecordings": [
        "2021-06-13T00:00:00Z",
        "2021-09-13T00:00:00Z",
        "2021-12-13T00:00:00Z",
        "2022-03-13T00:00:00Z",
        "2022-06-13T00:00:00Z",
        "2022-09-13T00:00:00Z",
        "2022-12-13T00:00:00Z"
      ]
    }
  ],
  "RecordCount": 5
}
}

@cla-bot cla-bot Bot added the cla-signed label Dec 13, 2022
@coury-clark coury-clark requested a review from a team December 13, 2022 21:30
@coury-clark coury-clark marked this pull request as ready for review December 13, 2022 21:30
@coury-clark coury-clark self-assigned this Dec 13, 2022
@coury-clark coury-clark changed the title Cclark/gitserver compression insights: deprecate custom commit index Dec 13, 2022
# Conflicts:
#	enterprise/internal/insights/background/queryrunner/worker_test.go
#	enterprise/internal/insights/compression/worker.go
@sourcegraph-bot

sourcegraph-bot commented Dec 13, 2022

Copy link
Copy Markdown
Contributor

Codenotify: Notifying subscribers in CODENOTIFY files for diff fbb2b8c...e6df7e7.

Notify File(s)
@sourcegraph/code-insights-backend enterprise/internal/insights/background/background.go
enterprise/internal/insights/background/historical_enqueuer.go
enterprise/internal/insights/background/queryrunner/worker.go
enterprise/internal/insights/background/queryrunner/worker_test.go
enterprise/internal/insights/compression/commits.go
enterprise/internal/insights/compression/commits_test.go
enterprise/internal/insights/compression/compression.go
enterprise/internal/insights/compression/compression_test.go
enterprise/internal/insights/compression/mocks_test.go
enterprise/internal/insights/compression/worker.go
enterprise/internal/insights/compression/worker_test.go
enterprise/internal/insights/discovery/git_commits.go
enterprise/internal/insights/pipeline/backfill.go
enterprise/internal/insights/pipeline/backfill_test.go
enterprise/internal/insights/query/capture_group_executor.go
enterprise/internal/insights/query/streaming_query_executor.go
enterprise/internal/insights/resolvers/insight_series_resolver.go
enterprise/internal/insights/scheduler/backfill_state_inprogress_handler.go
enterprise/internal/insights/scheduler/backfill_state_new_handler.go
enterprise/internal/insights/timeseries/frames.go
enterprise/internal/insights/timeseries/frames_test.go
enterprise/internal/insights/types/types.go

@leonore leonore left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just had a quick first pass through, is this still a wip or is it ready?

Comment thread CHANGELOG.md Outdated
@@ -7,36 +7,27 @@ import (
"github.com/sourcegraph/sourcegraph/enterprise/internal/insights/types"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opened https://github.com/sourcegraph/sourcegraph/issues/45656 as a grab issue to get rid of the "frame" legacy wording not to pollute this PR too much

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty!

OldestIndexedAt *time.Time
}

const getCommitsInRangeStr = `

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add an issue to remember to remove this table in the future

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coury-clark

Copy link
Copy Markdown
Contributor Author

@leonore this should be ready!

func (g *gitserverFilter) Filter(ctx context.Context, sampleTimes []time.Time, name api.RepoName) BackfillPlan {
var nodes []QueryExecution
getCommit := func(to time.Time) (*gitdomain.Commit, bool, error) {
commits, err := g.commitFetcher.RecentCommits(ctx, name, to)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this be most felt on large monorepos? My understanding was that git log always starts from the current point, so it would need to traverse though the history 12 times. We could probably get this down to a single call. I see there is a LogReverseEach. It doesn't take options so it might not have everything we'd need but in theory do a single call and in the call back mark the points where searches need to happen.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a good callout. Unfortunately I don't have any data one way or the other, so let me poke at this for a few hours and see what I can find.

@coury-clark coury-clark Dec 14, 2022

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the big question for me was this: is it worse to send all of the bytes for every commit through stdout and read them to do this in one pass, or let git iterate through the commit graph on it's own, but potentially have multiple calls?

I experimented on gigarepo (25GB) and here is what I found:

Running log over all commits and counting them takes about 40 seconds

bash-5.1# time git --no-pager log --pretty="%h %cd" --no-abbrev | wc -l
2174609

real    0m39.306s
user    0m31.750s
sys     0m21.508s

Grabbing arbitrary commits at a time is fast, but it turns out the commits are not in order because of the way the gigarepo is constructed, so this doesn't actually tell the entire story.

bash-5.1# time git --no-pager log --pretty="%h %cd" --before="2011-01-01 00:00:00" --no-abbrev --date-order -n1
3e1a8ada0e6ae6b61724c83313aeeba487a360e3 Fri Dec 31 20:16:11 2010 +0000

real    0m0.700s
user    0m0.362s
sys     0m0.336s

Next I figured if I put the date old enough git will have to iterate all of the commits internally anyway

bash-5.1# time git --no-pager log --pretty="%h %cd" --before="2000-01-01 00:00:00" --no-abbrev --date-order -n50
cc97d63d1c33b94883f859addd407afc32dbe9ac Fri Apr 1 02:03:04 1988 -0500
435c3c1ff9e422159950e8efcaedfb6bc8d18cba Fri Apr 1 02:02:04 1988 -0500
fa26916c3329c8d4e8d49cde59e5996affa8d7c4 Sun Jan 20 01:02:03 1974 -0400
7629aa8aed33b793b6e365ee09cea16e61e476d7 Tue Jul 18 19:05:45 1972 -0500
4f76c6804600e9aff8ebb9c90ca53b4db76079d3 Thu Jan 1 00:00:00 1970 +0000
4f161d5e893b5ad924c8873729f6ddf144d268d7 Thu Jan 1 00:00:00 1970 +0000

real    0m3.126s
user    0m2.557s
sys     0m0.567s

With this result, it would be probably worse to query 12 times than walk the input with this command. However, I remembered that commits store a reference to their parent, so it should be possible to walk old commits very fast if we have a reference commit:

bash-5.1# time git --no-pager log --pretty="%h %cd" --before="2000-01-01 00:00:00" --no-abbrev --date-order -n50 cc97d63d1c33b94883f859addd407afc32dbe9ac
cc97d63d1c33b94883f859addd407afc32dbe9ac Fri Apr 1 02:03:04 1988 -0500
435c3c1ff9e422159950e8efcaedfb6bc8d18cba Fri Apr 1 02:02:04 1988 -0500
fa26916c3329c8d4e8d49cde59e5996affa8d7c4 Sun Jan 20 01:02:03 1974 -0400
7629aa8aed33b793b6e365ee09cea16e61e476d7 Tue Jul 18 19:05:45 1972 -0500
4f76c6804600e9aff8ebb9c90ca53b4db76079d3 Thu Jan 1 00:00:00 1970 +0000
4f161d5e893b5ad924c8873729f6ddf144d268d7 Thu Jan 1 00:00:00 1970 +0000

real    0m0.010s
user    0m0.003s
sys     0m0.007s

This is a very compelling result, and makes sense. We still walk the log only once as far back as we need, but we iteratively pass in previous commits as the starting point. In other words, walking backwards from the most recent time to the oldest, without any duplicate iterations.

Just to be sure my intuition was correct, I modified the code to do both and ran it locally:

[
  {
    "headDuration": "19.614ms",
    "fromPrevDuration": "18.460916ms",
    "headRev": "46603095267dd73539b5e43ca6a01e77b7a268d2",
    "fromPrevRev": "46603095267dd73539b5e43ca6a01e77b7a268d2",
    "sampleTime": "2022-12-14 22:21:00 +0000 UTC"
  },
  {
    "headDuration": "20.504625ms",
    "fromPrevDuration": "26.970166ms",
    "headRev": "2fe76e22736bfc4ea8e511163e195fe3bb476bf4",
    "fromPrevRev": "2fe76e22736bfc4ea8e511163e195fe3bb476bf4",
    "sampleTime": "2022-10-14 22:21:00 +0000 UTC"
  },
  {
    "headDuration": "19.896083ms",
    "fromPrevDuration": "35.768ms",
    "headRev": "c8a60301e18c5c233e8480c9cc8856ec0a6febd2",
    "fromPrevRev": "c8a60301e18c5c233e8480c9cc8856ec0a6febd2",
    "sampleTime": "2022-08-14 22:21:00 +0000 UTC"
  },
  {
    "headDuration": "21.6195ms",
    "fromPrevDuration": "37.160041ms",
    "headRev": "0c7922306cedfeefe24d1988474f11e594f1629b",
    "fromPrevRev": "0c7922306cedfeefe24d1988474f11e594f1629b",
    "sampleTime": "2022-06-14 22:21:00 +0000 UTC"
  },
  {
    "headDuration": "22.621083ms",
    "fromPrevDuration": "39.965542ms",
    "headRev": "19c8381086d168bf47397deeca24c05a30688906",
    "fromPrevRev": "19c8381086d168bf47397deeca24c05a30688906",
    "sampleTime": "2022-04-14 22:21:00 +0000 UTC"
  },
  {
    "headDuration": "22.966875ms",
    "fromPrevDuration": "40.502666ms",
    "headRev": "039fafff4ac078b83450b2409a34463fbe88a194",
    "fromPrevRev": "039fafff4ac078b83450b2409a34463fbe88a194",
    "sampleTime": "2022-02-14 22:21:00 +0000 UTC"
  },
  {
    "headDuration": "22.924667ms",
    "fromPrevDuration": "37.380041ms",
    "headRev": "eef10580008e0cb5f8fae538d13069ae8986a1c2",
    "fromPrevRev": "eef10580008e0cb5f8fae538d13069ae8986a1c2",
    "sampleTime": "2021-12-14 22:21:00 +0000 UTC"
  },
  {
    "headDuration": "23.187917ms",
    "fromPrevDuration": "35.018292ms",
    "headRev": "0cf298d3a6953cc0f56b19b0e7d0ea0eae70af5e",
    "fromPrevRev": "0cf298d3a6953cc0f56b19b0e7d0ea0eae70af5e",
    "sampleTime": "2021-10-14 22:21:00 +0000 UTC"
  },
  {
    "headDuration": "27.402417ms",
    "fromPrevDuration": "38.130417ms",
    "headRev": "ebd6b4836245019a7ba5e4c06604bc5aa2b86e28",
    "fromPrevRev": "ebd6b4836245019a7ba5e4c06604bc5aa2b86e28",
    "sampleTime": "2021-08-14 22:21:00 +0000 UTC"
  },
  {
    "headDuration": "24.457458ms",
    "fromPrevDuration": "36.767334ms",
    "headRev": "e999549a6dea0330c8ca7ca3e48e2cbd18e96273",
    "fromPrevRev": "e999549a6dea0330c8ca7ca3e48e2cbd18e96273",
    "sampleTime": "2021-06-14 22:21:00 +0000 UTC"
  },
  {
    "headDuration": "24.706916ms",
    "fromPrevDuration": "41.805583ms",
    "headRev": "d527260f3e873446c4e9dfac14409c7136344041",
    "fromPrevRev": "d527260f3e873446c4e9dfac14409c7136344041",
    "sampleTime": "2021-04-14 22:21:00 +0000 UTC"
  },
  {
    "headDuration": "28.353334ms",
    "fromPrevDuration": "38.7715ms",
    "headRev": "7a2878701dbc1db1e1f0fd4e1975b7bfb9c60812",
    "fromPrevRev": "7a2878701dbc1db1e1f0fd4e1975b7bfb9c60812",
    "sampleTime": "2021-02-14 22:21:00 +0000 UTC"
  }
]

Running against src/src locally it takes longer when passing in the revision, but I suspect there is some overhead in fetching the older commit object when there are few commits (26.6k on src/src).

I think this approach should solve this problem, and I'll add a timer to see exactly how long its taking.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leonore leonore left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense but let's see how this fares in practice too

Comment thread enterprise/internal/insights/compression/compression.go
Comment thread enterprise/internal/insights/compression/compression.go
Comment thread enterprise/internal/insights/compression/compression.go
coury-clark and others added 3 commits December 14, 2022 17:05
Co-authored-by: leo <leo.p@sourcegraph.com>
#45683)

* pass in previous revision when fetching recent commits to reduce impact for large repos

* fix test

* lint

* Update enterprise/internal/insights/compression/compression.go

Co-authored-by: leo <leo.p@sourcegraph.com>

Co-authored-by: leo <leo.p@sourcegraph.com>
@coury-clark coury-clark merged commit 7588981 into main Dec 15, 2022
@coury-clark coury-clark deleted the cclark/gitserver-compression branch December 15, 2022 15:54
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

insights: missing commits indexed running locally

4 participants