Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

RFC: point-in-time searching#61513

Merged
camdencheek merged 30 commits into
mainfrom
cc/rev-at-time
Apr 12, 2024
Merged

RFC: point-in-time searching#61513
camdencheek merged 30 commits into
mainfrom
cc/rev-at-time

Conversation

@camdencheek

@camdencheek camdencheek commented Apr 1, 2024

Copy link
Copy Markdown
Member

Summary

This is a proposal to add the ability to do point-in-time searches. I'd like to add the ability to search a repo's state at a timestamp rather than at a specific commit, branch, or tag. This PR adds a rev:at.time(5 days ago, my-branch) predicate that allows a user to do just that.

Related discussion in Slack

Motivation

  • For Code Insights, this is useful for chart drilldown so we can generate searches that more closely represent the results that produced that data point (currently we use a diff search, which is not a very good proxy for a code insights data point).
  • For general code searching, it's often useful for exploratory searching to look at the state of a repo (or repos) at a given point in time. I often find myself doing something like HEAD~1000 hoping to find a commit that's approximately in the time period I'm interested in. On the command line, I can use git log --before, but that's not available outside of commit/diff search.
  • Customers often use search contexts to collect a set of tags together into "release" contexts, but that requires building tooling that generates these search contexts, and not all repos will necessarily have the same tagging conventions. This provides a zero-config way to get less precise, but functionally similar behavior.

Discussion

Syntax

None of the syntax is set in stone. Would really love feedback on this part. To stop myself from bikeshedding on the exact syntax, this PR is implemented with the simplest thing I could come up with: rev:ancestor.at(branch, 5 days ago), or rev:ancestor.at(5 days ago), which is equivalent to rev:ancestor.at(HEAD, 5 days ago). This PR implements the syntax as rev:at.time(5 days ago). This defaults to targeting HEAD, but can take an optional second argument to choose a different revision as the tip like rev:at.time(2021-01-01, cc/my-branch).

Currently, there is no user-facing way to add the revision to the repo filter because I couldn't find any syntax I liked for that. However, due to a quirk in how our query language is parsed, all rev: values need to be representable in the repo:reponame@rev position, so this super hackily json+base64 encodes the predicate struct. Note that a user will never actually see that, it's just an internal detail. I think this is acceptable for now, and if we decide on a good syntax it will be a non-breaking change.

Behavior

Because of force pushes, merge commits, and overriding commit dates, we cannot guarantee to yield the state of a branch at a given date. However, we can make a feature that 1) almost always matches user intent, and 2) is well-defined.

So, to define the behavior: RevAtTime returns the nearest (by generation number) ancestor commit commit in the linear history defined by --first-parent that has a committer timestamp (not author timestamp) before the requested timestamp. In the case of a tie in generation number (which can happen with non-linear commit histories introduced by merge commits), we return the commit with the greater timestamp.

Merges also make this a little complicated, so we might want to linearize the commit history with --first-parent.

Consider the following commit graph where e is a merge commit:

  a
 / \
b   |
|   |
c   d
 \ /
  e

If e is after the target date, but c and d are before, which commit should we yield as the target? As implemented, we’ll always select the newest one, but if the newest one is d, then we’ll skip b because it’s not an ancestor of d (even though it is an ancestor of another commit that matches the before filter).

Git commiter dates are not guaranteed to be monotonically increasing (they can be overridden with GIT_COMMIT_DATE), but in practice they usually are. Author dates are significantly more variable because commits are often authored before they are merged (think cherry picks). Since the goal of this feature is to find the state of a branch at a given timestamp, I think committer date is more appropriate than author date and this should not be configurable.

Performance

As implemented, this requires a gitserver query during the repo resolution step to resolve a date= rev into a commit hash. The git log command it runs is generally fast (<50ms for sourcegraph/sourcegraph), but its cost scales with the depth of the commit graph and the distance it has to traverse to find the first matching commit. I expect the cost of these queries will be dominated by the unindexed searches they generate (non-HEAD searches will generally be unindexed).

After some perf testing, to limit the cost of repeated queries of this type, I've added a small in-memory cache.

Test Plan

  • Added predicate parsing tests
  • Added git CLI tests
  • Added API integration tests
  • Manual testing of completions and hover docs

@cla-bot cla-bot Bot added the cla-signed label Apr 1, 2024
@camdencheek camdencheek force-pushed the cc/rev-at-time branch 4 times, most recently from 997aecb to 79cd0a0 Compare April 2, 2024 22:04

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx for implementing this as a proper gRPC call aligned with our current work to migrate all the things :)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm as excited to get rid of Exec as you are 😄

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:dab:

Comment thread internal/gitserver/v1/gitserver.proto Outdated
Comment on lines 146 to 144

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for all these. buf format seems to have been out of date

@sqs

sqs commented Apr 2, 2024

Copy link
Copy Markdown
Member

This is cool! Should we support git rev-parse syntax like git show 'HEAD@{3 months ago}' here? rev:"{3 months ago}"? It looks weird but maybe that could be more standard for people familiar with that syntax.

@eseliger

eseliger commented Apr 2, 2024

Copy link
Copy Markdown
Member

One thing that is weird about the above ref syntax is that git uses reflog for determining these dates, not the commit/author date, which camden is using in this PR (and intuitively makes more sense IMO).

So the above syntax that you provided means "HEAD as it was on Sourcegraph 3 months ago".
Here's that for a freshly cloned repo:

git rev-parse 'HEAD@{3 days ago}'
warning: log for 'HEAD' only goes back to Wed, 3 Apr 2024 00:16:15 +0200
9761f2597af73fa2db530f0afc3c7258538c9dd2

@camdencheek

Copy link
Copy Markdown
Member Author

Yep what Erik said. I agree the syntax is more intuitive, but rev@{3 days ago} has meaning to git that is different than what we're doing here (and doesn't actually make much sense in the context of a headless repo), so I don't know that we want to reuse the syntax exactly. Maybe we should use that syntax as inspiration though and pick something closer to that, e.g. repo:myrepo@HEAD%{5 days} or something. Will noodle on that.

@keegancsmith keegancsmith requested a review from a team April 3, 2024 08:03

@keegancsmith keegancsmith left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I like this a lot and would use it fairly often.

    • I often want to start a search from like 2 years ago to check a hypothesis.
  • I like how it is introduced into our syntax.

  • how do we make it clear which timestamp we search?

    • I guess we only allow and show commit timestamp?
    • for repos that we convert to git (eg perforce) is the commit timestamp useful or is it actually the author timestamp?
  • timezones for non-relative time queries

    • relative is fine because 5 days ago is relative to time.Now.
    • but non-relative (absolute?) without TZ in it is likely parsed as UTC by the server (or your macbook's timezone in your devenv)
    • I suspect a user would want it to be the browser TZ.
    • This also then brings up questions about sharing links with those sort of timestamps in them.
  • how do we handle non-linear history?

    • I like the idea you propose of –first-parent. That feels like the more useful value for user intent.
    • I'm unsure about --date-order? It feels like we should stick to a cut vertex in the commit graph. I suppose there are people who really do create funky graphs in how they do dev, and in that case my intuition of cut vertex is wrong.
  • how do we handle non-monotonic commit or author times? EDIT: you address this and I forgot to update my notes about it.

  • rev:ancestor.at(my-branch, 5 days ago) how could this conflict with actual rev names?

  • concern around performance implications if people start using it in search contexts.

@bahrmichael

Copy link
Copy Markdown
Contributor

This looks really cool! I would probably abuse it for bisect searches (unless there's something better for this).

My only concern is around performance. You mentioned a fast 50ms query. Do you know if there's an upper bound to how long this could take, or could it become increasingly slower as repos get larger and commit histories longer? Mainly concerned about noisy neighbor problems, esp. related to automated queries (e.g. from code insights).

@camdencheek

Copy link
Copy Markdown
Member Author

timezones for non-relative time queries

Yeah, that's tricky. We run into this with before and after too. The server uses UTC, but users expect relative days to mean local time. There's some more discussion here, along with an idea to provide a hover with the resolved date (which I think is a good idea). I personally don't think we should interpret queries differently for different users just because of the sharability aspect.

It feels like we should stick to a cut vertex in the commit graph.

I would generally agree, but there's not really an easy way to do this without abandoning the git CLI entirely AFAIK. The exception to that being if we use --first-parent, the graph is linear, so every vertex is a cut vertex.

The downside of --first-parent is it skips the history of all merged branches. So for places that heavily use merge commits, the granularity of the search will be at the "PR" level rather than at the commit level, which can be quite meaningful for long-lived branches. I'm starting to be convinced that this is a worthwhile tradeoff though, and that the simplicity of --first-parent outweights the loss of granularity.

how could this conflict with actual rev names?

The predicate parser only looks for things of the shape rev:ancestor.at([^)]*). I expect it to be vanishingly unlikely that it would conflict, and in the case that it does, we support quoting values like rev:"ancestor.at(abc)" (at least on paper...I haven't actually tested this in a while).

concern around performance

My thought here is that perf cost will be dominated by actually doing the unindexed searches. That's part is no different than a search context that contains a bunch of unindexed revs. Comparatively, the only added cost is the commit resolution, which is non-zero, but not massive either.

As a data point, the chromium repo (largest I have locally with 1.2M commits) takes ~12 seconds to traverse its entire history. It takes 100ms with --before="1 year ago". That seems like an acceptable perf tradeoff to me for a somewhat niche feature.

And in the case perf does become a significant concern, the work the Source team has done to virtualize git calls means we could pretty easily add a cache or index to this specific lookup.

Comment thread cmd/gitserver/internal/git/gitcli/revattime.go Outdated
@jtibshirani

Copy link
Copy Markdown
Contributor

The proposed behavior makes sense to me. +1 to using --first-parent, this feels easiest to understand for users.

My thought here is that perf cost will be dominated by actually doing the unindexed searches.

Now that we use hybrid search by default (yay!) many unindexed searches are very fast. But as you say, there's a lot we can do here -- for example a simple cache of time expression -> hash could really help. Also I assume the command scales logarithmically with history length (not linear) so there's not a huge concern with large histories?

Personally I find the naming a bit surprising because I don't think about "git ancestors" in my day to day dev work :) Throwing out some other ideas:

  • rev:commit.at(...)
  • rev:at.time(...) -> this feels consistent with the format we use for others like repo:has.topic

@camdencheek

Copy link
Copy Markdown
Member Author

Also I assume the command scales logarithmically with history length (not linear) so there's not a huge concern with large histories?

It does not. Git cannot assume a binary search is safe because commit dates are not guaranteed to be monotonic. So it has to iterate commit-by-commit. (I'm also not even sure Git has the indexes that would be needed to do a binary search, but that's another matter entirely)

Personally I find the naming a bit surprising

Thank you! I also dislike the naming, so I really appreciate the suggestions.

For the examples you gave, do they still make sense to you for non-HEAD commits? I read rev:at.time(yesterday) to mean "HEAD at yesterday", which I think is a good default. But what about if I wanted to search the history of other branches? Does rev:at.time(my-branch, yesterday) make sense? Or maybe rev:at.time(yesterday, my-branch) to put the optional param last? Obviously, we can document this all, but an intuitive syntax is preferable 🙂

@camdencheek

camdencheek commented Apr 3, 2024

Copy link
Copy Markdown
Member Author

Or maybe we could even get funky and allow multiple predicates like rev:branch(my-branch):at.time(yesterday) or just rev:my-branch:at.time(yesterday)

Comment thread cmd/gitserver/internal/git/observability.go Outdated
@camdencheek

camdencheek commented Apr 3, 2024

Copy link
Copy Markdown
Member Author

the command scales logarithmically

To back up my claims, I measured a few data points on chromium/chromium locally. Looks pretty linear to me. Looks like we can go through ~1M commits/second

@mmanela

mmanela commented Apr 3, 2024

Copy link
Copy Markdown
Contributor

@camdencheek I like rev:at.time(yesterday, my-branch) with branch being optional.

@jtibshirani

Copy link
Copy Markdown
Contributor

It does not. Git cannot assume a binary search is safe because commit dates are not guaranteed to be monotonic.

Oof, I see! Then a cache feels really valuable, even in the initial implementation, as a way of reducing load and making sure follow-up queries are fast.

@camdencheek

Copy link
Copy Markdown
Member Author

Updates from feedback:

  • I added --first-parent to linearize the history and make it easier to explain
  • I updated the syntax to rev:at.time(yesterday) or rev:at.time(yesterday, my-branch). Along with that, I've renamed all the backend pieces to match the "rev at time" naming.
  • I added a simple LRU cache to the commit resolution to mitigate the cost of repeated, expensive git commands

@camdencheek

Copy link
Copy Markdown
Member Author

Since there seems to be general agreement on direction, I'll work towards productionizing this PR (add tests mostly). Feel free to add more feedback in the mean time, but I'll explicitly request a review once it's ready for full code review

Comment thread cmd/gitserver/internal/git/gitcli/revattime.go Outdated
@camdencheek camdencheek changed the base branch from main to cc/generate April 3, 2024 20:01
@camdencheek camdencheek requested a review from a team April 9, 2024 20:59
@camdencheek

Copy link
Copy Markdown
Member Author

Okay! I think this is ready for a full code review now. I've added tests and query completions/hovers since the last update.

@sourcegraph-bot

sourcegraph-bot commented Apr 9, 2024

Copy link
Copy Markdown
Contributor

📖 Storybook live preview

@bahrmichael bahrmichael left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blockers :)

Comment thread cmd/gitserver/internal/server_grpc.go Outdated
}

gs.svc.LogIfCorrupt(ctx, repoName, err)
// TODO: Better error checking.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you'll address this before merging.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this was actually just copied from another chunk of code that I did not write. Given I don't actually understand the intent behind the TODO, I'm gonna let the Source team own fixing that one 🙂

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha I should give up here at this point 😆 feel free to leave it in, I'll do a pass of this file once we're done with migration.

return RevisionSpecifier{RefGlob: spec[1:]}
return RevisionSpecifier{RefGlob: spec[1:]}, nil
} else if strings.HasPrefix(spec, revAtTimePrefix) {
aat, err := ParseRevAtTime(spec[4:])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does aat sand for?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops. That was a leftover from when this was called "AncestorAtTime". Will rename it to a more sane var name

@jtibshirani jtibshirani left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! Just left some questions, and I'm going to stare at the git command another time before approving :)

Comment thread client/shared/src/search/query/filters.ts
// Human date
n := now()
if t, err := naturaldate.Parse(s, n); err == nil && t != n {
if t, err := naturaldate.Parse(s, n, naturaldate.WithDirection(naturaldate.Past)); err == nil && t != n {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain this change?

@camdencheek camdencheek Apr 11, 2024

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other place we use relative dates is before: and after:. The naturaldate library also supports relative dates in the future, but for the purpose of before, after, and rev:at.time(), a date in the future is very unlikely to be what the user actually intends. So it tells the naturaldate parser that the thing it's parsing should be something in the past. e.g. 5 days will be interpreted as "5 days ago" rather than "5 days from now". Since this function is specifically for git dates, it seems unlikely to me we'll ever want to target something in the future.

Not strictly related to this PR, so thanks for calling it out 🙂

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. So this will immediately change the behavior for before and after filters like "5 days"? That seems okay to me, but it's good to be aware of this.

if errors.Is(err, context.DeadlineExceeded) || errors.HasType(err, &gitdomain.BadCommitError{}) {
return nil, err
}
reportMissing(RepoRevSpecs{Repo: repo, Revs: []query.RevisionSpecifier{rev}})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking I understand -- if a time is too far in the past for some repo, we'll expose this as "missing" like we do for revisions that don't exist.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite. An error will only be returned if the starting revision can't be found. e.g. if you do rev:at.time(yesterday, my-branch), and my-branch doesn't exist for a repo. Or the repo itself doesn't exist (which should be rare since we just listed it from the database).

In the case that there is no commit that exists at the requested time, that is a "no result" scenario and there is no error or "missing" notification.

So:

  • repo:sourcegraph/sourcegraph rev:at.time(yesterday, my-branch-that-does-not-exist) will yield an alert that my-branch-does-not-exist does not exist
  • repo:sourcegraph/sourcegraph rev:at.time(20 years ago) will just have no results.

I think this is reasonable behavior, but I'm certainly willing to be persuaded otherwise.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, now I see the commit ID is empty in this case, and gitserver returns "ok" only if res.GetCommitSha() != "". It'd be nice to add documentation to the client interface about what "ok" means exactly.

I don't have a strong preference, but it feels cleaner to treat this as an error everywhere so we don't have the "no error, but not ok" state, which I always find a bit confusing.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

treat this as an error

Is this "treat it as an error" from a user perspective or a Go perspective? I can certainly modify the signature to return a special error type, but want to make sure I'm understanding correctly

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mainly thinking about the gitserver client, which has the signature

RevAtTime(ctx context.Context, repo api.RepoName, spec string, t time.Time) (api.CommitID, bool, error)

I don't have a strong feeling about the client behavior, and could see an argument either way (just returning no results vs. matching what we do for missing rev specs).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"no error, but not ok" state

Real optionals in Go sure would be nice 😄

@camdencheek camdencheek Apr 12, 2024

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed the return value to "found" for clarity (instead of "ok" which, though idiomatic, implies a "non-ok" state, which is weird when there's also a possible error).

I tried two other things:

  • Return a sentinel error value. I didn't love this because the nonexistence of a commit before the target time does not feel like an error to me. That's an expected outcome for some searches. Also, we'd need to assert on the error type, which causes us to lose some type safety.
  • Return an empty string as a sentinel that no such commit exists. I also didn't like this because it's easy to miss that the return value can be empty, and that should be handled specially. The nice thing about the third return value is that it forces the caller to do something with it.

RE: client behavior, I'm also not tied to the current behavior but going to go ahead and merge as-is, but certainly won't be offended if someone follows up changing it.

Comment thread cmd/gitserver/internal/git/gitcli/revattime_test.go

@jtibshirani jtibshirani left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now pondered relative dates and --first-parent and this still looks good to me :)

@camdencheek camdencheek enabled auto-merge (squash) April 12, 2024 00:19

@eseliger eseliger left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for treading through the new gitserver API stuff and adding it in the new format!

I left a bunch of comments on the gitserver stuff, feel free to tell me to shut up and just merge it, just some suggestions on making sure we maintain a high quality API surface :)

Comment thread internal/gitserver/v1/gitserver.proto
Comment thread internal/gitserver/v1/gitserver.proto
Comment thread internal/gitserver/v1/gitserver.proto
Comment thread internal/gitserver/commands.go Outdated
Comment thread internal/gitserver/commands.go
Comment thread internal/gitserver/client.go
Comment thread cmd/gitserver/internal/git/gitcli/clibackend.go
Comment thread cmd/gitserver/internal/git/gitcli/revattime.go Outdated
Comment thread cmd/gitserver/internal/git/gitcli/revattime.go
Comment thread cmd/gitserver/internal/git/gitcli/revattime_test.go
@camdencheek camdencheek disabled auto-merge April 12, 2024 01:14
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants