RFC: point-in-time searching#61513
Conversation
997aecb to
79cd0a0
Compare
There was a problem hiding this comment.
thx for implementing this as a proper gRPC call aligned with our current work to migrate all the things :)
There was a problem hiding this comment.
I'm as excited to get rid of Exec as you are 😄
There was a problem hiding this comment.
Sorry for all these. buf format seems to have been out of date
|
This is cool! Should we support |
|
One thing that is weird about the above ref syntax is that git uses reflog for determining these dates, not the commit/author date, which camden is using in this PR (and intuitively makes more sense IMO). So the above syntax that you provided means "HEAD as it was on Sourcegraph 3 months ago". |
|
Yep what Erik said. I agree the syntax is more intuitive, but |
There was a problem hiding this comment.
-
I like this a lot and would use it fairly often.
- I often want to start a search from like 2 years ago to check a hypothesis.
-
I like how it is introduced into our syntax.
-
how do we make it clear which timestamp we search?
- I guess we only allow and show commit timestamp?
- for repos that we convert to git (eg perforce) is the commit timestamp useful or is it actually the author timestamp?
-
timezones for non-relative time queries
- relative is fine because 5 days ago is relative to time.Now.
- but non-relative (absolute?) without TZ in it is likely parsed as UTC by the server (or your macbook's timezone in your devenv)
- I suspect a user would want it to be the browser TZ.
- This also then brings up questions about sharing links with those sort of timestamps in them.
-
how do we handle non-linear history?
- I like the idea you propose of –first-parent. That feels like the more useful value for user intent.
- I'm unsure about
--date-order? It feels like we should stick to a cut vertex in the commit graph. I suppose there are people who really do create funky graphs in how they do dev, and in that case my intuition of cut vertex is wrong.
-
how do we handle non-monotonic commit or author times? EDIT: you address this and I forgot to update my notes about it.
-
rev:ancestor.at(my-branch, 5 days ago)how could this conflict with actual rev names? -
concern around performance implications if people start using it in search contexts.
|
This looks really cool! I would probably abuse it for bisect searches (unless there's something better for this). My only concern is around performance. You mentioned a fast 50ms query. Do you know if there's an upper bound to how long this could take, or could it become increasingly slower as repos get larger and commit histories longer? Mainly concerned about noisy neighbor problems, esp. related to automated queries (e.g. from code insights). |
Yeah, that's tricky. We run into this with
I would generally agree, but there's not really an easy way to do this without abandoning the git CLI entirely AFAIK. The exception to that being if we use The downside of
The predicate parser only looks for things of the shape
My thought here is that perf cost will be dominated by actually doing the unindexed searches. That's part is no different than a search context that contains a bunch of unindexed revs. Comparatively, the only added cost is the commit resolution, which is non-zero, but not massive either. As a data point, the And in the case perf does become a significant concern, the work the Source team has done to virtualize git calls means we could pretty easily add a cache or index to this specific lookup. |
|
The proposed behavior makes sense to me. +1 to using
Now that we use hybrid search by default (yay!) many unindexed searches are very fast. But as you say, there's a lot we can do here -- for example a simple cache of time expression -> hash could really help. Also I assume the command scales logarithmically with history length (not linear) so there's not a huge concern with large histories? Personally I find the naming a bit surprising because I don't think about "git ancestors" in my day to day dev work :) Throwing out some other ideas:
|
It does not. Git cannot assume a binary search is safe because commit dates are not guaranteed to be monotonic. So it has to iterate commit-by-commit. (I'm also not even sure Git has the indexes that would be needed to do a binary search, but that's another matter entirely)
Thank you! I also dislike the naming, so I really appreciate the suggestions. For the examples you gave, do they still make sense to you for non-HEAD commits? I read |
|
Or maybe we could even get funky and allow multiple predicates like |
To back up my claims, I measured a few data points on |
|
@camdencheek I like |
Oof, I see! Then a cache feels really valuable, even in the initial implementation, as a way of reducing load and making sure follow-up queries are fast. |
31f0ca3 to
e462803
Compare
|
Updates from feedback:
|
|
Since there seems to be general agreement on direction, I'll work towards productionizing this PR (add tests mostly). Feel free to add more feedback in the mean time, but I'll explicitly request a review once it's ready for full code review |
0138d87 to
b85377b
Compare
ca86fe6 to
407e978
Compare
7150994 to
e4b8e8a
Compare
|
Okay! I think this is ready for a full code review now. I've added tests and query completions/hovers since the last update. |
| } | ||
|
|
||
| gs.svc.LogIfCorrupt(ctx, repoName, err) | ||
| // TODO: Better error checking. |
There was a problem hiding this comment.
I assume you'll address this before merging.
There was a problem hiding this comment.
Oh, this was actually just copied from another chunk of code that I did not write. Given I don't actually understand the intent behind the TODO, I'm gonna let the Source team own fixing that one 🙂
There was a problem hiding this comment.
haha I should give up here at this point 😆 feel free to leave it in, I'll do a pass of this file once we're done with migration.
| return RevisionSpecifier{RefGlob: spec[1:]} | ||
| return RevisionSpecifier{RefGlob: spec[1:]}, nil | ||
| } else if strings.HasPrefix(spec, revAtTimePrefix) { | ||
| aat, err := ParseRevAtTime(spec[4:]) |
There was a problem hiding this comment.
What does aat sand for?
There was a problem hiding this comment.
Whoops. That was a leftover from when this was called "AncestorAtTime". Will rename it to a more sane var name
jtibshirani
left a comment
There was a problem hiding this comment.
This looks good to me! Just left some questions, and I'm going to stare at the git command another time before approving :)
| // Human date | ||
| n := now() | ||
| if t, err := naturaldate.Parse(s, n); err == nil && t != n { | ||
| if t, err := naturaldate.Parse(s, n, naturaldate.WithDirection(naturaldate.Past)); err == nil && t != n { |
There was a problem hiding this comment.
Could you explain this change?
There was a problem hiding this comment.
The other place we use relative dates is before: and after:. The naturaldate library also supports relative dates in the future, but for the purpose of before, after, and rev:at.time(), a date in the future is very unlikely to be what the user actually intends. So it tells the naturaldate parser that the thing it's parsing should be something in the past. e.g. 5 days will be interpreted as "5 days ago" rather than "5 days from now". Since this function is specifically for git dates, it seems unlikely to me we'll ever want to target something in the future.
Not strictly related to this PR, so thanks for calling it out 🙂
There was a problem hiding this comment.
Ah I see. So this will immediately change the behavior for before and after filters like "5 days"? That seems okay to me, but it's good to be aware of this.
| if errors.Is(err, context.DeadlineExceeded) || errors.HasType(err, &gitdomain.BadCommitError{}) { | ||
| return nil, err | ||
| } | ||
| reportMissing(RepoRevSpecs{Repo: repo, Revs: []query.RevisionSpecifier{rev}}) |
There was a problem hiding this comment.
Checking I understand -- if a time is too far in the past for some repo, we'll expose this as "missing" like we do for revisions that don't exist.
There was a problem hiding this comment.
Not quite. An error will only be returned if the starting revision can't be found. e.g. if you do rev:at.time(yesterday, my-branch), and my-branch doesn't exist for a repo. Or the repo itself doesn't exist (which should be rare since we just listed it from the database).
In the case that there is no commit that exists at the requested time, that is a "no result" scenario and there is no error or "missing" notification.
So:
repo:sourcegraph/sourcegraph rev:at.time(yesterday, my-branch-that-does-not-exist)will yield an alert thatmy-branch-does-not-existdoes not existrepo:sourcegraph/sourcegraph rev:at.time(20 years ago)will just have no results.
I think this is reasonable behavior, but I'm certainly willing to be persuaded otherwise.
There was a problem hiding this comment.
Ah, now I see the commit ID is empty in this case, and gitserver returns "ok" only if res.GetCommitSha() != "". It'd be nice to add documentation to the client interface about what "ok" means exactly.
I don't have a strong preference, but it feels cleaner to treat this as an error everywhere so we don't have the "no error, but not ok" state, which I always find a bit confusing.
There was a problem hiding this comment.
treat this as an error
Is this "treat it as an error" from a user perspective or a Go perspective? I can certainly modify the signature to return a special error type, but want to make sure I'm understanding correctly
There was a problem hiding this comment.
I was mainly thinking about the gitserver client, which has the signature
RevAtTime(ctx context.Context, repo api.RepoName, spec string, t time.Time) (api.CommitID, bool, error)
I don't have a strong feeling about the client behavior, and could see an argument either way (just returning no results vs. matching what we do for missing rev specs).
There was a problem hiding this comment.
"no error, but not ok" state
Real optionals in Go sure would be nice 😄
There was a problem hiding this comment.
I've renamed the return value to "found" for clarity (instead of "ok" which, though idiomatic, implies a "non-ok" state, which is weird when there's also a possible error).
I tried two other things:
- Return a sentinel error value. I didn't love this because the nonexistence of a commit before the target time does not feel like an error to me. That's an expected outcome for some searches. Also, we'd need to assert on the error type, which causes us to lose some type safety.
- Return an empty string as a sentinel that no such commit exists. I also didn't like this because it's easy to miss that the return value can be empty, and that should be handled specially. The nice thing about the third return value is that it forces the caller to do something with it.
RE: client behavior, I'm also not tied to the current behavior but going to go ahead and merge as-is, but certainly won't be offended if someone follows up changing it.
jtibshirani
left a comment
There was a problem hiding this comment.
I've now pondered relative dates and --first-parent and this still looks good to me :)
f439cda to
dfbfdf2
Compare
eseliger
left a comment
There was a problem hiding this comment.
Looks good, thanks for treading through the new gitserver API stuff and adding it in the new format!
I left a bunch of comments on the gitserver stuff, feel free to tell me to shut up and just merge it, just some suggestions on making sure we maintain a high quality API surface :)
Co-authored-by: Erik Seliger <erikseliger@me.com>
Co-authored-by: Erik Seliger <erikseliger@me.com>
move #61513 out of the 5.3.9104 release section
Summary
This is a proposal to add the ability to do point-in-time searches. I'd like to add the ability to search a repo's state at a timestamp rather than at a specific commit, branch, or tag. This PR adds a
rev:at.time(5 days ago, my-branch)predicate that allows a user to do just that.Related discussion in Slack
Motivation
HEAD~1000hoping to find a commit that's approximately in the time period I'm interested in. On the command line, I can usegit log --before, but that's not available outside of commit/diff search.Discussion
Syntax
None of the syntax is set in stone. Would really love feedback on this part.
To stop myself from bikeshedding on the exact syntax, this PR is implemented with the simplest thing I could come up with:This PR implements the syntax asrev:ancestor.at(branch, 5 days ago), orrev:ancestor.at(5 days ago), which is equivalent torev:ancestor.at(HEAD, 5 days ago).rev:at.time(5 days ago). This defaults to targeting HEAD, but can take an optional second argument to choose a different revision as the tip likerev:at.time(2021-01-01, cc/my-branch).Currently, there is no user-facing way to add the revision to the repo filter because I couldn't find any syntax I liked for that. However, due to a quirk in how our query language is parsed, all
rev:values need to be representable in therepo:reponame@revposition, so this super hackily json+base64 encodes the predicate struct. Note that a user will never actually see that, it's just an internal detail. I think this is acceptable for now, and if we decide on a good syntax it will be a non-breaking change.Behavior
Because of force pushes, merge commits, and overriding commit dates, we cannot guarantee to yield the state of a branch at a given date. However, we can make a feature that 1) almost always matches user intent, and 2) is well-defined.
So, to define the behavior:
RevAtTimereturns the nearest(by generation number) ancestor commitcommit in the linear history defined by--first-parentthat has a committer timestamp (not author timestamp) before the requested timestamp.In the case of a tie in generation number (which can happen with non-linear commit histories introduced by merge commits), we return the commit with the greater timestamp.Merges also make this a little complicated, so we might want to linearize the commit history with--first-parent.Consider the following commit graph where e is a merge commit:If e is after the target date, but c and d are before, which commit should we yield as the target? As implemented, we’ll always select the newest one, but if the newest one is d, then we’ll skip b because it’s not an ancestor of d (even though it is an ancestor of another commit that matches the before filter).Git commiter dates are not guaranteed to be monotonically increasing (they can be overridden with
GIT_COMMIT_DATE), but in practice they usually are. Author dates are significantly more variable because commits are often authored before they are merged (think cherry picks). Since the goal of this feature is to find the state of a branch at a given timestamp, I think committer date is more appropriate than author date and this should not be configurable.Performance
As implemented, this requires a gitserver query during the repo resolution step to resolve a
date=rev into a commit hash. Thegit logcommand it runs is generally fast (<50ms for sourcegraph/sourcegraph), but its cost scales with the depth of the commit graph and the distance it has to traverse to find the first matching commit. I expect the cost of these queries will be dominated by the unindexed searches they generate (non-HEAD searches will generally be unindexed).After some perf testing, to limit the cost of repeated queries of this type, I've added a small in-memory cache.
Test Plan