Modify gitserverClient.ListFiles to speed up embeddings#50410
Conversation
|
Codenotify: Notifying subscribers in CODENOTIFY files for diff 01f87ca...76aed5d.
|
mrnugget
left a comment
There was a problem hiding this comment.
Leaving it up to @keegancsmith, @eseliger and @sourcegraph/repo-management to do final review
| // ListFiles returns a list of root-relative file paths matching the given | ||
| // pattern in a particular commit of a repository. | ||
| func (c *clientImplementor) ListFiles(ctx context.Context, checker authz.SubRepoPermissionChecker, repo api.RepoName, commit api.CommitID, pattern *regexp.Regexp) (_ []string, err error) { | ||
| cmd := c.gitCommand(repo, "ls-tree", "-z", "--name-only", "-r", string(commit), "--") |
There was a problem hiding this comment.
You're now using c.ReadDir which ends up using lsTreeUncached, so this gets called:
You're sure that the slight change in options doesn't make a difference?
There was a problem hiding this comment.
Yeah the --full-name always prints out the name as a path relative to the root of the git directory regardless of where you run it. --name-only makes it so the git ls-tree prints just the path (and nothing else), and uses the default behavior that prints the path relative to wherever the git ls-tree was run. Both of the iterations of these were run in the root directory of the repo so they print out the same result, but --full-name is actually probably the better option.
| for _, file := range files { | ||
| if file.IsDir() && !opts.IncludeDirs { | ||
| continue | ||
| } else if opts.MaxFileSizeBytes != nil && file.Size() > *opts.MaxFileSizeBytes { | ||
| continue | ||
| } else if opts.Pattern != nil && !opts.Pattern.MatchString(file.Name()) { | ||
| continue | ||
| } | ||
|
|
||
| filteredFiles = append(filteredFiles, file.Name()) | ||
| } | ||
|
|
||
| return filterPaths(ctx, checker, repo, matching) | ||
| return filterPaths(ctx, checker, repo, filteredFiles) |
There was a problem hiding this comment.
Since we're throwing away files that are bigger than size X and directories: would it be faster if we push that down even further to lsTreeUncached?
There was a problem hiding this comment.
I'm not sure if the speed would be much faster because there's no API call between ListFiles and lsTreeUncached, so you might save on a bit of memory by filtering it out a bit in lsTreeUncached. In which case I thought there wasn't a need to risk modifying lsTreeUncached which gets used more as opposed to just modifying ListFiles which is used only by the embeddings worker handler. But I don't feel super strongly about it, what do you think?
|
@varsanojidan still need a review on this? I know @eseliger had a related PR. Just wondering what the status is :) |
|
@keegancsmith I was chatting with @eseliger and he mentioned that his approach still needed a disk cache before it can be merged, but I'm not too sure what the status of that is. If its almost done then I can go ahead and close this, but if it might be a while, this could be a great speed boost currently. |
indradhanush
left a comment
There was a problem hiding this comment.
Let's get that boost in the meantime!
eseliger
left a comment
There was a problem hiding this comment.
LGTM, this has less drama around deployment than my PR so let's proceed with this one until we know if we can use the other one eventually :)
|
If we want this in the next patch release ( |
|
@novoselrok im happy to backport it, I just wasn't sure if this fit the criteria for something we would ship in a patch. |
I saw that the embeddings service would be calling gitserver a lot for file stat info, once for each file in a repo that's being embedded, felt like this could be sped up with one gitserver call per repo by modifying the ListFiles function. ## Test plan Unit tests pass.  ### Performance: I wrapped some timestamps around the code that fetches all the valid files with and without this change, tested on `sourcegraph/sourcegraph`. **This change specifically times how long it takes to get the list of valid files from gitserver, and NOT how long the entire embedding takes.** #### Before change ``` [ worker] Start: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:17:12.931563 +0000 UTC m=+197.190809292 [ worker] End: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:19:38.835779 +0000 UTC m=+343.094161167 ``` Total: 2min 26s #### After change ``` [ worker] Start: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:10:01.656261 +0000 UTC m=+73.195381626 [ worker] End: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:10:01.918023 +0000 UTC m=+73.457140793 ``` Total: 0.3s After the change it is **486.67 times** faster in fetching all of the valid files
|
@varsanojidan I'd say the ~500x speed up fits into the patch criteria 😄 I checked with the team here: https://sourcegraph.slack.com/archives/C052G9Y5Y8H/p1681408718158119 |
I saw that the embeddings service would be calling gitserver a lot for file stat info, once for each file in a repo that's being embedded, felt like this could be sped up with one gitserver call per repo by modifying the ListFiles function. ## Test plan Unit tests pass.  ### Performance: I wrapped some timestamps around the code that fetches all the valid files with and without this change, tested on `sourcegraph/sourcegraph`. **This change specifically times how long it takes to get the list of valid files from gitserver, and NOT how long the entire embedding takes.** #### Before change ``` [ worker] Start: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:17:12.931563 +0000 UTC m=+197.190809292 [ worker] End: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:19:38.835779 +0000 UTC m=+343.094161167 ``` Total: 2min 26s #### After change ``` [ worker] Start: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:10:01.656261 +0000 UTC m=+73.195381626 [ worker] End: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:10:01.918023 +0000 UTC m=+73.457140793 ``` Total: 0.3s After the change it is **486.67 times** faster in fetching all of the valid files (cherry picked from commit a3c6571)
I saw that the embeddings service would be calling gitserver a lot for file stat info, once for each file in a repo that's being embedded, felt like this could be sped up with one gitserver call per repo by modifying the ListFiles function. ## Test plan Unit tests pass.  ### Performance: I wrapped some timestamps around the code that fetches all the valid files with and without this change, tested on `sourcegraph/sourcegraph`. **This change specifically times how long it takes to get the list of valid files from gitserver, and NOT how long the entire embedding takes.** #### Before change ``` [ worker] Start: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:17:12.931563 +0000 UTC m=+197.190809292 [ worker] End: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:19:38.835779 +0000 UTC m=+343.094161167 ``` Total: 2min 26s #### After change ``` [ worker] Start: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:10:01.656261 +0000 UTC m=+73.195381626 [ worker] End: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:10:01.918023 +0000 UTC m=+73.457140793 ``` Total: 0.3s After the change it is **486.67 times** faster in fetching all of the valid files


I saw that the embeddings service would be calling gitserver a lot for file stat info, once for each file in a repo that's being embedded, felt like this could be sped up with one gitserver call per repo by modifying the ListFiles function.
Test plan
Unit tests pass.

Performance:
I wrapped some timestamps around the code that fetches all the valid files with and without this change, tested on
sourcegraph/sourcegraph.This change specifically times how long it takes to get the list of valid files from gitserver, and NOT how long the entire embedding takes.
Before change
Total: 2min 26s
After change
Total: 0.3s
After the change it is 486.67 times faster in fetching all of the valid files