Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

Modify gitserverClient.ListFiles to speed up embeddings#50410

Merged
varsanojidan merged 9 commits into
mainfrom
iv/add-gitserver-client-method-for-embeddings
Apr 13, 2023
Merged

Modify gitserverClient.ListFiles to speed up embeddings#50410
varsanojidan merged 9 commits into
mainfrom
iv/add-gitserver-client-method-for-embeddings

Conversation

@varsanojidan

@varsanojidan varsanojidan commented Apr 5, 2023

Copy link
Copy Markdown
Contributor

I saw that the embeddings service would be calling gitserver a lot for file stat info, once for each file in a repo that's being embedded, felt like this could be sped up with one gitserver call per repo by modifying the ListFiles function.

Test plan

Unit tests pass.
image

Performance:

I wrapped some timestamps around the code that fetches all the valid files with and without this change, tested on sourcegraph/sourcegraph.

This change specifically times how long it takes to get the list of valid files from gitserver, and NOT how long the entire embedding takes.

Before change

[         worker] Start: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:17:12.931563 +0000 UTC m=+197.190809292
[         worker] End: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:19:38.835779 +0000 UTC m=+343.094161167

Total: 2min 26s

After change

[         worker] Start: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:10:01.656261 +0000 UTC m=+73.195381626
[         worker] End: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:10:01.918023 +0000 UTC m=+73.457140793

Total: 0.3s

After the change it is 486.67 times faster in fetching all of the valid files

@cla-bot cla-bot Bot added the cla-signed label Apr 5, 2023
@sourcegraph-bot

sourcegraph-bot commented Apr 5, 2023

Copy link
Copy Markdown
Contributor

Codenotify: Notifying subscribers in CODENOTIFY files for diff 01f87ca...76aed5d.

Notify File(s)
@efritz enterprise/cmd/worker/internal/embeddings/repo/BUILD.bazel
enterprise/cmd/worker/internal/embeddings/repo/handler.go
@indradhanush internal/gitserver/BUILD.bazel
internal/gitserver/client.go
internal/gitserver/commands.go
internal/gitserver/commands_test.go
internal/gitserver/mocks_temp.go
internal/gitserver/test_utils.go
@sashaostrikov internal/gitserver/BUILD.bazel
internal/gitserver/client.go
internal/gitserver/commands.go
internal/gitserver/commands_test.go
internal/gitserver/mocks_temp.go
internal/gitserver/test_utils.go

Comment thread internal/gitserver/commands.go
@varsanojidan varsanojidan changed the title Modify gitserverClient.ListFiles for more efficient embeddings Modify gitserverClient.ListFiles to speed up embeddings Apr 5, 2023
@varsanojidan varsanojidan requested review from a team and novoselrok April 6, 2023 04:26
@novoselrok novoselrok requested a review from keegancsmith April 6, 2023 04:58
@mrnugget mrnugget requested a review from eseliger April 6, 2023 07:59

@mrnugget mrnugget left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving it up to @keegancsmith, @eseliger and @sourcegraph/repo-management to do final review

// ListFiles returns a list of root-relative file paths matching the given
// pattern in a particular commit of a repository.
func (c *clientImplementor) ListFiles(ctx context.Context, checker authz.SubRepoPermissionChecker, repo api.RepoName, commit api.CommitID, pattern *regexp.Regexp) (_ []string, err error) {
cmd := c.gitCommand(repo, "ls-tree", "-z", "--name-only", "-r", string(commit), "--")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're now using c.ReadDir which ends up using lsTreeUncached, so this gets called:

https://github.com/sourcegraph/sourcegraph/blob/cb34cb3bc8431ea0059105f56976847aed96bdd7/internal/gitserver/commands.go#L511-L517

You're sure that the slight change in options doesn't make a difference?

@varsanojidan varsanojidan Apr 6, 2023

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the --full-name always prints out the name as a path relative to the root of the git directory regardless of where you run it. --name-only makes it so the git ls-tree prints just the path (and nothing else), and uses the default behavior that prints the path relative to wherever the git ls-tree was run. Both of the iterations of these were run in the root directory of the repo so they print out the same result, but --full-name is actually probably the better option.

image
image

Comment on lines +1030 to +1042
for _, file := range files {
if file.IsDir() && !opts.IncludeDirs {
continue
} else if opts.MaxFileSizeBytes != nil && file.Size() > *opts.MaxFileSizeBytes {
continue
} else if opts.Pattern != nil && !opts.Pattern.MatchString(file.Name()) {
continue
}

filteredFiles = append(filteredFiles, file.Name())
}

return filterPaths(ctx, checker, repo, matching)
return filterPaths(ctx, checker, repo, filteredFiles)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're throwing away files that are bigger than size X and directories: would it be faster if we push that down even further to lsTreeUncached?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the speed would be much faster because there's no API call between ListFiles and lsTreeUncached, so you might save on a bit of memory by filtering it out a bit in lsTreeUncached. In which case I thought there wasn't a need to risk modifying lsTreeUncached which gets used more as opposed to just modifying ListFiles which is used only by the embeddings worker handler. But I don't feel super strongly about it, what do you think?

Comment thread enterprise/cmd/worker/internal/embeddings/repo/handler.go Outdated
Comment thread internal/gitserver/commands.go
Comment thread internal/gitserver/commands_test.go
@keegancsmith

Copy link
Copy Markdown
Member

@varsanojidan still need a review on this? I know @eseliger had a related PR. Just wondering what the status is :)

@varsanojidan

Copy link
Copy Markdown
Contributor Author

@keegancsmith I was chatting with @eseliger and he mentioned that his approach still needed a disk cache before it can be merged, but I'm not too sure what the status of that is. If its almost done then I can go ahead and close this, but if it might be a while, this could be a great speed boost currently.

@indradhanush indradhanush left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get that boost in the meantime!

@eseliger eseliger left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this has less drama around deployment than my PR so let's proceed with this one until we know if we can use the other one eventually :)

Comment thread internal/gitserver/protocol/gitserver.go Outdated
Comment thread internal/gitserver/protocol/gitserver.go Outdated
@varsanojidan varsanojidan merged commit a3c6571 into main Apr 13, 2023
@varsanojidan varsanojidan deleted the iv/add-gitserver-client-method-for-embeddings branch April 13, 2023 16:15
@novoselrok

Copy link
Copy Markdown
Contributor

If we want this in the next patch release (5.0.2) we have to backport this PR as well.

@varsanojidan

Copy link
Copy Markdown
Contributor Author

@novoselrok im happy to backport it, I just wasn't sure if this fit the criteria for something we would ship in a patch.

almeidapaulooliveira pushed a commit that referenced this pull request Apr 13, 2023
I saw that the embeddings service would be calling gitserver a lot for
file stat info, once for each file in a repo that's being embedded, felt
like this could be sped up with one gitserver call per repo by modifying
the ListFiles function.

## Test plan
Unit tests pass.

![image](https://user-images.githubusercontent.com/20795805/230265838-98bd0dd4-25ba-47a7-9147-2d002ff40f95.png)

### Performance:

I wrapped some timestamps around the code that fetches all the valid
files with and without this change, tested on `sourcegraph/sourcegraph`.

**This change specifically times how long it takes to get the list of
valid files from gitserver, and NOT how long the entire embedding
takes.**

#### Before change
```
[         worker] Start: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:17:12.931563 +0000 UTC m=+197.190809292
[         worker] End: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:19:38.835779 +0000 UTC m=+343.094161167
```
Total: 2min 26s

#### After change
```
[         worker] Start: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:10:01.656261 +0000 UTC m=+73.195381626
[         worker] End: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:10:01.918023 +0000 UTC m=+73.457140793
```
Total: 0.3s


After the change it is **486.67 times** faster in fetching all of the
valid files
@novoselrok

Copy link
Copy Markdown
Contributor

@varsanojidan I'd say the ~500x speed up fits into the patch criteria 😄 I checked with the team here: https://sourcegraph.slack.com/archives/C052G9Y5Y8H/p1681408718158119

@sashaostrikov sashaostrikov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:chef_kiss:

github-actions Bot pushed a commit that referenced this pull request Apr 14, 2023
I saw that the embeddings service would be calling gitserver a lot for
file stat info, once for each file in a repo that's being embedded, felt
like this could be sped up with one gitserver call per repo by modifying
the ListFiles function.

## Test plan
Unit tests pass.

![image](https://user-images.githubusercontent.com/20795805/230265838-98bd0dd4-25ba-47a7-9147-2d002ff40f95.png)

### Performance:

I wrapped some timestamps around the code that fetches all the valid
files with and without this change, tested on `sourcegraph/sourcegraph`.

**This change specifically times how long it takes to get the list of
valid files from gitserver, and NOT how long the entire embedding
takes.**

#### Before change
```
[         worker] Start: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:17:12.931563 +0000 UTC m=+197.190809292
[         worker] End: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:19:38.835779 +0000 UTC m=+343.094161167
```
Total: 2min 26s

#### After change
```
[         worker] Start: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:10:01.656261 +0000 UTC m=+73.195381626
[         worker] End: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:10:01.918023 +0000 UTC m=+73.457140793
```
Total: 0.3s

After the change it is **486.67 times** faster in fetching all of the
valid files

(cherry picked from commit a3c6571)
cesrjimenez pushed a commit that referenced this pull request Apr 14, 2023
I saw that the embeddings service would be calling gitserver a lot for
file stat info, once for each file in a repo that's being embedded, felt
like this could be sped up with one gitserver call per repo by modifying
the ListFiles function.

## Test plan
Unit tests pass.

![image](https://user-images.githubusercontent.com/20795805/230265838-98bd0dd4-25ba-47a7-9147-2d002ff40f95.png)

### Performance:

I wrapped some timestamps around the code that fetches all the valid
files with and without this change, tested on `sourcegraph/sourcegraph`.

**This change specifically times how long it takes to get the list of
valid files from gitserver, and NOT how long the entire embedding
takes.**

#### Before change
```
[         worker] Start: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:17:12.931563 +0000 UTC m=+197.190809292
[         worker] End: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:19:38.835779 +0000 UTC m=+343.094161167
```
Total: 2min 26s

#### After change
```
[         worker] Start: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:10:01.656261 +0000 UTC m=+73.195381626
[         worker] End: [github.com/sourcegraph/sourcegraph](http://github.com/sourcegraph/sourcegraph) 2023-04-06 03:10:01.918023 +0000 UTC m=+73.457140793
```
Total: 0.3s


After the change it is **486.67 times** faster in fetching all of the
valid files
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants