fix: De-dup and concurrent-ify file content requests & splitting#64169
Conversation
d8c5548 to
165bfdf
Compare
d4fe5ea to
33cf283
Compare
165bfdf to
ac31357
Compare
33cf283 to
413c4af
Compare
There was a problem hiding this comment.
I think the iterator (and mixing the grouping and fetching logic) makes this very hard to read.
Can't we split out the grouping logic as its own synchronous step and then iterate over neededContents with conc.Map?
Any worries that accumulating whole files for all usages before starting to slice the contents could be a problem? If we grouped usages by file before we should be able to resolve the lines as soon as we get the file content and deallocate the remaining file.
There was a problem hiding this comment.
I've simplified the iterator logic.
Can't we split out the grouping logic as its own synchronous step and then iterate over neededContents with conc.Map?
We can't iterate over a map using conc.Map, so we'd need to maintain a separate slice (I was initially going down that route). So it has its own downside where you have the state in two different places, so I chose to implement it this way.
Any worries that accumulating whole files for all usages before starting to slice the contents could be a problem?
I don't think so, no, unless we get a lot of large files. Most files should be a few KB, multiply that by 1K tops you get a few MB at most. We can pass in the lines of interest if needed. If we do that, doing a 2-pass approach would be simpler using conc.Map.
There was a problem hiding this comment.
I gave it a quick try. Shouldn't this do what we need?
func NewPreciseUsageResolvers(ctx context.Context, gitserverClient gitserver.Client, usages []shared.UploadUsage) ([]resolverstubs.UsageResolver, error) {
uniquePaths := genslices.Uniq(genslices.Map(usages, func(usage shared.UploadUsage) types.RepoCommitPath {
return types.RepoCommitPath{
Repo: usage.Upload.RepositoryName,
Commit: usage.TargetCommit,
Path: usage.Path.RawValue(),
}
}))
contentMap := map[types.RepoCommitPath][]string{}
contentMapLock := sync.Mutex{}
conciter.Iterator[types.RepoCommitPath]{MaxGoroutines: 4}.ForEach(uniquePaths, func(rcp *types.RepoCommitPath) {
byteContents, err := gitresolvers.GetFullContents(ctx, gitserverClient,
api.RepoName(rcp.Repo), api.CommitID(rcp.Commit), core.NewRepoRelPathUnchecked(rcp.Path))
if err != nil {
return
}
contentMapLock.Lock()
contentMap[*rcp] = strings.Split(string(byteContents), "\n")
contentMapLock.Unlock()
})
out := make([]resolverstubs.UsageResolver, 0, len(usages))
for _, usage := range usages {
contents, ok := contentMap[types.RepoCommitPath{
Repo: usage.Upload.RepositoryName,
Commit: usage.TargetCommit,
Path: usage.Path.RawValue(),
}]
if !ok {
continue
}
...
}There was a problem hiding this comment.
Sure, but (1) that needlessly does two passes with genslices.Uniq(genslices.Map and (2) it's tied to the implementation of constructing precise resolvers.
Honestly, at this point, I feel like there isn't a big difference except what I've written is a bit more low-level/imperative, whereas your implementation is more functional.
That said, I will change the code because I don't think we should spend more time debating this.
befd877 to
fe5531e
Compare
fe5531e to
0355402
Compare
This speeds up precise find usages from about 600ms to 400ms
for a test workload involving 16 precise usages across 5 files.
(Specifically for the
codenav.Servicetype)Fixes GRAPH-781
Stacked on top of:
I have some more thoughts on speeding this up further, but
I'll likely look into that after adding tests.
Test plan
Manually tested. Will add automated tests in a follow-up PR.