Skip to content

Indexing: respect indexing buffer limit#686

Merged
jtibshirani merged 1 commit into
mainfrom
jtibs/drop-content
Nov 10, 2023
Merged

Indexing: respect indexing buffer limit#686
jtibshirani merged 1 commit into
mainfrom
jtibs/drop-content

Conversation

@jtibshirani

@jtibshirani jtibshirani commented Nov 10, 2023

Copy link
Copy Markdown
Contributor

When indexing documents, we buffer up documents until we reach the shard size
limit (100MB), then flush the shard. If we decide to skip a document because
it's a binary file, then (naturally) we don't count its content size towards
the shard limit. But we still buffered the full document. So if there are a large
number of binary files, we could easily blow past the 100MB limit and run into
memory issues.

This change simply clears Content whenever SkipReason is set. The
invariant: a buffered document should only ever have SkipReason or Content,
not both.

Update: this also fixes a bug where we still ran ctags even if we identified a
file was binary and should be skipped. Now, we avoid running ctags in these
cases.

@jtibshirani

Copy link
Copy Markdown
Contributor Author

I started digging into this after I noticed indexserver memory spikes on S2:

Screenshot 2023-11-09 at 5 22 43 PM

I then correlated these with a large perforce repo that is consistently failing to index:

19:23:40.318963	 .  4440	... error: command [zoekt-git-index -submodules=false -incremental -branches HEAD -language_map c_sharp:scip,go:scip,python:scip,scala:scip,typescript:scip,kotlin:scip,ruby:scip,javascript:scip,rust:scip,zig:scip -file_limit 1048576 -parallelism 8 -index /data/index -require_ctags -large_file **/fixtures.json /data/index/.indexserver.tmp/perforce-sgdev-org%2Fdevx-80k-files.git] failed: signal: killed OUT: 2023/11/09 19:22:30 attempting to index 86024 total files
19:23:40.318966	 .     3	... state: fail

The repo contents are auto-generated and contain a large number of non-source files. With this fix, Zoekt no longer chokes on indexing the repo.

@jtibshirani jtibshirani marked this pull request as ready for review November 10, 2023 01:27

@keegancsmith keegancsmith left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find. What is surprising though is we avoid reading large files from git. However, it is possible those files end up being excluded for other reasons afterwards. Or we just build up a very large number of files that are excluded (eg a third_party or vendor dir). IE this is still a great change.

Maybe given we sometimes set SkipReason before looking at a document we should update Builder.Add to skip most of the work if SkipReason is already set?

zoekt/gitindex/index.go

Lines 557 to 572 in db067d1

if blob.Size > int64(opts.BuildOptions.SizeMax) && !opts.BuildOptions.IgnoreSizeMax(keyFullPath) {
if err := builder.Add(zoekt.Document{
SkipReason: fmt.Sprintf("file size %d exceeds maximum size %d", blob.Size, opts.BuildOptions.SizeMax),
Name: keyFullPath,
Branches: brs,
SubRepositoryPath: key.SubRepoPath,
}); err != nil {
return err
}
continue
}
contents, err := blobContents(blob)
if err != nil {
return err
}

@keegancsmith

Copy link
Copy Markdown
Member

I just had a realisation and I think this PR will fix it. In our ctagsAddSymbolsParserMap code we don't do anything like skip files with a non-empty skip reason. That means all those skipped files will still get jammed into ctags! Maybe you can also update that code path to skip parsing if Content is empty?

@jtibshirani

jtibshirani commented Nov 10, 2023

Copy link
Copy Markdown
Contributor Author

What is surprising though is we avoid reading large files from git. However, it is possible those files end up being excluded for other reasons afterwards.

Indeed, what happened with this perforce depot is that all the files were relatively small (so we didn't skip them upfront because they are too large). But they were not recognized as source by zoekt.CheckText, so they didn't contribute to the calculated buffer size. This is probably not super common, but I guess it can happen with test repos with a lot of auto-generated content.

That means all those skipped files will still get jammed into ctags! Maybe you can also update that code path to skip parsing if Content is empty?

This is a good point! I'll merge this to now fix the issues on S2, but then follow up with a refactor + more complete fix.

@jtibshirani jtibshirani merged commit 2355607 into main Nov 10, 2023
@jtibshirani jtibshirani deleted the jtibs/drop-content branch November 10, 2023 16:18
jtibshirani added a commit that referenced this pull request Nov 16, 2023
When indexing documents, we buffer up documents until we reach the shard size
limit (100MB), then flush the shard. If we decide to skip a document because
it's a binary file, then (naturally) we don't count its content size towards
the shard limit. But we still buffered the full document. So if there are a large
number of binary files, we could easily blow past the 100MB limit and run into
memory issues.

This change simply clears `Content` whenever `SkipReason` is set. The
invariant: a buffered document should only ever have `SkipReason` or `Content`,
not both.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants