Skip to content

Exclude files in search.largeFiles#480

Merged
gl-srgr merged 10 commits into
mainfrom
garyl/negate_pattern_largeFile
Dec 8, 2022
Merged

Exclude files in search.largeFiles#480
gl-srgr merged 10 commits into
mainfrom
garyl/negate_pattern_largeFile

Conversation

@gl-srgr

@gl-srgr gl-srgr commented Nov 14, 2022

Copy link
Copy Markdown
Contributor

The site config setting search.largeFiles should allow file exclusions so that known large files can be excluded from the index.

This PR introduces extra steps in builder for handling interpretation of largeFiles patterns. Although doublestar still handles parsing for us we now handle additional cases:

  1. a pattern is prefixed by !: We use this character to define negation. Any file matching the pattern following the ! is excluded from the index, regardless if another pattern in largeFiles is a positive match on the file. Replaced with behavior that matches gitignore more closely. The current behavior is now to take the latest entry in the pattern list, and it will take precedence / override preceding entries in the pattern list. We iterate in reverse and return first pattern that matches, regardless of if it is an exclude pattern or not. Note that doublestar in a more recent version uses negate meta characters ^ and ! but only for character classes.

  2. a pattern escapes an exclamation mark \!: No interpretation of the ! so no negation is applied, and ensure our doublestar patternMatch includes the ! (and no escape \).

GH Issue: https://github.com/sourcegraph/sourcegraph/issues/40688

Comment thread build/builder.go Outdated
Comment thread build/builder.go Outdated
// IgnoreSizeMax determines whether the max size should be ignored.
func (o *Options) IgnoreSizeMax(name string) bool {
matched := false
for _, pattern := range o.LargeFiles {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could bucket patterns to distinguish negate/exclude patterns from include patterns. Then we could visit each and every exclude pattern before any include pattern, just to ensure a file is not disallowed by the former. I don't think this optimization would save us much though.

@gl-srgr gl-srgr requested a review from a team November 14, 2022 07:06
Comment thread build/builder.go Outdated
}

// if negated then strip prefix meta characters representing negated filter pattern
if strings.Index(pattern, negate) == 0 {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is better to use strings.HasPrefix here? I looked at the implementations of both and HasPrefix seems to have less overhead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I'll replace with HasPrefix

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this function would be relatively straightforward using HasPrefix. You essentially only need to deal with strings.HasPrefix(pattern, "!") and strings.HasPrefix(pattern, "\\!")

Comment thread build/builder.go Outdated

@keegancsmith keegancsmith left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of using ! and \! LGTM. Some inline comments.

FYI here is the documentation from gitignore about ! which I think we should try be consistent with. However, we always explore all files so we don't need to be consistent with the performance note about directories. https://git-scm.com/docs/gitignore#_pattern_format

An optional prefix "!" which negates the pattern; any matching file excluded by a previous pattern will become included again. It is not possible to re-include a file if a parent directory of that file is excluded. Git doesn’t list excluded directories for performance reasons, so any patterns on contained files have no effect, no matter where they are defined. Put a backslash ("\") in front of the first "!" for patterns that begin with a literal "!", for example, "\!important!.txt".

Comment thread build/builder.go Outdated
}

// if negated then strip prefix meta characters representing negated filter pattern
if strings.Index(pattern, negate) == 0 {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this function would be relatively straightforward using HasPrefix. You essentially only need to deal with strings.HasPrefix(pattern, "!") and strings.HasPrefix(pattern, "\\!")

Comment thread build/builder.go
@gl-srgr

gl-srgr commented Dec 7, 2022

Copy link
Copy Markdown
Contributor Author

You can hold off on reviewing my recent push until I finish the sourcegraph/sourcegraph PR

@gl-srgr

gl-srgr commented Dec 7, 2022

Copy link
Copy Markdown
Contributor Author

https://github.com/sourcegraph/sourcegraph/pull/45318 PR for sourcegraph/sourcegraph to update filter for searches and add test cases.

Comment thread build/e2e_test.go
@gl-srgr

gl-srgr commented Dec 8, 2022

Copy link
Copy Markdown
Contributor Author

@keegancsmith btw do I need to do any updates for these version consts?

https://sourcegraph.com/github.com/sourcegraph/zoekt@4f8bcd5bed7e28f36a217809e2a5090c073178bd/-/blob/toc.go?L31:7&popover=pinned

My line of thinking is that for the specific case of file names prefixed with literal ! which match a pattern in the largeFiles allow list which also has prefix of !: pre-upgrade they will be included in the allow list, but after upgrade the file names will not match any pattern in largeFiles since we'll strip the ! prefix from the pattern, and the large files should not be indexed (until the ! gets escaped which we assume it's not currently). If the repository remains in a steady state across the upgrade then would the file potentially never be removed from the index? The sourcegraph/sourcegraph PR will take effect immediately on disallowing (although that search code path on sourcegraph/sourcegraph is a bit unfamiliar to me atm, I'm gonna look at it further)

@keegancsmith keegancsmith left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw do I need to do any updates for these version consts?

Nah, I think the likelyhood of a pattern prefixed with ! are so low it isn't worth the tradeoff on causing reindexing. Even then, we will eventually converge assuming the repo has commits to it. All in all, I think it is fine and I'd consider noting that in the changelog entry.

@gl-srgr gl-srgr merged commit 16127e9 into main Dec 8, 2022
@gl-srgr gl-srgr deleted the garyl/negate_pattern_largeFile branch December 8, 2022 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants