Search backend: remove plan expansion of `file:contains.content()` by camdencheek · Pull Request #39501 · sourcegraph/sourcegraph-public-snapshot

camdencheek · 2022-07-27T14:33:18Z

This modifies the evaluation of the file:contains.content() predicate to no longer expand ahead of time. There are very few cases where this would work correctly in the past because the number of files we expanded into was enormous, and they all had to be scoped in the query by a repo, which made for extremely complex OR queries that would just cause stack overflows when we tried to process them.

There are two cases where we support file:contains.content():

Text search
Diff search

Text search is implemented in a very efficient manner. Basically, for a user input like file:contains(abc) def, we execute the search as if the user searched for abc and def, then we filter out the ranges that matched abc (but keep any that match def). This lets us to take full advantage of our existing, efficient AND/OR machinery.

Diff search is implemented in an extremely inefficient manner. For each result that comes through, we execute an unindexed search on the files matched in the diff at the matched commit and ensure that they contain all the patterns specified by the file:contains.content() predicate. This is slow, but diff search is also slow, and I expect that the file:contains.content() feature for diff search is hardly used, if at all, so I think it's fine. I don't want to put the effort into supporting this natively in diff search right now.

Stacked on https://github.com/sourcegraph/sourcegraph/pull/39383

This is the last predicate that used the query expansion machinery, so I will remove that in the next PR.

Test plan

Added tests, backend integration test changes reflect changes in behavior.

* refactor and adding tests * update schema for new policy * remove PasswordPolicy from default dev config * add more tests, fix bugs * Add PasswordPolicy changes to frontend * add PasswordPolicy jsoncontext for frontend * add PasswordPolicy to schema * create security helper functions, code re-use * dedup validatePassword functionality * remove duplicate auth.passwordPolicy * always use minPasswordLen * fix build, add types, fix tests * add experimental passwordpolicy back to schema * refactor, dedup password requirement check * add more tests * refactor, run linters * add deprecation notices * always return a GenericPasswordPolicy * remove conf.go move to general conf * remove interface type * deprecate PasswordPolicy * fix tests, refactor * run pretty * serialize json as frontend expects * Apply suggestions from code review Co-authored-by: Thorsten Ball <mrnugget@gmail.com> * Apply suggestions from code review Co-authored-by: Thorsten Ball <mrnugget@gmail.com> * change test to be table driven Co-authored-by: Thorsten Ball <mrnugget@gmail.com>

…#39389)

…n server-side execution (#38921) * add migration * add sql for migration * add sql to squashed.sql * specify batch change id in createBatchSpecFromRaw * pass batch change in frontend * add comment * arrange methods properly * remove trailing lines * revert arrngement * update * fix failing test * fix failing tests * add test for resolver * add test for service * add test for service * more test for service * set field name to bigint * remove comment * update the update query * remove confusing comment * rename variables * rename BastchChangeID variable * add test for unauthorised user creating a batch spec from raw * update db schema * remove trailing line * remove comment * prettier things

…39115)

This prevents serving 3.42.0 by default, which is broken for the preview/apply workflows.

For permission testing I wrote this command to very conveniently switch between users. This is kind of like the already existing testproxy, except much more convenient to use as well as being targetted for auth testing rather than http-header testing. I'm unsure of how to document this further so people are aware of it. Alternatively I think it is also useful to maybe spin up by default in our enterprise env. I'll leave both of those for future PRs. For now I will advertise in #dev-chat and #dev-experience. Here is the example output to give you a feel for what it does: $ go run ./dev/internal/cmd/auth-proxy-http-header https://docs.sourcegraph.com/admin/auth#http-authentication-proxies "auth.providers": [ { "type": "http-header", "usernameHeader": "X-Forwarded-User", "emailHeader": "X-Forwarded-Email" } ] Visit http://127.0.0.1:10810 for keegan keegan@sourcegraph.com Visit http://127.0.0.1:10811 for user1 keegan+user1@sourcegraph.com Visit http://127.0.0.1:10812 for user2 keegan+user2@sourcegraph.com Visit http://127.0.0.1:10813 for user3 keegan+user3@sourcegraph.com Visit http://127.0.0.1:10814 for user4 keegan+user4@sourcegraph.com Visit http://127.0.0.1:10815 for user5 keegan+user5@sourcegraph.com Test Plan: Ran locally

…key fields (#39344)

…information (#39250) This PR adds support for reading the [GitHub style code owner specs](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners) from repositories to infer code ownership information automatically. To do that, we add a new dependency: https://github.com/hmarr/codeowners (MIT license so I don't see any legal issues). Here's a bullet point recap of what changes we can find here: - I've added a new `codeownership.Ruleset` type that currently is a proxy to the `codeowners.Ruleset` (from the new dependency). This allows us to expose a nice API but also will make it easier to later add other ways to get the code ownership mapping: E.g. we may want to fetch data rules from our postgres database instead. - We have a new method to create a ruleset based on querying the gitserver for a specific repo at a specific revision. This will download a static number of blobs that are [allowed code owner locations](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners#codeowners-file-location) (The plugin also [supports GitLab style code owners](hmarr/codeowners@f72d282) so I've added the [`.gitlab` path](https://docs.gitlab.com/ee/user/project/code_owners.html#set-up-code-owners)) - In the filter job, we create a new `map` so that we only have to create the `codeowners.Ruleset` for a specific repo once per search. So if one repo returns 100 results, we only have one ruleset. - When a code ownership mapping was found, we test the result paths against the mapping. If the owners contain the owner we need to include, we keep the result, otherwise we drop it. - I added a new test for the filter job. - I also migrated the filter job test to use `autogold` while at it. - I added a feature flag to gate out the work.

Instrumentation showed it is not being used anymore.

This PR makes two functional improvements to the RepoInfo client method and a semantic one. The functional improvements are: 1. Limit length of map to max number of shards For reasons unknown now, we were making an unnecesarily high number of requests since the `shards` map being created as a factor of number of repos. But it should only ever be the total number of shards of gitserver currently running. 2. Do not hide response body for failed HTTP reqs Currently, if the API server returns anythign but a 200 OK, we return the HTTP status, but swallow the HTTP body. This will contain important information that will be helpful to understand the error, which is not possible as of now. The semantic improvement is: 1. Add comment and rename shard -> repoInfoReq I spent an unnecessarily high amount of time trying to "fix" this code for what I thought was not the optimal way of collecting and sending the API requests, until I had the aha moment and finally understood why we're doing it this way. As a result, added a comment to clarify this as well as renamed shard to repoInfoReq as it doesn't look like that's the right name here and confused me somewhat more when I was reading this code.

… database (#39483) Previously every test case was run in the same test function which led to every consequent test case use the database with some state changed from every previous test case. As all the test cases deal with similar tables in the database, this can lead to unexpected behaviour or false positives, which possibility is now eliminated as every test runs against the clean database.

phabricator: Simply repo create We no longer need to perform a remote call from repo-updater in order to create a Phabricator repo, we can simply call the DB directly. This removes the need for an internal handler in frontend to talk to the DB as well as the client code to call this handler.

camdencheek · 2022-07-27T18:09:52Z

Note: this is a change in behavior. Previously, we would search for files containing after_success on the HEAD commit, which would return .travis.yml, then we would search for all diffs on that file. Now, we search for diffs where the file contains after_success in the commit that is being searched. after_success was added in the last commit that touched .travis.yml, which is why this now returns 1 result rather than 10.

…query (#39518)

add a flaky test exception

rvantonder

Nice built-in treatment for text search :-) I played around locally, all looks great

sourcegraph-bot · 2022-07-27T19:31:39Z

Codenotify: Notifying subscribers in OWNERS files for diff ee0908d...fb601f6.

Notify	File(s)
@mrnugget	dev/sg/internal/analytics/analytics.go dev/sg/internal/analytics/context.go dev/sg/internal/analytics/tracer.go dev/sg/internal/check/runner.go dev/sg/sg_migration.go
@sourcegraph/dev-experience	dev/sg/internal/analytics/analytics.go dev/sg/internal/analytics/context.go dev/sg/internal/analytics/tracer.go dev/sg/internal/check/runner.go dev/sg/sg_migration.go enterprise/dev/ci/internal/ci/operations.go

Co-authored-by: Rijnard van Tonder <rvantonder@gmail.com>

sourcegraph-bot · 2022-07-27T20:02:44Z

Not notifying subscribers because the number of notifying subscribers (20) has exceeded the threshold (15).

evict and others added 30 commits July 26, 2022 18:04

usage-data: push scraped events to pubsub if telemetry job is enabled (…

70b8f57

…#39389)

dev/sg/check: fix Runner tracing (#39197)

1da629b

insights: determine if query targets a single repo (#39382)

0f438a4

migration: Add stitch utilities (#36319)

bdb58fd

log: use log import instead of otfields (#39464)

af8079c

Fix missing button in code nav policy config in small browser window (#…

80ed728

…39115)

executors: Tweak wording on config page (#39111)

76d8e6e

docs: Mention enabling the IAM API for GCP executors (#39110)

a00950c

rockskip: Update CHANGELOG for ROCKSKIP_MIN_REPO_SIZE_MB (#38314)

7a7ed77

Bump src-cli to 3.42.2 (#39466)

994c2bc

This prevents serving 3.42.0 by default, which is broken for the preview/apply workflows.

docs: Update code nav images (#39109)

1a09932

alerts: Do not alert on long transactions in codeintel-db (#36619)

477f4cf

oobmigration: Add bounds to ScheduleMigrationInterrupts (#39458)

7701884

oobmigration: Add version parser (#39457)

8ccd14a

migration: Return leaves when stitching definitions (#39459)

bcd0b0f

notebooks: improvements to the read-only mode (#39407)

cd1391b

Create simple page for lockfile index (#39427)

0bc0766

oobmigration: derive subscription account number and extract license …

b4bb2d2

…key fields (#39344)

ci: wrap yarn into a retrying loop (#39454)

09ee1f3

frontend: Remove unused resolve-revision handler (#39451)

c550fa9

Instrumentation showed it is not being used anymore.

Move git-extras functionality to the core workflow (#39133)

a2a661f

codeintel: Validate package names before insertion. (#39444)

d8516e8

dump lsif_configuration_policies (#39486)

84264dc

lib/group: fix race condition in test (#39340)

124dc27

camdencheek commented Jul 27, 2022

View reviewed changes

camdencheek mentioned this pull request Jul 27, 2022

Search predicates: remove expansion machinery #39520

Merged

insights: graphql query schema to support insights based on a search …

3b7722c

…query (#39518)

camdencheek requested review from rvantonder and tbliu98 July 27, 2022 18:35

Backend: add a flaky test exception (#39521)

96931a2

add a flaky test exception

rvantonder approved these changes Jul 27, 2022

View reviewed changes

Comment thread internal/search/job/jobutil/filter_file_contains.go Outdated

Search backend: add structured diff to CommitMatch result (#39383)

cecd1da

An error occurred while trying to automatically change base from cc/structured-diffs to main July 27, 2022 19:24

An error occurred while trying to automatically change base from cc/structured-diffs to main July 27, 2022 19:25

camdencheek added 10 commits July 27, 2022 13:29

move file:contains.content predicate evaluation to job

9ae53e1

simplify matchers

94b5444

rename for clarity

bd1c7ce

add case sensitivity

47ce33f

wip

fa0eaa6

wip

dbd7592

wip

914f05b

add tests for diff

12f5f64

update gql tests

c8bd7d5

update comment

2740885

camdencheek force-pushed the backend-integration/cc/file-contains-predicate branch from 57c6355 to 2740885 Compare July 27, 2022 19:29

Update internal/search/job/jobutil/filter_file_contains.go

fb601f6

Co-authored-by: Rijnard van Tonder <rvantonder@gmail.com>

camdencheek merged commit d420562 into cc/structured-diffs Jul 27, 2022

camdencheek mentioned this pull request Jul 27, 2022

Search backend: re-open #39501 against main #39526

Merged

camdencheek mentioned this pull request Aug 4, 2022

Streamline existing search predicates #38367

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search backend: remove plan expansion of `file:contains.content()`#39501

Search backend: remove plan expansion of `file:contains.content()`#39501
camdencheek merged 58 commits into
cc/structured-diffsfrom
backend-integration/cc/file-contains-predicate

camdencheek commented Jul 27, 2022 •

edited

Loading

Uh oh!

camdencheek Jul 27, 2022

Uh oh!

rvantonder left a comment

Uh oh!

Uh oh!

sourcegraph-bot commented Jul 27, 2022 •

edited

Loading

Uh oh!

sourcegraph-bot commented Jul 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

camdencheek commented Jul 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

camdencheek Jul 27, 2022

Choose a reason for hiding this comment

Uh oh!

rvantonder left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sourcegraph-bot commented Jul 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sourcegraph-bot commented Jul 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

camdencheek commented Jul 27, 2022 •

edited

Loading

sourcegraph-bot commented Jul 27, 2022 •

edited

Loading