Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

Search backend: remove plan expansion of file:contains.content()#39501

Merged
camdencheek merged 58 commits into
cc/structured-diffsfrom
backend-integration/cc/file-contains-predicate
Jul 27, 2022
Merged

Search backend: remove plan expansion of file:contains.content()#39501
camdencheek merged 58 commits into
cc/structured-diffsfrom
backend-integration/cc/file-contains-predicate

Conversation

@camdencheek

@camdencheek camdencheek commented Jul 27, 2022

Copy link
Copy Markdown
Member

This modifies the evaluation of the file:contains.content() predicate to no longer expand ahead of time. There are very few cases where this would work correctly in the past because the number of files we expanded into was enormous, and they all had to be scoped in the query by a repo, which made for extremely complex OR queries that would just cause stack overflows when we tried to process them.

There are two cases where we support file:contains.content():

  1. Text search
  2. Diff search

Text search is implemented in a very efficient manner. Basically, for a user input like file:contains(abc) def, we execute the search as if the user searched for abc and def, then we filter out the ranges that matched abc (but keep any that match def). This lets us to take full advantage of our existing, efficient AND/OR machinery.

Diff search is implemented in an extremely inefficient manner. For each result that comes through, we execute an unindexed search on the files matched in the diff at the matched commit and ensure that they contain all the patterns specified by the file:contains.content() predicate. This is slow, but diff search is also slow, and I expect that the file:contains.content() feature for diff search is hardly used, if at all, so I think it's fine. I don't want to put the effort into supporting this natively in diff search right now.

Stacked on https://github.com/sourcegraph/sourcegraph/pull/39383

This is the last predicate that used the query expansion machinery, so I will remove that in the next PR.

Test plan

Added tests, backend integration test changes reflect changes in behavior.

evict and others added 30 commits July 26, 2022 18:04
* refactor and adding tests

* update schema for new policy

* remove PasswordPolicy from default dev config

* add more tests, fix bugs

* Add PasswordPolicy changes to frontend

* add PasswordPolicy jsoncontext for frontend

* add PasswordPolicy to schema

* create security helper functions, code re-use

* dedup validatePassword functionality

* remove duplicate auth.passwordPolicy

* always use minPasswordLen

* fix build, add types, fix tests

* add experimental passwordpolicy back to schema

* refactor, dedup password requirement check

* add more tests

* refactor, run linters

* add deprecation notices

* always return a GenericPasswordPolicy

* remove conf.go move to general conf

* remove interface type

* deprecate PasswordPolicy

* fix tests, refactor

* run pretty

* serialize json as frontend expects

* Apply suggestions from code review

Co-authored-by: Thorsten Ball <mrnugget@gmail.com>

* Apply suggestions from code review

Co-authored-by: Thorsten Ball <mrnugget@gmail.com>

* change test to be table driven

Co-authored-by: Thorsten Ball <mrnugget@gmail.com>
…n server-side execution (#38921)

* add migration

* add sql for migration

* add sql to squashed.sql

* specify batch change id in createBatchSpecFromRaw

* pass batch change in frontend

* add comment

* arrange methods properly

* remove trailing lines

* revert arrngement

* update

* fix failing test

* fix failing tests

* add test for resolver

* add test for service

* add test for service

* more test for service

* set field name to bigint

* remove comment

* update the update query

* remove confusing comment

* rename variables

* rename BastchChangeID variable

* add test for unauthorised user creating a batch spec from raw

* update db schema

* remove trailing line

* remove comment

* prettier things
This prevents serving 3.42.0 by default, which is broken for the preview/apply workflows.
For permission testing I wrote this command to very conveniently switch
between users. This is kind of like the already existing testproxy,
except much more convenient to use as well as being targetted for auth
testing rather than http-header testing.

I'm unsure of how to document this further so people are aware of it.
Alternatively I think it is also useful to maybe spin up by default in
our enterprise env. I'll leave both of those for future PRs. For now I
will advertise in #dev-chat and #dev-experience.

Here is the example output to give you a feel for what it does:

  $ go run ./dev/internal/cmd/auth-proxy-http-header
  https://docs.sourcegraph.com/admin/auth#http-authentication-proxies

    "auth.providers": [
      {
        "type": "http-header",
        "usernameHeader": "X-Forwarded-User",
        "emailHeader": "X-Forwarded-Email"
      }
    ]

  Visit http://127.0.0.1:10810 for keegan keegan@sourcegraph.com
  Visit http://127.0.0.1:10811 for user1 keegan+user1@sourcegraph.com
  Visit http://127.0.0.1:10812 for user2 keegan+user2@sourcegraph.com
  Visit http://127.0.0.1:10813 for user3 keegan+user3@sourcegraph.com
  Visit http://127.0.0.1:10814 for user4 keegan+user4@sourcegraph.com
  Visit http://127.0.0.1:10815 for user5 keegan+user5@sourcegraph.com

Test Plan: Ran locally
…information (#39250)

This PR adds support for reading the [GitHub style code owner
specs](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners)
from repositories to infer code ownership information automatically. To
do that, we add a new dependency: https://github.com/hmarr/codeowners
(MIT license so I don't see any legal issues).

Here's a bullet point recap of what changes we can find here:

- I've added a new `codeownership.Ruleset` type that currently is a
  proxy to the `codeowners.Ruleset` (from the new dependency). This
  allows us to expose a nice API but also will make it easier to later
  add other ways to get the code ownership mapping: E.g. we may want to
  fetch data rules from our postgres database instead. 
- We have a new method to create a ruleset based on querying the
  gitserver for a specific repo at a specific revision. This will
  download a static number of blobs that are [allowed code owner
  locations](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners#codeowners-file-location)
  (The plugin also [supports GitLab style code
  owners](hmarr/codeowners@f72d282)
  so I've added the [`.gitlab`
  path](https://docs.gitlab.com/ee/user/project/code_owners.html#set-up-code-owners)) 
- In the filter job, we create a new `map` so that we only have to
  create the `codeowners.Ruleset` for a specific repo once per search.
  So if one repo returns 100 results, we only have one ruleset. 
- When a code ownership mapping was found, we test the result paths
  against the mapping. If the owners contain the owner we need to
  include, we keep the result, otherwise we drop it.
- I added a new test for the filter job.
- I also migrated the filter job test to use `autogold` while at it.  
- I added a feature flag to gate out the work.
Instrumentation showed it is not being used anymore.
This PR makes two functional improvements to the RepoInfo client
method and a semantic one.

The functional improvements are:

1. Limit length of map to max number of shards

For reasons unknown now, we were making an unnecesarily high number of
requests since the `shards` map being created as a factor of number of
repos. But it should only ever be the total number of shards of
gitserver currently running.

2. Do not hide response body for failed HTTP reqs

Currently, if the API server returns anythign but a 200 OK, we return
the HTTP status, but swallow the HTTP body. This will contain
important information that will be helpful to understand the error,
which is not possible as of now.

The semantic improvement is:

1. Add comment and rename shard -> repoInfoReq

I spent an unnecessarily high amount of time trying to "fix" this code
for what I thought was not the optimal way of collecting and sending
the API requests, until I had the aha moment and finally understood
why we're doing it this way.

As a result, added a comment to clarify this as well as renamed shard
to repoInfoReq as it doesn't look like that's the right name here and
confused me somewhat more when I was reading this code.
… database (#39483)

Previously every test case was run in the same test function which led to every consequent test case use the database with some state changed from every previous test case. As all the test cases deal with similar tables in the database, this can lead to unexpected behaviour or false positives, which possibility is now eliminated as every test runs against the clean database.
phabricator: Simply repo create

We no longer need to perform a remote call from repo-updater in order to
create a Phabricator repo, we can simply call the DB directly. This
removes the need for an internal handler in frontend to talk to the DB
as well as the client code to call this handler.
Comment thread dev/gqltest/search_test.go Outdated
Comment on lines 1289 to 1291

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this is a change in behavior. Previously, we would search for files containing after_success on the HEAD commit, which would return .travis.yml, then we would search for all diffs on that file. Now, we search for diffs where the file contains after_success in the commit that is being searched. after_success was added in the last commit that touched .travis.yml, which is why this now returns 1 result rather than 10.

@rvantonder rvantonder left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice built-in treatment for text search :-) I played around locally, all looks great

Comment thread internal/search/job/jobutil/filter_file_contains.go Outdated
An error occurred while trying to automatically change base from cc/structured-diffs to main July 27, 2022 19:24
An error occurred while trying to automatically change base from cc/structured-diffs to main July 27, 2022 19:25
An error occurred while trying to automatically change base from cc/structured-diffs to main July 27, 2022 19:25
An error occurred while trying to automatically change base from cc/structured-diffs to main July 27, 2022 19:25
@camdencheek camdencheek force-pushed the backend-integration/cc/file-contains-predicate branch from 57c6355 to 2740885 Compare July 27, 2022 19:29
@sourcegraph-bot

sourcegraph-bot commented Jul 27, 2022

Copy link
Copy Markdown
Contributor

Codenotify: Notifying subscribers in OWNERS files for diff ee0908d...fb601f6.

Notify File(s)
@mrnugget dev/sg/internal/analytics/analytics.go
dev/sg/internal/analytics/context.go
dev/sg/internal/analytics/tracer.go
dev/sg/internal/check/runner.go
dev/sg/sg_migration.go
@sourcegraph/dev-experience dev/sg/internal/analytics/analytics.go
dev/sg/internal/analytics/context.go
dev/sg/internal/analytics/tracer.go
dev/sg/internal/check/runner.go
dev/sg/sg_migration.go
enterprise/dev/ci/internal/ci/operations.go

Co-authored-by: Rijnard van Tonder <rvantonder@gmail.com>
@camdencheek camdencheek merged commit d420562 into cc/structured-diffs Jul 27, 2022
@sourcegraph-bot

Copy link
Copy Markdown
Contributor

Not notifying subscribers because the number of notifying subscribers (20) has exceeded the threshold (15).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.