Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

Search: add repo:has.topic() predicate#48875

Merged
camdencheek merged 19 commits into
mainfrom
cc/repo-has-topic
Mar 9, 2023
Merged

Search: add repo:has.topic() predicate#48875
camdencheek merged 19 commits into
mainfrom
cc/repo-has-topic

Conversation

@camdencheek

@camdencheek camdencheek commented Mar 7, 2023

Copy link
Copy Markdown
Member

This adds first-class support for searching repos by GitHub topic.

I added automated ingestion of GitHub topics in https://github.com/sourcegraph/sourcegraph/pull/47635. This PR add support for searching repos using this information.

Overview of this PR:

  • Add a GIN index to the metadata stored in Postgres. This enables fast queries on topics even at the scale of sourcegraph.com (I did some manual perf testing to ensure this index can be used).
  • Add the repo:has.topic() predicate, which exposes this via the search language
  • Add hovers, completions, and documentation

This is an alternative to the planned implementation where we ingest topics as KVPs then query them with repo:has.tag(topic). I decided to switch course for a few reasons:

  1. It's been brought up that repo:has.tag() is confusing because it suggests git tags. repo:has.topic() has a meaning that is more obvious.
  2. Keeping the repo_kvp table consistent with the repo table is surprisingly difficult, and there were a lot of edge cases like conflicting keys that would be difficult to solve in a maintainable way.
  3. It keeps repo syncing and search unentangled from each other and greatly simplifies the implementation complexity.

Test plan

I added a handful of unit tests, database tests, and integration tests to exercise the new behavior. I manually tested that the added index works well on sourcegraph.com.

@cla-bot cla-bot Bot added the cla-signed label Mar 7, 2023
@camdencheek camdencheek changed the title Cc/repo has topic Search: add repo:has.topic() predicate Mar 7, 2023
@sg-e2e-regression-test-bob

sg-e2e-regression-test-bob commented Mar 8, 2023

Copy link
Copy Markdown

Bundle size report 📦

Initial size Total size Async size Modules
0.01% (+0.33 kb) 0.00% (+0.64 kb) 0.00% (+0.31 kb) 0.00% (0)

Look at the Statoscope report for a full comparison between the commits c7c225f and 46e0f62 or learn more.

Open explanation
  • Initial size is the size of the initial bundle (the one that is loaded when you open the page)
  • Total size is the size of the initial bundle + all the async loaded chunks
  • Async size is the size of all the async loaded chunks
  • Modules is the number of modules in the initial bundle

@camdencheek camdencheek changed the base branch from main to garyl/fix_app_local_resolver March 8, 2023 00:13
Base automatically changed from garyl/fix_app_local_resolver to main March 8, 2023 00:17
@camdencheek

camdencheek commented Mar 8, 2023

Copy link
Copy Markdown
Member Author

I will be out for most of tomorrow, and backend integration tests aren't yet passing, but I wanted to get this open for review ASAP given branch cut is looming. I will fix the integration tests once I get back.

@camdencheek camdencheek marked this pull request as ready for review March 8, 2023 01:26
@sourcegraph-bot

sourcegraph-bot commented Mar 8, 2023

Copy link
Copy Markdown
Contributor

Codenotify: Notifying subscribers in CODENOTIFY files for diff 46e0f62...c7c225f.

Notify File(s)
@eseliger internal/database/repos.go
internal/database/repos_test.go
@fkling client/shared/src/search/query/completion.test.ts
client/shared/src/search/query/completion.test.ts
client/shared/src/search/query/decoratedToken.test.ts
client/shared/src/search/query/decoratedToken.test.ts
client/shared/src/search/query/decoratedToken.ts
client/shared/src/search/query/decoratedToken.ts
client/shared/src/search/query/hover.test.ts
client/shared/src/search/query/hover.test.ts
client/shared/src/search/query/hover.ts
client/shared/src/search/query/hover.ts
client/shared/src/search/query/predicates.test.ts
client/shared/src/search/query/predicates.test.ts
client/shared/src/search/query/predicates.ts
client/shared/src/search/query/predicates.ts
@keegancsmith internal/search/job/jobutil/job.go
internal/search/job/jobutil/job_test.go
internal/search/query/predicate.go
internal/search/query/predicate_test.go
internal/search/query/types.go
internal/search/repos/repos.go
internal/search/types.go
@sourcegraph/delivery doc/admin/repo/metadata.md
@unknwon dev/gqltest/search_test.go

@jtibshirani jtibshirani left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good to me, but I still need to review the database logic. To check -- do you want to ship this with the upcoming release (so we should be speedy with reviews!)

One overall question: am I understanding right that we'll now have both has.tag and has.topic predicates? This feels a little confusing, will we eventually remove has.tag?

Comment thread doc/code_search/reference/language.md Outdated
@camdencheek

Copy link
Copy Markdown
Member Author

do you want to ship this with the upcoming release

Yes! We have quite a few customers asking for this, so I'd really like to not make them wait until June 🙂

we'll now have both has.tag and has.topic predicates?

Yes, that is correct. repo:has.tag() is still for arbitrary user-added key-value-pairs (it's just a key-value pair with a null value). repo:has.topic() is specifically for topics as defined by the codehost. We do not support adding topics in Sourcegraph.

I agree this is a little confusing. I would love to do a little bit of consolidation of our query language, and the kvp predicates could use some careful design.

@camdencheek

Copy link
Copy Markdown
Member Author

Fixed backend integration tests. The problem is I added topics to the repos in github.com/sgtest, but we actually use ghe.sgdev.org/sgtest for integration tests 🤦

@camdencheek

Copy link
Copy Markdown
Member Author

I still need to review the database logic

I added a few comments to the code in an attempt to help clarify what's going on there

@jtibshirani jtibshirani left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I gave my best shot at a review, but I'm still getting up to speed on our SQL standards and approach to database migrations.

I left a couple questions but didn't see anything blocking.

Comment thread doc/code_search/reference/queries.md Outdated
// the repo's metadata. This is designed to work with the
// idx_repo_github_topics index.
//
// We use the unusual `jsonb_build_array` and `jsonb_build_object`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we ever use prepared statements to optimize common queries and prevent SQL injections?

@camdencheek camdencheek Mar 9, 2023

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two parts to this answer:

For query optimization, our Postgres driver automatically prepares and caches prepared statements. I haven't actually looked into how well that cache is utilized -- this would be worth looking into. For something like listing repos though, explicitly preparing all variations of the query would likely become untenable given the number of different options we support (all of which change the shape of the query, making the prepared statement unusable).

For SQL injections, sqlf.Sprintf() does actually protect against this. It is not the same as fmt.Sprintf(). It maintains the separation of query text and query variables which we then pass separately to Postgres, which handles the rest.

My comment here was just about not being able to construct a JSONB object safely with the string syntax. SELECT jsonb_build_object('Name', $1::text); is valid, SELECT '{"Name": "$1"}'::jsonb; is not. I didn't want to do something like q := "SELECT '{\"Name": \"" + topic +" \"}'::jsonb" in the application code because that's super vulnerable.

@@ -0,0 +1,4 @@
-- Undo the changes made in the up migration

DROP INDEX IF EXISTS idx_repo_github_topics;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of testing -- am I understanding right that CI automatically runs basic checks against migrations (like for idempotency, running them before starting unit tests, etc.)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! We test that all migrations run, they are idempotent, and they can be re-applied after migrating down. All unit tests run against a fully migrated database. I don't know of any good ways to test "does this index apply to my query" other than doing a manual test on a realistic instance and ensuring the EXPLAIN output looks right -- would be interested in solutions here 🙂

Comment thread doc/code_search/reference/language.md
@camdencheek camdencheek merged commit 03462c8 into main Mar 9, 2023
@camdencheek camdencheek deleted the cc/repo-has-topic branch March 9, 2023 22:48
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants