Make repos sortable by size#44360
Conversation
|
@mrnugget Question: Do you think making |
|
This works now: query Repositories {
repositories(orderBy: SIZE, descending: true) {
totalCount,
nodes {
name,
diskSizeBytes
}
}
} |
Bundle size report 📦
Look at the Statoscope report for a full comparison between the commits e504976 and cc7bd73 or learn more. Open explanation
|
Where in that query do you specify to sort by size? I only see |
There was a problem hiding this comment.
I don't think you'll need to expose JoinGitserverRepos if you change this check here to check for opt.OrderBy == RepoListSize. I think that'd be cleaner.
|
The results in your screenshot are not ordered by size. Easy to spot by looking at number of digits but also like this: |
|
@mrnugget Shoot, not sure how I messed it up, but missed Fixed it in the description and message above, and here is the correct screenshot: https://share.cleanshot.com/IL7G1q |
|
Also works with the UI: https://share.cleanshot.com/ZapzF1 with a two-liner change. We can basically ship this 😉 |
|
Hey, @mrnugget, I did an experiment for the UI, which turned out to be quicker and nicer than I'd expected. I think it's quite nice overall, could be an easy win if we go with this: https://www.loom.com/share/536b50a954bb49659cf696552f494193 Extra goodies:
Known caveats:
|
54c7b59 to
158f0b4
Compare
There was a problem hiding this comment.
Do we have an index on that column? If we sort by it, should we have one?
There was a problem hiding this comment.
I've just checked it, and no, we don't have an index on it yet. I haven't looked into the performance aspect yet, but great point! I've now put it on my list to add the index as part of this PR. :)
There was a problem hiding this comment.
can we get query plans for this from the dotcom database (dev-readonly account is in 1pass)? That table is very critically congested sadly
There was a problem hiding this comment.
This whole file is basically a smart copy of FilterControl.tsx. I generally don't favor copy&paste programming, but when there are two of a thing (rather than three, four, ...), it's always a tricky call to decide how much to generalize.
In this case, I went with duplication because:
- This is an old component that we may deprecate, so, honestly, I didn't want to invest a lot of time in it
- Ordering and filtration are two different concepts, not the copy of the same thing. So if retained, they may evolve in different ways. So generalizing them might actually be a step backward.
The same reasoning applies to a few other changes in this PR.
e43092d to
ddf19b0
Compare
|
Codenotify: Notifying subscribers in CODENOTIFY files for diff cc7bd73...e504976.
|
eseliger
left a comment
There was a problem hiding this comment.
Nice work. Query plan please! :)
There was a problem hiding this comment.
can we get query plans for this from the dotcom database (dev-readonly account is in 1pass)? That table is very critically congested sadly
| @@ -0,0 +1,2 @@ | |||
| CREATE INDEX IF NOT EXISTS gitserver_repo_size_bytes | |||
There was a problem hiding this comment.
Indexes should usually be created concurrently
| @@ -0,0 +1,2 @@ | |||
| CREATE INDEX IF NOT EXISTS gitserver_repo_size_bytes | |||
| ON gitserver_repos USING btree (repo_size_bytes) | |||
There was a problem hiding this comment.
| ON gitserver_repos USING btree (repo_size_bytes) | |
| ON gitserver_repos USING btree (repo_size_bytes); |
| $orderBy: RepositoryOrderBy | ||
| $descending: Boolean |
There was a problem hiding this comment.
I know this already exists on main, but want to leave some thoughts on this pattern (maybe these thoughts are more for @philipp-spiess and @courier-new as they're working on standardized pagination etc):
I think this pattern encourages to set one value or the other, but not necessarily both, which feels undesirable.
Idea that would enforce that none or both values are set, and also makes ascending/descending more expressive (otherwise, what's the default here? descending: false => ascending?):
type Query {
repositories(order: RepositoryOrderInput): RepositoryConnection!
}
enum OrderDirection {
ASCENDING
DESCENDING
}
input RepositoryOrderInput {
orderBy: RepositoryOrderBy!
direction: OrderDirection!
}4d2a29c to
ef12fe2
Compare
|
@eseliger and I tested the performance of the new index. Huge difference, so thanks, @philipp-spiess for making sure we didn't miss this! We tested with two queryies. 🥁 The contestants: Just a "sort by size, desc"SELECT repo.id,repo.name,repo.private,repo.external_id,repo.external_service_type,repo.external_service_id,repo.uri,repo.description,repo.fork,repo.archived,repo.stars,repo.created_at,repo.updated_at,repo.deleted_at,repo.metadata,repo.blocked,(SELECT json_object_agg(key, value) FROM repo_kvps WHERE repo_kvps.repo_id = repo.id),
(
SELECT
json_agg(
json_build_object(
'CloneURL', esr.clone_url,
'ID', esr.external_service_id,
'Kind', LOWER(svcs.kind)
)
)
FROM external_service_repos AS esr
JOIN external_services AS svcs ON esr.external_service_id = svcs.id
WHERE
esr.repo_id = repo.id
AND
svcs.deleted_at IS NULL
)
FROM repo
JOIN gitserver_repos gr ON gr.repo_id = repo.id
WHERE
repo.deleted_at IS NULL AND repo.blocked IS NULL AND (TRUE) -- Populates "queryConds"
AND
(
(
-- Bypass authz
TRUE
)
) -- Populates "authzConds"
ORDER BY gr.repo_size_bytes DESC LIMIT 21 OFFSET 0Sorting by size, plus filtering for "%github%"SELECT repo.id,repo.name,repo.private,repo.external_id,repo.external_service_type,repo.external_service_id,repo.uri,repo.description,repo.fork,repo.archived,repo.stars,repo.created_at,repo.updated_at,repo.deleted_at,repo.metadata,repo.blocked,(SELECT json_object_agg(key, value) FROM repo_kvps WHERE repo_kvps.repo_id = repo.id),
(
SELECT
json_agg(
json_build_object(
'CloneURL', esr.clone_url,
'ID', esr.external_service_id,
'Kind', LOWER(svcs.kind)
)
)
FROM external_service_repos AS esr
JOIN external_services AS svcs ON esr.external_service_id = svcs.id
WHERE
esr.repo_id = repo.id
AND
svcs.deleted_at IS NULL
)
FROM repo
JOIN gitserver_repos gr ON gr.repo_id = repo.id
WHERE
repo.deleted_at IS NULL AND repo.blocked IS NULL AND (lower(name) LIKE "%github%") -- Populates "queryConds"
AND
(
(
-- Bypass authz
TRUE
)
) -- Populates "authzConds"
ORDER BY gr.repo_size_bytes DESC LIMIT 21 OFFSET 0👎 And the results, unindexed: First query, unindexed: 32829.716 msSecond query, unindexed: 34892.949 ms👍 Indexed now: First query, indexed: 3.288 msSecond query, indexed: 6.568 ms``` QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=1851.06..2301.96 rows=21 width=700) (actual time=5.623..6.331 rows=21 loops=1) -> Result (cost=1851.06..16086.60 rows=663 width=700) (actual time=5.622..6.325 rows=21 loops=1) -> Sort (cost=1851.06..1852.72 rows=663 width=636) (actual time=5.515..5.524 rows=21 loops=1) Sort Key: gr.repo_size_bytes DESC Sort Method: top-N heapsort Memory: 55kB -> Nested Loop (cost=1.11..1833.18 rows=663 width=636) (actual time=0.047..5.090 rows=406 loops=1) -> Index Scan using erik_test_remove6 on repo (cost=0.56..2.78 rows=660 width=628) (actual time=0.030..1.752 rows=406 loops=1) Index Cond: ((lower((name)::text) >= 'github.com/sourceg'::text) AND (lower((name)::text) < 'github.com/sourceh'::text)) Filter: (lower((name)::text) ~~ 'github.com/sourceg%'::text) -> Index Scan using gitserver_repos_pkey on gitserver_repos gr (cost=0.56..2.77 rows=1 width=12) (actual time=0.007..0.007 rows=1 loops=406) Index Cond: (repo_id = repo.id) SubPlan 1 -> Aggregate (cost=1.70..1.71 rows=1 width=32) (actual time=0.010..0.010 rows=1 loops=21) -> Index Only Scan using repo_kvps_pkey on repo_kvps (cost=0.43..1.67 rows=8 width=42) (actual time=0.009..0.009 rows=0 loops=21) Index Cond: (repo_id = repo.id) Heap Fetches: 0 SubPlan 2 -> Aggregate (cost=19.74..19.75 rows=1 width=32) (actual time=0.025..0.025 rows=1 loops=21) -> Nested Loop (cost=16.23..19.73 rows=1 width=58) (actual time=0.017..0.019 rows=1 loops=21) -> Index Scan using external_service_repos_repo_id_external_service_id_unique on external_service_repos esr (cost=0.43..2.65 rows=1 width=51) (actual time=0.009..0.010 rows=1 loops=21) Index Cond: (repo_id = repo.id) -> Bitmap Heap Scan on external_services svcs (cost=15.80..16.91 rows=1 width=15) (actual time=0.005..0.005 rows=1 loops=24) Recheck Cond: (id = esr.external_service_id) Filter: (deleted_at IS NULL) Heap Blocks: exact=24 -> Bitmap Index Scan on external_services_pkey (cost=0.00..15.80 rows=1 width=0) (actual time=0.004..0.004 rows=1 loops=24) Index Cond: (id = esr.external_service_id) Planning Time: 2.881 ms Execution Time: 6.568 ms (29 rows) ```Seems like a win :) 🍾 |
ef12fe2 to
6ac9bb2
Compare
Not sure if this is right
6ac9bb2 to
ebea946
Compare
|
|
||
| interface Props extends RouteComponentProps<{}>, TelemetryProps {} | ||
|
|
||
| const ORDERING_OPTIONS: OrderedConnectionOrderingOption[] = [ |
There was a problem hiding this comment.
Other than that it doesn't support multi-column sort (which might now be a requirement for this page) this LOOKS GOOD.
| return ( | ||
| <div className={styles.orderControl}> | ||
| {orderingOptions.map(orderingOption => { | ||
| if (orderingOption.type === 'radio') { |
There was a problem hiding this comment.
I think we can remove this branch and add back only if we really need it (not in advance to "be safe")
| label: string | ||
|
|
||
| /** "radio" or "select" */ | ||
| type: string |
There was a problem hiding this comment.
😱
| type: string | |
| type: "radio" | "select" |
but I think we can drop "radio"?
|
OMG this PR just got so much lighter. After going through all the hassle of adding a new way to order items, I realized that the filter part accepted arrays of items! When doing the change @philipp-spiess suggested, I went through a bunch of "find usages" searches, when this usage of the component caught my eye, where the filter is used for ordering. Turns out it was possible all along, and hardly any front-end changes were needed. 🤦 As much as I've got attached to my beautiful code, the front-end part of it is plain redundant (the original solution is pretty much pixel perfectly the same as my additional solution), so I've reverted the whole thing. If CI passes, I'm quite confident to merge this despite any Chromium issues because this is basically a backend-only change now. |
| map(searchParams => new URLSearchParams(searchParams)), | ||
| map(searchParams => getFilterFromURL(searchParams, this.props.filters)), | ||
| map((searchParams: URLSearchParams) => ({ | ||
| newFilterValues: getFilterFromURL(searchParams, this.props.filters), |
There was a problem hiding this comment.
why does this have to be an object now?
There was a problem hiding this comment.
It doesn't. Reverted that, too.
Summary: This is an extension of the
FilteredConnectioncomponent on the front end to allow for setting an order of the items, plus a tiny back-end change.This is how the new UI looks:
Goodies about the UI change:
FilteredConnectioncomponent.REPOSITORY_CREATED_AT) can be added in five minutes as an extra option if that adds value (plus it can be further extended with other fields quite easily)Test plan
gitserver_repos.repo_size_bytescolumn, so the sorting should be performant even in the case of 500k+ repos. (thanks to @philipp-spiess for bringing this up, and @eseliger for his help!)Also tested the API with this query (results):