Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

gitserver: Improve mapping of paths to repository #56712

@eseliger

Description

@eseliger

Today, gitserver loosely tracks a repo on disk with a repo in the database by turning the repo name into a dir on disk, and vice-versa.

So for example for a repo called github.com/sourcegraph/sourcegraph, the path will be /data/repos/github.com/sourcegraph/sourcegraph, and when looking at the path /data/repos/github.com/sourcegraph/sourcegraph, the repo will be found by saying WHERE repo.name = 'github.com/sourcegraph/sourcegraph'.

So far so simple, but name is not actually a unique identifier in the Sourcegraph repo table. Also, it could change over time (repo is renamed).
So what happens when a repo is renamed on the code host? TODO figure this out.
When a repo is deleted on the code host, it's name in the database is turned into DELETED-<transaction_timestamp>-<name>, which makes the mapping harder. We can still find the path on disk by stripping the DELETED-<transaction_timestamp>- prefix, but we are not able to go the other way around: from the repo path on disk to the deleted version of the name, because we cannot know the transaction_timestamp.
Last, when a repo called github.com/sourcegraph/sourcegraph is deleted, and a NEW repo on the code host (different external_id) with the same name, the DB looks like the following:

  • name: github.com/sourcegraph/sourcegraph external_id: 234
  • name: DELETED-123456789-github.com/sourcegraph/sourcegraph external_id: 123
    After our above logic that turns a repo name into a path by stripping the DELETED-<ts> prefix, two repo records now point to the same directory.

We should find a stronger mapping that is not based on the name, for all the reasons above.
Ideas that we had before:

  • Store a repo_dir field on the gitserver_repos entry, so we can reverse-lookup path->repo and can create another directory for the external_id: 234 repo from the duplicate case above
  • Store the repo.id field in the clone directory
  • Migrate the gitserver filesystem to a new structure that includes repo IDs, perhaps by utilizing symlinks for the debugging benefits, while keeping a large directory that just contains numeric repos, like /data/repos/by-id/<id>, which /data/repos/github.com/sourcegraph/sourcegraph could symlink to, for simpler debugging

/cc @sourcegraph/source

Metadata

Metadata

Assignees

Labels

team/sourceTickets under the purview of Source - the one Source to graph it all

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions