Today, gitserver loosely tracks a repo on disk with a repo in the database by turning the repo name into a dir on disk, and vice-versa.
So for example for a repo called github.com/sourcegraph/sourcegraph, the path will be /data/repos/github.com/sourcegraph/sourcegraph, and when looking at the path /data/repos/github.com/sourcegraph/sourcegraph, the repo will be found by saying WHERE repo.name = 'github.com/sourcegraph/sourcegraph'.
So far so simple, but name is not actually a unique identifier in the Sourcegraph repo table. Also, it could change over time (repo is renamed).
So what happens when a repo is renamed on the code host? TODO figure this out.
When a repo is deleted on the code host, it's name in the database is turned into DELETED-<transaction_timestamp>-<name>, which makes the mapping harder. We can still find the path on disk by stripping the DELETED-<transaction_timestamp>- prefix, but we are not able to go the other way around: from the repo path on disk to the deleted version of the name, because we cannot know the transaction_timestamp.
Last, when a repo called github.com/sourcegraph/sourcegraph is deleted, and a NEW repo on the code host (different external_id) with the same name, the DB looks like the following:
name: github.com/sourcegraph/sourcegraph external_id: 234
name: DELETED-123456789-github.com/sourcegraph/sourcegraph external_id: 123
After our above logic that turns a repo name into a path by stripping the DELETED-<ts> prefix, two repo records now point to the same directory.
We should find a stronger mapping that is not based on the name, for all the reasons above.
Ideas that we had before:
- Store a
repo_dir field on the gitserver_repos entry, so we can reverse-lookup path->repo and can create another directory for the external_id: 234 repo from the duplicate case above
- Store the
repo.id field in the clone directory
- Migrate the gitserver filesystem to a new structure that includes repo IDs, perhaps by utilizing symlinks for the debugging benefits, while keeping a large directory that just contains numeric repos, like
/data/repos/by-id/<id>, which /data/repos/github.com/sourcegraph/sourcegraph could symlink to, for simpler debugging
/cc @sourcegraph/source
Today, gitserver loosely tracks a repo on disk with a repo in the database by turning the repo name into a dir on disk, and vice-versa.
So for example for a repo called
github.com/sourcegraph/sourcegraph, the path will be/data/repos/github.com/sourcegraph/sourcegraph, and when looking at the path/data/repos/github.com/sourcegraph/sourcegraph, the repo will be found by sayingWHERE repo.name = 'github.com/sourcegraph/sourcegraph'.So far so simple, but
nameis not actually a unique identifier in the Sourcegraphrepotable. Also, it could change over time (repo is renamed).So what happens when a repo is renamed on the code host? TODO figure this out.
When a repo is deleted on the code host, it's name in the database is turned into
DELETED-<transaction_timestamp>-<name>, which makes the mapping harder. We can still find the path on disk by stripping theDELETED-<transaction_timestamp>-prefix, but we are not able to go the other way around: from the repo path on disk to the deleted version of the name, because we cannot know the transaction_timestamp.Last, when a repo called
github.com/sourcegraph/sourcegraphis deleted, and a NEW repo on the code host (differentexternal_id) with the same name, the DB looks like the following:name: github.com/sourcegraph/sourcegraph external_id: 234name: DELETED-123456789-github.com/sourcegraph/sourcegraph external_id: 123After our above logic that turns a repo name into a path by stripping the
DELETED-<ts>prefix, two repo records now point to the same directory.We should find a stronger mapping that is not based on the name, for all the reasons above.
Ideas that we had before:
repo_dirfield on the gitserver_repos entry, so we can reverse-lookup path->repo and can create another directory for theexternal_id: 234repo from the duplicate case aboverepo.idfield in the clone directory/data/repos/by-id/<id>, which/data/repos/github.com/sourcegraph/sourcegraphcould symlink to, for simpler debugging/cc @sourcegraph/source