Add a new ``references`` builder by chrisjsewell · Pull Request #12190 · sphinx-doc/sphinx

chrisjsewell · 2024-03-23T18:48:02Z

This PR add a new builder references, to build a single references.json, which provides a mapping for almost* all targets available to reference in the project, including:

Internal domain objects, generated within the current project
External domain objects, loaded from the objects.inv configured via intersphinx_mapping (when using the sphinx.ext.intersphinx extension)

* I say almost, because this assumes the objects returned from domain.get_objects account for the majority of referencable items in a project, but there are currently some notable exceptions, like the math domain not returning any (but that is for another PR to fix)

This partialy addresses #12152, to allow for a clear way for users to understand:

What targets are available for them to reference
How to reference these targets

Crucially, the references.json includes the mapping of object type to role names (this can be one-to-many),
since a role name is required for the reference syntax, not the object type.

I would also invisage other tools (like VS Code extension) could utilise this, to provide things like auto-completions, and "jump to target/references"

Some considerations:

I feel a builder is really the only way to do this comprehensively; having a standalone CLI (like the current python -m sphinx.ext.intersphinx) can only get you so far, and then you will have to start re-implementing features of a normal sphinx build (like reading configuration, etc)
Perhaps in a follow up PR I could introduce a complimentary CLI, that reads the references.json and allows users to quickly generate references. Something like sphinx-ref find 're.Match' returning :class:`~re.Match` (i.e https://github.com/orgs/sphinx-doc/discussions/12152#discussioncomment-8862652)
There are cases where an object type has no matching role names, this PR is not addressing that (although I want to eventually)
As I mention in Make intersphinx (a.k.a. external references) more user friendly #12152, it would be ideal for this to include, not just the document path where a local target is defined, but also the line number (if available). But this is not within the scope of this PR
Creating a singular references.json is probably the simplest way to do this. But, it could get rather large, for a large project, or one with lots of intersphinx mappings.
Is this ok, or do we think another format would be better, like one JSON file per domain / object type, or even something like an sqlite database file?
The other non included in this PR, is any additions to the documention, I could do this here or in a follow-up PR

chrisjsewell · 2024-03-23T18:50:03Z

(cc also @webknjaz, as I can't add you as a reviewer)

picnixz · 2024-03-23T18:53:07Z

(test failure is likely because of a side effect)

chrisjsewell · 2024-03-23T18:55:43Z

(test failure is likely because of a side effect)

yeh hmm works locally (when calling the singular test), but I perhaps I can't "piggy-back" on the existing test-basic folder

chrisjsewell · 2024-03-23T18:57:34Z

anyway, whilst I fix that, interested to hear your thoughts

picnixz

A bit of comments (I'll be less available from now)

picnixz · 2024-03-23T18:54:28Z

sphinx/builders/references.py

For new files like that, could we have explicit __all__ (empty by default if possible, since we don't really know what should be public or not).

picnixz · 2024-03-23T18:56:02Z

sphinx/builders/references.py

+
+class LocalReference(TypedDict, total=False):
+    type: Literal['local']
+    document: str


In general, we use document for something else I think. Would it be possible to use docname instead? (I didn't see yet but is it a full path or not?). If so, I'd suggest path instead of filepath. Because document is generally.. the document node.

Fair, its the full path as thats more helpful for users, so yeh could change to filepath

picnixz · 2024-03-23T18:58:35Z

sphinx/builders/references.py

+from __future__ import annotations
+
+import json
+from os import path


Could you use os.path instead?

(Actually, it's essentially to reduce the possibility of having a variable shadowing the import)

sphinx/builders/references.py

picnixz · 2024-03-23T19:01:08Z

sphinx/builders/references.py

+                    data[domainname][otype_name] = {'items': {}}
+                    if otype := domain.object_types.get(otype_name):
+                        data[domainname][otype_name]['roles'] = list(otype.roles)
+                data.setdefault(domainname, {}).setdefault(otype_name, {})['items'].setdefault(


Suggested change

data.setdefault(domainname, {}).setdefault(otype_name, {})['items'].setdefault(

data[domainname].setdefault(otype_name, {})['items'].setdefault(

Or better: use intermediate variables here (because it's a bit hard to parse).

picnixz · 2024-03-23T19:02:13Z

sphinx/builders/references.py

+                        'url': url,
+                    }
+                    # only add dispname if it is set and not the same as name
+                    if not (dispname == name or not dispname or dispname == '-'):


same as above

picnixz · 2024-03-23T19:02:19Z

sphinx/builders/references.py

+                            otype := local_domain.object_types.get(otype_name)
+                        ):
+                            data[domainname][otype_name]['roles'] = list(otype.roles)
+                    data.setdefault(domainname, {}).setdefault(otype_name, {})[


picnixz · 2024-03-23T19:03:15Z

tests/test_builders/test_build_references.py

+    # test the content of the reference file
+    content = (app.outdir / 'references.json').read_text('utf-8')
+    data = json.loads(content)
+    assert data == {


Ok, this one might need a factory for the sake of readability.

picnixz · 2024-03-23T19:03:22Z

tests/test_builders/test_build_references.py

+    # test the content of the reference file
+    content = (app.outdir / 'references.json').read_text('utf-8')
+    data = json.loads(content)
+    assert data == {


picnixz · 2024-03-23T19:06:20Z

yeh hmm works locally (when calling the singular test), but I perhaps I can't "piggy-back" on the existing test-basic folder

If you are worried about that, use srcdir=os.urandom(16).hex() in the sphinx marker. It's a way to isolate your test so that you don't have weird surprises (well you could still have surprises but you should be VERY unlucky (or lucky, if you were an adversary targetting AES-128)).

chrisjsewell · 2024-03-23T20:12:20Z

A bit of comments (I'll be less available from now)

Thanks for the review @picnixz, but perhaps I could nudge you for some quick general feedback on the concept 😅
Do you agree that this is a "good" thing to introduce? any thoughts on the references.json format?

bskinn · 2024-03-24T00:09:33Z

Read through the new references.py builder. I'm weak on some of the technical details and Sphinx internals, so I can't speak strongly there.

But, here are some other thoughts.

Reaction to the 'generate a complete local & remote references list' idea --- +0.25.

It might be helpful having all targets, local or intersphinx, in one artifact? But after thinking about it, I don't think it's very important to me, personally---and, it seems to me the bigger problem is the object-type lossiness of the current v2 objects.inv format. (Or, at least the way in which Sphinx currently builds to that format.)

I think I would rather have better/more accurate information about the targets in my intersphinx-referenced docsets---which would require a new inventory format, as best I figure---than a list of all local and remote references, where the info I get on the remote references in that all-in-one artifact requires as much work to transform into a working cross-reference as the info I can get out of sphobjinv does.

If I'm trying to reference something in another project, I know which project it is, and I don't mind pointing a single-docset tool at that project's docs. (And, there's a good chance I might prefer that single-docset tool if I don't have to mess with an intermediate data file as part of the process.) The 'all in one place' aspect of this may have a broad appeal, but it's less important to me, personally.

Reaction to the layout of `references.json` --- overall +0.5 or so, with thoughts/caveats.

For automated ingestion of reference data, this schema seems great. 👍

Coming from a sphobjinv-biased perspective, my primary use case is, "I have this thing X that I want to cross-reference; how do I do that?"

So, from a data mapping perspective, what I want to be able to do with the output of this is to walk from [object name] -> [object reference].

I like the sound of the sphinx-ref find ... tool you proposed, but what happens if it doesn't do the search I want?

The current semantics of references.json are exactly backward for manual REPL exploration: it'll take a beefy, nested list comprehension to search through it for target names.

That said, using the right tool -- jsonpath-ng, say -- probably would make that search relatively straightforward. (Though, it would be more complex if the JSON gets broken up into multiple files.)

Choice of `references.<ext>` Format

If there's eventually a sphinx-ref find, I don't think the format of the output matters too much. As long as it's a standard, open format, anybody who wants to can interface with it. Format thoughts:

JSON would probably be the simplest format
- Likely the easiest for manual exploration
- Though the filesize question is real for large docsets, especially given that references.json would include all transitive references to intersphinx projects
  - All targets from the entire Python docs included in every references.json built...
SQLite does seem like a good option, giving a more compact file and sqlite is in stdlib
- Manual exploration would be considerably more cumbersome, though
- The schema would take some figuring out -- performance isn't a huge issue
  - One giant table, with domain and object_type columns?
  - One table per domain, with object_type columns? (Probably best?)
  - One table per domain/object_type combo? (Probably way too many tables)
Maybe tinydb? SQLite-like, but document database
- Not in the stdlib, so it'd be a dependency both for Sphinx and for anyone trying to read it independently
- But it fits the hierarchical data shape better, and it would be easier for manual exploration

picnixz · 2024-03-24T00:46:12Z

I'll comment tomorrow (for this one, I need a bit of sleep)

jakobandersen · 2024-03-24T11:30:46Z

A core problem is the use of domain.get_objects(). As alluded to in https://github.com/orgs/sphinx-doc/discussions/12152#discussioncomment-8877586 there is an inherent problem in intersphinx in that it assumes in knows how to write and read declared entities from each domain. The reading was mostly delegated to the domains, but the writing has not been yet.
Essentially I think we should figure out this delegation, including a new inventory/references format, before building more on top of the old problematic formats.

Currently get_objects() is used for only two purposes, as far as I can see: creating the index and creating inventories. The former is fine, as the fullname and dispname are only used for display purposes.
For inventories the fullname needs to encode all information about the entity in a string, so it can be loaded in again. This is not convenient for languages like C++ where the scoping information can be rather complex.
If I'm not mistaken, then this references builder is very similar to the inventory generation in its used of get_objects().

picnixz · 2024-03-24T15:04:25Z

Since we are talking about a new Intersphinx format, I would like you to also think about how to serialize the entries in the inventories, especially concerning #11932. After reading Jakob's argument, I also think that domains should be responsible for serializing their intersphinx part however they see fit. It could also solve multiple issues that I could not necessarily find when implementing #11932 but if each domain knows how to properly represent their references in intersphinx, it would be better.

In addition, we could change the format of a specific domain (e.g., if there are bugs) without affecting the format of other domains. I suggest using the same approach as for ELF where there is a header section containing the location of each program section. Then each domain would serialize its own intersphinx inventory and intersphinx would only be responsible for merging the parts together. Then, each domain would deserialize its dedicated section and recover its references mapping.

The references builder you are suggesting would be responsible to normalize each domain output in a more human-readable format. In the JSON output, you would include "human-readable" entries + an offset and buffer size to the serialized data in the objects.inv binary file. That way, you can use it to recover a single referencable entity, and using in a standalone manner as well.

chrisjsewell · 2024-03-25T15:02:16Z

Essentially I think we should figure out this delegation, including a new inventory/references format, before building more on top of the old problematic formats.
Since we are talking about a new Intersphinx format

See https://github.com/orgs/sphinx-doc/discussions/12204

✨ Add references builder

cbb6189

chrisjsewell requested review from danieleades, jakobandersen, jayaddison, jdillard and picnixz March 23, 2024 18:49

Add test-build-refs-basic

9283732

picnixz reviewed Mar 23, 2024

View reviewed changes

This was referenced Mar 28, 2024

Add hyper role, to render hyperlinks with style tech-writing/sphinx-design-elements#71

Merged

Sphinx: Add inventory decoder for Sphinx pyveci/pueblo#73

Closed

AA-Turner changed the title ~~✨ Add references builder~~ Add a new references builder Apr 9, 2024

AA-Turner added the builder label Apr 9, 2024

AA-Turner force-pushed the master branch from 4447021 to d95dafa Compare February 16, 2025 02:56

amotl mentioned this pull request Feb 26, 2025

Intersphinx Registry tech-writing/linksmith#17

Open

	data.setdefault(domainname, {}).setdefault(otype_name, {})['items'].setdefault(
	data[domainname].setdefault(otype_name, {})['items'].setdefault(

Uh oh!

Conversation

chrisjsewell commented Mar 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrisjsewell commented Mar 23, 2024

Uh oh!

picnixz commented Mar 23, 2024

Uh oh!

chrisjsewell commented Mar 23, 2024

Uh oh!

chrisjsewell commented Mar 23, 2024

Uh oh!

picnixz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chrisjsewell Mar 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

picnixz Mar 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

picnixz commented Mar 23, 2024

Uh oh!

chrisjsewell commented Mar 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bskinn commented Mar 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reaction to the 'generate a complete local & remote references list' idea --- +0.25.

Reaction to the layout of references.json --- overall +0.5 or so, with thoughts/caveats.

Choice of references.<ext> Format

Uh oh!

picnixz commented Mar 24, 2024

Uh oh!

jakobandersen commented Mar 24, 2024

Uh oh!

picnixz commented Mar 24, 2024

Uh oh!

chrisjsewell commented Mar 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

chrisjsewell commented Mar 23, 2024 •

edited

Loading

chrisjsewell Mar 23, 2024 •

edited

Loading

picnixz Mar 23, 2024 •

edited

Loading

chrisjsewell commented Mar 23, 2024 •

edited

Loading

bskinn commented Mar 24, 2024 •

edited

Loading

Reaction to the layout of `references.json` --- overall +0.5 or so, with thoughts/caveats.

Choice of `references.<ext>` Format