Store uids' declarations instead of node locations by voodoos · Pull Request #11782 · ocaml/ocaml

voodoos · 2022-12-02T14:04:00Z

The compiler registers UIDs' location during typing.
These are stored in a table named uid_to_loc_tbl in Env.

This is used by Merlin's locate feature to perform jump to definition:

First locate reduces the shape of the identifier and gets its definition's UID.
Then Merlin queries the uid-to-loc table (which is dumped in the cmt files) to get the location of the definition.

However, right now, these locations are not the one we could expect: the compiler registers the location of the definition's (or declaration's) node instead of the location of its name.

That means that when locating the definition of a type t Merlin will jump to |type t = int instead of type |t = int.

This is not a big issue for the locate command since the user does land near the expected location, but it will be much more problematic when performing occurrences queries based on shapes. If the user wants to rename an identifier after performing such a query we need to return the names/labels loc. and not the nodes'.

@Octachron do you think this bug fix can make it to 5.0 ? I am also willing to prepare a backport for an hypothetical 4.14.1 release.
cc @gasche @trefis

gasche · 2022-12-02T14:27:25Z

I would like to be convinced that this is the approach we want before moving fast in this direction. In theory our support for shapes in the compiler is general enough that tools can reuse it for various purposes. But now we are finding that the information we store is not informative enough for the second need of shapes, and you need to change it.

Is your suggestion (storing another location) the right approach? Will the third use of shapes need something different? Should we consider, for example, storing the parsetree of the definitions, instead of just its location? This would allow to recover the location of the binder for example. But do we run into storage-size issues?

voodoos · 2022-12-02T16:27:10Z

Thanks for the quick reply! Your concerns are valid ones.

The usages we identified so far are the following:

jump to definition
list occurrences across the whole project
rename all occurrences of an identifier

In my opinion, correct behavior for 1 is to jump to the beginning of the name in the definition. In that case both 1 and 2 need the same info: the location of the name. 1 need only the starting position but 2 requires both the start and the end to highlight the occurrences.

Renaming is more subtle, with tricky cases to handle like in punning. But here we only consider the declarations/definitions of exposed identifiers, and I don't see a case where we would need more info than the location of the name. Do you have one in mind ?

And of course there is the possibility of other usages of this table. I went through the feature list of the LSP protocol for inspiration but I cannot think of on right now... Most complex transformations happen in the active buffer, of which Merlin has complete control, and not project-wide.

Storing a fragment of the parse tree would effectively be more flexible and maybe cover unforeseen use-cases. If you feel like we should do that (and if the added weight is reasonable), I can try it.

Octachron · 2022-12-02T17:11:25Z

Concerning 5.0 and 4.14.2, it is getting too late in the release cycle for integrating design changes
that are not clearly bug fixes.
Note however that this doesn't close the door to an update in 4.14.2 (since the 4.14 branch is expected to keep having minor releases until OCaml 5 is stabilized).

sidkshatriya · 2022-12-03T21:36:46Z

One of the big complaints for newbies in OCaml is the tooling. In recent years with dune, ocaml-lsp etc. the story has improved significantly.

However, we're still waiting for very important features like occurrences search across the whole project and not only in a single file. The open world assumption of OCaml makes these features especially useful. In Rust/Haskell we can simply ask what data structures satisfy an interface (because the user explicitly mentions that a data structure implements a trait/typeclass). With OCaml there needs to be a lot of searching across the codebase for "where something has come from" or "where else is this used".

I am just a user who is a bit tired of grepping across an OCaml database. Occurrences search supported by merlin/ocamllsp would be a very powerful feature and if some adjustment need to be made to shapes, I request it be considered.

I do understand it is very close to the release of 5.0 -- so I guess if it is not possible, then it is not possible.

TL;DR -- I hope that occurrences and similar features can arrive in our editors soon!

voodoos · 2022-12-05T11:55:22Z

@sidkshatriya, having project-wide occurrences requires an indexing step where we traverse the typetrees of every modules in the project. We can re-discover the declarations / definitions at that point and get the exact locations we need. It would be cleaner to do it as proposed in that PR but I do not think this is a blocking matter :-)

typing/typemod.ml

gasche · 2022-12-05T14:36:20Z

My gut feeling on this PR (but I would welcome other opinions):

the proposed change is pretty small and probably fine as a short-term fix to make Merlin's life easier
storing the full declaration is probably a better medium/long term approach, but it requires a bit more work, in particular we have to experiment to see whether it changes the build artifact sizes (I would hope that these typedtree fragments are already stored somewhere in the .cmt and will be shared).

If someone is going to do the work of experimenting, it is probably @voodoos.

voodoos · 2023-01-17T14:33:30Z

In the last two commits I store fragments of the typedtree in the "uid to loc" tables. (this changes the cmt files format so I guess it requires a magic number update)
I checked the total CMT size on Irmin's codebase:

4.14.1: 247M
4.14.1+fragments: 276M

That's a 12% increase.

voodoos · 2023-02-13T10:31:15Z

I pushed changes that appear to preserve sharing. On Irmin's codebase, the total cmt size is 369M with OCaml 5.0 and 370M with these changes. The main changes are:

In the first version I felt compelled to use Types nodes for some uids and that was wrong. The new datatype for "fragments" only contains Typedtree nodes. I believe this was the main cause for the cmt size increase in the first iteration of this PR.
As an attempt to preserve sharing (but this may not be needed after 1) I took advantage of the Tast_mapper that summaries environments before writing the cmt file to also gather the uids. This decouples uid gathering from the typing. Since the mapping was done anyway for env cleaning, it should not be more costly than the "interleaved" solution. This approach also paves the way to gather shapes of values in the same mapping as it would be useful for the "project-wide occurrences" tooling feature (option 3 in [occurrences] Store modules' shapes in env sumaries #11983).

gasche · 2023-02-13T11:55:56Z

file_formats/cmt_format.ml

+      str_items = List.map (sub.structure_item sub) str_items;
+      str_final_env = sub.env sub str_final_env;
+      str_type;
+    });


Why do you need the signature and structure operations here? From a distance it looks like they only map without doing anything else special, so it should not be necessary to write them explicitly. If they are subtly doing something different from the default case that is important, they would deserve a comment.

They actually do something: they update the environment in the closure of the mapper for the nested Env.find call to succeed. Since we are only interested in struct/sig level definitions / declarations we only need the "final" envs.

Another option would have been to map on the struct_item / sig_item cases to get more local environments. Not sure if it would be cleaner ?

Ah, so you need an iterator with a view of the "current" typing environment. Maybe this can be made available globally instead of making it a specific change locally, but I'm not familiar with the typedtree-iterators stuff so I don't know offhand which API to suggest.

I guess we could provide a small wrapper over the default iterators ?
but maybe maybe it is a bit out-of-scope for this PR...

I think it would be much better to not use environments in this code. It looks like you are just using it to get the uids for definitions from the typedtree. Those should really already be stored in the typedtree as part of those definitions.

Part of the point of uids is to let people examine the binding structure of the tree without needing to mess around with environments.

typing/typemod.ml

gasche · 2023-02-13T12:03:06Z

I only looked at a part of the commits, but I like the new approach. I suggested the separate-mapper approach because I think that it gives overall nicer code. typemod.ml is fairly complex already and I would rather move as much concerns to other place as possible; the proposed change does just that for "construction of the Uid table".

gasche · 2023-02-13T13:20:48Z

Currently you map to summarize the environments and you also map at the same time to register the uids. Could the second pass be an iter only instead? (Wouldn't that make the code simpler?)

voodoos · 2023-02-13T14:00:59Z

Currently you map to summarize the environments and you also map at the same time to register the uids. Could the second pass be an iter only instead? (Wouldn't that make the code simpler?)

The gathering pass could be an iter and thus slightly simpler. (Note that it would need to be the first pass, not the second, since the complete environment is required.) Also having the same iterator perform the env-cleanup and the gathering does make the code a bit more convoluted. I chose to merge them to spare the cost of a second iteration, but I don't actually now what would be the correct performance / readability ratio here...

gasche · 2023-02-13T14:15:06Z

I would be surprised if you could measure a noticeable performance cost for an extra iter on the typedtree.

voodoos · 2023-02-14T12:49:59Z

I think it would be much better to not use environments in this code. It looks like you are just using it to get the uids for definitions from the typedtree. Those should really already be stored in the typedtree as part of those definitions.

I gave that idea a try in the last two commits. It is mostly straightforward to add the info to the Typedtree except for the value_binding case for which I am unsure of the correct way to proceed. The Uid are associated to pattern variables but it is not clear to me which Typedtree node should receive the additional info. (maybe Tpat_var and Tpat_alias)

lpw25 · 2023-02-14T16:13:04Z

Yeah, I was thinking Tpat_var and Tpat_alias.

gasche

The new commits seem to go in an interesting direction suggested by @lpw25, but there are aspects that I feel strange -- in particular, the calls to Env.find_* to find uids at declaration-checking time feel strange to me. We may need @lpw25 to review this instead of me.

I would still be curious to see the registration process defined as an iterator rather than a mapper, and possibly done after environment trimming -- this should now be possible -- to ensure that we never capture environments.

Can we articulate the design of the whole thing, before and after the PR?

Before PR:

the uid of each source declaration is stored in the Types declaration
those Types declaration are reachable from the environment
the uid are "registered" into a global table, bound to the declaration location, during type-checking (about at the time where they are created)

After the PR:

the uid of each source declaration is stored in the Types declaration
each Types declaration is included in the corresponding Typedtree declaration
this lets us register each uid, along with its Typedtree declaration, by iterating on the typedtree without tracking environments

Naive questions, mostly for @lpw25 I guess:

Why is it better to include more things in the Typedtree data rather than getting it through the environment? (I'm assuming that for each typedtree declaration it is reasonably easy to get its "final environment" from the typedtree itself.)
If we can recover the Types declaration from the Typedtree declaration (because we added the Types declaration as a new field, or because we look it up from the environment), why do we have so much duplication of fields on both sides? For example after the PR module_declaration is:

(* in types.mli *)
and module_declaration =
  {
    md_type: module_type;
    md_attributes: Parsetree.attributes;
    md_loc: Location.t;
    md_uid: Uid.t;
  }

(* in typedtree.mli *)
and module_declaration =
    {
     md_id: Ident.t option;
     md_name: string option loc;
     md_decl: Types.module_declaration option;
     md_presence: module_presence;
     md_type: module_type;
     md_attributes: attribute list;
     md_loc: Location.t;
    }

the fields loc and attributes from Typedtree seem redundant, why do we have them in the first place? (In fact it looks like Ident.t and module_presence could also be added in Types.module_declaration and thus possibly removed from Typedtree.module_declaration.)

typing/typemod.ml

file_formats/cmt_format.ml

voodoos · 2023-02-15T09:26:53Z

there are aspects that I feel strange -- in particular, the calls to Env.find_* to find uids at declaration-checking time feel strange to me.

Right, it was really a first instinctive draft, it should possible to always trickle down the declarations.

I would still be curious to see the registration process defined as an iterator rather than a mapper, and possibly done after environment trimming -- this should now be possible -- to ensure that we never capture environments.

Yes, I agree, it should simplify the code a lot.

About the before/after in your comment @gasche, the declarations are already in the typedtree in some cases, so I would rephrase the second item like this:

some of those Types declaration are directly included in the corresponding Typedtree declaration while others need to be reached from the environment

lpw25 · 2023-02-15T14:32:56Z

Why is it better to include more things in the Typedtree data rather than getting it through the environment?

Managing environments correctly is unnecessary burden to place on consumers of cmt files. See, for example, this code in merlin which was getting the environments wrong and leading to jump-to-definition failing.

It seems particularly ridiculous to need to use the environment to go from a binding to a Uid.t. They are supposed to be unique identifiers for those bindings. Really there should be a Uid.t on every binding site and every usage site in the typedtree.

why do we have so much duplication of fields on both sides?

In general, I think that the duplication comes from using the same type for both the "type"s in the typed tree and for the types in the environment and signatures. We store additional information on the types in the environment, e.g. locations, so that we can give better error messages. (We also unnecessarily store attributes on them, rather than handling attributes properly by translating them as part of typing the definition.).

In the specific case you point out, I think it would be reasonable to just put the Uid.t in the typedtree directly rather than pointing to a whole Types.module_declaration. More generally I'm not sure what the right solution is. Perhaps splitting the additional information like locations out from e.g. Types.module_declaration and inlining it into the places where those appear: e.g. the Sig_module constructor and the module_data type in Env. It might be more effort than it is worth to avoid the duplication though.

gasche · 2023-02-24T09:46:35Z

I realize that I have more silly questions about this, apologies. The Uid.t stuff is out of my memory and I have trouble paging the explanations back in.

Isn't Ident.t already supposed to be a unique identifier for each name? Why do we have both Ident.t and Uid.t? Where is this explained/documented in the code? (Is the idea that "declarations" and "definitions" of the same Ident.t are different bindings and get different Uids?)
Obviously the typedtree stores information about the environment in which each node was interpreted and, for bindings, the "output environment" after the binding is declared. If the problem is that this information is structured in a way that is hard to use, could we provide iterators on the typedtree that pass the correct environment(s) when visiting each typedtree node? (A more general version of what @voodoos had been hacking in this PR.) Wouldn't there be other potential benefits than just tracking Uids?

trefis · 2023-02-24T10:20:15Z

I realize that I have more silly questions about this, apologies. The Uid.t stuff is out of my memory and I have trouble paging the explanations back in.
* Isn't `Ident.t` already supposed to be a unique identifier for each name? Why do we have both `Ident.t` and `Uid.t`? Where is this explained/documented in the code? (Is the idea that "declarations" and "definitions" of the same `Ident.t` are different bindings and get different Uids?)

You might want to refer back to this conversation. It might not completely / exactly answer your question, but it'll be a good start.

lpw25 · 2023-02-24T14:18:39Z

Isn't Ident.t already supposed to be a unique identifier for each name?

A Uid.t is a unqiue identifier of a binding in the source code of the program. An Ident.t is a bound name. These are very different concepts. For example, it makes sense to substitute an Ident.t for another path or expression. As another example, consider a path Foo.Bar.baz, whilst Foo is an Ident.t it does not make sense to ask for the identifier of baz, but it does make sense to ask for the place in the source code where baz was declared -- and that has a corresponding Uid.t.

So Uid.t is the right abstraction to use when implementing jump-to-declaration, since it represents the destination of the jump. It is also the right abstraction to use for implementing warnings about unused bindings, since they answer the question: if I removed this binding from the source code would it still all compile.

Here we are talking about the typedtree: which is a representation of the source code of the program, and asking if it should include Uid.ts, which are the identifiers of positions in that source code. Phrased that way it should be clear that it obviously should.

voodoos · 2023-02-28T17:46:31Z

I made multiple changes in the latest commits, the main ones are:

f0dc803 replace whole Types declarations added to the Typedtree by this PR by uids only. This align with both having easy access to Uids as stressed by @lpw25 and concerns about duplicating info and cluttering the tree raised by @gasche. A further step in that direction would be to do the same with other Types part embedded in the Typedtree (probably not in the scope of that PR).
af63b79 I separated the env-cleaning mapper and the decl-gathering iterator as suggested by @gasche.

voodoos · 2023-03-01T10:36:10Z

The fact that uids changed in one test 52081b7 make me wonder about the correctness of the changes I made to add uids to pattern variables nodes in a540124.

I think this part need more careful review by some uid-expert 🙂

voodoos · 2023-03-07T10:14:06Z

After discussing it offline I realized that iterating after the fact as we do in that PR remove the uid_to_decl table from the environment which was useful for Merlin when performing locate queries in the current buffer.

Now Merlin will have to iterate on the typedtree to answer such queries for the current buffer, and I wonder what would be the performance impact of this additional iteration. It is not a big deal when generatign cmts, but it might be for the expected low-latency of Merlin queries ?

Also, maybe we could expose the iterator so that Merlin can reuse it directly ?

…d.ml

trefis · 2023-04-12T11:58:28Z

Now Merlin will have to iterate on the typedtree to answer such queries for the current buffer, and I wonder what would be the performance impact of this additional iteration. It is not a big deal when generatign cmts, but it might be for the expected low-latency of Merlin queries ?

Fair question. Perhaps merlin could always do this iteration after its typing phase, and cache the result?
The cost shouldn't be too noticeable compared to type-checking time (assuming you can make it incremental / restartable, just like type-checking currently is in merlin), and there'd be no extra latency on queries hitting the cache.

voodoos · 2023-08-28T13:40:19Z

Superseded by #12508

voodoos changed the title ~~Attribute locs of names not of whole nodes to uids~~ Attribute locations of names to uids instead of whole nodes Dec 2, 2022

gasche reviewed Dec 5, 2022

View reviewed changes

typing/typemod.ml Outdated Show resolved Hide resolved

Octachron added this to the 5.1 milestone Jan 13, 2023

lpw25 self-assigned this Jan 13, 2023

voodoos force-pushed the store-uid-names-loc branch from ef47b07 to de7ebd4 Compare February 10, 2023 14:48

gasche reviewed Feb 13, 2023

View reviewed changes

typing/typemod.ml Outdated Show resolved Hide resolved

gasche reviewed Feb 14, 2023

View reviewed changes

typing/typemod.ml Outdated Show resolved Hide resolved

typing/typemod.ml Outdated Show resolved Hide resolved

typing/typemod.ml Outdated Show resolved Hide resolved

typing/typemod.ml Show resolved Hide resolved

file_formats/cmt_format.ml Outdated Show resolved Hide resolved

voodoos added 20 commits March 22, 2023 11:53

Fix dune build on macos

36e885d

Store typedtree nodes in uid_to_loc table

0455c04

Only keep the env summaries in the fragments

26fd2dd

Remove register_uid from typemode

32076c2

Use the tastmapper to gather uids

3cec4e2

Refactor

fec0dc1

Check for row_name to prevent duplicates and clean comments in typemo…

ef4c109

…d.ml

"Simplify" and comment the structure and signature

6efdd9a

Store Types.module_decl in typedtree module bindings

5a2f02e

Add types.decl to rypedtree mod_decl and modtyp_decl

97f551c

Store uid in Tpat_var and Tpat_alias

a540124

Remove the need for env in the mapper

35d9954

Store Uid in the Typedtree instead of whole types declarations

4cb609e

Make Uid not optional for mod_decl and mod_bind

419e3a2

Split the env mapper and the decl iterator

a6bd88c

Rename uid_fragment type

75ba9eb

Rename uid_to_loc -> uid_to_decl

31fb379

Bump magic numbers

544f4f7

Promote Uid change in test

52081b7

Update bootstrap after bumping magick number

4f6334e

voodoos force-pushed the store-uid-names-loc branch from 3a92cbd to 4f6334e Compare March 22, 2023 16:34

voodoos changed the title ~~Attribute locations of names to uids instead of whole nodes~~ Store uids' declarations instead of node locations Mar 28, 2023

voodoos mentioned this pull request Mar 28, 2023

[occurrences] Store index in cmt #12142

Closed

gasche modified the milestones: 5.1, 5.2 Apr 21, 2023

tonowak mentioned this pull request Aug 11, 2023

Manually applied changes from PR #11782 oxcaml/oxcaml#1732

Merged

voodoos mentioned this pull request Aug 28, 2023

[shapes] Add support for project-wide occurrences #12508

Merged

voodoos closed this Aug 28, 2023

Conversation

voodoos commented Dec 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gasche commented Dec 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

voodoos commented Dec 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Octachron commented Dec 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sidkshatriya commented Dec 3, 2022

Uh oh!

voodoos commented Dec 5, 2022

Uh oh!

Uh oh!

gasche commented Dec 5, 2022

Uh oh!

voodoos commented Jan 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

voodoos commented Feb 13, 2023

Uh oh!

gasche Feb 13, 2023

Choose a reason for hiding this comment

Uh oh!

voodoos Feb 13, 2023

Choose a reason for hiding this comment

Uh oh!

gasche Feb 13, 2023

Choose a reason for hiding this comment

Uh oh!

voodoos Feb 13, 2023

Choose a reason for hiding this comment

Uh oh!

lpw25 Feb 13, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gasche commented Feb 13, 2023

Uh oh!

gasche commented Feb 13, 2023

Uh oh!

voodoos commented Feb 13, 2023

Uh oh!

gasche commented Feb 13, 2023

Uh oh!

voodoos commented Feb 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lpw25 commented Feb 14, 2023

Uh oh!

gasche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

voodoos commented Feb 15, 2023

Uh oh!

lpw25 commented Feb 15, 2023

Uh oh!

gasche commented Feb 24, 2023

Uh oh!

trefis commented Feb 24, 2023

Uh oh!

lpw25 commented Feb 24, 2023

Uh oh!

voodoos commented Feb 28, 2023

Uh oh!

voodoos commented Mar 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

voodoos commented Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

voodoos commented Dec 2, 2022 •

edited

Loading

gasche commented Dec 2, 2022 •

edited

Loading

voodoos commented Dec 2, 2022 •

edited

Loading

Octachron commented Dec 2, 2022 •

edited

Loading

voodoos commented Jan 17, 2023 •

edited

Loading

voodoos commented Feb 14, 2023 •

edited

Loading

voodoos commented Mar 1, 2023 •

edited

Loading

voodoos commented Mar 7, 2023 •

edited

Loading