Skip to content

Store uids' declarations instead of node locations#11782

Closed
voodoos wants to merge 20 commits intoocaml:trunkfrom
voodoos:store-uid-names-loc
Closed

Store uids' declarations instead of node locations#11782
voodoos wants to merge 20 commits intoocaml:trunkfrom
voodoos:store-uid-names-loc

Conversation

@voodoos
Copy link
Copy Markdown
Contributor

@voodoos voodoos commented Dec 2, 2022

The compiler registers UIDs' location during typing.
These are stored in a table named uid_to_loc_tbl in Env.

This is used by Merlin's locate feature to perform jump to definition:

  1. First locate reduces the shape of the identifier and gets its definition's UID.
  2. Then Merlin queries the uid-to-loc table (which is dumped in the cmt files) to get the location of the definition.

However, right now, these locations are not the one we could expect: the compiler registers the location of the definition's (or declaration's) node instead of the location of its name.

That means that when locating the definition of a type t Merlin will jump to |type t = int instead of type |t = int.

This is not a big issue for the locate command since the user does land near the expected location, but it will be much more problematic when performing occurrences queries based on shapes. If the user wants to rename an identifier after performing such a query we need to return the names/labels loc. and not the nodes'.

@Octachron do you think this bug fix can make it to 5.0 ? I am also willing to prepare a backport for an hypothetical 4.14.1 release.
cc @gasche @trefis

@voodoos voodoos changed the title Attribute locs of names not of whole nodes to uids Attribute locations of names to uids instead of whole nodes Dec 2, 2022
@gasche
Copy link
Copy Markdown
Member

gasche commented Dec 2, 2022

I would like to be convinced that this is the approach we want before moving fast in this direction. In theory our support for shapes in the compiler is general enough that tools can reuse it for various purposes. But now we are finding that the information we store is not informative enough for the second need of shapes, and you need to change it.

Is your suggestion (storing another location) the right approach? Will the third use of shapes need something different? Should we consider, for example, storing the parsetree of the definitions, instead of just its location? This would allow to recover the location of the binder for example. But do we run into storage-size issues?

@voodoos
Copy link
Copy Markdown
Contributor Author

voodoos commented Dec 2, 2022

Thanks for the quick reply! Your concerns are valid ones.

The usages we identified so far are the following:

  1. jump to definition
  2. list occurrences across the whole project
  3. rename all occurrences of an identifier

In my opinion, correct behavior for 1 is to jump to the beginning of the name in the definition. In that case both 1 and 2 need the same info: the location of the name. 1 need only the starting position but 2 requires both the start and the end to highlight the occurrences.

Renaming is more subtle, with tricky cases to handle like in punning. But here we only consider the declarations/definitions of exposed identifiers, and I don't see a case where we would need more info than the location of the name. Do you have one in mind ?

And of course there is the possibility of other usages of this table. I went through the feature list of the LSP protocol for inspiration but I cannot think of on right now... Most complex transformations happen in the active buffer, of which Merlin has complete control, and not project-wide.

Storing a fragment of the parse tree would effectively be more flexible and maybe cover unforeseen use-cases. If you feel like we should do that (and if the added weight is reasonable), I can try it.

@Octachron
Copy link
Copy Markdown
Member

Octachron commented Dec 2, 2022

Concerning 5.0 and 4.14.2, it is getting too late in the release cycle for integrating design changes
that are not clearly bug fixes.
Note however that this doesn't close the door to an update in 4.14.2 (since the 4.14 branch is expected to keep having minor releases until OCaml 5 is stabilized).

@sidkshatriya
Copy link
Copy Markdown
Contributor

One of the big complaints for newbies in OCaml is the tooling. In recent years with dune, ocaml-lsp etc. the story has improved significantly.

However, we're still waiting for very important features like occurrences search across the whole project and not only in a single file. The open world assumption of OCaml makes these features especially useful. In Rust/Haskell we can simply ask what data structures satisfy an interface (because the user explicitly mentions that a data structure implements a trait/typeclass). With OCaml there needs to be a lot of searching across the codebase for "where something has come from" or "where else is this used".

I am just a user who is a bit tired of grepping across an OCaml database. Occurrences search supported by merlin/ocamllsp would be a very powerful feature and if some adjustment need to be made to shapes, I request it be considered.

I do understand it is very close to the release of 5.0 -- so I guess if it is not possible, then it is not possible.

TL;DR -- I hope that occurrences and similar features can arrive in our editors soon!

@voodoos
Copy link
Copy Markdown
Contributor Author

voodoos commented Dec 5, 2022

@sidkshatriya, having project-wide occurrences requires an indexing step where we traverse the typetrees of every modules in the project. We can re-discover the declarations / definitions at that point and get the exact locations we need. It would be cleaner to do it as proposed in that PR but I do not think this is a blocking matter :-)

@gasche
Copy link
Copy Markdown
Member

gasche commented Dec 5, 2022

My gut feeling on this PR (but I would welcome other opinions):

  • the proposed change is pretty small and probably fine as a short-term fix to make Merlin's life easier
  • storing the full declaration is probably a better medium/long term approach, but it requires a bit more work, in particular we have to experiment to see whether it changes the build artifact sizes (I would hope that these typedtree fragments are already stored somewhere in the .cmt and will be shared).

If someone is going to do the work of experimenting, it is probably @voodoos.

@Octachron Octachron added this to the 5.1 milestone Jan 13, 2023
@lpw25 lpw25 self-assigned this Jan 13, 2023
@voodoos
Copy link
Copy Markdown
Contributor Author

voodoos commented Jan 17, 2023

In the last two commits I store fragments of the typedtree in the "uid to loc" tables. (this changes the cmt files format so I guess it requires a magic number update)
I checked the total CMT size on Irmin's codebase:

4.14.1: 247M
4.14.1+fragments: 276M

That's a 12% increase.

@voodoos voodoos force-pushed the store-uid-names-loc branch from ef47b07 to de7ebd4 Compare February 10, 2023 14:48
@voodoos
Copy link
Copy Markdown
Contributor Author

voodoos commented Feb 13, 2023

I pushed changes that appear to preserve sharing. On Irmin's codebase, the total cmt size is 369M with OCaml 5.0 and 370M with these changes. The main changes are:

  1. In the first version I felt compelled to use Types nodes for some uids and that was wrong. The new datatype for "fragments" only contains Typedtree nodes. I believe this was the main cause for the cmt size increase in the first iteration of this PR.

  2. As an attempt to preserve sharing (but this may not be needed after 1) I took advantage of the Tast_mapper that summaries environments before writing the cmt file to also gather the uids. This decouples uid gathering from the typing. Since the mapping was done anyway for env cleaning, it should not be more costly than the "interleaved" solution. This approach also paves the way to gather shapes of values in the same mapping as it would be useful for the "project-wide occurrences" tooling feature (option 3 in [occurrences] Store modules' shapes in env sumaries #11983).

str_items = List.map (sub.structure_item sub) str_items;
str_final_env = sub.env sub str_final_env;
str_type;
});
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need the signature and structure operations here? From a distance it looks like they only map without doing anything else special, so it should not be necessary to write them explicitly. If they are subtly doing something different from the default case that is important, they would deserve a comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They actually do something: they update the environment in the closure of the mapper for the nested Env.find call to succeed. Since we are only interested in struct/sig level definitions / declarations we only need the "final" envs.

Another option would have been to map on the struct_item / sig_item cases to get more local environments. Not sure if it would be cleaner ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so you need an iterator with a view of the "current" typing environment. Maybe this can be made available globally instead of making it a specific change locally, but I'm not familiar with the typedtree-iterators stuff so I don't know offhand which API to suggest.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could provide a small wrapper over the default iterators ?
but maybe maybe it is a bit out-of-scope for this PR...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be much better to not use environments in this code. It looks like you are just using it to get the uids for definitions from the typedtree. Those should really already be stored in the typedtree as part of those definitions.

Part of the point of uids is to let people examine the binding structure of the tree without needing to mess around with environments.

@gasche
Copy link
Copy Markdown
Member

gasche commented Feb 13, 2023

I only looked at a part of the commits, but I like the new approach. I suggested the separate-mapper approach because I think that it gives overall nicer code. typemod.ml is fairly complex already and I would rather move as much concerns to other place as possible; the proposed change does just that for "construction of the Uid table".

@gasche
Copy link
Copy Markdown
Member

gasche commented Feb 13, 2023

Currently you map to summarize the environments and you also map at the same time to register the uids. Could the second pass be an iter only instead? (Wouldn't that make the code simpler?)

@voodoos
Copy link
Copy Markdown
Contributor Author

voodoos commented Feb 13, 2023

Currently you map to summarize the environments and you also map at the same time to register the uids. Could the second pass be an iter only instead? (Wouldn't that make the code simpler?)

The gathering pass could be an iter and thus slightly simpler. (Note that it would need to be the first pass, not the second, since the complete environment is required.) Also having the same iterator perform the env-cleanup and the gathering does make the code a bit more convoluted. I chose to merge them to spare the cost of a second iteration, but I don't actually now what would be the correct performance / readability ratio here...

@gasche
Copy link
Copy Markdown
Member

gasche commented Feb 13, 2023

I would be surprised if you could measure a noticeable performance cost for an extra iter on the typedtree.

@voodoos
Copy link
Copy Markdown
Contributor Author

voodoos commented Feb 14, 2023

I think it would be much better to not use environments in this code. It looks like you are just using it to get the uids for definitions from the typedtree. Those should really already be stored in the typedtree as part of those definitions.

I gave that idea a try in the last two commits. It is mostly straightforward to add the info to the Typedtree except for the value_binding case for which I am unsure of the correct way to proceed. The Uid are associated to pattern variables but it is not clear to me which Typedtree node should receive the additional info. (maybe Tpat_var and Tpat_alias)

@lpw25
Copy link
Copy Markdown
Contributor

lpw25 commented Feb 14, 2023

Yeah, I was thinking Tpat_var and Tpat_alias.

Copy link
Copy Markdown
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new commits seem to go in an interesting direction suggested by @lpw25, but there are aspects that I feel strange -- in particular, the calls to Env.find_* to find uids at declaration-checking time feel strange to me. We may need @lpw25 to review this instead of me.

I would still be curious to see the registration process defined as an iterator rather than a mapper, and possibly done after environment trimming -- this should now be possible -- to ensure that we never capture environments.

Can we articulate the design of the whole thing, before and after the PR?

Before PR:

  • the uid of each source declaration is stored in the Types declaration
  • those Types declaration are reachable from the environment
  • the uid are "registered" into a global table, bound to the declaration location, during type-checking (about at the time where they are created)

After the PR:

  • the uid of each source declaration is stored in the Types declaration
  • each Types declaration is included in the corresponding Typedtree declaration
  • this lets us register each uid, along with its Typedtree declaration, by iterating on the typedtree without tracking environments

Naive questions, mostly for @lpw25 I guess:

  1. Why is it better to include more things in the Typedtree data rather than getting it through the environment? (I'm assuming that for each typedtree declaration it is reasonably easy to get its "final environment" from the typedtree itself.)

  2. If we can recover the Types declaration from the Typedtree declaration (because we added the Types declaration as a new field, or because we look it up from the environment), why do we have so much duplication of fields on both sides? For example after the PR module_declaration is:

(* in types.mli *)
and module_declaration =
  {
    md_type: module_type;
    md_attributes: Parsetree.attributes;
    md_loc: Location.t;
    md_uid: Uid.t;
  }

(* in typedtree.mli *)
and module_declaration =
    {
     md_id: Ident.t option;
     md_name: string option loc;
     md_decl: Types.module_declaration option;
     md_presence: module_presence;
     md_type: module_type;
     md_attributes: attribute list;
     md_loc: Location.t;
    }

the fields loc and attributes from Typedtree seem redundant, why do we have them in the first place? (In fact it looks like Ident.t and module_presence could also be added in Types.module_declaration and thus possibly removed from Typedtree.module_declaration.)

@voodoos
Copy link
Copy Markdown
Contributor Author

voodoos commented Feb 15, 2023

there are aspects that I feel strange -- in particular, the calls to Env.find_* to find uids at declaration-checking time feel strange to me.

Right, it was really a first instinctive draft, it should possible to always trickle down the declarations.

I would still be curious to see the registration process defined as an iterator rather than a mapper, and possibly done after environment trimming -- this should now be possible -- to ensure that we never capture environments.

Yes, I agree, it should simplify the code a lot.

About the before/after in your comment @gasche, the declarations are already in the typedtree in some cases, so I would rephrase the second item like this:

  • some of those Types declaration are directly included in the corresponding Typedtree declaration while others need to be reached from the environment

@lpw25
Copy link
Copy Markdown
Contributor

lpw25 commented Feb 15, 2023

Why is it better to include more things in the Typedtree data rather than getting it through the environment?

Managing environments correctly is unnecessary burden to place on consumers of cmt files. See, for example, this code in merlin which was getting the environments wrong and leading to jump-to-definition failing.

It seems particularly ridiculous to need to use the environment to go from a binding to a Uid.t. They are supposed to be unique identifiers for those bindings. Really there should be a Uid.t on every binding site and every usage site in the typedtree.

why do we have so much duplication of fields on both sides?

In general, I think that the duplication comes from using the same type for both the "type"s in the typed tree and for the types in the environment and signatures. We store additional information on the types in the environment, e.g. locations, so that we can give better error messages. (We also unnecessarily store attributes on them, rather than handling attributes properly by translating them as part of typing the definition.).

In the specific case you point out, I think it would be reasonable to just put the Uid.t in the typedtree directly rather than pointing to a whole Types.module_declaration. More generally I'm not sure what the right solution is. Perhaps splitting the additional information like locations out from e.g. Types.module_declaration and inlining it into the places where those appear: e.g. the Sig_module constructor and the module_data type in Env. It might be more effort than it is worth to avoid the duplication though.

@gasche
Copy link
Copy Markdown
Member

gasche commented Feb 24, 2023

I realize that I have more silly questions about this, apologies. The Uid.t stuff is out of my memory and I have trouble paging the explanations back in.

  • Isn't Ident.t already supposed to be a unique identifier for each name? Why do we have both Ident.t and Uid.t? Where is this explained/documented in the code? (Is the idea that "declarations" and "definitions" of the same Ident.t are different bindings and get different Uids?)
  • Obviously the typedtree stores information about the environment in which each node was interpreted and, for bindings, the "output environment" after the binding is declared. If the problem is that this information is structured in a way that is hard to use, could we provide iterators on the typedtree that pass the correct environment(s) when visiting each typedtree node? (A more general version of what @voodoos had been hacking in this PR.) Wouldn't there be other potential benefits than just tracking Uids?

@trefis
Copy link
Copy Markdown
Contributor

trefis commented Feb 24, 2023

I realize that I have more silly questions about this, apologies. The Uid.t stuff is out of my memory and I have trouble paging the explanations back in.

* Isn't `Ident.t` already supposed to be a unique identifier for each name? Why do we have both `Ident.t` and `Uid.t`? Where is this explained/documented in the code? (Is the idea that "declarations" and "definitions" of the same `Ident.t` are different bindings and get different Uids?)

You might want to refer back to this conversation. It might not completely / exactly answer your question, but it'll be a good start.

@lpw25
Copy link
Copy Markdown
Contributor

lpw25 commented Feb 24, 2023

Isn't Ident.t already supposed to be a unique identifier for each name?

A Uid.t is a unqiue identifier of a binding in the source code of the program. An Ident.t is a bound name. These are very different concepts. For example, it makes sense to substitute an Ident.t for another path or expression. As another example, consider a path Foo.Bar.baz, whilst Foo is an Ident.t it does not make sense to ask for the identifier of baz, but it does make sense to ask for the place in the source code where baz was declared -- and that has a corresponding Uid.t.

So Uid.t is the right abstraction to use when implementing jump-to-declaration, since it represents the destination of the jump. It is also the right abstraction to use for implementing warnings about unused bindings, since they answer the question: if I removed this binding from the source code would it still all compile.

Here we are talking about the typedtree: which is a representation of the source code of the program, and asking if it should include Uid.ts, which are the identifiers of positions in that source code. Phrased that way it should be clear that it obviously should.

@voodoos
Copy link
Copy Markdown
Contributor Author

voodoos commented Feb 28, 2023

I made multiple changes in the latest commits, the main ones are:

  • f0dc803 replace whole Types declarations added to the Typedtree by this PR by uids only. This align with both having easy access to Uids as stressed by @lpw25 and concerns about duplicating info and cluttering the tree raised by @gasche. A further step in that direction would be to do the same with other Types part embedded in the Typedtree (probably not in the scope of that PR).
  • af63b79 I separated the env-cleaning mapper and the decl-gathering iterator as suggested by @gasche.

@voodoos
Copy link
Copy Markdown
Contributor Author

voodoos commented Mar 1, 2023

The fact that uids changed in one test 52081b7 make me wonder about the correctness of the changes I made to add uids to pattern variables nodes in a540124.

I think this part need more careful review by some uid-expert 🙂

@voodoos
Copy link
Copy Markdown
Contributor Author

voodoos commented Mar 7, 2023

After discussing it offline I realized that iterating after the fact as we do in that PR remove the uid_to_decl table from the environment which was useful for Merlin when performing locate queries in the current buffer.

Now Merlin will have to iterate on the typedtree to answer such queries for the current buffer, and I wonder what would be the performance impact of this additional iteration. It is not a big deal when generatign cmts, but it might be for the expected low-latency of Merlin queries ?

Also, maybe we could expose the iterator so that Merlin can reuse it directly ?

@voodoos voodoos force-pushed the store-uid-names-loc branch from 3a92cbd to 4f6334e Compare March 22, 2023 16:34
@voodoos voodoos changed the title Attribute locations of names to uids instead of whole nodes Store uids' declarations instead of node locations Mar 28, 2023
@trefis
Copy link
Copy Markdown
Contributor

trefis commented Apr 12, 2023

Now Merlin will have to iterate on the typedtree to answer such queries for the current buffer, and I wonder what would be the performance impact of this additional iteration. It is not a big deal when generatign cmts, but it might be for the expected low-latency of Merlin queries ?

Fair question. Perhaps merlin could always do this iteration after its typing phase, and cache the result?
The cost shouldn't be too noticeable compared to type-checking time (assuming you can make it incremental / restartable, just like type-checking currently is in merlin), and there'd be no extra latency on queries hitting the cache.

@voodoos
Copy link
Copy Markdown
Contributor Author

voodoos commented Aug 28, 2023

Superseded by #12508

@voodoos voodoos closed this Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants