Store uids' declarations instead of node locations#11782
Store uids' declarations instead of node locations#11782voodoos wants to merge 20 commits intoocaml:trunkfrom
Conversation
|
I would like to be convinced that this is the approach we want before moving fast in this direction. In theory our support for shapes in the compiler is general enough that tools can reuse it for various purposes. But now we are finding that the information we store is not informative enough for the second need of shapes, and you need to change it. Is your suggestion (storing another location) the right approach? Will the third use of shapes need something different? Should we consider, for example, storing the parsetree of the definitions, instead of just its location? This would allow to recover the location of the binder for example. But do we run into storage-size issues? |
|
Thanks for the quick reply! Your concerns are valid ones. The usages we identified so far are the following:
In my opinion, correct behavior for 1 is to jump to the beginning of the name in the definition. In that case both 1 and 2 need the same info: the location of the name. 1 need only the starting position but 2 requires both the start and the end to highlight the occurrences. Renaming is more subtle, with tricky cases to handle like in punning. But here we only consider the declarations/definitions of exposed identifiers, and I don't see a case where we would need more info than the location of the name. Do you have one in mind ? And of course there is the possibility of other usages of this table. I went through the feature list of the LSP protocol for inspiration but I cannot think of on right now... Most complex transformations happen in the active buffer, of which Merlin has complete control, and not project-wide. Storing a fragment of the parse tree would effectively be more flexible and maybe cover unforeseen use-cases. If you feel like we should do that (and if the added weight is reasonable), I can try it. |
|
Concerning 5.0 and 4.14.2, it is getting too late in the release cycle for integrating design changes |
|
One of the big complaints for newbies in OCaml is the tooling. In recent years with dune, ocaml-lsp etc. the story has improved significantly. However, we're still waiting for very important features like occurrences search across the whole project and not only in a single file. The open world assumption of OCaml makes these features especially useful. In Rust/Haskell we can simply ask what data structures satisfy an interface (because the user explicitly mentions that a data structure implements a trait/typeclass). With OCaml there needs to be a lot of searching across the codebase for "where something has come from" or "where else is this used". I am just a user who is a bit tired of grepping across an OCaml database. Occurrences search supported by merlin/ocamllsp would be a very powerful feature and if some adjustment need to be made to shapes, I request it be considered. I do understand it is very close to the release of 5.0 -- so I guess if it is not possible, then it is not possible. TL;DR -- I hope that occurrences and similar features can arrive in our editors soon! |
|
@sidkshatriya, having project-wide occurrences requires an indexing step where we traverse the typetrees of every modules in the project. We can re-discover the declarations / definitions at that point and get the exact locations we need. It would be cleaner to do it as proposed in that PR but I do not think this is a blocking matter :-) |
|
My gut feeling on this PR (but I would welcome other opinions):
If someone is going to do the work of experimenting, it is probably @voodoos. |
|
In the last two commits I store fragments of the typedtree in the "uid to loc" tables. (this changes the cmt files format so I guess it requires a magic number update) That's a 12% increase. |
ef47b07 to
de7ebd4
Compare
|
I pushed changes that appear to preserve sharing. On Irmin's codebase, the total cmt size is
|
file_formats/cmt_format.ml
Outdated
| str_items = List.map (sub.structure_item sub) str_items; | ||
| str_final_env = sub.env sub str_final_env; | ||
| str_type; | ||
| }); |
There was a problem hiding this comment.
Why do you need the signature and structure operations here? From a distance it looks like they only map without doing anything else special, so it should not be necessary to write them explicitly. If they are subtly doing something different from the default case that is important, they would deserve a comment.
There was a problem hiding this comment.
They actually do something: they update the environment in the closure of the mapper for the nested Env.find call to succeed. Since we are only interested in struct/sig level definitions / declarations we only need the "final" envs.
Another option would have been to map on the struct_item / sig_item cases to get more local environments. Not sure if it would be cleaner ?
There was a problem hiding this comment.
Ah, so you need an iterator with a view of the "current" typing environment. Maybe this can be made available globally instead of making it a specific change locally, but I'm not familiar with the typedtree-iterators stuff so I don't know offhand which API to suggest.
There was a problem hiding this comment.
I guess we could provide a small wrapper over the default iterators ?
but maybe maybe it is a bit out-of-scope for this PR...
There was a problem hiding this comment.
I think it would be much better to not use environments in this code. It looks like you are just using it to get the uids for definitions from the typedtree. Those should really already be stored in the typedtree as part of those definitions.
Part of the point of uids is to let people examine the binding structure of the tree without needing to mess around with environments.
|
I only looked at a part of the commits, but I like the new approach. I suggested the separate-mapper approach because I think that it gives overall nicer code. typemod.ml is fairly complex already and I would rather move as much concerns to other place as possible; the proposed change does just that for "construction of the Uid table". |
|
Currently you |
The gathering pass could be an |
|
I would be surprised if you could measure a noticeable performance cost for an extra |
I gave that idea a try in the last two commits. It is mostly straightforward to add the info to the |
|
Yeah, I was thinking |
gasche
left a comment
There was a problem hiding this comment.
The new commits seem to go in an interesting direction suggested by @lpw25, but there are aspects that I feel strange -- in particular, the calls to Env.find_* to find uids at declaration-checking time feel strange to me. We may need @lpw25 to review this instead of me.
I would still be curious to see the registration process defined as an iterator rather than a mapper, and possibly done after environment trimming -- this should now be possible -- to ensure that we never capture environments.
Can we articulate the design of the whole thing, before and after the PR?
Before PR:
- the uid of each source declaration is stored in the Types declaration
- those Types declaration are reachable from the environment
- the uid are "registered" into a global table, bound to the declaration location, during type-checking (about at the time where they are created)
After the PR:
- the uid of each source declaration is stored in the Types declaration
- each Types declaration is included in the corresponding Typedtree declaration
- this lets us register each uid, along with its Typedtree declaration, by iterating on the typedtree without tracking environments
Naive questions, mostly for @lpw25 I guess:
-
Why is it better to include more things in the Typedtree data rather than getting it through the environment? (I'm assuming that for each typedtree declaration it is reasonably easy to get its "final environment" from the typedtree itself.)
-
If we can recover the Types declaration from the Typedtree declaration (because we added the Types declaration as a new field, or because we look it up from the environment), why do we have so much duplication of fields on both sides? For example after the PR
module_declarationis:
(* in types.mli *)
and module_declaration =
{
md_type: module_type;
md_attributes: Parsetree.attributes;
md_loc: Location.t;
md_uid: Uid.t;
}
(* in typedtree.mli *)
and module_declaration =
{
md_id: Ident.t option;
md_name: string option loc;
md_decl: Types.module_declaration option;
md_presence: module_presence;
md_type: module_type;
md_attributes: attribute list;
md_loc: Location.t;
}the fields loc and attributes from Typedtree seem redundant, why do we have them in the first place? (In fact it looks like Ident.t and module_presence could also be added in Types.module_declaration and thus possibly removed from Typedtree.module_declaration.)
Right, it was really a first instinctive draft, it should possible to always trickle down the declarations.
Yes, I agree, it should simplify the code a lot. About the before/after in your comment @gasche, the declarations are already in the typedtree in some cases, so I would rephrase the second item like this:
|
Managing environments correctly is unnecessary burden to place on consumers of cmt files. See, for example, this code in merlin which was getting the environments wrong and leading to jump-to-definition failing. It seems particularly ridiculous to need to use the environment to go from a binding to a
In general, I think that the duplication comes from using the same type for both the "type"s in the typed tree and for the types in the environment and signatures. We store additional information on the types in the environment, e.g. locations, so that we can give better error messages. (We also unnecessarily store attributes on them, rather than handling attributes properly by translating them as part of typing the definition.). In the specific case you point out, I think it would be reasonable to just put the |
|
I realize that I have more silly questions about this, apologies. The Uid.t stuff is out of my memory and I have trouble paging the explanations back in.
|
You might want to refer back to this conversation. It might not completely / exactly answer your question, but it'll be a good start. |
A So Here we are talking about the typedtree: which is a representation of the source code of the program, and asking if it should include |
|
I made multiple changes in the latest commits, the main ones are:
|
|
After discussing it offline I realized that iterating after the fact as we do in that PR remove the Now Merlin will have to iterate on the typedtree to answer such queries for the current buffer, and I wonder what would be the performance impact of this additional iteration. It is not a big deal when generatign cmts, but it might be for the expected low-latency of Merlin queries ? Also, maybe we could expose the iterator so that Merlin can reuse it directly ? |
3a92cbd to
4f6334e
Compare
Fair question. Perhaps merlin could always do this iteration after its typing phase, and cache the result? |
|
Superseded by #12508 |
The compiler registers UIDs' location during typing.
These are stored in a table named
uid_to_loc_tblinEnv.This is used by Merlin's
locatefeature to perform jump to definition:locatereduces the shape of the identifier and gets its definition's UID.However, right now, these locations are not the one we could expect: the compiler registers the location of the definition's (or declaration's) node instead of the location of its name.
That means that when locating the definition of a type
tMerlin will jump to|type t = intinstead oftype |t = int.This is not a big issue for the locate command since the user does land near the expected location, but it will be much more problematic when performing occurrences queries based on shapes. If the user wants to rename an identifier after performing such a query we need to return the names/labels loc. and not the nodes'.
@Octachron do you think this bug fix can make it to 5.0 ? I am also willing to prepare a backport for an hypothetical
4.14.1release.cc @gasche @trefis