Skip to content

[shapes] Add support for project-wide occurrences#12508

Merged
Octachron merged 59 commits intoocaml:trunkfrom
voodoos:store-index-in-cmt-trunk
Jan 22, 2024
Merged

[shapes] Add support for project-wide occurrences#12508
Octachron merged 59 commits intoocaml:trunkfrom
voodoos:store-index-in-cmt-trunk

Conversation

@voodoos
Copy link
Copy Markdown
Contributor

@voodoos voodoos commented Aug 28, 2023

This PR implements required changes to provide project-wide-occurrences in OCaml projects.

It is structured in 4 parts:

  1. Systematic annotation of the Typedtree with uids. This was discussed in PR#11782. It was already partially the case but some uids required to index declarations were missing. This is done in commit d57a68c.
  2. Introduction of a number of modifications and improvements to the Shapes' representation and reduction. This is done in commits starting with the Shapes: tag. This was also taken as an opportunity to add some documentation to the mli of the Shape module.
  3. Moving the indexation of declarations, previously interleaved with typing, to a separate iteration performed before writing a cmt file. This requires the changes made in 1. and makes PR Store uids' declarations instead of node locations #11782 obsolete. This is done in commit fda848b.
  4. Finally, we introduce an additional iteration on the typedtree that lists usages of values, types, modules, constructors and labels in the cmt file associated with their locally-reduced shape.

Grouping all these changes together in the same branch was more convenient to iterate on the design, but if this feels too much for one PR I can split it.

List of the shape changes

  • 1866d74 introduces "weak" reduction and sharing of the memoization tables (which should be safe at the level of the compilation unit where idents are unique). It was first proposed and discussed in PR #11020. @gasche I did not yet implement your suggestion. (We can keep that discussion in Shapes: Add an option to share memo tables accros reductions #11020, but it was more convenient to replicate the change to have this PR self-contained.)
  • 1ee1e17 fixes an issue where reduced functor applications would not have the correct uids.
  • d9ee586 equips the shapes with a new constructor that enable the tracing of module aliases. When reducing, these alias markers are stripped when in head position.
  • 1ee1e17 adds labels and constructors to the shapes. This is required to locate them and get their occurrences.
  • 1e6049f mark shapes whose reduction was incomplete (for example, when hitting a first class module). Previously an approximated uid was silently returned, this is good enough for jumping to "definition" but wrong for occurrences.

About the indexation

Indexing values (or types, etc) requires reducing the shape of every usage of a value to obtain its definition. This process requires access to the full typing environment which contains all intermediary shapes. Since only a stripped down
environment is stored in the CMT file this processing cannot be done by an external tool with only the Typedtree and environment stored in the cmt files.

This was discussed in PR #11983 and we decided to perform (partial) reduction of the shape associated with each usage of a value in the compiler and store the result in the cmt file. Shapes that relies on external compilation units are not reduced completely to preserve separate compilation.

Project-wide occurrences

With the changes proposed in this PR, one should be able to build an index of all values in a project by following these steps:

  • Build the project with -bin-annot.
  • Read every cmt file and finish the reduction of the shapes in the cmt_usages_index lists to get the definitions corresponding to the listed values. This step might require loading other cmt files.
  • Group every usages of the same definition to get the index of a cmt file.
  • Merge all these local-indexes to get the project's index.

Size increase and performance discussion

It is expected that:

  • The additional typedtree iterations should have a negligible impact on build duration.
  • Reducing all the shapes of every occurrences could be expensive but this should be largely alleviated by reusing the same memo table for each compilation unit.
  • Adding the index of usage to the cmt should not take too much additional
    space due to sharing with the already-present typedtree.

I will prepare some benchmarks next week to verify these assertions.

Copy link
Copy Markdown
Contributor

@trefis trefis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a first look, mostly it looks good.
I left a few comments here and there, and I'm leaving some of the review work for others (sharing is caring).

Looking forward to seeing those benchmarks you mentioned.

@voodoos
Copy link
Copy Markdown
Contributor Author

voodoos commented Sep 7, 2023

I made some benchmarks:

Compiler

Version Command Mean [s] Min [s] Max [s] Relative
trunk make world.opt --jobs 10 75.260 ± 0.336 74.995 76.087 1.00
trunk+index make world.opt --jobs 10 75.881 ± 0.131 75.725 76.100 1.00

I ran the benchmark multiple times and duration was always a bit longer with these changes, although it's mostly in the error margin.

Version cmt size
trunk 42M
trunk+index 47M

That's a 11% increase in size.

Irmin 3.7

I backported the changed to 5.1+trunk to be able to test them on real-world projects.
Irmin is know for having complex shapes due to its intricate use of functors.

Version Command Mean [s] Min [s] Max [s] Relative
5.1 dune build @install 19.070 ± 0.303 18.824 19.677 1.00
5.1+index dune build @install 18.866 ± 0.266 18.549 19.513 1.00

Did that benchmark multiple times, no perceptible build time difference.

Version cmt size
5.1 48M
5.1+index 49M

A 2% increase in size.

Size of the opam switches' cmt files

The opam switches have all Irmin's dependencies installed.

Package list alcotest
alcotest-lwt
angstrom
arp
asn1-combinators
astring
awa
awa-mirage
base
base-bigarray
base-bytes
base-domains
base-nnp
base-threads
base-unix
base64
bentov
bheap
bigarray-compat
bigstringaf
bos
ca-certs
ca-certs-nss
carton
carton-git
carton-lwt
cf
cf-lwt
checkseum
cmdliner
cohttp
cohttp-lwt
cohttp-lwt-unix
conduit
conduit-lwt
conduit-lwt-unix
conf-gmp
conf-gmp-powm-sec
conf-gnuplot
conf-libffi
conf-pkg-config
cppo
crunch
csexp
cstruct
cstruct-lwt
cstruct-unix
ctypes
ctypes-foreign
decompress
digestif
dispatch
dns
dns-client
dns-client-lwt
dns-client-mirage
domain-name
duff
dune
dune-configurator
duration
either
emile
encore
eqaf
ethernet
faraday
fmt
fpath
fsevents
fsevents-lwt
functoria-runtime
git
git-mirage
git-paf
git-unix
gmap
graphql
graphql-cohttp
graphql-lwt
graphql_parser
h2
happy-eyeballs
happy-eyeballs-lwt
happy-eyeballs-mirage
hex
hkdf
hpack
httpaf
hxd
index
integers
ipaddr
ipaddr-cstruct
ipaddr-sexp
irmin-watcher
jsonm
ke
logs
lru
lwt
lwt-dllist
macaddr
macaddr-cstruct
magic-mime
menhir
menhirLib
menhirSdk
metrics
metrics-unix
mimic
mimic-happy-eyeballs
mirage-clock
mirage-clock-unix
mirage-crypto
mirage-crypto-ec
mirage-crypto-pk
mirage-crypto-rng
mirage-crypto-rng-lwt
mirage-flow
mirage-kv
mirage-net
mirage-random
mirage-runtime
mirage-time
mirage-unix
mtime
notty
num
ocaml
ocaml-compiler-libs
ocaml-config
ocaml-syntax-shims
ocaml-variants
ocamlbuild
ocamlfind
ocamlgraph
ocplib-endian
optint
paf
parsexp
pbkdf
pecu
ppx_derivers
ppx_deriving
ppx_repr
ppx_sexp_conv
ppxlib
printbox
printbox-text
progress
psq
ptime
randomconv
re
repr
result
rresult
rusage
semaphore-compat
seq
sexplib
sexplib0
stdlib-shims
stringext
tcpip
terminal
tezos-base58
tls
tls-lwt
tls-mirage
topkg
uri
uri-sexp
uucp
uuidm
uutf
vector
webmachine
x509
yaml
yojson
zarith
Version cmt size switch size
5.1+trunk 85M 669M
5.1+index 94M 679M

A 10.5% size increase of the cmt files.
A 1.5% size increase of the switch size.

Discussion

I expect the additional size to come from the index' list of shapes and locations. Partially reduced shapes might not appear elsewhere in the cmt files and the list can be quite large so there is some administrative cost to it. (However the locs themselves should be entirely shared with the typedtree's ones.)

Irmin's cmts might already contain more of the shape information that is added by the index, explaining the smaller size increase.

All around, the 10.5% size increase is probably a good idea of the new cmt sizes after this PR.

A 10% artifact size increase is not unreasonable when one expect its project to be indexed for full occurrences search.

The total switch since increase is much less (1.5%), but most of the time we won't need this additional information. (it might be useful for opam-scale analysis, but not for individual projects). Maybe we should introduce a flag to trigger the indexation, so that the new indexing information would not be a part of the installed libraries but could be easily enabled (and probably by default) by the various build systems for user projects?

(cc @lpw25)

@voodoos
Copy link
Copy Markdown
Contributor Author

voodoos commented Sep 18, 2023

We discussed the performances with @trefis. We decided to introduce (in 456bf83) a flag to enable indexation: even if the overhead is reasonable, it is wasteful to build and store the index for every installed package. The flag make it simple for build systems to actually enable indexation of projects in developpement mode and not in release mode.

| Some Class_type -> (in_printing_env @@ Env.find_cltype path).clty_loc
| Some (Extension_constructor|Value) | None -> Location.none
| Some (Extension_constructor|Value|Constructor|Label) | None ->
Location.none
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this is the correct behaviour (though I guess this new code is unused, so it's going to be hard to check).
Ping @Octachron ? 🙂

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the "correct" behavior for item kinds whose paths cannot appear in type error messages (and thus we are never trying to disambiguate).

@gasche
Copy link
Copy Markdown
Member

gasche commented Oct 5, 2023

I had a meeting on Thursday September 28th with @voodoos, @trefis and @Ekdohibs to discuss this PR. @voodoos presented the design, the new documentation on shapes (thanks!), explained the various iterations and the current state. @Ekdohibs explained some bugs left over in the implementation of deep reduction (in particular in the corner case of terms whose reduction is aborted without reaching a normal form, due to fuel limit) and proposed various ways to fix the idea of "weak reductions" and improve the system. We reviewed the interface / documentation, and we decided to simplify the "No Fuel Left" case by removing the not-quite-reduced term in this case, simplifying @Ekdohibs' implementation.

My current understanding:

  • I am okay with the current design. (We are exactly in the "worst case" scenario where each new shape-related feature actually requires changes to the compiler codebase, instead of being a flexible abstraction to offer once and for all. But the collaboration dynamics are good and I trust that the proposed design has a reasonable amount of domain logic in the compiler.)
  • Sometimes still need to do a review pass on the implementation. This person may end up being me.

(A minor point: the "remove weak_reduce" commit is entirely based, unless I am missing something, on original non-trivial ideas from @Ekdohibs, and her implementation in Ekdohibs@2af1bfb . In this case I believe that it would be better to leave @Ekdohibs as author of the commit, possibly with a co-authored-by tag if you made non-trivial changes on top of the commit, instead of having @voodoos marked as the main author of 9393609 . You can use indicate any author when you make a commit with git commit --author=...., and cherry-picking/rebasing should be default preserve the authorship information.)

@voodoos voodoos force-pushed the store-index-in-cmt-trunk branch from 9393609 to 033f075 Compare October 5, 2023 16:24
@voodoos
Copy link
Copy Markdown
Contributor Author

voodoos commented Oct 17, 2023

An update as I been working mostly offline on this: I am updating Merlin to actually use this PR's changes to answer locate and occurrences queries. This is a way to check that nothing is missing / wrong with the PR with regards to its intended use-case. That led me to make a few changes and fixes that I will add to the PR before cleaning and rebasing it for another round of review.

@voodoos voodoos force-pushed the store-index-in-cmt-trunk branch 10 times, most recently from c23ab14 to 787a4bf Compare November 8, 2023 15:31
@voodoos
Copy link
Copy Markdown
Contributor Author

voodoos commented Nov 8, 2023

After discussing with @trefis we decided to reshape types and constructors that are now struct that can contain constructors or labels. This allows a correct model of inline records in types, type extensions and exceptions.

These changes are done in d759cfd (@trefis , if you want to have a look: I had to handle more cases than we first thought of for type extensions and exceptions).

I also rebased the PR on latest trunk and squashed all the changes into meaningful commits that should be reviewable one by one.

I tried to please the CI as much as possible but:

  • I added some long URLs in commentaries which break the 80 columns rule. What should I do about that ?
  • I tried running make depend but it seems like it was outdated before this PR, and promoting the results appear to break the CI. Is that expected ? For now I only promoted the changes related to this PR.
  • I guess some of the CI breaks because this PR requires a magic_number bump. Should I do it ? (@Octachron)

@gasche you mentioned that you might review at some point, if that is still the case I think the PR is now ready for that :-)

@voodoos voodoos force-pushed the store-index-in-cmt-trunk branch from 787a4bf to eb2d7e1 Compare November 8, 2023 15:43
@Octachron
Copy link
Copy Markdown
Member

For the check-typo issue, I would propose to simply disable the column width counting for lines that contain "https://".

Copy link
Copy Markdown
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the change introducing constructor and labels in shapes (good idea!) and I had a few drive-by comments.

@voodoos voodoos force-pushed the store-index-in-cmt-trunk branch from d14e449 to bbdc93b Compare November 13, 2023 12:52
@voodoos voodoos force-pushed the store-index-in-cmt-trunk branch from 84b5885 to 6d2e9d5 Compare January 15, 2024 12:58
voodoos added a commit to voodoos/ocaml that referenced this pull request Jan 15, 2024
@voodoos voodoos force-pushed the store-index-in-cmt-trunk branch from 6d2e9d5 to e71de27 Compare January 15, 2024 13:00
Copy link
Copy Markdown
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments on the changelog entry.

Changes Outdated
locations. Add an option -bin-annot-occurrences to the driver which triggers
the indexing of values, modules, types, labels and constructors usages and
store that index to the cmt files. Document shapes, make the interface more
precise and share memoization tables to improve reduction perofrmance.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
precise and share memoization tables to improve reduction perofrmance.
precise and share memoization tables to improve reduction performance.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The attribution part below is fine, but the description of the change below is too low-level: the main readers are people interested in understanding new features of an OCaml release. To record development choices we use nice commit messages.

I propose the following:

Add compiler-side support for project-wide occurrences in Merlin,
by generating index tables of all identifier occurrences.
This extra data in .cmt files is only added when the new flag
  -bin-annot-occurrences
is passed.

"-bin-annot", Arg.Unit f, " Save typedtree in <filename>.cmt"

let mk_store_occurrences f =
"-bin-annot-occurrences", Arg.Unit f,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if the name of the option (as a variable) matched the name of the option (as a flag): mk_binannot_occurences for example.

| Texp_send _
| Texp_letmodule _ | Texp_letexception _ | Texp_assert _ | Texp_lazy _
| Texp_object _ | Texp_pack _ | Texp_letop _ | Texp_unreachable
| Texp_extension_constructor _ | Texp_open _ -> ());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is small missing occurrence here: Texp_extension_constructor represents [%extension_constructor Not_found] and thus may hide an occurrence of an extension constructor.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I just pushed a test showing the missing occurrence and a fix.

@Octachron Octachron merged commit a682d51 into ocaml:trunk Jan 22, 2024
@gasche
Copy link
Copy Markdown
Member

gasche commented Jan 22, 2024

Congratulations everyone for the merge :-)

Octachron added a commit that referenced this pull request Jan 22, 2024
[shapes] Add support for project-wide occurrences

(cherry picked from commit a682d51)
@Octachron
Copy link
Copy Markdown
Member

Cherry-picked to 5.2 in 02b3970 . Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants