Skip to content

[occurrences] Store index in cmt#12142

Closed
voodoos wants to merge 58 commits intoocaml:trunkfrom
voodoos:store-index-in-cmt
Closed

[occurrences] Store index in cmt#12142
voodoos wants to merge 58 commits intoocaml:trunkfrom
voodoos:store-index-in-cmt

Conversation

@voodoos
Copy link
Copy Markdown
Contributor

@voodoos voodoos commented Mar 28, 2023

This is a prototype for option 3 in #11983 (comment)

Context:

When the shapes were introduced into the compiler it was primarily with the goal of supporting jump-to-definition. For this purpose, visible shapes of compilation units where added to the cmt files and Merlin use these when reducing a value's shape to find its definition.

However, when indexing the code-base to provide project-wide occurrences having only the shape of the top-level of the compilation unit is not sufficient: one have to iterate over values in the cmt's Typedtree and reduce their shapes and that reduction require access to intermediate modules' shapes.

The quick-fix proposed in #11983 of storing full shape information in the cmt file is expensive in term of space.

Another approach is taken in that pr: to have the compiler index values by their locally-reduced shape directly in the cmt file. Two optimizations are performed:

I need to rebase that PR to 500 to evaluate its cost in terms of cmt size and compilation time, I will un-draft it when I have some results. This PR also includes changes from #11782, only the last four commits are new.

cc @lpw25

@voodoos
Copy link
Copy Markdown
Contributor Author

voodoos commented Apr 11, 2023

So I cherry picked these commit to the 4.14 branch to have meaningful benchmarks.

As usual I used the Irmin codebase as example.

Version dune build @all total cmt size
4.14 40.047s 330M
4.14+index 39.796s 378M
4.14+index ensure sharing 40.423s 360M

This seems to indicate that there is no perceptible build time increase by performing the indexing.
However the cmt size increase (as it was to be expected) is noticiable.

Should I introduce a flag to enable indexation ? And maybe another flag that don't store the typedtree in the cmt files ?

@voodoos voodoos force-pushed the store-index-in-cmt branch from 0d55900 to 5143818 Compare April 11, 2023 20:28
@voodoos voodoos marked this pull request as ready for review April 11, 2023 20:29
@voodoos
Copy link
Copy Markdown
Contributor Author

voodoos commented Apr 12, 2023

@lpw25 I was able to fix the sharing issue (there was a very silly mistake in my previous attempt). I updated the results in the previous comment.

On the whole Irmin codebase (65K lines in ml files), the indexed data adds 30M (from 330M to 360M), that is a 10% increase.

But the index is incomplete right now, the only nodes I consider right now are:

  • Texp_ident
  • Ttyp_constr
  • Tmod_ident

Do you have in mind what other nodes would be important to consider for this first version ?
The main drawback of having the compiler do the indexing is that we are less flexibility in what we can add/remove.

@lpw25
Copy link
Copy Markdown
Contributor

lpw25 commented Apr 12, 2023

I guess the potential candidates are every occurrence of Longident.t in the parsetree. I suppose it depends on which sorts of items we intend to support "find all uses". For instance, if you want to support it for record fields then you'll need to include all the places with Longident.ts that correspond to those.

@voodoos
Copy link
Copy Markdown
Contributor Author

voodoos commented Apr 12, 2023

I guess the potential candidates are every occurrence of Longident.t in the parsetree. I suppose it depends on which sorts of items we intend to support "find all uses". For instance, if you want to support it for record fields then you'll need to include all the places with Longident.ts that correspond to those.

Right, I am not sure were we should put the cursor for the first iteration, I will make a list and start filling cases.
I will also add some tests.

What do you think about making the indexation opt-in with a flag ?
10% growth of all installed cmts is still a lot of wasted space if we don't care about indexing the whole switch.

@lpw25
Copy link
Copy Markdown
Contributor

lpw25 commented Apr 12, 2023

I think it's worth avoiding a flag if possible. Before add one I think it would be worth investigating the sources of the size increase -- in case there are simple things that can be done to improve things. It would also be worth benchmarking it on a different project -- my understanding is that Irmin has particularly large shapes for some reason.

@voodoos
Copy link
Copy Markdown
Contributor Author

voodoos commented Apr 12, 2023

I would expect the size increase to be directly related to the shapes we store. This does mean that Irmin should be more impacted than other projects.

Since I have switches with all Irmin dependencies I can simply check the sizes of the cmts in these switches:

Version switch with all irmin deps cmt size
4.14 371M
4.14+index ensure sharing 381M

It looks quite good ! The switch contains 207 packages, including Irmin itself.

Full package list alcotest alcotest-lwt angstrom arp asn1-combinators astring awa awa-mirage base base-bigarray base-bytes base-threads base-unix base64 bentov bheap bigarray-compat bigstringaf bisect_ppx bos ca-certs ca-certs-nss carton carton-git carton-lwt cf cf-lwt checkseum cmdliner cohttp cohttp-lwt cohttp-lwt-unix conduit conduit-lwt conduit-lwt-unix conf-gmp conf-gmp-powm-sec conf-gnuplot conf-libffi conf-pkg-config cppo crunch csexp cstruct cstruct-lwt cstruct-sexp cstruct-unix ctypes ctypes-foreign decompress digestif dispatch dns dns-client dns-client-lwt dns-client-mirage domain-name duff dune dune-configurator duration either emile encore eqaf ethernet faraday fmt fpath fsevents fsevents-lwt functoria-runtime git git-mirage git-paf git-unix gmap graphql graphql-cohttp graphql-lwt graphql_parser h2 happy-eyeballs happy-eyeballs-lwt happy-eyeballs-mirage hex hkdf hpack httpaf hxd index integers ipaddr ipaddr-cstruct ipaddr-sexp irmin irmin-bench irmin-chunk irmin-cli irmin-containers irmin-fs irmin-git irmin-graphql irmin-http irmin-mirage irmin-mirage-git irmin-mirage-graphql irmin-pack irmin-test irmin-tezos irmin-tezos-utils irmin-watcher jsonm ke libirmin logs lru lwt lwt-dllist macaddr macaddr-cstruct magic-mime menhir menhirLib menhirSdk metrics metrics-unix mimic mimic-happy-eyeballs mirage-clock mirage-clock-unix mirage-crypto mirage-crypto-ec mirage-crypto-pk mirage-crypto-rng mirage-crypto-rng-lwt mirage-flow mirage-kv mirage-net mirage-random mirage-runtime mirage-time mirage-unix mtime notty num ocaml ocaml-base-compiler ocaml-compiler-libs ocaml-config ocaml-options-vanilla ocaml-syntax-shims ocamlbuild ocamlfind ocamlgraph ocplib-endian optint paf parsexp pbkdf pecu ppx_cstruct ppx_derivers ppx_deriving ppx_irmin ppx_repr ppx_sexp_conv ppxlib printbox printbox-text progress psq ptime qcheck-alcotest qcheck-core randomconv re repr result rresult rusage semaphore-compat seq sexplib sexplib0 stdlib-shims stringext tcpip terminal tezos-base58 tls tls-lwt tls-mirage topkg uri uri-sexp uucp uuidm uutf vector webmachine x509 yaml yojson zarith

@voodoos voodoos force-pushed the store-index-in-cmt branch from 495f2bd to 3438624 Compare April 24, 2023 20:08
@voodoos voodoos force-pushed the store-index-in-cmt branch from 3438624 to bbf671a Compare May 25, 2023 09:54
@voodoos
Copy link
Copy Markdown
Contributor Author

voodoos commented Aug 28, 2023

Superseded by #12508

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants