Skip to content

compress cmt/cmti files#23

Closed
c-cube wants to merge 2 commits intoocaml:masterfrom
c-cube:rfc-zip-cmt
Closed

compress cmt/cmti files#23
c-cube wants to merge 2 commits intoocaml:masterfrom
c-cube:rfc-zip-cmt

Conversation

@c-cube
Copy link
Copy Markdown
Contributor

@c-cube c-cube commented Mar 20, 2021

@c-cube c-cube changed the title propose to compress cmt/cmti files compress cmt/cmti files Mar 20, 2021
@dbuenzli
Copy link
Copy Markdown
Contributor

Maybe worth investigating alternate compression formats like LZ4, a few benchmarks can be found on this page.

@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Mar 20, 2021

At very first glance it seems to compress less well than gzip (on the command line). For base it goes from 40M to 30M rather than 27M, for example.

@dbuenzli
Copy link
Copy Markdown
Contributor

At very first glance it seems to compress less well than gzip (on the command line). For base it goes from 40M to 30M rather than 27M, for example.

Yes according to the graphs on the page I mentioned that was expected. I was more suggesting in looking the issue not only from a disk size perspective but also with (de)compression time overhead. The compiler has to produce these and lots of tools consume them so one should be mindful about these timings too.

@dinosaure
Copy link
Copy Markdown

A potential side benefit would be to be able to add zlib to the standard library,
which could really use such a pervasive format. Of course it can remain a compiler-only
dependency if preferred.

Just to forward some private discussions, this RFC can be useful but it should only extend the compiler - and the inflation/deflation support should not exceed the compiler area. In others words, it should not be an extension of the standard library or the runtime to support inflation/deflation (at least, otherlibs can be a good place for the compression/decompression implementation). Such constraint is on the spirit of MirageOS where we want the smaller runtime as we can and delay third part pieces of the operating system to libraries (as we did for TCP/IP, zlib or DNS).

@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Mar 21, 2021

@dbuenzli this is for .cmt and .cmti which, afaik, are not consumed by the compiler but only by other tools such as merlin (and odoc?). I'm proposing zlib because it's a pervasive format which is reasonably fast and compresses reasonably good, with a tiny C library supported absolutely everywhere.

@dinosaure yes, it could just be an otherlibs as we discussed. I do personally think OCaml should ship with something as crucial as zlib, but it doesn't have to be in the stdlib itself, it can be like num or str or unix. I'm just interested in the size reduction of opam switches here.

@dbuenzli
Copy link
Copy Markdown
Contributor

@dbuenzli this is for .cmt and .cmti which, afaik, are not consumed by the compiler but only by other tools such as merlin (and odoc?).

Precisely, especially for a tool like merlin which is for interactive behaviour decompression latency is important. Compression latency is also important because as a matter of fact it will slowdown your builds.

@gasche
Copy link
Copy Markdown
Member

gasche commented Mar 21, 2021

Do Merlin or the OCaml LSP-server implementation use .cmt and .cmti files nowadays? A good share of the information can be found in .cmi with -keep-locs (which has been the default for a while).

@dbuenzli
Copy link
Copy Markdown
Contributor

Do Merlin or the OCaml LSP-server implementation use .cmt and .cmti files nowadays?

I remember a discussion where @let-def or @trefis indicated me they intended so. I don't know if that happened but that file mentions them.

A good share of the information can be found in .cmi with -keep-locs (which has been the default for a while).

Btw. this may also suggest other investigations, complementary to this proposal, to reduce the size of ocaml installs (I'm also a bit annoyed by these GB opam switches).

  • Do .cmi files still store documentation comments or have they full migrated to cmt[i] files. That is how much info is duplicated in .cmi and cmti files ?
  • (Wild guess) Could maybe a little dose of sharing reduce the size of these files ?

@dra27
Copy link
Copy Markdown
Member

dra27 commented Mar 22, 2021

  • (Wild guess) Could maybe a little dose of sharing reduce the size of these files ?

I was wondering about this too. I think @garrigue has mentioned before the idea of .cmia files, which, if there is a saving to be made with sharing, might be the vehicle for doing so?

@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Mar 23, 2021

Using a script by Adrien Nader (@adrien-n), on an opam switch built around containers, this is a comparison between several compression methods and parameters (columns are size, compression time, decompression time, for all the files of a given type in an opam switch):

cma: 101 files
            xz --lzma2=preset=6,dict=32M:    21948684 [32.820] [1.840]
                     xz --lzma2=preset=6:    21964448 [32.540] [1.880]
            xz --lzma2=preset=4,dict=32M:    22659468 [25.770] [1.890]
            xz --lzma2=preset=1,dict=32M:    25514528 [12.350] [2.060]
            xz --lzma2=preset=0,dict=32M:    25963040 [9.280] [2.030]
                                zstd -19:    25945170 [34.930] [0.170]
                                 zstd -9:    28772853 [4.260] [0.150]
                                    zstd:    31443168 [0.660] [0.160]
                                    gzip:    34354019 [5.570] [0.730]
                               lz4 -m -9:    41029515 [3.320] [0.090]
                                  lz4 -m:    47359243 [0.250] [0.080]
                                    none:    94899924 [0.070] [0.040]

cmi: 1428 files
            xz --lzma2=preset=6,dict=32M:     5716060 [12.100] [0.450]
                     xz --lzma2=preset=6:     5716060 [5.960] [0.470]
            xz --lzma2=preset=4,dict=32M:     5756544 [11.560] [0.460]
            xz --lzma2=preset=1,dict=32M:     6015180 [11.490] [0.480]
            xz --lzma2=preset=0,dict=32M:     6036316 [9.130] [0.470]
                                zstd -19:     6420086 [4.790] [0.040]
                                 zstd -9:     6670191 [0.630] [0.040]
                                    zstd:     7047310 [0.200] [0.040]
                                    gzip:     7593523 [0.940] [0.150]
                               lz4 -m -9:     9354444 [0.440] [0.020]
                                  lz4 -m:     9949697 [0.080] [0.020]
                                    none:    17634033 [0.050] [0.010]

cmt: 896 files
            xz --lzma2=preset=6,dict=32M:    45246608 [47.400] [3.780]
                     xz --lzma2=preset=6:    45245840 [42.120] [3.750]
            xz --lzma2=preset=4,dict=32M:    45796392 [43.660] [3.900]
            xz --lzma2=preset=1,dict=32M:    49281820 [19.880] [4.320]
            xz --lzma2=preset=0,dict=32M:    49952044 [18.930] [4.400]
                                zstd -19:    51337744 [45.630] [0.350]
                                 zstd -9:    56066638 [6.670] [0.300]
                                    zstd:    60692390 [1.370] [0.280]
                                    gzip:    63791112 [10.860] [1.240]
                               lz4 -m -9:    78283742 [5.530] [0.130]
                                  lz4 -m:    89351358 [0.480] [0.140]
                                    none:   158492662 [0.120] [0.060]

cmxs: 85 files
            xz --lzma2=preset=6,dict=32M:     6155252 [10.160] [0.500]
                     xz --lzma2=preset=6:     6155224 [9.750] [0.530]
            xz --lzma2=preset=4,dict=32M:     6396704 [7.610] [0.510]
            xz --lzma2=preset=1,dict=32M:     6694688 [2.510] [0.520]
            xz --lzma2=preset=0,dict=32M:     6891968 [2.140] [0.570]
                                zstd -19:     6532108 [10.450] [0.050]
                                 zstd -9:     7008768 [0.870] [0.040]
                                    zstd:     7663126 [0.200] [0.050]
                                    gzip:     9024983 [1.480] [0.200]
                               lz4 -m -9:    10933285 [1.280] [0.040]
                                  lz4 -m:    13618987 [0.090] [0.030]
                                    none:    32005328 [0.030] [0.020]

cmti: 686 files
            xz --lzma2=preset=6,dict=32M:     9574768 [13.790] [0.850]
                     xz --lzma2=preset=6:     9574840 [10.120] [0.810]
            xz --lzma2=preset=4,dict=32M:     9716448 [11.750] [0.840]
            xz --lzma2=preset=1,dict=32M:    10321132 [7.050] [0.850]
            xz --lzma2=preset=0,dict=32M:    10453372 [6.120] [0.840]
                                zstd -19:    10439785 [13.310] [0.070]
                                 zstd -9:    11237727 [1.080] [0.060]
                                    zstd:    12061416 [0.300] [0.070]
                                    gzip:    14377914 [1.950] [0.310]
                               lz4 -m -9:    17544602 [0.940] [0.040]
                                  lz4 -m:    19611563 [0.130] [0.040]
                                    none:    38107476 [0.050] [0.020]

cmx: 1435 files
            xz --lzma2=preset=6,dict=32M:     3924376 [11.540] [0.300]
                     xz --lzma2=preset=6:     3924376 [4.280] [0.300]
            xz --lzma2=preset=4,dict=32M:     3960064 [11.290] [0.310]
            xz --lzma2=preset=1,dict=32M:     4034728 [15.010] [0.300]
            xz --lzma2=preset=0,dict=32M:     4057616 [8.820] [0.300]
                                zstd -19:     3963000 [3.480] [0.020]
                                 zstd -9:     4098046 [0.360] [0.030]
                                    zstd:     4258308 [0.140] [0.020]
                                    gzip:     4375358 [0.360] [0.090]
                               lz4 -m -9:     5006699 [0.180] [0.010]
                                  lz4 -m:     5430647 [0.060] [0.020]
                                    none:    10255674 [0.050] [0.000]

@dbuenzli
Copy link
Copy Markdown
Contributor

Thanks for the numbers.

Regarding merlin (re)loads, I don't know the details – it's magic after all – but it is good to keep in mind the 0.1s duration. Below that duration, users have the feeling the system is reacting instantaneously (reference).

@xavierleroy
Copy link
Copy Markdown

Precisely, especially for a tool like merlin which is for interactive behaviour decompression latency is important.

True. But if the file is cold (not cached in memory already) and needs to be read from HDD or SSD, the reduction in loading time coming from the reduction in size far outweighs the extra time spent decompressing. For warm files (already cached), the effect is less dramatic, but I would still expect a win from fast decompressors like lz4 or zstd.

@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Mar 23, 2021

Zstd is extremely fast, and it's a C library under a BSD license. @Drup concluded on IRC that it's probably the best choice and I now tend to agree, if adding bindings to compiler-libs is deemed acceptable.

(Anecdotically I started using zstd on my machine to compress various things last year, and it's impressive. Archlinux uses it by default to compress packages, too.)

@xavierleroy
Copy link
Copy Markdown

if adding bindings to compiler-libs is deemed acceptable.

Old Unix guy here: what's wrong with piping through an external compressor / decompressor? Yes, I know: Windows, Mirage, impoverished environments, etc. But if it's data for Merlin or Typerex, should we really care about impoverished environments?

@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Mar 23, 2021

Interesting idea. At first my gut reaction is that shelling to subprocesses is fragile and slow (especialyl when you produce the file and have to wait for the subprocess to terminate, to avoid the situation where the subprocess dies before it could write all the compressed data).

However, if we arranged things so that the compiler (or merlin) knows how to shell to a subprocess to read foo.cmt.zst (which works pretty well), then the compression becomes optional and can even be delegated to dune the build system as an optimization. It could even do the compression only on _build/install files or equivalent.

@Drup
Copy link
Copy Markdown
Collaborator

Drup commented Mar 23, 2021

Old Unix guy here: what's wrong with piping through an external compressor / decompressor? Yes, I know: Windows, Mirage, impoverished environments, etc. But if it's data for Merlin or Typerex, should we really care about impoverished environments?

Not to debate the general argument, but while the effects are pretty dramatic on .cmt[i], compression also cuts size of .cmi to a third. .cmi can be pretty big for heavily functorized libraries, and unmarshalling is a significant time cost in the typechecker. It's probable we actually gain speed by compressing, so I would really like to apply it there too, and then we care about "impoverished environments" again.

It's probably good for a proof of concept though.

@Armael
Copy link
Copy Markdown
Member

Armael commented Mar 23, 2021

FWIW here are the numbers on my side with a run of Adrien's script on a fairly big switch (including coq, z3, and a bunch of ocaml libraries) (non-flambda compiler, laptop CPU, SSD).

Compressors are sorted roughly from more compression to less compression.
I/O time will be negligible.
Output reads as follows:

                    compressor --options: output_size_bytes [compression time] [decompression time]

cma: 1041 files
            xz --lzma2=preset=6,dict=32M:   193387368 [440.510] [13.520]
                     xz --lzma2=preset=6:   193541192 [343.370] [16.360]
            xz --lzma2=preset=4,dict=32M:   198332740 [394.620] [17.880]
            xz --lzma2=preset=0,dict=32M:   240399564 [115.770] [26.330]
                                zstd -19:   218306140 [407.320] [1.950]
                                 zstd -9:   238888304 [58.400] [1.440]
                                    zstd:   261211752 [6.310] [1.150]
                                    gzip:   318835108 [64.990] [5.510]
                              lz4 -12 -m:   373168726 [227.400] [0.770]
                               lz4 -4 -m:   379392758 [16.030] [0.770]
                                  lz4 -m:   443611233 [2.460] [0.790]
                                    none:   813454888 [0.430] [0.360]

cmi: 9636 files
            xz --lzma2=preset=6,dict=32M:    46274044 [96.390] [3.180]
                     xz --lzma2=preset=6:    46274028 [42.340] [3.080]
            xz --lzma2=preset=4,dict=32M:    46707308 [70.540] [3.110]
            xz --lzma2=preset=0,dict=32M:    48953468 [54.670] [3.840]
                                zstd -19:    51352058 [45.930] [0.330]
                                 zstd -9:    53244980 [5.000] [0.290]
                                    zstd:    56176110 [1.450] [0.290]
                                    gzip:    63061791 [8.380] [1.300]
                              lz4 -12 -m:    75780047 [63.300] [0.130]
                               lz4 -4 -m:    76055200 [2.100] [0.130]
                                  lz4 -m:    82122524 [0.580] [0.120]
                                    none:   151644718 [0.230] [0.070]

cmt: 5317 files
            xz --lzma2=preset=6,dict=32M:   305986692 [466.600] [34.980]
                     xz --lzma2=preset=6:   306184436 [406.160] [25.800]
            xz --lzma2=preset=4,dict=32M:   312271020 [314.430] [22.260]
            xz --lzma2=preset=0,dict=32M:   342124372 [133.870] [50.270]
                                zstd -19:   324290743 [425.790] [2.350]
                                 zstd -9:   352043822 [41.170] [1.760]
                                    zstd:   391247020 [9.290] [2.060]
                                    gzip:   464623656 [85.530] [10.910]
                              lz4 -12 -m:   565300572 [458.560] [1.840]
                               lz4 -4 -m:   569148135 [35.670] [1.500]
                                  lz4 -m:   651687222 [6.040] [1.900]
                                    none:  1246616151 [3.850] [0.950]

cmxs: 2017 files
            xz --lzma2=preset=6,dict=32M:    93270096 [252.170] [8.300]
                     xz --lzma2=preset=6:    93380468 [189.620] [8.500]
            xz --lzma2=preset=4,dict=32M:    97564320 [138.520] [7.160]
            xz --lzma2=preset=0,dict=32M:   104118948 [32.170] [6.960]
                                zstd -19:   100044146 [184.610] [0.820]
                                 zstd -9:   107110286 [15.800] [0.540]
                                    zstd:   117719854 [2.780] [0.720]
                                    gzip:   133797625 [23.740] [3.410]
                              lz4 -12 -m:   170831005 [188.960] [0.290]
                               lz4 -4 -m:   176171287 [5.660] [0.320]
                                  lz4 -m:   206422270 [1.020] [0.340]
                                    none:   518599816 [0.240] [0.210]

cmti: 3518 files
            xz --lzma2=preset=6,dict=32M:    56778428 [79.380] [4.040]
                     xz --lzma2=preset=6:    56778216 [60.640] [3.950]
            xz --lzma2=preset=4,dict=32M:    57771668 [65.210] [4.000]
            xz --lzma2=preset=0,dict=32M:    61916232 [32.620] [4.120]
                                zstd -19:    61614945 [66.660] [0.330]
                                 zstd -9:    65857345 [5.680] [0.320]
                                    zstd:    71172746 [1.290] [0.280]
                                    gzip:    87491640 [9.810] [1.450]
                              lz4 -12 -m:   104237171 [77.900] [0.190]
                               lz4 -4 -m:   104731512 [3.280] [0.160]
                                  lz4 -m:   118095704 [0.660] [0.160]
                                    none:   234457651 [0.160] [0.100]

cmx: 9710 files
            xz --lzma2=preset=6,dict=32M:    30925620 [60.370] [2.030]
                     xz --lzma2=preset=6:    30925620 [25.880] [2.020]
            xz --lzma2=preset=4,dict=32M:    31421016 [65.470] [2.180]
            xz --lzma2=preset=0,dict=32M:    32050768 [53.000] [2.020]
                                zstd -19:    30831684 [20.870] [0.150]
                                 zstd -9:    32063255 [2.240] [0.130]
                                    zstd:    33451828 [0.780] [0.150]
                                    gzip:    35153665 [2.560] [0.530]
                              lz4 -12 -m:    40200728 [4.730] [0.050]
                               lz4 -4 -m:    40412021 [0.820] [0.050]
                                  lz4 -m:    43594663 [0.280] [0.050]
                                    none:    78156067 [0.160] [0.030]

@xavierleroy
Copy link
Copy Markdown

compression also cuts size of .cmi to a third.

I see a factor 2.5 to 2.75 with zstd. This said, you have a point that .cmi files are loaded over and over again during compilations.

@alainfrisch
Copy link
Copy Markdown
Contributor

This 14 year old message might be relevant for the current discussion : https://sympa.inria.fr/sympa/arc/caml-list/2006-06/msg00102.html

In a nutshell, OCaml's marshal format is not very friendly to usual compression algorithms, due to its use of relative object offsets. Perhaps worth doing something like above if we are going to compress build artefacts.

@adrien-n
Copy link
Copy Markdown

This is the script that @c-cube mentionned: https://gist.githubusercontent.com/adrien-n/8a68016fdb9dccb3e28073c9fd274974/raw/2608610d78b6b05b07503639c048d67fa5ceeb5c/.sh

It will use your current opam switch.

A few notes about the benchamarks:

  • compressors have been chosen to showcase a wide range of possibilities and trade-offs (which is why lz4 -12 is used even though it's terrible),
  • there are quite a few of them, including slow ones and that can be very long one large switches (look at @Armael 's times for instance even though I've removed a couple compressors in the version above, but only a couple),
  • some compressors have been left off because they performed worse on all aspects (compression ratio, compression speed, decompression speed); this is the case for bzip2 for instance (*),
  • gzip is listed mainly because it's the de-facto standard not because it's very interesting,
  • small (< 100 bytes) files are excluded but there are many of them (1319 such .cm* files out of 1680 but 14MB out of 140MB in my default opam switch),
  • no multi-threading has been used.

(*) generally speaking, bzip2 compresses slowly, decompresses at half the compression speed and has not-so-good compression except on some inputs (such as text iirc)

Small files have been excluded because it's less interesting to compress them and they can skew benchmarks by giving too much importance to repeated syscalls such as open/read/close/... Skipping them is obviously another biais but the goal was to showcase compression performance, not syscall overhead. Initially I skipped them because I thought that decompression tests would require creating a file for each but it turns out all compressors happily decompress the concatenation of their own output (I really don't know how that's useful in practice).

And a word about using separate executables, I think there can be a noticeable performance impact especially for repeated file accesses and even more so on Windows but more importantly, it would probably require either a flag day or new extensions such as .cmaz, cmxsz and so on.

@adrien-n
Copy link
Copy Markdown

About Marshal's output format, I also seem to remember that it's already somewhat compressed or that it already tries to save some space but I'm not familiar with it. No idea if that's actually the case or if my brain made that up though.

@dra27
Copy link
Copy Markdown
Member

dra27 commented Mar 23, 2021

I’m not sure there’d be a need for new file extensions (indeed, I think it seriously complicates things). All these compression formats have magic numbers - having the compiler recognise them and use that as a trigger to pipe the result through a decompression API/process costs very little.

@adrien-n
Copy link
Copy Markdown

Aren't the current format(s) flexible enough to hold compressed data?

Anyway, it's probably good to start a quick experiment to get a rough idea of the actual impact of such a change (all files compressed, no backward compat, external process probably).

@adrien-n
Copy link
Copy Markdown

I've started experimenting on this topic. I've chosen to do so with an external process for (de)compression quite unsurprisingly. This currently only covers cmi and cmt files because they're separated under file-formats/ and nothing seemed to use Linear_format during my compilations (all through opam).

Compilation speed seemed affected slightly, maybe around 3% of the total opam steps and therefore possibly more relative to the actual compilation. During the opam build, cmi files have been accessed around 10k times (IIRC there's slightly less than 1k of them). Not all cmi files were compressed, which I think is due to the use of the bootstrap compiler for some of them. Having to handle both is an issue but it can be handled (well, currently I simply try zstd-codepath with _ -> regular-codepath).

It is not possible to compress only parts of the files or do anything remotely "precise" when using command-line tools (at least zstd but almost certainly others too). Zstd for instance tries to read as much as possible from its input (or, at least, 4K chunks) and having non-zstd data somewhere would make it error out without indication about the location of the non-zstd data.

Since there is no way in the standard library to spawn a process and read from its standard output through an in_channel (nor write to its standard input using an out_channel), it does not seem possible to use processes as filters. I've therefore used temporary files which causes additional I/O and requires some additional care (and probably rules out using /tmp and the like for the temporary dir due to security concerns like TOCTOU). This will be an issue if you want to run analysis tools without the ability to write to the filesystem (read-only filesystems and security policies).

Because there are compression gains for every .cm* file type, I've also looked at compressing cma files (that was an early idea but while experimenting it was easier to begin with cmi and cmt files). An important difference is that they can also be used at runtime and that makes the problem with temporary files worse.

Doing it with C bindings would solve several of the issues above because (de)compression could be transparent in channels. It would however increase the coupling with the corresponding compression library and I expect that might have to be done in the runtime; therefore I expect that some people will want to not have compression at all (or library-less methods). There would also be questions related to the API and whether or not to expose it (I guess the answer is to not expose it).

@edwintorok
Copy link
Copy Markdown

Perhaps calling the zstd compression functions (when available) directly from the C runtime would be the simplest?
E.g. modifying caml_(really)_{put,get}block and adding an appropriate header/magic/flag to identify that zstd compression is used, based on additional extern_flags in the Marshal to request compressed objects and adjusting headers appropriately (which might have to be written out uncompressed).

Another possibility might be to provide the needed hooks in in_channel and out_channel to install a custom OCaml filter function (and reject seek requests as if you opened a pipe), e.g. build a filtered in_channel out of a regular one.
Then once zstd is available it can be installed as a filter in ocamlc/ocamlopt, while keeping everything simpler during bootstrapping or mirage unikernels/cross-compilation where zstd could simply be configured to not be available and the compiler wouldn't attempt to do any compression.
As an added benefit users could then install their own filters on channels too, which might be a nice feature to have.

FWIW binutils can currently use zlib to compress debug sections, but there are patches already committed to add zstd compressed debug sections: https://maskray.me/blog/2022-09-09-zstd-compressed-debug-sections. It doesn't spawn a separate process, but links the library, so zstd seems like a good choice here.
(BTW if the goal is to reduce size of .opam then ocamlopt could probably start taking advantage of GCC's -gz, and clang's equivalent when available independently of this RFC)

Perhaps compression could be done at install time (e.g. by dune / opam-installer, etc.) and keep uncompressed ones during edit-compile cycles. Or in the dev profile of dune don't compress and in the release mode do compress.
That would still keep the installed .opam size smaller, while not adding unnecessary compression/decompression overhead during development.

Then the only place that would need to read compressed .cmt/.cmti files (and thus have to decompress it) would be merlin/ocaml-lsp/odoc, and it is probably a lot easier to add dependencies on a zstd library there than in the compiler itself (it could also use a different extension to more easily distinguish, e.g. .cmt.zst).

A similar approach might also work for .cma/.cmxa/.cmi files where they could be compressed upon installation (adding a .zst to their extension), and decompressed as needed during the build by the build system, caching the uncompressed file (e.g. by dune). Although of course if the stdlib is compressed this way then build systems that are unaware of compressed .cm* files wouldn't work, so this is not a general solution.

@xavierleroy
Copy link
Copy Markdown

This RFC was mentioned again during the recent discussions on the size of OCaml installations and my PRs that reduce the size of executables. I agree that other compilation artefacts (cmt, cmti, cmi, maybe even cmo) are quite large too, and would benefit from compression.

Right now, I'm thinking of integrating compression directly within the marshaler (output_value / input_value), with optional compression during marshalling and automatic decompression during unmarshalling. Integration at the level of I/O channels can also be considered.

One thing that would help developing this RFC further is measurements of compression and decompression times, to make sure we will not increasing compilation times too much. Especially, .cmi files are read many times, so decompression should be fast.

@adrien-n
Copy link
Copy Markdown

adrien-n commented Feb 7, 2023

I think it's better to decide upon criteria early because discussions of compression algortithms are often passionated. It's impossible to come up with a perfect benchmark and that's why I used 2%: it's a good margin of error (I'm not even sure uncertainty would be the standard deviation would be lower than that if the benchmark covers the compiler and some libraries). I didn't even mention memory use so far.

Note that I mentioned xz for a few reasons: first, as a reasonable upper bound for compression (i.e. the last step before zpaq) and second, because it is easy to tweak which makes it easy to learn more about a given dataset. I know zstd can be tuned but I don't know how to do it properly, especially on the command line while xz has --lzma2=preset=0,dict=8M to combine low CPU with higher/average memory usage. But again, the goal was to learn about the files compressibility, rather than learn about compressors.

I've actually done some more quick tests this morning.
First, it seems there is no redundancy across cmt files: compressing their concatenation gives barely better compression than compressing them independently (< 10% improvement).
Second, xz' BCJ filters for x86/ppc/ia64/arm/armthumb/sparc code didn't change the compression ratio; I wondered if they could pick up some of aforementioned relative adresses and either they don't because the layouts are too different, or there's nothing to gain on top of current compression algorithms. I'm not knowledgeable enough on marshal's outputand on assembly to tell; testing was quick, that's it.

@dinosaure Would compression of some of these files be completely pointless for MIrage or could it benefit from that? I'm asking because it seems very easy to make it optional (assuming it's a flag for Marshal) and a compiler could be made to not use it at all. By the way, in my tests, LZO fared pretty well overall.

@alainfrisch
Copy link
Copy Markdown
Contributor

Xavier's reminder about ocaml/ocaml#4056 should be considered before comparing/benchmarking actual compression algorithms; making the marshal format more compression-friendly could vastly change the numbers.

Also, I really think there is an opportunity of improving the storage size by recreating more sharing before marshaling, as ocp-cmicomp did; and since this would reduce redundancy quite a lot, this might again change significantly the gains of various "generic compressors" (and compared to low-level compression, there is also an opportunity to speed up loading the file, and even reduce memory usage of tools that read those files -- this is perhaps less important for .cmt than for .cmi, though).

@dinosaure
Copy link
Copy Markdown

@dinosaure Would compression of some of these files be completely pointless for MIrage or could it benefit from that? I'm asking because it seems very easy to make it optional (assuming it's a flag for Marshal) and a compiler could be made to not use it at all. By the way, in my tests, LZO fared pretty well overall.

For our perspective is much more about what you want to integrate into the compiler (eg. an inflate/deflate implementation). For our purpose, we must recompile the entire compiler with a specific toolchain to be able, then, to statically link everything to our microkernel. For instance, if the plan is to use zstd, on our side, we must compile zstd with our toolchain (our nolibc, with our link-script, etc.) and, for obvious reasons, it will be tricky.

On the other side, if a standalone implementation is the plan, as far as this implementation is not fancy and use basic C idioms will not be a problem for us. The question is more about what you want to integrate and how. And globally, I agree with the RFC 👍.

@xavierleroy
Copy link
Copy Markdown

Dear @dinosaure : do you have zlib already compiled in your Mirage environment? I'm sure other OCaml packages use it. (Two examples from my own code: Cryptokit and Camlzip.) If so, is it OK for the OCaml runtime system to link with zlib? Could it actually simplify your "mirage/decompress" library by allowing it to use the actual zlib code? Just trying to understand what the heck you're doing in Mirage. Thanks.

@xavierleroy
Copy link
Copy Markdown

Before I get flamed to death: the reason I mentioned zlib is not because it performs better than alternatives (it doesn't), but because it has a clean API for streaming compression / decompression, while the corresponding APIs in lz4 and zstd make no sense to me. Now I'll get flamed to death :-)

@xavierleroy
Copy link
Copy Markdown

Also, I really think there is an opportunity of improving the storage size by recreating more sharing before marshaling, as ocp-cmicomp did

A posteriori re-sharing is very expensive. Is it worth doing?

@alainfrisch
Copy link
Copy Markdown
Contributor

A posteriori re-sharing is very expensive. Is it worth doing?

Hard to know! Maximal sharing might be too expensive, but it might be possible to identify specific kinds of sub-terms for which a naive (non optimal) hash-consing strategy could be beneficial. Also, the extra cost is only when generating the artefact, and all uses benefit from more sharing (less data to unmarshal, and smaller memory usage); I suspect that doing it for .cmi files read by the compiler might be worthwhile, while for .cmt files, it's less clear.

Anyway, before spending time on such approach, it would be interesting to understand which kinds of values explain most of the size. Changing the representation in the compiler itself might be another option. For instance, how many (physically different) Location.t value do we have in a typical .cmt file? We could certainly find a more compact representation, optimizing for frequent cases (e.g. where all pos_fname are the same), encoding the loc_ghost flag in a tag (-> inline record), and using an encoding for Lexing.position which would lead to smaller integers (which results in smaller encoding by the marshaler).

More importantly, I suspect that Env.t values take a lot of space in .cmt files. The summary itself is quite compact, but an Env.t record itself, even with most field being empty, is quite big. Defining it as type t = Full of { values: .... } | Summary of {summary: summary; local_constraints; ...; flags: int} (or even encoding the flag -- I believe there is only one -- in the constructor) would make the data structure smaller in .cmt files. Btw, making Env.t more "stratified", so that fields that don't change frequently are grouped into an inner record, could be beneficial for the memory usage during type-checking. This also reminds me of #9039, which is another axis for improving sharing of environments within .cmt files.

Some other redundancy that could be eliminated : in the vast majority of cases, the location of the second component of Texp_ident of Path.t * Longident.t loc * Types.value_description (which must be quite frequent in typical ASTs) will be the same as the corresponding exp_loc. Is there an actual point keeping it in the Typedtree (which introduces some extra wrapping)?

Texp_apply (also quite frequent) seems a bit wasteful in terms of memory representation; a simple improvement would be to represent it as Texp_apply of expression * arg_label array * expression array with the two arrays of the same length, with hash-consing on the arg_label array (there will only be a small number of them!), and a fixed dummy expression for the rare None case of the expression. Of course, this makes the code that manipulates it a bit less nice!

I'm not saying we should necessarily do many of such "data structure optimizations", but perhaps a small number of well-chosen ones could bring a significant improvement (to both size of .cmt file and RAM usage during compilation).

@jberdine
Copy link
Copy Markdown

jberdine commented Feb 7, 2023

A posteriori re-sharing is very expensive. Is it worth doing?

As a data point, until relatively recently Infer used an a posteriori sharing introduction pass on the path to writing Marshaled values that would be read many times, in parallel. It was indeed expensive, but worth doing for a while due to the memory consumption reduction in the many parallel processes that read the values. Currently it is better to use a form of hash consing that is much faster but not as effective at reducing memory consumption.

So I guess I would say that if you are really up against the wall regarding memory consumption, it might be worth doing a posteriori re-sharing. But it is a very use case specific call, so I would not consider this data point to be on in favor of a blanket re-sharing implementation.

FWIW in the case of Infer, it is typing environments that dominate the savings (or not) of sharing (or not).

@alainfrisch
Copy link
Copy Markdown
Contributor

One data point : replacing Longident.t loc by just Longindent.t under Texp_ident reduces by 2% the size of typecore.cmt.

@alainfrisch
Copy link
Copy Markdown
Contributor

And, if we also hashcons that Longident.t, we reach a 2.78% size reduction.

@alainfrisch
Copy link
Copy Markdown
Contributor

alainfrisch commented Feb 7, 2023

Also hashconsing the Path.t argument of Texp_ident -> 3.5% size reduction.

And 4.6% by also dropping the string log argument of Tpat_var (the string must be Ident.name id and the location ppat_loc).

@alainfrisch
Copy link
Copy Markdown
Contributor

And close to 10% by also "stratifying" the Env.t type, i.e. putting most fields under a common sub-record. This quick experiment can be found here https://github.com/alainfrisch/ocaml/commits/afrisch_hashcons

@edwintorok
Copy link
Copy Markdown

Xavier's reminder about ocaml/ocaml#4056 should be considered before comparing/benchmarking actual compression algorithms; making the marshal format more compression-friendly could vastly change the numbers.

Parquet has an interesting idea: store data of a column together, which may improve compression (all data of same type is very near each-other, which allows the compression algorithms/dictionaries/etc to tune themselves to each column's data type).
I think by the time we hit 'Marshal' most type information is lost, but one could still try to consider "a column" based on the block's tag, and for tuples/records based on the block's size.
One could also consider nesting, and e.g. serialize all the 'keys', in sorted order (of an assoc list, map, hashtbl) and then serialize the values.

Similarly 'bzip2' uses the Burrow-wheeler transform that makes the data more compressible by "sorting" (although that is quite slow).
There are probably places in the .cm* files, where order of elements doesn't actually matter (they are logically a set or a map), and using a different ordering might improve compressibility (e.g. hash-tables, or association lists).

I'm not suggesting that parquet (or BWT) should be used in the compiler itself, but they might provide a quick way to validate the idea and see how compressible it would be if this "columnar storage" idea is used, and if beneficial find a way to implement it in a simplified form in 'extern.c' without relying on these external libraries (doing breadth-first instead of depth-first traversal of values and binning by size as a memory allocator would).

A quick look at one of the largest .cmt I have at hand (parser.cmt and github_j.cmt) shows there are a lot of strings being repeated (and even though a compressor should be able to find and optimize them, if they're too far apart it may not, they typically have a limit on match distance), so hashconsing these (and their surrounding structures) may be beneficial:

strings 5.0.0/lib/github-data/github_j.cmt|sort|uniq -c|sort -n|tail
    624 &Buffer(add_char
    840 .Atdgen_runtime&Oj_run+read_string
   1275 &Yojson$Safe*read_space
   1465 $None
   1567 (is_first
   1993 @@@@
   2143 $Some
   2251 #len
   5696 &String*unsafe_get
  10234 #pos
strings parsing/parser.cmt|sort|uniq -c|sort -n|tail
   1845 )_startpos
   2280 -_startpos__1_
   2401 2parsing/parser.mly
   2999 '_endpos
   3352 5CamlinternalMenhirLib+EngineTypes$next
   3352 5CamlinternalMenhirLib+EngineTypes$semv
   3352 5CamlinternalMenhirLib+EngineTypes%state
   3352 5CamlinternalMenhirLib+EngineTypes&startp
   4252 5CamlinternalMenhirLib+EngineTypes$endp
   4254 -_menhir_stack

. For instance, if the plan is to use zstd, on our side, we must compile zstd with our toolchain (our nolibc, with our link-script, etc.) and, for obvious reasons, it will be tricky

About zstd / lz4 in freestanding environments, their authors have thought about it, see: https://github.com/lz4/lz4/blob/dev/lib/lz4.h#L100-L110 it requires only 3 functions from the runtime (memcpy/memmove/memset), doesn't even need a malloc. This one is probably the easiest to integrate.

zstd needs a little bit more from the runtime, a few memory allocation functions: https://github.com/facebook/zstd/blob/329169189c248696c197ef9b88eb97c315bfb831/lib/common/zstd_deps.h, and there is a script to remove other dependencies: https://github.com/facebook/zstd/blob/f302ad8811643c428c4e3498e28f53a0578020d3/contrib/freestanding_lib/freestanding.py, which can be used to build something suitable that runs inside the Linux kernel: https://github.com/facebook/zstd/blob/f302ad8811643c428c4e3498e28f53a0578020d3/contrib/linux-kernel/Makefile).

@dinosaure
Copy link
Copy Markdown

Dear @dinosaure : do you have zlib already compiled in your Mirage environment? I'm sure other OCaml packages use it. (Two examples from my own code: Cryptokit and Camlzip.) If so, is it OK for the OCaml runtime system to link with zlib? Could it actually simplify your "mirage/decompress" library by allowing it to use the actual zlib code? Just trying to understand what the heck you're doing in Mirage. Thanks.

The purpose of decompress was to replace zlib by something in OCaml 🙂. I think, with the recent changes of MirageOS 4, we could be able to use zlib directly (but decompress evolved since several years now and we still think that it's a good solution - specially for js_of_ocaml). I talked about decompress just to give you an OCaml implementation of zlib if you want to be inspired.

Before I get flamed to death: the reason I mentioned zlib is not because it performs better than alternatives (it doesn't), but because it has a clean API for streaming compression / decompression, while the corresponding APIs in lz4 and zstd make no sense to me. Now I'll get flamed to death :-)

I agree with you that the API provided by zlib is much more interesting than lz4.

As @edwintorok mentionned it, we mainly would like to still compile the OCaml distribution into a freestanding environment (again with, our nolibc, our openlibm and our linker script). I known that some implementations of the RFC1951/DEFLATE exists and they are more easier to integrate without a big requirements from a libc used and as far as the code is pretty standalone, we are happy. My intervention was more about: ok this feature seems cool but be aware that we should not integrate something "big" like zstd.

More specifically, our real goal is to compile libasmrun.a with our freestanding environment. However, as it stands and in practice, we need to compile the OCaml compiler (for many reasons outside the scope of this RFC) which complexify (a bit) our freestanding environment. From my perspective, the integration of zlib/zstd/lz4 will complexify again our environment if we don't pay attention to what we integrate but plenty solutions exists between: implement a basic compressor and use zstd as the compressor. . That's my only requirement 🙂.

@alainfrisch
Copy link
Copy Markdown
Contributor

alainfrisch commented Feb 8, 2023

A quick look at one of the largest .cmt I have at hand (parser.cmt and github_j.cmt) shows there are a lot of strings being repeated

That's a typical case where a simple hash-consing of values, on the OCaml side, could be beneficial. This could even already be done as the level of the lexer/parser, so that multiple occurrences of the same identifier are physically shared (also in the Parsetree, which would reduce the size of the data structure passed to external ppx).

But : even if a string is shared, references to that string from multiple places in the Typedtree will result in different binary encoding. This is because references to previously-seen objects is done using relative offsets, not absolute ones, and this is what #4056 addresses.

@xavierleroy
Copy link
Copy Markdown

I made a POC for marshaling with compression: ocaml/ocaml#12006

@alainfrisch
Copy link
Copy Markdown
Contributor

I made an experiment replacing all Location.t values with a fixed one, and the impact on the size of .cmt files is quite huge: typecore.cmt goes from 272kB to 206kB; typecore.cmi goes from 18kB to 14kB. This suggests that tweaking the representation of Location might be worthwhile. Unfortunately, the type is currently a concrete one, and changing its representation might break some third-party tools.

@gasche
Copy link
Copy Markdown
Member

gasche commented Feb 26, 2023

I experimented with pre-trained dictionaries for compression of .cmi files using zstd, and the results are disappointing: the resulting files are slightly smaller, at the cost of higher total time.

$ cd .opam
$ mkdir no-dict with-dict

# working set: .cmi of my 4.14 switch
$ find 4.14.0/ -name '*.cmi' | xargs du -bch | tail -n 1
46M	total

# pre-train a dictionary on the .cmi of size <=16Kio of my 4.12 switch
$ zstd --train $(find 4.12.0/lib -name '*.cmi' -size -16k) -o cmi.dict
Trying 5 different sets of parameters                                          
k=537                                                                          
d=8
f=20
steps=4
split=75
accel=1
Save dictionary of size 112640 into file cmi.dict 

# compress without a pre-trained directory
$ time zstd --output-dir-mirror no-dict $(find 4.14.0/lib -name '*.cmi')
2413 files compressed :41.42%   (  45.1 MiB =>   18.7 MiB)                     KiB ==> 45%

real	0m0.755s
user	0m0.539s
sys	0m0.212s

# compress with a pre-trained directory
$ time zstd --output-dir-mirror with-dict -D cmi.dict $(find 4.14.0/lib -name '*.cmi')
2413 files compressed :39.95%   (  45.1 MiB =>   18.0 MiB)                     KiB ==> 61%

real	0m0.990s
user	0m0.760s
sys	0m0.224s

So: from 18.7MiB to 18.0Mib, but from 0.54s of user time to 0.76s.

@xavierleroy
Copy link
Copy Markdown

What happens if you compress only the small .cmi files? That's where you should see the biggest relative space savings.

Other than that, if you want slightly smaller files at the cost of longer compression times, you can just increase the compression level of ZSTD, no need for a fancy pre-trained dictionary :-)

@gasche
Copy link
Copy Markdown
Member

gasche commented Feb 26, 2023

The results on small files are similarly inconclusive or worse.

Note: whether or not using a dictionary, it is interesting (edit: expected) that we get better (edit: worse) compression ratios on smaller files.

$ zstd -f --output-dir-mirror no-dict $(find 4.14.0/lib -name '*.cmi' -size -2k)
353 files compressed :75.41%   (   172 KiB =>    130 KiB)
$ zstd -f --output-dir-mirror no-dict $(find 4.14.0/lib -name '*.cmi' -size -4k)
930 files compressed :66.30%   (  1.27 MiB =>    865 KiB)
$ zstd -f --output-dir-mirror no-dict $(find 4.14.0/lib -name '*.cmi' -size -8k)
1431 files compressed :59.69%   (  3.52 MiB =>   2.10 MiB)
$ zstd -f --output-dir-mirror no-dict $(find 4.14.0/lib -name '*.cmi' -size -16k)
1859 files compressed :55.65%   (  8.06 MiB =>   4.49 MiB)
$ zstd -f --output-dir-mirror no-dict $(find 4.14.0/lib -name '*.cmi' -size -32k)
2128 files compressed :52.29%   (  13.3 MiB =>   6.93 MiB)
$ zstd -f --output-dir-mirror no-dict $(find 4.14.0/lib -name '*.cmi' -size -64k)
2258 files compressed :49.34%   (  18.5 MiB =>   9.14 MiB)
$ zstd -f --output-dir-mirror no-dict $(find 4.14.0/lib -name '*.cmi' -size -128k)
2342 files compressed :46.58%   (  25.7 MiB =>   12.0 MiB)
$ zstd -f --output-dir-mirror no-dict $(find 4.14.0/lib -name '*.cmi' -size -256k)
2385 files compressed :44.55%   (  32.8 MiB =>   14.6 MiB)
$ zstd -f --output-dir-mirror no-dict $(find 4.14.0/lib -name '*.cmi')
2413 files compressed :41.42%   (  45.1 MiB =>   18.7 MiB)                     KiB ==> 62%

@xavierleroy
Copy link
Copy Markdown

we get better compression ratios on smaller files.

I think you're misreading the percentages. They are "size compressed / size uncompressed", so the lower the percentage the more compression.

@edwintorok
Copy link
Copy Markdown

edwintorok commented Feb 26, 2023

I wouldn't worry about pre-training dictionaries: they won't benefit large files (according to manpage training uses just first 128KiB), and for small files the savings would be minimal (I've done some similar experiments with similarly disappointing results). And the disadvantage is that you now have to somehow ship this pretrained dictionary with the compiler and if you change it you won't be able to read back old marshaled data.

If you really want to save space on small .cmis then a much more effective way to save space on lots of all .cmi files is to store them all into a bigger cmi archive.
E.g. if we apply zstd to all stdlib*.cmi individually we get:

for i in stdlib*.cmi; do zstd $i; done
du -csh stdlib*.zst
[...]
4.0K    stdlib__Unit.cmi.zst
4.0K    stdlib__Weak.cmi.zst
328K total
ar q stdlib.cmi.a stdlib*.cmi
zstd stdlib.cmi.a 
stdlib.cmi.a         : 18.08%   (   894 KiB =>    162 KiB, stdlib.cmi.a.zst)

So that will go from 328KiB (compressing each file individually, but note that according to 'du' the smallest files can't take up less than 4K) to 162KiB.

Although this might introduce some unnecessary complexity in the toolchain as a whole, and accessing individual .cmi-s might be slower now since you might have to uncompress everything before it to reach it... so I don't think it is worth going down this route.

@xavierleroy
Copy link
Copy Markdown

you now have to somehow ship this pretrained dictionary with the compiler and if you change it you won't be able to read back old marshaled data

This sounds like a big hassle indeed. Let's forget about pre-trained dictionaries, then.

@xavierleroy
Copy link
Copy Markdown

note that according to 'du' the smallest files can't take up less than 4K

du --apparent-size will use the same file sizes that ls -l displays, i.e. the lengths in bytes. Without this option, du tells you about disk usage, which differs for various reasons (rounding up to whole blocks; files with holes; filesystem-level compression; etc).

@gasche
Copy link
Copy Markdown
Member

gasche commented Feb 27, 2023

I went ahead and merged ocaml/ocaml#12006, which provides:

  • a new marshalling flag to enable zstandard compression on the fly (provided the runtime environment was built with libzstd available)
  • its use in the compiler codebase to compress cmi, cmt, cmti files, and the marshalled parts of cmo/cma files as well.

I'm moving to close this RFC which definitely succeeded in gathering descriptions and exploring options (along with @lefessant's blog post as a recent motivator to revisit this question), and is now basically "done". Thanks to everyone involved.

(There are more things to explore in terms of tweaking the internal compiler datatypes to get more compact representation with and without compression, as restarted by @alainfrisch.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.