Conversation
|
Maybe worth investigating alternate compression formats like LZ4, a few benchmarks can be found on this page. |
|
At very first glance it seems to compress less well than |
Yes according to the graphs on the page I mentioned that was expected. I was more suggesting in looking the issue not only from a disk size perspective but also with (de)compression time overhead. The compiler has to produce these and lots of tools consume them so one should be mindful about these timings too. |
Just to forward some private discussions, this RFC can be useful but it should only extend the compiler - and the inflation/deflation support should not exceed the compiler area. In others words, it should not be an extension of the standard library or the runtime to support inflation/deflation (at least, |
|
@dbuenzli this is for .cmt and .cmti which, afaik, are not consumed by the compiler but only by other tools such as merlin (and odoc?). I'm proposing zlib because it's a pervasive format which is reasonably fast and compresses reasonably good, with a tiny C library supported absolutely everywhere. @dinosaure yes, it could just be an otherlibs as we discussed. I do personally think OCaml should ship with something as crucial as zlib, but it doesn't have to be in the stdlib itself, it can be like |
Precisely, especially for a tool like merlin which is for interactive behaviour decompression latency is important. Compression latency is also important because as a matter of fact it will slowdown your builds. |
|
Do Merlin or the OCaml LSP-server implementation use |
I remember a discussion where @let-def or @trefis indicated me they intended so. I don't know if that happened but that file mentions them.
Btw. this may also suggest other investigations, complementary to this proposal, to reduce the size of
|
I was wondering about this too. I think @garrigue has mentioned before the idea of .cmia files, which, if there is a saving to be made with sharing, might be the vehicle for doing so? |
|
Using a script by Adrien Nader (@adrien-n), on an opam switch built around containers, this is a comparison between several compression methods and parameters (columns are size, compression time, decompression time, for all the files of a given type in an opam switch): |
|
Thanks for the numbers. Regarding merlin (re)loads, I don't know the details – it's magic after all – but it is good to keep in mind the 0.1s duration. Below that duration, users have the feeling the system is reacting instantaneously (reference). |
True. But if the file is cold (not cached in memory already) and needs to be read from HDD or SSD, the reduction in loading time coming from the reduction in size far outweighs the extra time spent decompressing. For warm files (already cached), the effect is less dramatic, but I would still expect a win from fast decompressors like lz4 or zstd. |
|
Zstd is extremely fast, and it's a C library under a BSD license. @Drup concluded on IRC that it's probably the best choice and I now tend to agree, if adding bindings to compiler-libs is deemed acceptable. (Anecdotically I started using zstd on my machine to compress various things last year, and it's impressive. Archlinux uses it by default to compress packages, too.) |
Old Unix guy here: what's wrong with piping through an external compressor / decompressor? Yes, I know: Windows, Mirage, impoverished environments, etc. But if it's data for Merlin or Typerex, should we really care about impoverished environments? |
|
Interesting idea. At first my gut reaction is that shelling to subprocesses is fragile and slow (especialyl when you produce the file and have to wait for the subprocess to terminate, to avoid the situation where the subprocess dies before it could write all the compressed data). However, if we arranged things so that the compiler (or merlin) knows how to shell to a subprocess to read |
Not to debate the general argument, but while the effects are pretty dramatic on It's probably good for a proof of concept though. |
|
FWIW here are the numbers on my side with a run of Adrien's script on a fairly big switch (including coq, z3, and a bunch of ocaml libraries) (non-flambda compiler, laptop CPU, SSD). |
I see a factor 2.5 to 2.75 with zstd. This said, you have a point that .cmi files are loaded over and over again during compilations. |
|
This 14 year old message might be relevant for the current discussion : https://sympa.inria.fr/sympa/arc/caml-list/2006-06/msg00102.html In a nutshell, OCaml's marshal format is not very friendly to usual compression algorithms, due to its use of relative object offsets. Perhaps worth doing something like above if we are going to compress build artefacts. |
|
This is the script that @c-cube mentionned: https://gist.githubusercontent.com/adrien-n/8a68016fdb9dccb3e28073c9fd274974/raw/2608610d78b6b05b07503639c048d67fa5ceeb5c/.sh It will use your current opam switch. A few notes about the benchamarks:
(*) generally speaking, bzip2 compresses slowly, decompresses at half the compression speed and has not-so-good compression except on some inputs (such as text iirc) Small files have been excluded because it's less interesting to compress them and they can skew benchmarks by giving too much importance to repeated syscalls such as open/read/close/... Skipping them is obviously another biais but the goal was to showcase compression performance, not syscall overhead. Initially I skipped them because I thought that decompression tests would require creating a file for each but it turns out all compressors happily decompress the concatenation of their own output (I really don't know how that's useful in practice). And a word about using separate executables, I think there can be a noticeable performance impact especially for repeated file accesses and even more so on Windows but more importantly, it would probably require either a flag day or new extensions such as .cmaz, cmxsz and so on. |
|
About Marshal's output format, I also seem to remember that it's already somewhat compressed or that it already tries to save some space but I'm not familiar with it. No idea if that's actually the case or if my brain made that up though. |
|
I’m not sure there’d be a need for new file extensions (indeed, I think it seriously complicates things). All these compression formats have magic numbers - having the compiler recognise them and use that as a trigger to pipe the result through a decompression API/process costs very little. |
|
Aren't the current format(s) flexible enough to hold compressed data? Anyway, it's probably good to start a quick experiment to get a rough idea of the actual impact of such a change (all files compressed, no backward compat, external process probably). |
|
I've started experimenting on this topic. I've chosen to do so with an external process for (de)compression quite unsurprisingly. This currently only covers cmi and cmt files because they're separated under file-formats/ and nothing seemed to use Linear_format during my compilations (all through opam). Compilation speed seemed affected slightly, maybe around 3% of the total opam steps and therefore possibly more relative to the actual compilation. During the opam build, cmi files have been accessed around 10k times (IIRC there's slightly less than 1k of them). Not all cmi files were compressed, which I think is due to the use of the bootstrap compiler for some of them. Having to handle both is an issue but it can be handled (well, currently I simply try zstd-codepath with _ -> regular-codepath). It is not possible to compress only parts of the files or do anything remotely "precise" when using command-line tools (at least zstd but almost certainly others too). Zstd for instance tries to read as much as possible from its input (or, at least, 4K chunks) and having non-zstd data somewhere would make it error out without indication about the location of the non-zstd data. Since there is no way in the standard library to spawn a process and read from its standard output through an in_channel (nor write to its standard input using an out_channel), it does not seem possible to use processes as filters. I've therefore used temporary files which causes additional I/O and requires some additional care (and probably rules out using /tmp and the like for the temporary dir due to security concerns like TOCTOU). This will be an issue if you want to run analysis tools without the ability to write to the filesystem (read-only filesystems and security policies). Because there are compression gains for every .cm* file type, I've also looked at compressing cma files (that was an early idea but while experimenting it was easier to begin with cmi and cmt files). An important difference is that they can also be used at runtime and that makes the problem with temporary files worse. Doing it with C bindings would solve several of the issues above because (de)compression could be transparent in channels. It would however increase the coupling with the corresponding compression library and I expect that might have to be done in the runtime; therefore I expect that some people will want to not have compression at all (or library-less methods). There would also be questions related to the API and whether or not to expose it (I guess the answer is to not expose it). |
|
Perhaps calling the Another possibility might be to provide the needed hooks in FWIW Perhaps compression could be done at install time (e.g. by Then the only place that would need to read compressed .cmt/.cmti files (and thus have to decompress it) would be A similar approach might also work for |
|
This RFC was mentioned again during the recent discussions on the size of OCaml installations and my PRs that reduce the size of executables. I agree that other compilation artefacts (cmt, cmti, cmi, maybe even cmo) are quite large too, and would benefit from compression. Right now, I'm thinking of integrating compression directly within the marshaler (output_value / input_value), with optional compression during marshalling and automatic decompression during unmarshalling. Integration at the level of I/O channels can also be considered. One thing that would help developing this RFC further is measurements of compression and decompression times, to make sure we will not increasing compilation times too much. Especially, .cmi files are read many times, so decompression should be fast. |
|
I think it's better to decide upon criteria early because discussions of compression algortithms are often passionated. It's impossible to come up with a perfect benchmark and that's why I used 2%: it's a good margin of error (I'm not even sure uncertainty would be the standard deviation would be lower than that if the benchmark covers the compiler and some libraries). I didn't even mention memory use so far. Note that I mentioned xz for a few reasons: first, as a reasonable upper bound for compression (i.e. the last step before zpaq) and second, because it is easy to tweak which makes it easy to learn more about a given dataset. I know zstd can be tuned but I don't know how to do it properly, especially on the command line while xz has --lzma2=preset=0,dict=8M to combine low CPU with higher/average memory usage. But again, the goal was to learn about the files compressibility, rather than learn about compressors. I've actually done some more quick tests this morning. @dinosaure Would compression of some of these files be completely pointless for MIrage or could it benefit from that? I'm asking because it seems very easy to make it optional (assuming it's a flag for Marshal) and a compiler could be made to not use it at all. By the way, in my tests, LZO fared pretty well overall. |
|
Xavier's reminder about ocaml/ocaml#4056 should be considered before comparing/benchmarking actual compression algorithms; making the marshal format more compression-friendly could vastly change the numbers. Also, I really think there is an opportunity of improving the storage size by recreating more sharing before marshaling, as ocp-cmicomp did; and since this would reduce redundancy quite a lot, this might again change significantly the gains of various "generic compressors" (and compared to low-level compression, there is also an opportunity to speed up loading the file, and even reduce memory usage of tools that read those files -- this is perhaps less important for .cmt than for .cmi, though). |
For our perspective is much more about what you want to integrate into the compiler (eg. an inflate/deflate implementation). For our purpose, we must recompile the entire compiler with a specific toolchain to be able, then, to statically link everything to our microkernel. For instance, if the plan is to use On the other side, if a standalone implementation is the plan, as far as this implementation is not fancy and use basic C idioms will not be a problem for us. The question is more about what you want to integrate and how. And globally, I agree with the RFC 👍. |
|
Dear @dinosaure : do you have |
|
Before I get flamed to death: the reason I mentioned |
A posteriori re-sharing is very expensive. Is it worth doing? |
Hard to know! Maximal sharing might be too expensive, but it might be possible to identify specific kinds of sub-terms for which a naive (non optimal) hash-consing strategy could be beneficial. Also, the extra cost is only when generating the artefact, and all uses benefit from more sharing (less data to unmarshal, and smaller memory usage); I suspect that doing it for .cmi files read by the compiler might be worthwhile, while for .cmt files, it's less clear. Anyway, before spending time on such approach, it would be interesting to understand which kinds of values explain most of the size. Changing the representation in the compiler itself might be another option. For instance, how many (physically different) More importantly, I suspect that Some other redundancy that could be eliminated : in the vast majority of cases, the location of the second component of
I'm not saying we should necessarily do many of such "data structure optimizations", but perhaps a small number of well-chosen ones could bring a significant improvement (to both size of .cmt file and RAM usage during compilation). |
As a data point, until relatively recently Infer used an a posteriori sharing introduction pass on the path to writing So I guess I would say that if you are really up against the wall regarding memory consumption, it might be worth doing a posteriori re-sharing. But it is a very use case specific call, so I would not consider this data point to be on in favor of a blanket re-sharing implementation. FWIW in the case of Infer, it is typing environments that dominate the savings (or not) of sharing (or not). |
|
One data point : replacing |
|
And, if we also hashcons that Longident.t, we reach a 2.78% size reduction. |
|
Also hashconsing the Path.t argument of Texp_ident -> 3.5% size reduction. And 4.6% by also dropping the |
|
And close to 10% by also "stratifying" the Env.t type, i.e. putting most fields under a common sub-record. This quick experiment can be found here https://github.com/alainfrisch/ocaml/commits/afrisch_hashcons |
Parquet has an interesting idea: store data of a column together, which may improve compression (all data of same type is very near each-other, which allows the compression algorithms/dictionaries/etc to tune themselves to each column's data type). Similarly 'bzip2' uses the Burrow-wheeler transform that makes the data more compressible by "sorting" (although that is quite slow). I'm not suggesting that parquet (or BWT) should be used in the compiler itself, but they might provide a quick way to validate the idea and see how compressible it would be if this "columnar storage" idea is used, and if beneficial find a way to implement it in a simplified form in 'extern.c' without relying on these external libraries (doing breadth-first instead of depth-first traversal of values and binning by size as a memory allocator would). A quick look at one of the largest .cmt I have at hand (parser.cmt and github_j.cmt) shows there are a lot of strings being repeated (and even though a compressor should be able to find and optimize them, if they're too far apart it may not, they typically have a limit on match distance), so hashconsing these (and their surrounding structures) may be beneficial:
About zstd / lz4 in freestanding environments, their authors have thought about it, see: https://github.com/lz4/lz4/blob/dev/lib/lz4.h#L100-L110 it requires only 3 functions from the runtime (memcpy/memmove/memset), doesn't even need a malloc. This one is probably the easiest to integrate. zstd needs a little bit more from the runtime, a few memory allocation functions: https://github.com/facebook/zstd/blob/329169189c248696c197ef9b88eb97c315bfb831/lib/common/zstd_deps.h, and there is a script to remove other dependencies: https://github.com/facebook/zstd/blob/f302ad8811643c428c4e3498e28f53a0578020d3/contrib/freestanding_lib/freestanding.py, which can be used to build something suitable that runs inside the Linux kernel: https://github.com/facebook/zstd/blob/f302ad8811643c428c4e3498e28f53a0578020d3/contrib/linux-kernel/Makefile). |
The purpose of
I agree with you that the API provided by As @edwintorok mentionned it, we mainly would like to still compile the OCaml distribution into a freestanding environment (again with, our More specifically, our real goal is to compile |
That's a typical case where a simple hash-consing of values, on the OCaml side, could be beneficial. This could even already be done as the level of the lexer/parser, so that multiple occurrences of the same identifier are physically shared (also in the Parsetree, which would reduce the size of the data structure passed to external ppx). But : even if a string is shared, references to that string from multiple places in the Typedtree will result in different binary encoding. This is because references to previously-seen objects is done using relative offsets, not absolute ones, and this is what #4056 addresses. |
|
I made a POC for marshaling with compression: ocaml/ocaml#12006 |
|
I made an experiment replacing all |
|
I experimented with pre-trained dictionaries for compression of So: from 18.7MiB to 18.0Mib, but from 0.54s of user time to 0.76s. |
|
What happens if you compress only the small .cmi files? That's where you should see the biggest relative space savings. Other than that, if you want slightly smaller files at the cost of longer compression times, you can just increase the compression level of ZSTD, no need for a fancy pre-trained dictionary :-) |
|
The results on small files are similarly inconclusive or worse. Note: whether or not using a dictionary, it is |
I think you're misreading the percentages. They are "size compressed / size uncompressed", so the lower the percentage the more compression. |
|
I wouldn't worry about pre-training dictionaries: they won't benefit large files (according to manpage training uses just first 128KiB), and for small files the savings would be minimal (I've done some similar experiments with similarly disappointing results). And the disadvantage is that you now have to somehow ship this pretrained dictionary with the compiler and if you change it you won't be able to read back old marshaled data. If you really want to save space on small .cmis then a much more effective way to save space on lots of all .cmi files is to store them all into a bigger cmi archive. So that will go from 328KiB (compressing each file individually, but note that according to 'du' the smallest files can't take up less than 4K) to 162KiB. Although this might introduce some unnecessary complexity in the toolchain as a whole, and accessing individual .cmi-s might be slower now since you might have to uncompress everything before it to reach it... so I don't think it is worth going down this route. |
This sounds like a big hassle indeed. Let's forget about pre-trained dictionaries, then. |
|
|
I went ahead and merged ocaml/ocaml#12006, which provides:
I'm moving to close this RFC which definitely succeeded in gathering descriptions and exploring options (along with @lefessant's blog post as a recent motivator to revisit this question), and is now basically "done". Thanks to everyone involved. (There are more things to explore in terms of tweaking the internal compiler datatypes to get more compact representation with and without compression, as restarted by @alainfrisch.) |
Rendered