Workaround #12238 by using large lexing buffers by gasche · Pull Request #12396 · ocaml/ocaml

gasche · 2023-07-20T09:55:16Z

Some of our error printing styles print the source code at the location of the error. This source is found either in the lexing buffer when available (it has not been discarded to make room for more source code), orelse by trying to re-open the source file. Re-opening the source file is not so reliable, in particular it fails in presence of preprocessor directives (the user-facing locations we have do not necessarily refer to real locations in the input file).

We propose to workaround this issue by simply using large lexing buffers by default, so that the vast majority of program keep the whole source input in the lexing buffer, and the unreliable fallback is rarely used.

Some of our error printing styles print the source code at the location of the error. This source is found either in the lexing buffer when available (it has not been discarded to make room for more source code), orelse by trying to re-open the source file. Re-opening the source file is not so reliable, in particular it fails in presence of preprocessor directives (the user-facing locations we have do not necessarily refer to real locations in the input file). We propose to workaround this issue by simply using large lexing buffers by default, so that the vast majority of program keep the whole source input in the lexing buffer, and the unreliable fallback is rarely used.

gasche · 2023-07-20T09:56:21Z

Note: it would be nice to also ensure that we don't print nonsensical code for larger source files -- we could either disable the fallback of re-reading the source completely, or try to disable it conditionally when lexer directives are used. This is sensibly more work than the current workaround, so I think a separate PR would be better.

xavierleroy · 2023-07-20T16:46:46Z

You know you're getting desperate when you increase the size of buffers to make a bug go away :-)

If you really want to go this way, I'd suggest reading the source code into a string (In_channel.input_all) then run the lexer over this string (Lexing.from_string). At least, the memory usage will be proportional to the size of the source code, instead of being absurdly high for tiny source files.

(For what it's worth: after much back and forth, the CompCert compiler also settled on reading the preprocessed C source code into a string before lexing and parsing, although I can't remember all the reasons why; but there were several.)

gasche · 2023-07-20T16:54:09Z

I also considered this approach, but I was somewhat worried about the idea of unbounded memory usage for files that are larger than 256Kib. In the compiler we have a menhir-generated parser.ml file clocking at 2.4MiB. Do we want a cutoff based on the file size, or are we happy with using a lot of memory on very large files? (As discussed in #12238 (comment), very large files also produce large build artifacts, so I guess their live memory usage is high anyway, but the build artifacts seem to remain smaller than the source code in practice.)

alainfrisch · 2023-07-20T17:47:46Z

Do we want a cutoff based on the file size, or are we happy with using a lot of memory on very large files?

My intuition is that the compiler will need much more than X of RAM to process a file of size X (and esp. so on 64-bit platforms); the extra cost of keeping the full buffer in RAM seems definitely ok to me.

yallop · 2023-07-20T18:39:47Z

My intuition is that the compiler will need much more than X of RAM to process a file of size X

Yes, this is certainly true in general. Out of curiosity, I added code to the compiler to print out the file size and typedtree size (based on Obj.reachable_words) during compilation. A few examples, picked at random:

option.ml: file is 2104 bytes; typedtree is 530968 bytes (252x larger)
lambda/translcore.ml: file is 47477 bytes; typedtree is 4906824 bytes (103x larger)
typing/includemod.ml: file is 46043 bytes; typedtree is 5687624 bytes (124x larger)

It's pretty clear that reading the whole file into memory for lexing is unlikely to cause memory usage problems (except on improbably large files consisting mostly of whitespace and comments, I suppose).

gasche · 2023-07-20T19:44:38Z

Thanks! I will try to write a follow-up PR that does this.

gasche · 2023-07-21T15:02:17Z

In #12403 I propose to read the whole source file in one go before lexing.

gasche closed this Jul 20, 2023

gasche mentioned this pull request Jul 21, 2023

Read source files in one go before lexing #12403

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workaround #12238 by using large lexing buffers#12396

Workaround #12238 by using large lexing buffers#12396
gasche wants to merge 1 commit intoocaml:trunkfrom
gasche:larger-lexing-buffer

gasche commented Jul 20, 2023

Uh oh!

gasche commented Jul 20, 2023

Uh oh!

xavierleroy commented Jul 20, 2023

Uh oh!

gasche commented Jul 20, 2023

Uh oh!

alainfrisch commented Jul 20, 2023

Uh oh!

yallop commented Jul 20, 2023

Uh oh!

gasche commented Jul 20, 2023

Uh oh!

gasche commented Jul 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

gasche commented Jul 20, 2023

Uh oh!

gasche commented Jul 20, 2023

Uh oh!

xavierleroy commented Jul 20, 2023

Uh oh!

gasche commented Jul 20, 2023

Uh oh!

alainfrisch commented Jul 20, 2023

Uh oh!

yallop commented Jul 20, 2023

Uh oh!

gasche commented Jul 20, 2023

Uh oh!

gasche commented Jul 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants