Skip to content

Workaround #12238 by using large lexing buffers#12396

Closed
gasche wants to merge 1 commit intoocaml:trunkfrom
gasche:larger-lexing-buffer
Closed

Workaround #12238 by using large lexing buffers#12396
gasche wants to merge 1 commit intoocaml:trunkfrom
gasche:larger-lexing-buffer

Conversation

@gasche
Copy link
Copy Markdown
Member

@gasche gasche commented Jul 20, 2023

Some of our error printing styles print the source code at the location of the error. This source is found either in the lexing buffer when available (it has not been discarded to make room for more source code), orelse by trying to re-open the source file. Re-opening the source file is not so reliable, in particular it fails in presence of preprocessor directives (the user-facing locations we have do not necessarily refer to real locations in the input file).

We propose to workaround this issue by simply using large lexing buffers by default, so that the vast majority of program keep the whole source input in the lexing buffer, and the unreliable fallback is rarely used.

Some of our error printing styles print the source code at the
location of the error. This source is found either in the lexing
buffer when available (it has not been discarded to make room for more
source code), orelse by trying to re-open the source file. Re-opening
the source file is not so reliable, in particular it fails in presence
of preprocessor directives (the user-facing locations we have do not
necessarily refer to real locations in the input file).

We propose to workaround this issue by simply using large lexing
buffers by default, so that the vast majority of program keep the
whole source input in the lexing buffer, and the unreliable fallback
is rarely used.
@gasche
Copy link
Copy Markdown
Member Author

gasche commented Jul 20, 2023

Note: it would be nice to also ensure that we don't print nonsensical code for larger source files -- we could either disable the fallback of re-reading the source completely, or try to disable it conditionally when lexer directives are used. This is sensibly more work than the current workaround, so I think a separate PR would be better.

@xavierleroy
Copy link
Copy Markdown
Contributor

You know you're getting desperate when you increase the size of buffers to make a bug go away :-)

If you really want to go this way, I'd suggest reading the source code into a string (In_channel.input_all) then run the lexer over this string (Lexing.from_string). At least, the memory usage will be proportional to the size of the source code, instead of being absurdly high for tiny source files.

(For what it's worth: after much back and forth, the CompCert compiler also settled on reading the preprocessed C source code into a string before lexing and parsing, although I can't remember all the reasons why; but there were several.)

@gasche
Copy link
Copy Markdown
Member Author

gasche commented Jul 20, 2023

I also considered this approach, but I was somewhat worried about the idea of unbounded memory usage for files that are larger than 256Kib. In the compiler we have a menhir-generated parser.ml file clocking at 2.4MiB. Do we want a cutoff based on the file size, or are we happy with using a lot of memory on very large files? (As discussed in #12238 (comment), very large files also produce large build artifacts, so I guess their live memory usage is high anyway, but the build artifacts seem to remain smaller than the source code in practice.)

@alainfrisch
Copy link
Copy Markdown
Contributor

Do we want a cutoff based on the file size, or are we happy with using a lot of memory on very large files?

My intuition is that the compiler will need much more than X of RAM to process a file of size X (and esp. so on 64-bit platforms); the extra cost of keeping the full buffer in RAM seems definitely ok to me.

@yallop
Copy link
Copy Markdown
Member

yallop commented Jul 20, 2023

My intuition is that the compiler will need much more than X of RAM to process a file of size X

Yes, this is certainly true in general. Out of curiosity, I added code to the compiler to print out the file size and typedtree size (based on Obj.reachable_words) during compilation. A few examples, picked at random:

option.ml: file is 2104 bytes; typedtree is 530968 bytes (252x larger)
lambda/translcore.ml: file is 47477 bytes; typedtree is 4906824 bytes (103x larger)
typing/includemod.ml: file is 46043 bytes; typedtree is 5687624 bytes (124x larger)

It's pretty clear that reading the whole file into memory for lexing is unlikely to cause memory usage problems (except on improbably large files consisting mostly of whitespace and comments, I suppose).

@gasche
Copy link
Copy Markdown
Member Author

gasche commented Jul 20, 2023

Thanks! I will try to write a follow-up PR that does this.

@gasche
Copy link
Copy Markdown
Member Author

gasche commented Jul 21, 2023

In #12403 I propose to read the whole source file in one go before lexing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants