Workaround #12238 by using large lexing buffers#12396
Workaround #12238 by using large lexing buffers#12396gasche wants to merge 1 commit intoocaml:trunkfrom
Conversation
Some of our error printing styles print the source code at the location of the error. This source is found either in the lexing buffer when available (it has not been discarded to make room for more source code), orelse by trying to re-open the source file. Re-opening the source file is not so reliable, in particular it fails in presence of preprocessor directives (the user-facing locations we have do not necessarily refer to real locations in the input file). We propose to workaround this issue by simply using large lexing buffers by default, so that the vast majority of program keep the whole source input in the lexing buffer, and the unreliable fallback is rarely used.
|
Note: it would be nice to also ensure that we don't print nonsensical code for larger source files -- we could either disable the fallback of re-reading the source completely, or try to disable it conditionally when lexer directives are used. This is sensibly more work than the current workaround, so I think a separate PR would be better. |
|
You know you're getting desperate when you increase the size of buffers to make a bug go away :-) If you really want to go this way, I'd suggest reading the source code into a string ( (For what it's worth: after much back and forth, the CompCert compiler also settled on reading the preprocessed C source code into a string before lexing and parsing, although I can't remember all the reasons why; but there were several.) |
|
I also considered this approach, but I was somewhat worried about the idea of unbounded memory usage for files that are larger than 256Kib. In the compiler we have a menhir-generated parser.ml file clocking at 2.4MiB. Do we want a cutoff based on the file size, or are we happy with using a lot of memory on very large files? (As discussed in #12238 (comment), very large files also produce large build artifacts, so I guess their live memory usage is high anyway, but the build artifacts seem to remain smaller than the source code in practice.) |
My intuition is that the compiler will need much more than X of RAM to process a file of size X (and esp. so on 64-bit platforms); the extra cost of keeping the full buffer in RAM seems definitely ok to me. |
Yes, this is certainly true in general. Out of curiosity, I added code to the compiler to print out the file size and typedtree size (based on It's pretty clear that reading the whole file into memory for lexing is unlikely to cause memory usage problems (except on improbably large files consisting mostly of whitespace and comments, I suppose). |
|
Thanks! I will try to write a follow-up PR that does this. |
|
In #12403 I propose to read the whole source file in one go before lexing. |
Some of our error printing styles print the source code at the location of the error. This source is found either in the lexing buffer when available (it has not been discarded to make room for more source code), orelse by trying to re-open the source file. Re-opening the source file is not so reliable, in particular it fails in presence of preprocessor directives (the user-facing locations we have do not necessarily refer to real locations in the input file).
We propose to workaround this issue by simply using large lexing buffers by default, so that the vast majority of program keep the whole source input in the lexing buffer, and the unreliable fallback is rarely used.