Deprecate the Out_of_memory exception#8628
Conversation
…f-memory conditions.
|
Cc @mshinwell , @damiendoligez |
|
I may sound like a broken record but Would it be possible to change |
|
As I said in #1926 (review), there are in fact several kinds of out-of-memory and some of them are rather easy to foresee and to handle. I'm not convinced that it's a good idea to turn those into fatal errors. I fully agree with @dbuenzli's suggestion to change the |
The issue with OOM is that it can trigger in a place which is completely independent to the place where the memory leak is. So, I would say that being able to handle some OOM conditions but not others does not make much more sense than the current situation where the GC unreliably raises an OOM or a fatal error.
Can do. But this is largely independent to this PR (there are other kinds of fatal errors, which could use |
I wouldn't say so, with this PR it becomes quite important not to exit with A lot of cli based (e.g. cmdliner based) have a catch all handler and exit with a special code in case of uncaught exception. Previous to this PR the cases @damiendoligez mentions, if unhandled, would turn into this exit code. After this PR they will turn into 2 which is pretty bad since these cases can easily happen due to programming errors.
His second message in this discussion actually advocates for it. |
|
I have just submitted PR #8630, which replaces The remaining questions are: 1- Do we want to stop raising the OOM exception asynchronously when the GC can no longer allocate? Nobody has objected here, but I know that @xavierleroy was not particularly in favor when discussed in #1926. 2- Do we want to entirely deprecate |
|
Another thought for the mix is whether it should compulsory for the program to abort following out of memory - can we not instead insist on tearing down the runtime? I'm thinking of the case of a C program which wraps OCaml which might want to do something different if OCaml runs out of memory? One possibility would be to keep |
|
Rewording @dra27 thought is that when you use the OCaml system as a library you don't expect your library to either So isn't what @dra27 suggests something that should happen for all instance of |
|
@dbuenzli - indeed, it could (should) be solved that way in general. But I was wondering given that at present you can (I think) "catch" the |
|
fun fact : we have been using #1926 patch in prod for months now, and we were actually discussing yesterday the possibility to switch back from abort to exit or sigkill. But it only makes sense in the context of large server side application. Basically if you have a 200GB process that abort and coredump, then your FS is full and you have two problems... The point is, there are very different environments requirement different handling, and for this reason i think @dbuenzli proposal makes a lot of sense. |
|
@jorisgio : but in that case why wouldn't you simply deactivate coredumps? If your FS is too small for storing the core dumps, then there is no point in keeping them at all... |
As someone who cares about high-assurance software, fatal, unrecoverable errors still make me sad. I believe that even mortal sins such as running out of memory are amenable to redemption, in the form of a well-placed exception handler. Redemption doesn't always work, but when it does it's nice. You're taking a different viewpoint, namely that sinners should core dump and burn in hell. That doesn't feel right to me. |
Think of toplevel interactive use. You really want jokers (or students) who enter to be greeted with a core dump? |
|
@xavierleroy, to be fair if you care about high-assurance software you also need to care and design for fatal, irrecoverable errors as they do exist anyways. I personally don't have a strong opinion at the moment on whether we should get rid of that exception but as suggested by @jhjourdan above in his point 2. keeping it at obvious allocation points ( However as we saw in #2118 these asynchronous exceptions ( So I'm wondering whether these errors would not be better served by some form of OCaml user definable continuation trap to which the current backtrace is given and allows to restart the program from there. In the particular toplevel case this would allow to try to reinvoke the read-eval-print loop, on out of memory and/or stack overflow errors from there. You may say that this not change much w.r.t. to a catch all global handler. However it allows to clearly distinguish the error handling mecanisms for runtime system errors and user program errors. Somehow currently these two are fused together and user programs constantly need to deal with the idea that the runtime system may fail (as manifested by the folk knowledge that you should never write a catch all exception handler) but with absolutely no gain in doing so since nothing really reliable can be done when these happen. |
|
I'm aware that these "asynchronous" exceptions are not perfect and have some issues of their own. I'd be happy to learn about possible alternatives to handle out-of-memory conditions. But replacing those exceptions by fatal errors is throwing the baby with the bathwater. |
|
Also, before reproducing it here, please see the extensive discussion for #1926, especially #1926 (comment) on how Rust does it. |
|
I want to preface this by saying that it seems that everybody agrees that the current situation is not very good, that the problem of offering good support for handling of Out_of_memory is a very difficult one, and I think it was probably without a good solution until Rust came along with its exception-safety model. I agree with @jhjourdan that one should not keep features with which one cannot write reliable programs at all, but I am more overtly optimistic than @xavierleroy that the current situation can lead with some efforts to an
For the purpose of resource management it is important to distinguish asynchronous exceptions (e.g. Sys.Break or a possible thread-killing exception) from synchronous-but-unexpected ones such as Out_of_memory, as the possible solutions are slightly different. For Out_of_memory, it's easier in certain aspects and harder in others (but not impossible in theory).
In my understanding, it's: either at the toplevel, or with no guarantees about the state of accessed mutable data. That's because at the toplevel it's easy to check that nothing invalid escapes. In fact, another example where this is reliable is at the bottom-level: if Out_of_memory is thrown reliably, then it is possible to implement e.g. More generally, whenever you can guarantee that nothing invalid escapes, then you are allowed to catch Out_of_memory (but only with a catch-all, see discussion at #2118). This is difficult to ensure, think of a program that may be interrupted in the middle of writing to memory that is shared between threads (even as part of a safe abstraction). The discussion at #1926 shows that catching Out_of_memory for a sub-component is desirable for some important use-cases; there is even more empirical evidence coming from open discussions in the Rust community. Rust's exception-safety model pushes further this idea of isolating components at boundaries, as opposed to the C++-style exception-safety model where the user has to ensure the validity of invariants at all points in time (which is already very inconvenient for expected exceptions). Rust offers language support for it: “poison” RAII guards are used to propagate the information of failure to other users of the shared data and solve the problem I just mentioned (so resource management features actually help for isolation), and an This exception-safety model could be adapted for OCaml's “serious” exceptions. Then, the case of Out_of_memory is even more specific and a bit harder in functional programming, due the the issues of small allocations and non-locality, as discussed in #1926. Given that there's a long road ahead if one wants to offer satisfactory support for Out_of_memory, my advice is to decide what would be the ideal design, then decide of the best trajectory to reach that design, and accept that one has an imperfect solution in the meanwhile (still better than nothing). With this imperfect situation in mind, the proposal at #1926 to optionally abort on out of memory sounds sensible to me. Historically, Rust's model came as an evolution of Erlang's “let it fail” model for writing reliable distributed systems, which Rust brought to shared-memory concurrency thanks to these two new concepts. Joe Armstrong realised that in order to build reliable scalable systems, one has to incorporate the possibility of unexpected failure. Sadly, Joe Armstrong passed away last Saturday. In memory of his life and achievements, one can listen to or read his panel discussion with Tony Hoare and Carl Hewitt, his Erlang paper in CACM, and his PhD thesis.
Indeed, this is still in the same spirit, but an exception still allows you to clean-up resources with unwind-protect, and a global continuation trap is non-compositional, so not a very good language support for the "let it fail" philosophy.
I agree with these two different but important problems:
These concern more generally serious exceptions, rather than just Out_of_memory.
The discussion in #1926 identifies two distinct issues and solutions:
Again, this is less to say “we should do this” than to say “I agree with the diagnosis and here's what it would take, to the best of my knowledge”.
These comments seem to argue in favour of abort in the name of “let it fail”, but their solution seem to relegate the question of language support to the “meta”. To me, they seem to paint a programming language which is not scalable. |
|
The suggestion that the ideal solution involves evolving OCaml to have a notion of isolated task seems strange to me. We have a notion of isolated task: the process. From that perspective the current proposal seems to line up correctly with the position that out of memory conditions should only be detected for isolated tasks. |
|
It is indeed possible that OCaml multicore will have a reliable concurrency and parallelism model centred around small OCaml processes which communicate via message passing. Leo and I seem agree on this: before deciding the fate of |
|
Since we are far from a consensus on this question, and since my solution seems to be even less consensual, I am closing this PR. |
|
It sounds like you are abandoning the idea of sanitizing OOM. From online and offline discussions it seemed to me that a consensus was likely to be reached around the approach in #1926 (at least for a starter). The latter was closed because it was "superseded" but I do not see any more PR to sanitize OOM. Are there still plans to explore the latter approach? If not, should we reopen #1926, or open a different PR based on it? |
To be honest, what I feel here is that people don't agree on what to do, and I am not motivated to lead the debate for finding the right thing to do, since I am personally not impacted by this somewhat weird behavior. But if you feel comfortable with leading such a debate and finally write a PR which implements what will come out of this, that would be great! |
|
I think we're almost there, but it looks more complicated than it is because two questions have been mixed in these discussions:
We all agree that the first question is complicated. I described what I think is one solution. Besides the design issues, I do not minimize the engineering efforts required. I hear people who say that backing out of a failed minor collection is hard to implement. I'll be happy to continue discussing it with people who are interested. I just think it is premature to address it. To begin with, a prerequisite is to agree on how to reconcile unexpected exceptions with safe resource management, a proposal about which is only just being written. In a while maybe, after multicore is successful, resource management issues will be more prevalent, and more interested people will be willing to throw resources at gracefully recover from OOM. The second concern has been raised in #1926 by an industrial user, and it looks simpler to address. I think the question also covers many of your good points. I will continue the discussion on that other PR, focusing on that specific concern. |
|
I might have been too quick to close #1926 feel free to reopen. I had the feeling consensus was being reached on this one, but apparently i was wrong. For what it's worth, we have been using patched ocaml since then, that allows custom handling of oom. I do not know if we are the only ones affected by this problem at this point and if adding a user defined callback to tweak the behavior is useful enough. |
This PR proposes to replace, in the runtime, the use of the
Out_of_memoryexception with the use of a fata error, which ends the program.The rationale for this change is the following:
Out_of_memoryexceptions can occur in an asynchronous manner, which make them particularly difficult to handle in a reliable manner.Out_of_memorymakes it easier to handle out-of-memory conditions in the runtime. Instead of propagating errors until being in a place where it is safe to raise the exception, we can usually show a fatal error message and exit the program immediately.This change will also make #847 simpler by simplifying the
alloc_shrAPI.