Skip to content

Deprecate the Out_of_memory exception#8628

Closed
jhjourdan wants to merge 5 commits intoocaml:trunkfrom
jhjourdan:no_oom
Closed

Deprecate the Out_of_memory exception#8628
jhjourdan wants to merge 5 commits intoocaml:trunkfrom
jhjourdan:no_oom

Conversation

@jhjourdan
Copy link
Copy Markdown
Contributor

This PR proposes to replace, in the runtime, the use of the Out_of_memory exception with the use of a fata error, which ends the program.

The rationale for this change is the following:

  • In the current state, an out-of-memory condition is not guaranteed to generate this exception. If it occurs during a minor collection, then a fatal error will occur anyway.
  • Out_of_memory exceptions can occur in an asynchronous manner, which make them particularly difficult to handle in a reliable manner.
  • Getting rid of Out_of_memory makes it easier to handle out-of-memory conditions in the runtime. Instead of propagating errors until being in a place where it is safe to raise the exception, we can usually show a fatal error message and exit the program immediately.

This change will also make #847 simpler by simplifying the alloc_shr API.

@jhjourdan
Copy link
Copy Markdown
Contributor Author

Cc @mshinwell , @damiendoligez

@dbuenzli
Copy link
Copy Markdown
Contributor

I may sound like a broken record but caml_fatal_error uses exit with 2. This leads to exit code semantics confusion when user programs uses Stdlib.exit with well defined exit numbers.

Would it be possible to change caml_fatal_error to use abort instead (i.e. the process will signal with SIGABRT) ? Or at least use a well-defined high exit number ?

@nojb
Copy link
Copy Markdown
Contributor

nojb commented Apr 19, 2019

Related to #1926 and #7185.

@damiendoligez
Copy link
Copy Markdown
Member

As I said in #1926 (review), there are in fact several kinds of out-of-memory and some of them are rather easy to foresee and to handle. I'm not convinced that it's a good idea to turn those into fatal errors.

I fully agree with @dbuenzli's suggestion to change the exit(2) into abort().

@jhjourdan
Copy link
Copy Markdown
Contributor Author

As I said in #1926 (review), there are in fact several kinds of out-of-memory and some of them are rather easy to foresee and to handle. I'm not convinced that it's a good idea to turn those into fatal errors.

The issue with OOM is that it can trigger in a place which is completely independent to the place where the memory leak is. So, I would say that being able to handle some OOM conditions but not others does not make much more sense than the current situation where the GC unreliably raises an OOM or a fatal error.

I fully agree with @dbuenzli's suggestion to change the exit(2) into abort().

Can do. But this is largely independent to this PR (there are other kinds of fatal errors, which could use abort or exit), and @mshinwell already advocated for not using abort in #1926. I personally don't care.

@dbuenzli
Copy link
Copy Markdown
Contributor

Can do. But this is largely independent to this PR (there are other kinds of fatal errors, which could use abort or exit),

I wouldn't say so, with this PR it becomes quite important not to exit with 2.

A lot of cli based (e.g. cmdliner based) have a catch all handler and exit with a special code in case of uncaught exception.

Previous to this PR the cases @damiendoligez mentions, if unhandled, would turn into this exit code. After this PR they will turn into 2 which is pretty bad since these cases can easily happen due to programming errors.

@mshinwell already advocated for not using abort in #1926. I personally don't care.

His second message in this discussion actually advocates for it.

@jhjourdan
Copy link
Copy Markdown
Contributor Author

jhjourdan commented Apr 19, 2019

I have just submitted PR #8630, which replaces exit(2) with abort().

The remaining questions are:

1- Do we want to stop raising the OOM exception asynchronously when the GC can no longer allocate? Nobody has objected here, but I know that @xavierleroy was not particularly in favor when discussed in #1926.

2- Do we want to entirely deprecate Out_of_memory and stop raising it even in function such as Array.make, Gc.set, ... ? I personally think that these cases cannot be handled reliably anyway and that it making this distinction make the behavior less predictable. A successful call to e.g., Array.make can fill the heap and make subsequent allocations impossible, hence producing a fatal error (even if we were trying to catch Out_of_memory). Conversely, a failing call to Array.make can actually reveal that the heap is already full.

@dra27
Copy link
Copy Markdown
Member

dra27 commented Apr 20, 2019

Another thought for the mix is whether it should compulsory for the program to abort following out of memory - can we not instead insist on tearing down the runtime? I'm thinking of the case of a C program which wraps OCaml which might want to do something different if OCaml runs out of memory?

One possibility would be to keep caml_raise_out_of_memory and have it invoke a function via a pointer which can be overridden in C and simply document that caml_raise_out_of_memory now does something entirely uncatchable in OCaml-land.

@dbuenzli
Copy link
Copy Markdown
Contributor

Rewording @dra27 thought is that when you use the OCaml system as a library you don't expect your library to either exit or abort.

So isn't what @dra27 suggests something that should happen for all instance of caml_fatal_error ? That is this PR could be kept as is and caml_fatal_error does what it does now except for the final abort which can be user redefined. Something should then also be done for the remaining runtime initiated exits that might remain after #8630.

@dra27
Copy link
Copy Markdown
Member

dra27 commented Apr 20, 2019

@dbuenzli - indeed, it could (should) be solved that way in general. But I was wondering given that at present you can (I think) "catch" the Out_of_memory exception from C at the moment whether both paths would want to remain - i.e. one be able to distinguish out of memory from other kinds of "fatal" error.

@jorisgio
Copy link
Copy Markdown

fun fact : we have been using #1926 patch in prod for months now, and we were actually discussing yesterday the possibility to switch back from abort to exit or sigkill.

But it only makes sense in the context of large server side application. Basically if you have a 200GB process that abort and coredump, then your FS is full and you have two problems...

The point is, there are very different environments requirement different handling, and for this reason i think @dbuenzli proposal makes a lot of sense.

@jhjourdan
Copy link
Copy Markdown
Contributor Author

@jorisgio : but in that case why wouldn't you simply deactivate coredumps? If your FS is too small for storing the core dumps, then there is no point in keeping them at all...

@xavierleroy
Copy link
Copy Markdown
Contributor

xavierleroy commented Apr 21, 2019

I know that @xavierleroy was not particularly in favor when discussed in #1926.

As someone who cares about high-assurance software, fatal, unrecoverable errors still make me sad.

I believe that even mortal sins such as running out of memory are amenable to redemption, in the form of a well-placed exception handler. Redemption doesn't always work, but when it does it's nice. You're taking a different viewpoint, namely that sinners should core dump and burn in hell. That doesn't feel right to me.

@xavierleroy
Copy link
Copy Markdown
Contributor

Do we want to entirely deprecate Out_of_memory and stop raising it even in function such as Array.make, Gc.set, ... ?

Think of toplevel interactive use. You really want jokers (or students) who enter

Array.make 1_000_000_000_000 42;;

to be greeted with a core dump?

@dbuenzli
Copy link
Copy Markdown
Contributor

@xavierleroy, to be fair if you care about high-assurance software you also need to care and design for fatal, irrecoverable errors as they do exist anyways.

I personally don't have a strong opinion at the moment on whether we should get rid of that exception but as suggested by @jhjourdan above in his point 2. keeping it at obvious allocation points (Array.make etc.) does not even allow for an accurate detection of programming errors like the one you mention in the toplevel -- it does however allow to try to recover from it.

However as we saw in #2118 these asynchronous exceptions (Out_of_memory, Stack_overflow) do not allow to correctly handle resources and in the end are only meant to be catched at the toplevel for a tentative recovery with no guarantees about the state of the system when they do happen.

So I'm wondering whether these errors would not be better served by some form of OCaml user definable continuation trap to which the current backtrace is given and allows to restart the program from there. In the particular toplevel case this would allow to try to reinvoke the read-eval-print loop, on out of memory and/or stack overflow errors from there.

You may say that this not change much w.r.t. to a catch all global handler. However it allows to clearly distinguish the error handling mecanisms for runtime system errors and user program errors. Somehow currently these two are fused together and user programs constantly need to deal with the idea that the runtime system may fail (as manifested by the folk knowledge that you should never write a catch all exception handler) but with absolutely no gain in doing so since nothing really reliable can be done when these happen.

@xavierleroy
Copy link
Copy Markdown
Contributor

I'm aware that these "asynchronous" exceptions are not perfect and have some issues of their own. I'd be happy to learn about possible alternatives to handle out-of-memory conditions. But replacing those exceptions by fatal errors is throwing the baby with the bathwater.

@xavierleroy
Copy link
Copy Markdown
Contributor

Also, before reproducing it here, please see the extensive discussion for #1926, especially #1926 (comment) on how Rust does it.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Apr 22, 2019

I want to preface this by saying that it seems that everybody agrees that the current situation is not very good, that the problem of offering good support for handling of Out_of_memory is a very difficult one, and I think it was probably without a good solution until Rust came along with its exception-safety model. I agree with @jhjourdan that one should not keep features with which one cannot write reliable programs at all, but I am more overtly optimistic than @xavierleroy that the current situation can lead with some efforts to an Out_of_memory exception that is consistently reliable. Also, I am only stating what I think are facts; it is still up to the OCaml community to decide what is the scope of their language, and if I disregard below issues such as the fatal error during minor collection, that is only to offer a theoretical viewpoint and not to minimise the engineering efforts required.

However as we saw in #2118 these asynchronous exceptions (Out_of_memory, Stack_overflow) do not allow to correctly handle resources

For the purpose of resource management it is important to distinguish asynchronous exceptions (e.g. Sys.Break or a possible thread-killing exception) from synchronous-but-unexpected ones such as Out_of_memory, as the possible solutions are slightly different. For Out_of_memory, it's easier in certain aspects and harder in others (but not impossible in theory).

and in the end are only meant to be catched at the toplevel for a tentative recovery with no guarantees about the state of the system when they do happen.

In my understanding, it's: either at the toplevel, or with no guarantees about the state of accessed mutable data. That's because at the toplevel it's easy to check that nothing invalid escapes. In fact, another example where this is reliable is at the bottom-level: if Out_of_memory is thrown reliably, then it is possible to implement e.g. Array.make_opt (that returns option) from Array.make (that raises) reliably.

More generally, whenever you can guarantee that nothing invalid escapes, then you are allowed to catch Out_of_memory (but only with a catch-all, see discussion at #2118). This is difficult to ensure, think of a program that may be interrupted in the middle of writing to memory that is shared between threads (even as part of a safe abstraction). The discussion at #1926 shows that catching Out_of_memory for a sub-component is desirable for some important use-cases; there is even more empirical evidence coming from open discussions in the Rust community.

Rust's exception-safety model pushes further this idea of isolating components at boundaries, as opposed to the C++-style exception-safety model where the user has to ensure the validity of invariants at all points in time (which is already very inconvenient for expected exceptions). Rust offers language support for it: “poison” RAII guards are used to propagate the information of failure to other users of the shared data and solve the problem I just mentioned (so resource management features actually help for isolation), and an UnwindSafe trait (still a bit experimental IIUC) is used to recognise which closures are safe to catch exceptions (panics) from; before that it was already safe to catch exceptions at task (thread) boundaries.

This exception-safety model could be adapted for OCaml's “serious” exceptions. Then, the case of Out_of_memory is even more specific and a bit harder in functional programming, due the the issues of small allocations and non-locality, as discussed in #1926. Given that there's a long road ahead if one wants to offer satisfactory support for Out_of_memory, my advice is to decide what would be the ideal design, then decide of the best trajectory to reach that design, and accept that one has an imperfect solution in the meanwhile (still better than nothing). With this imperfect situation in mind, the proposal at #1926 to optionally abort on out of memory sounds sensible to me.

Historically, Rust's model came as an evolution of Erlang's “let it fail” model for writing reliable distributed systems, which Rust brought to shared-memory concurrency thanks to these two new concepts. Joe Armstrong realised that in order to build reliable scalable systems, one has to incorporate the possibility of unexpected failure.

Sadly, Joe Armstrong passed away last Saturday. In memory of his life and achievements, one can listen to or read his panel discussion with Tony Hoare and Carl Hewitt, his Erlang paper in CACM, and his PhD thesis.

So I'm wondering whether these errors would not be better served by some form of OCaml user definable continuation trap to which the current backtrace is given and allows to restart the program from there. In the particular toplevel case this would allow to try to reinvoke the read-eval-print loop, on out of memory and/or stack overflow errors from there.

You may say that this not change much w.r.t. to a catch all global handler.

Indeed, this is still in the same spirit, but an exception still allows you to clean-up resources with unwind-protect, and a global continuation trap is non-compositional, so not a very good language support for the "let it fail" philosophy.

However it allows to clearly distinguish the error handling mecanisms for runtime system errors and user program errors. Somehow currently these two are fused together and user programs constantly need to deal with the idea that the runtime system may fail (as manifested by the folk knowledge that you should never write a catch all exception handler) but with absolutely no gain in doing so since nothing really reliable can be done when these happen.

I agree with these two different but important problems:

  • One needs to more clearly (and officially) distinguish catchable exceptions from “serious” exceptions. With the introduction of Fun.protect, the distinction is already more-or-less official since it documents that one should not observe the difference between, say, Out_of_memory and Finally_raised.

  • One needs a more flexible way to handle serious exceptions. It is not clear to me that Rust-style resource management is necessary to implement poisoning, one could already do that with unwind-protect wrappers, say for locking a mutex. I wonder how much having a clearer stance on them would help drive the evolution of language and library support.

These concern more generally serious exceptions, rather than just Out_of_memory.

2- Do we want to entirely deprecate Out_of_memory and stop raising it even in function such as Array.make, Gc.set, ... ? I personally think that these cases cannot be handled reliably anyway and that it making this distinction make the behavior less predictable. A successful call to e.g., Array.make can fill the heap and make subsequent allocations impossible, hence producing a fatal error (even if we were trying to catch Out_of_memory). Conversely, a failing call to Array.make can actually reveal that the heap is already full.

The discussion in #1926 identifies two distinct issues and solutions:

  • One should design the feature with use-cases in mind; there will always be Out_of_memory situations that are not recoverable (e.g. due to outside pressure, but then it's a problem for the OS). But the actual desired use-cases describe recoverable situations. In practice, this means that one can assume some cooperation from the programmer to ensure that the Out_of_memory goes well (it's going to free up some space eventually; if it happens again in the unwind-protect handler then they are ok with that...).

  • The non-locality of the “Out of memory” situation which you mention. With the assumption that we can assume cooperation from the programmer, I sketched in Add a runtime flag to call abort(3) instead of raising Out_of_memory #1926 a design where they register components as being able to receive Out_of_memory, and whenever Out_of_memory occurs, the GC looks for a thread that accepts it. It's inspired by task-based isolation again, and also by the current treatment of signals. Notice that in the case where no handler has been registered, this proposal degenerates to calling some abort function as suggested in the current PR.

Again, this is less to say “we should do this” than to say “I agree with the diagnosis and here's what it would take, to the best of my knowledge”.

[several other comments that argue in favour of abort in the name of “let it fail”]

These comments seem to argue in favour of abort in the name of “let it fail”, but their solution seem to relegate the question of language support to the “meta”. To me, they seem to paint a programming language which is not scalable.

@lpw25
Copy link
Copy Markdown
Contributor

lpw25 commented Apr 23, 2019

The suggestion that the ideal solution involves evolving OCaml to have a notion of isolated task seems strange to me. We have a notion of isolated task: the process. From that perspective the current proposal seems to line up correctly with the position that out of memory conditions should only be detected for isolated tasks.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Apr 25, 2019

It is indeed possible that OCaml multicore will have a reliable concurrency and parallelism model centred around small OCaml processes which communicate via message passing.

Leo and I seem agree on this: before deciding the fate of Out_of_memory, one first has to agree in which direction one evolves the currently loosely-defined exception-safety model for serious exceptions.

@jhjourdan
Copy link
Copy Markdown
Contributor Author

Since we are far from a consensus on this question, and since my solution seems to be even less consensual, I am closing this PR.

@jhjourdan jhjourdan closed this Sep 5, 2019
@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Sep 5, 2019

It sounds like you are abandoning the idea of sanitizing OOM. From online and offline discussions it seemed to me that a consensus was likely to be reached around the approach in #1926 (at least for a starter). The latter was closed because it was "superseded" but I do not see any more PR to sanitize OOM. Are there still plans to explore the latter approach? If not, should we reopen #1926, or open a different PR based on it?

@jhjourdan jhjourdan deleted the no_oom branch September 5, 2019 15:17
@jhjourdan
Copy link
Copy Markdown
Contributor Author

It sounds like you are abandoning the idea of sanitizing OOM. From online and offline discussions it seemed to me that a consensus was likely to be reached around the approach in #1926 (at least for a starter). The latter was closed because it was "superseded" but I do not see any more PR to sanitize OOM. Are there still plans to explore the latter approach? If not, should we reopen #1926, or open a different PR based on it?

To be honest, what I feel here is that people don't agree on what to do, and I am not motivated to lead the debate for finding the right thing to do, since I am personally not impacted by this somewhat weird behavior.

But if you feel comfortable with leading such a debate and finally write a PR which implements what will come out of this, that would be great!

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Sep 12, 2019

I think we're almost there, but it looks more complicated than it is because two questions have been mixed in these discussions:

  1. Is it possible to reliably handle OOM conditions, and if so how?
  2. How can we improve the current OOM situation for people for whom it is harmful?

We all agree that the first question is complicated. I described what I think is one solution. Besides the design issues, I do not minimize the engineering efforts required. I hear people who say that backing out of a failed minor collection is hard to implement. I'll be happy to continue discussing it with people who are interested. I just think it is premature to address it. To begin with, a prerequisite is to agree on how to reconcile unexpected exceptions with safe resource management, a proposal about which is only just being written. In a while maybe, after multicore is successful, resource management issues will be more prevalent, and more interested people will be willing to throw resources at gracefully recover from OOM.

The second concern has been raised in #1926 by an industrial user, and it looks simpler to address. I think the question also covers many of your good points. I will continue the discussion on that other PR, focusing on that specific concern.

@jorisgio
Copy link
Copy Markdown

I might have been too quick to close #1926 feel free to reopen. I had the feeling consensus was being reached on this one, but apparently i was wrong. For what it's worth, we have been using patched ocaml since then, that allows custom handling of oom. I do not know if we are the only ones affected by this problem at this point and if adding a user defined callback to tweak the behavior is useful enough.

@gadmm gadmm mentioned this pull request Apr 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants