Add a runtime flag to call abort(3) instead of raising Out_of_memory by jorisgio · Pull Request #1926 · ocaml/ocaml

jorisgio · 2018-07-24T10:19:27Z

Currently, if allocation fails, runtime raises Out_of_memory exception. This exception is almost impossible to handle properly as it can be raised from virtually any allocation point, and will be caught by any catch all clauses.
This pr adds an OCAMLRUNPARAM option to turn this exception into an abort call.

Alternatives:

most malloc implementation support a similar flags, but this makes it harder to get a consistent ocaml behavior accross malloc implementations/huge_page support, systems...
If a flag will be added, i wonder if it should be more generic and extend the behavior to other hard failures exceptions like Assert_failure or Match_failure which exhibit the same problem.

Concerns

Some library features are well contained and it might be worth keeping the out_of_memory behavior for them, for instance bigarray creation or marshal.

Status

this PR is wip but i'd like to know if this approach looks right before adding tests etc.

mshinwell · 2018-07-24T10:34:22Z

I think we have previously discussed permanently changing this exception into a fatal error. I'm not sure it should be calling abort, though, which may produce a core dump.

nojb · 2018-07-24T10:39:30Z

Without looking at the details, I think it is a good idea to revisit Out_of_memory, as there is no way of handling it safely as far as I know.

If a flag will be added, i wonder if it should be more generic and extend the behavior to other hard failures exceptions like Assert_failure or Match_failure which exhibit the same problem.

I don't think so - these exceptions signal programmer error, are perfectly well defined, and they can be handled in the normal flow of a program, so they should not abort the program run.

pmetzger · 2018-07-24T14:11:02Z

I'm not sure it should be calling abort, though, which may produce a core dump.

It's often harder to debug a weird error without a core dump to look at...

alainfrisch · 2018-07-24T14:45:05Z

I don't think so - these exceptions signal programmer error, are perfectly well defined, and they can be handled in the normal flow of a program, so they should not abort the program run.

I agree. One could have a similar discussion for Stack_overflow and "asynchronous" exceptions (raised from signal handlers or finalizers), although for asynchronous exceptions, one should be able to register a global handler as an OCaml function (which could decide to exit the application, or just to log the error and continue).

xavierleroy · 2018-07-24T16:36:07Z

I'm not convinced by the motivations. Catch-all exceptions should never be used except for finalization (followed by a reraise) or in the entry point of a program for hardening purposes (log then restart). Turning a run-time exception into an abort() is just making the program unable to handle the exceptional condition. Something is lost. What is gained?

ygrek · 2018-07-24T17:46:57Z

I strongly support this.
Also I think it makes sense to allow override exn-vs-abort choice for Stack_overflow, Assert_failure and maybe Match_failure, each separately. Or allow setup user-handlers for this conditions (which by default will raise exceptions).
Different programs have different requiremets/expectations for this and it makes sense to allow choice.
I would claim that no code out there that didn't plan specifically for Out_of_memory exception is able to handle it. Same for Stack_overflow. Requiring all code to be written without catchalls, including third-party dependencies - is not always practical..

stedolan · 2018-07-25T08:41:25Z

Turning a run-time exception into an abort() is just making the program unable to handle the exceptional condition. Something is lost. What is gained?

Programs are already unable to handle allocation failure as Out_of_memory is not raised reliably, because it is always a fatal error for allocation to fail during minor GC. I would much prefer to see allocation failure result in a consistent fatal error than the current situation (an unpredictable choice between a fatal error and an exception depending on internal GC state).

I don't think this option should be added. Instead, out of memory should always be a fatal error and Out_of_memory should be deleted.

alainfrisch · 2018-07-25T09:02:28Z

Related discussion on https://caml.inria.fr/mantis/view.php?id=7185 , about avoiding the fatal error case.
And the also #852.

jorisgio · 2018-07-25T09:30:33Z

What is gained?

Crash consistency.

Catch-all exceptions should never be used except for finalization (followed by a reraise) or in the entry point of a program for hardening purposes (log then restart).

In an ideal world, yes, but in practice it's hard to control all dependencies and I believe it makes sense to have catch all clauses in "safe" points to log and restart only part of the program.
A good example is a server/RPC service: even if something unpredictable happens become some downstream changes fires now an unplanned exception (something that the type system would not catch), it is most of the time safe enough to continue serving other requests and log and abort only a single request, especially if restarting the whole service is costly.

The main problem is that most exceptions signal local errors. In concurrent or multi thread programs, Out_of_memory is special because it is not necessarily raised close to the cause of the error.
In the server example, aborting the request will probably not fix the out of memory issue.

Another problem is that catching Out_of_memory gives you a false sense of security, and i've encountered cases which were hard to understand simply from the trace : if you are lucky, a gc will manage to release enough memory to unwind all the stack, run all finalization code between reraise, and exit.
But not always. Since OOM can be raise from almost any place, it can be reraised from inside the handlers without running them and "skipping".

Basically i agree that it's not strictly necessary to have the option to abort in the runtime, and we discussed another possibility of writing a ppx to rewrite catchall clauses into catchall-except OOM and abort, but having the option to directly abort on oom would be much easier.

Instead, out of memory should always be a fatal error and Out_of_memory should be deleted.

Maybe but i can still see where OOM can be useful : for instance when allocating a large array/bigarray and having this allocation fail, it's contained and easy to catch enough for programmer to try handling it.

dbuenzli · 2018-07-25T10:16:10Z

What I dislike about calling abort (and caml_fatal_error exiting with 2) is that you can no longer distinguish runtime error conditions from application defined conditions which can make diagnostic more painful.

damiendoligez · 2018-09-13T14:55:39Z

I agree with @stedolan:

I don't think this option should be added. Instead, out of memory should always be a fatal error and Out_of_memory should be deleted.

MPR#7185 is an interesting discussion but it looks like a lot of work (and potential bugs) for very little benefit.

stedolan · 2018-10-02T15:16:53Z

@jorisgio

Instead, out of memory should always be a fatal error and Out_of_memory should be deleted.

Maybe but i can still see where OOM can be useful : for instance when allocating a large array/bigarray and having this allocation fail, it's contained and easy to catch enough for programmer to try handling it.

This does sound useful, but I think that the Out_of_memory exception is a poor way to express it (being asynchronous and unreliable). I still think we should remove it, but to support your use-case it might be worth adding Array.try_create : int -> 'a -> 'a array option (and similarly for Bigarray).

damiendoligez

I've marked one line in memory.c where raising the exception is dependent on the GC's internal state. This is the one that makes Out_of_memory unreliable, and which should be changed to a fatal error.

Elsewhere, I think it does make sense to raise Out_of_memory and it's easy to recover from it:

when creating a huge array, bigarray, or bytes
when changing the minor heap size
when marshalling or unmarshalling a huge value

So we should keep the exception in these cases.

At any rate, I don't think it's a good idea to introduce a run-time option to control this behavior. @jorisgio would you be willing to simplify your PR accordingly?

damiendoligez · 2018-11-06T14:36:41Z

runtime/extern.c

-  free_extern_output();
-  caml_raise_out_of_memory();
-}
-


This change is not correct. If you're in "raise OOM" mode, then you need to replay the trail and free the output before you can raise the exception.

damiendoligez · 2018-11-06T14:39:48Z

runtime/memory.c

+  if (caml_init_out_of_memory_mode == 0) {
+    caml_raise_out_of_memory();
+  } else {
+    abort();


You should print an error message before aborting, like we do everywhere else when we use abort.

damiendoligez · 2018-11-06T14:48:27Z

runtime/memory.c

@@ -498,7 +507,7 @@ static inline value caml_alloc_shr_aux (mlsize_t wosize, tag_t tag,
      else if (caml_in_minor_collection)


This is the line that makes Out_of_memory unreliable. You should remove the test and always call caml_fatal_error in this case.

jorisgio · 2018-11-09T07:19:04Z

Thanks. I've simplified to the one line changed you suggested, i'm going to test in the wild too. I'm not totally convinced it's a good idea to raise Out_of_memory while allocating a large string/array, but i believe your suggestion is a good middle ground and is indeed the biggest issue so i'm happy with that solution.

The other change with caml_fatal_error is that it will not abort() but exit() and this seems to be the consensus. I'd rather have a coredump to debug this sort of issue, but i guess i'm heavily biased towards servers and network services development where the dev is also the user.

mshinwell · 2018-11-09T10:08:19Z

Hmm, I think maybe caml_fatal_error should always call abort. I've seen a lot of situations (not specifically arising from this function, but similar cases) where the lack of core dumps has seriously hindered debugging. I would hazard a guess that most users don't have them enabled by default anyway.

dbuenzli · 2018-11-09T10:31:30Z

Hmm, I think maybe caml_fatal_error should always call abort.

IIUC abort entails that the process terminates via a signal (waitpid on such a process would return WSIGNALED). I would be all for it: for now caml_fatal_error exits with 2 which is confusing if your cli program uses exit codes.

stedolan · 2018-11-13T15:35:13Z

Elsewhere, I think it does make sense to raise Out_of_memory and it's easy to recover from it:

when creating a huge array, bigarray, or bytes

when changing the minor heap size

when marshalling or unmarshalling a huge value

So we should keep the exception in these cases.

FWIW, I agree completely with this. This makes Out_of_memory an ordinary exception that can be raised from a handful of functions (e.g. Array.make, Bigarray.Array1.create, Gc.set, etc.), along the lines of Invalid_argument or Unix_error. My objection is only to the asynchronous Out_of_memory that can be raised from anywhere.

DemiMarie · 2018-11-17T15:00:42Z

Some programs, such as libvirt, depend on handling out-of-memory situations. Libvirt, in particular, includes OCaml code that would be broken by this change ― it catches Out_of_memory and recovers from it. Libvirt includes a daemon that is long-running and is in a memory-constrained environment. Embedded systems are a similar situation.

lpw25 · 2018-11-17T23:06:05Z

@DemiMarie Could you provide a pointer to the code in libvirt which does this?

gadmm · 2018-11-19T06:27:18Z

I think it is a good idea to revisit Out_of_memory, as there is no way of handling it safely as far as I know.

I would claim that no code out there that didn't plan specifically for Out_of_memory exception is able to handle it.

MPR#7185 is an interesting discussion but it looks like a lot of work (and potential bugs) for very little benefit.

It is interesting to read about the handling of OOM in other languages, and in particular the reasons why Rust has decided go from aborting to giving an option to raise an exception (https://github.com/rust-lang/rfcs/blob/master/text/2116-alloc-me-maybe.md). This RFC and the two linked discussions are worth reading (as well as several earlier threads linked from there; it's a bit long, but in any case I read it for you). They refute several misconceptions about the difficulty of handling OOM safely, and they contain several convincing examples of use-cases.

A recurring example of use-case is that of servers (mentioned also by @jorisgio), where one is ready to have a request be aborted to let others continue; this is made safe thanks to the isolation of requests. The embedded use-case is also mentioned, but for this they prefer Array.try_create-style functions and to let OOM be handled explicitly everywhere (but mention it is not a full solution). (Besides, the above comments mention handling OOM, but even without proper handling, the OOM exception lets you clean up, which sounds quite reasonable to ask for.)

In Rust, the exception-safety model is the one @xavierleroy alluded to:

Catch-all exceptions should never be used except for finalization (followed by a reraise) or in the entry point of a program for hardening purposes (log then restart).

In other words, it asks for resource safety (unwind-protect-style) but frees the programmer from the task of ensuring the consistency of local invariants in the face of exceptions: instead, whoever catches the exception has to make sure that no invalid data is accessed. The Rust community shows a way towards making this model compositional and correct by construction (which is quite notable in itself) thanks to a notion of isolation of a task (https://doc.rust-lang.org/std/panic/trait.UnwindSafe.html). This is key to making the Rust approach safe, and I believe worth taking inspiration from.

Here compositional means that (by analogy) one can write catch-alls in the middle of a program, not really anywhere like a normal try, but at the beginning of an isolated task (like @jorisgio and @ygrek said they would like to be allowed to do).

Now in OCaml, an additional difficulty is that it is harder to control allocations. Let us focus on the use-cases where OOM is caused by trying to allocate too much (as opposed to being low on memory due to outside pressure). A difference is that in functional programming, it is more likely that a large allocation consists in a lot of small allocations. This is the situation in the use-case that @mlasson describes in MPR#7185. In contrast, the current proposed solution only caters to situations where OOM is caused by a possibly large allocation (this is also mentioned as a limitation by the Rust RFC for the non-raising half of their proposal).

Moreover:

The main problem is that most exceptions signal local errors. In concurrent or multi thread programs, Out_of_memory is special because it is not necessarily raised close to the cause of the error.

I think this is an important point. Out_of_memory is not a local condition, and the culprit could be another thread. Taking this into account, given the known use cases, and given the use-cases for opting-out of the Out_of_memory exception, I propose the following design for handling OOM events, inspired by signal handling:

Give the possibility to register a thread as accepting OOM exceptions (e.g. with a scoped combinator).
When OOM occurs, look if the current thread accepts OOM, if it does, raise OOM in the current thread.
Otherwise, look if another thread accepts the OOM exception; if one does, restart it with an OOM exception behaving as a genuine asynchronous exception.
If none accepts OOM, abort.

The first step would also be responsible for allocating the landing pad evoked by Jacques-Henri as a solution for MPR#7185.

xavierleroy · 2019-04-22T07:47:33Z

Nine months later, this PR is coming back in radicalized form as #8628 ("Deprecate the Out_of_memory exception"). Where are we with this more moderate PR?

gadmm · 2019-04-22T23:46:42Z

I hope that I did not derail this PR by thinking out loud about an ideal design (that no programming language currently comes close to implement). ~~With a currently imperfect treatment of Out_of_memory, it makes sense to offer an option to turn it off and to abort (or dbuenzli-abort) instead.~~ Given that the design admits flat-out aborting as a special case, providing an option to turn off Out_of_memory and abort (or dbuenzli-abort) instead brings us a bit closer to it. See also #8628 (comment) for another discussion.
(Edit: remove non-scientific claim)

jorisgio · 2019-08-01T00:55:28Z

this was superseded by #8630

gadmm · 2019-09-13T13:39:02Z

I suggest we have a second look at the idea of adding a new runparam that causes fatal_error on OOM.

I tried to understand the effects of the current patch that proposes to always stop with fatal_error in caml_alloc_shr. Before the patch, it is guaranteed that the fatal_error never happens when trying to allocate blocks of size > Max_young_wosize. To me the fact that it achieves this by branching on the internal state of the GC is not relevant, because this makes the current behaviour consistent in that sense. After the current patch, it becomes inconsistent whether large allocations fail with fatal_error or by raising Out_of_memory. At cursory look, fatal_error becomes the default, including for instance in arrays, bytes and string allocations, and third-party libraries calling caml_alloc functions. This would remove one reliable appearance of Out_of_memory, after an accidental or obviously too large single allocation. I agree that it makes the Out_of_memory exception look like a synchronous exception, but I do not think that Out_of_memory coming out of thin air is the issue here.

A new runparam to turn off the Out_of_memory exception would cater the needs of developers who just want to assume that memory allocation does not fail. I argue that such a switch is justified because you would want it even if the Out_of_memory exception was properly implemented. Indeed, even then, graceful handling of memory exhaustion is not automatic and is likely to require additional care from the developer. But in many situations, developers are in a model where it is assumed that allocations never fail (witness overcommitting).

Having a way to choose whether to immediately abort on OOM has precedents in Rust, C++, and some malloc implementations. I refer to the experience of Rust developers (https://github.com/rust-lang/rfcs/blob/master/text/2116-alloc-me-maybe.md):

Early on in Rust's history we made a policy decision [...] to abort. This is because most developers aren't prepared to handle it, or interested. [...] We call this approach infallible collection allocation, because the developer model is that allocations just don't fail.

For OCaml, it is desirable to leave the current behaviour to raise Out_of_memory as a default, e.g. for the case of the REPL user. It fixes the current issue without requiring a breaking change. It does not make the Out_of_memory exception reliable, which is a problem left open.

@jorisgio, do you like the idea and do you want to reopen the PR and update your former patch? Or I can do it myself in a new PR if nobody is against.

jorisgio · 2019-09-13T14:04:25Z

Personally i got hooked it by this proposal of having a global handler that by default raises abort but a user can register a callback to customize behavior. It looks like rust lang item feature, which is nice to use. The good thing about it is that you can log custom info about heap and program specific data before aborting.
But i think runtime flag is ok enoug i guess.

gadmm · 2019-09-13T14:41:03Z

Since #8630 has been merged, it is understood that fatal_error already includes such a hook. It remains to call fatal_error on out_of_memory.

Note, given that fatal_error_hook is called during a failed minor gc, it should probably be documented that the hook should not call the ocaml runtime.

jorisgio force-pushed the abort_on_oom branch from d74d756 to 1002842 Compare July 24, 2018 13:33

dbuenzli mentioned this pull request Oct 11, 2018

Implement Systhreads in terms of domains ocaml-multicore/ocaml-multicore#100

Closed

damiendoligez requested changes Nov 6, 2018

View reviewed changes

runtime: raise a fatal error when allocation fail during GC

b19c879

jorisgio force-pushed the abort_on_oom branch from 1002842 to b19c879 Compare November 9, 2018 07:14

gadmm mentioned this pull request Nov 30, 2018

protect: treat as an error the case where two exceptions are being raised #2118

Merged

nojb mentioned this pull request Apr 19, 2019

Deprecate the Out_of_memory exception #8628

Closed

jorisgio closed this Aug 1, 2019

jorisgio deleted the abort_on_oom branch August 1, 2019 00:55

		@@ -498,7 +507,7 @@ static inline value caml_alloc_shr_aux (mlsize_t wosize, tag_t tag,
		else if (caml_in_minor_collection)

Conversation

jorisgio commented Jul 24, 2018

Alternatives:

Concerns

Status

Uh oh!

mshinwell commented Jul 24, 2018

Uh oh!

nojb commented Jul 24, 2018

Uh oh!

pmetzger commented Jul 24, 2018

Uh oh!

alainfrisch commented Jul 24, 2018

Uh oh!

xavierleroy commented Jul 24, 2018

Uh oh!

ygrek commented Jul 24, 2018

Uh oh!

stedolan commented Jul 25, 2018

Uh oh!

alainfrisch commented Jul 25, 2018

Uh oh!

jorisgio commented Jul 25, 2018

Uh oh!

dbuenzli commented Jul 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

damiendoligez commented Sep 13, 2018

Uh oh!

stedolan commented Oct 2, 2018

Uh oh!

damiendoligez left a comment

Choose a reason for hiding this comment

Uh oh!

damiendoligez Nov 6, 2018

Choose a reason for hiding this comment

Uh oh!

damiendoligez Nov 6, 2018

Choose a reason for hiding this comment

Uh oh!

damiendoligez Nov 6, 2018

Choose a reason for hiding this comment

Uh oh!

jorisgio commented Nov 9, 2018

Uh oh!

mshinwell commented Nov 9, 2018

Uh oh!

dbuenzli commented Nov 9, 2018

Uh oh!

stedolan commented Nov 13, 2018

Uh oh!

DemiMarie commented Nov 17, 2018

Uh oh!

lpw25 commented Nov 17, 2018

Uh oh!

gadmm commented Nov 19, 2018

Uh oh!

xavierleroy commented Apr 22, 2019

Uh oh!

gadmm commented Apr 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisgio commented Aug 1, 2019

Uh oh!

gadmm commented Sep 13, 2019

Uh oh!

jorisgio commented Sep 13, 2019

Uh oh!

gadmm commented Sep 13, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

dbuenzli commented Jul 25, 2018 •

edited

Loading

gadmm commented Apr 22, 2019 •

edited

Loading