Skip to content

Add a runtime flag to call abort(3) instead of raising Out_of_memory#1926

Closed
jorisgio wants to merge 1 commit intoocaml:trunkfrom
jorisgio:abort_on_oom
Closed

Add a runtime flag to call abort(3) instead of raising Out_of_memory#1926
jorisgio wants to merge 1 commit intoocaml:trunkfrom
jorisgio:abort_on_oom

Conversation

@jorisgio
Copy link
Copy Markdown

Currently, if allocation fails, runtime raises Out_of_memory exception. This exception is almost impossible to handle properly as it can be raised from virtually any allocation point, and will be caught by any catch all clauses.
This pr adds an OCAMLRUNPARAM option to turn this exception into an abort call.

Alternatives:

  • most malloc implementation support a similar flags, but this makes it harder to get a consistent ocaml behavior accross malloc implementations/huge_page support, systems...
  • If a flag will be added, i wonder if it should be more generic and extend the behavior to other hard failures exceptions like Assert_failure or Match_failure which exhibit the same problem.

Concerns

Some library features are well contained and it might be worth keeping the out_of_memory behavior for them, for instance bigarray creation or marshal.

Status

this PR is wip but i'd like to know if this approach looks right before adding tests etc.

@mshinwell
Copy link
Copy Markdown
Contributor

I think we have previously discussed permanently changing this exception into a fatal error. I'm not sure it should be calling abort, though, which may produce a core dump.

@nojb
Copy link
Copy Markdown
Contributor

nojb commented Jul 24, 2018

Without looking at the details, I think it is a good idea to revisit Out_of_memory, as there is no way of handling it safely as far as I know.

If a flag will be added, i wonder if it should be more generic and extend the behavior to other hard failures exceptions like Assert_failure or Match_failure which exhibit the same problem.

I don't think so - these exceptions signal programmer error, are perfectly well defined, and they can be handled in the normal flow of a program, so they should not abort the program run.

@pmetzger
Copy link
Copy Markdown
Member

I'm not sure it should be calling abort, though, which may produce a core dump.

It's often harder to debug a weird error without a core dump to look at...

@alainfrisch
Copy link
Copy Markdown
Contributor

I don't think so - these exceptions signal programmer error, are perfectly well defined, and they can be handled in the normal flow of a program, so they should not abort the program run.

I agree. One could have a similar discussion for Stack_overflow and "asynchronous" exceptions (raised from signal handlers or finalizers), although for asynchronous exceptions, one should be able to register a global handler as an OCaml function (which could decide to exit the application, or just to log the error and continue).

@xavierleroy
Copy link
Copy Markdown
Contributor

I'm not convinced by the motivations. Catch-all exceptions should never be used except for finalization (followed by a reraise) or in the entry point of a program for hardening purposes (log then restart). Turning a run-time exception into an abort() is just making the program unable to handle the exceptional condition. Something is lost. What is gained?

@ygrek
Copy link
Copy Markdown
Contributor

ygrek commented Jul 24, 2018

I strongly support this.
Also I think it makes sense to allow override exn-vs-abort choice for Stack_overflow, Assert_failure and maybe Match_failure, each separately. Or allow setup user-handlers for this conditions (which by default will raise exceptions).
Different programs have different requiremets/expectations for this and it makes sense to allow choice.
I would claim that no code out there that didn't plan specifically for Out_of_memory exception is able to handle it. Same for Stack_overflow. Requiring all code to be written without catchalls, including third-party dependencies - is not always practical..

@stedolan
Copy link
Copy Markdown
Contributor

Turning a run-time exception into an abort() is just making the program unable to handle the exceptional condition. Something is lost. What is gained?

Programs are already unable to handle allocation failure as Out_of_memory is not raised reliably, because it is always a fatal error for allocation to fail during minor GC. I would much prefer to see allocation failure result in a consistent fatal error than the current situation (an unpredictable choice between a fatal error and an exception depending on internal GC state).

I don't think this option should be added. Instead, out of memory should always be a fatal error and Out_of_memory should be deleted.

@alainfrisch
Copy link
Copy Markdown
Contributor

Related discussion on https://caml.inria.fr/mantis/view.php?id=7185 , about avoiding the fatal error case.
And the also #852.

@jorisgio
Copy link
Copy Markdown
Author

What is gained?

Crash consistency.

Catch-all exceptions should never be used except for finalization (followed by a reraise) or in the entry point of a program for hardening purposes (log then restart).

In an ideal world, yes, but in practice it's hard to control all dependencies and I believe it makes sense to have catch all clauses in "safe" points to log and restart only part of the program.
A good example is a server/RPC service: even if something unpredictable happens become some downstream changes fires now an unplanned exception (something that the type system would not catch), it is most of the time safe enough to continue serving other requests and log and abort only a single request, especially if restarting the whole service is costly.

The main problem is that most exceptions signal local errors. In concurrent or multi thread programs, Out_of_memory is special because it is not necessarily raised close to the cause of the error.
In the server example, aborting the request will probably not fix the out of memory issue.

Another problem is that catching Out_of_memory gives you a false sense of security, and i've encountered cases which were hard to understand simply from the trace : if you are lucky, a gc will manage to release enough memory to unwind all the stack, run all finalization code between reraise, and exit.
But not always. Since OOM can be raise from almost any place, it can be reraised from inside the handlers without running them and "skipping".

Basically i agree that it's not strictly necessary to have the option to abort in the runtime, and we discussed another possibility of writing a ppx to rewrite catchall clauses into catchall-except OOM and abort, but having the option to directly abort on oom would be much easier.

Instead, out of memory should always be a fatal error and Out_of_memory should be deleted.

Maybe but i can still see where OOM can be useful : for instance when allocating a large array/bigarray and having this allocation fail, it's contained and easy to catch enough for programmer to try handling it.

@dbuenzli
Copy link
Copy Markdown
Contributor

dbuenzli commented Jul 25, 2018

What I dislike about calling abort (and caml_fatal_error exiting with 2) is that you can no longer distinguish runtime error conditions from application defined conditions which can make diagnostic more painful.

@damiendoligez
Copy link
Copy Markdown
Member

I agree with @stedolan:

I don't think this option should be added. Instead, out of memory should always be a fatal error and Out_of_memory should be deleted.

MPR#7185 is an interesting discussion but it looks like a lot of work (and potential bugs) for very little benefit.

@stedolan
Copy link
Copy Markdown
Contributor

stedolan commented Oct 2, 2018

@jorisgio

Instead, out of memory should always be a fatal error and Out_of_memory should be deleted.

Maybe but i can still see where OOM can be useful : for instance when allocating a large array/bigarray and having this allocation fail, it's contained and easy to catch enough for programmer to try handling it.

This does sound useful, but I think that the Out_of_memory exception is a poor way to express it (being asynchronous and unreliable). I still think we should remove it, but to support your use-case it might be worth adding Array.try_create : int -> 'a -> 'a array option (and similarly for Bigarray).

Copy link
Copy Markdown
Member

@damiendoligez damiendoligez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've marked one line in memory.c where raising the exception is dependent on the GC's internal state. This is the one that makes Out_of_memory unreliable, and which should be changed to a fatal error.

Elsewhere, I think it does make sense to raise Out_of_memory and it's easy to recover from it:

  • when creating a huge array, bigarray, or bytes
  • when changing the minor heap size
  • when marshalling or unmarshalling a huge value

So we should keep the exception in these cases.

At any rate, I don't think it's a good idea to introduce a run-time option to control this behavior. @jorisgio would you be willing to simplify your PR accordingly?

free_extern_output();
caml_raise_out_of_memory();
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not correct. If you're in "raise OOM" mode, then you need to replay the trail and free the output before you can raise the exception.

runtime/memory.c Outdated
if (caml_init_out_of_memory_mode == 0) {
caml_raise_out_of_memory();
} else {
abort();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should print an error message before aborting, like we do everywhere else when we use abort.

runtime/memory.c Outdated
@@ -498,7 +507,7 @@ static inline value caml_alloc_shr_aux (mlsize_t wosize, tag_t tag,
else if (caml_in_minor_collection)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the line that makes Out_of_memory unreliable. You should remove the test and always call caml_fatal_error in this case.

@jorisgio
Copy link
Copy Markdown
Author

jorisgio commented Nov 9, 2018

Thanks. I've simplified to the one line changed you suggested, i'm going to test in the wild too. I'm not totally convinced it's a good idea to raise Out_of_memory while allocating a large string/array, but i believe your suggestion is a good middle ground and is indeed the biggest issue so i'm happy with that solution.

The other change with caml_fatal_error is that it will not abort() but exit() and this seems to be the consensus. I'd rather have a coredump to debug this sort of issue, but i guess i'm heavily biased towards servers and network services development where the dev is also the user.

@mshinwell
Copy link
Copy Markdown
Contributor

Hmm, I think maybe caml_fatal_error should always call abort. I've seen a lot of situations (not specifically arising from this function, but similar cases) where the lack of core dumps has seriously hindered debugging. I would hazard a guess that most users don't have them enabled by default anyway.

@dbuenzli
Copy link
Copy Markdown
Contributor

dbuenzli commented Nov 9, 2018

Hmm, I think maybe caml_fatal_error should always call abort.

IIUC abort entails that the process terminates via a signal (waitpid on such a process would return WSIGNALED). I would be all for it: for now caml_fatal_error exits with 2 which is confusing if your cli program uses exit codes.

@stedolan
Copy link
Copy Markdown
Contributor

Elsewhere, I think it does make sense to raise Out_of_memory and it's easy to recover from it:

  • when creating a huge array, bigarray, or bytes
  • when changing the minor heap size
  • when marshalling or unmarshalling a huge value

So we should keep the exception in these cases.

FWIW, I agree completely with this. This makes Out_of_memory an ordinary exception that can be raised from a handful of functions (e.g. Array.make, Bigarray.Array1.create, Gc.set, etc.), along the lines of Invalid_argument or Unix_error. My objection is only to the asynchronous Out_of_memory that can be raised from anywhere.

@DemiMarie
Copy link
Copy Markdown
Contributor

Some programs, such as libvirt, depend on handling out-of-memory situations. Libvirt, in particular, includes OCaml code that would be broken by this change ― it catches Out_of_memory and recovers from it. Libvirt includes a daemon that is long-running and is in a memory-constrained environment. Embedded systems are a similar situation.

@lpw25
Copy link
Copy Markdown
Contributor

lpw25 commented Nov 17, 2018

@DemiMarie Could you provide a pointer to the code in libvirt which does this?

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Nov 19, 2018

I think it is a good idea to revisit Out_of_memory, as there is no way of handling it safely as far as I know.

I would claim that no code out there that didn't plan specifically for Out_of_memory exception is able to handle it.

MPR#7185 is an interesting discussion but it looks like a lot of work (and potential bugs) for very little benefit.

It is interesting to read about the handling of OOM in other languages, and in particular the reasons why Rust has decided go from aborting to giving an option to raise an exception (https://github.com/rust-lang/rfcs/blob/master/text/2116-alloc-me-maybe.md). This RFC and the two linked discussions are worth reading (as well as several earlier threads linked from there; it's a bit long, but in any case I read it for you). They refute several misconceptions about the difficulty of handling OOM safely, and they contain several convincing examples of use-cases.

A recurring example of use-case is that of servers (mentioned also by @jorisgio), where one is ready to have a request be aborted to let others continue; this is made safe thanks to the isolation of requests. The embedded use-case is also mentioned, but for this they prefer Array.try_create-style functions and to let OOM be handled explicitly everywhere (but mention it is not a full solution). (Besides, the above comments mention handling OOM, but even without proper handling, the OOM exception lets you clean up, which sounds quite reasonable to ask for.)

In Rust, the exception-safety model is the one @xavierleroy alluded to:

Catch-all exceptions should never be used except for finalization (followed by a reraise) or in the entry point of a program for hardening purposes (log then restart).

In other words, it asks for resource safety (unwind-protect-style) but frees the programmer from the task of ensuring the consistency of local invariants in the face of exceptions: instead, whoever catches the exception has to make sure that no invalid data is accessed. The Rust community shows a way towards making this model compositional and correct by construction (which is quite notable in itself) thanks to a notion of isolation of a task (https://doc.rust-lang.org/std/panic/trait.UnwindSafe.html). This is key to making the Rust approach safe, and I believe worth taking inspiration from.

Here compositional means that (by analogy) one can write catch-alls in the middle of a program, not really anywhere like a normal try, but at the beginning of an isolated task (like @jorisgio and @ygrek said they would like to be allowed to do).

Now in OCaml, an additional difficulty is that it is harder to control allocations. Let us focus on the use-cases where OOM is caused by trying to allocate too much (as opposed to being low on memory due to outside pressure). A difference is that in functional programming, it is more likely that a large allocation consists in a lot of small allocations. This is the situation in the use-case that @mlasson describes in MPR#7185. In contrast, the current proposed solution only caters to situations where OOM is caused by a possibly large allocation (this is also mentioned as a limitation by the Rust RFC for the non-raising half of their proposal).

Moreover:

The main problem is that most exceptions signal local errors. In concurrent or multi thread programs, Out_of_memory is special because it is not necessarily raised close to the cause of the error.

I think this is an important point. Out_of_memory is not a local condition, and the culprit could be another thread. Taking this into account, given the known use cases, and given the use-cases for opting-out of the Out_of_memory exception, I propose the following design for handling OOM events, inspired by signal handling:

  • Give the possibility to register a thread as accepting OOM exceptions (e.g. with a scoped combinator).
  • When OOM occurs, look if the current thread accepts OOM, if it does, raise OOM in the current thread.
  • Otherwise, look if another thread accepts the OOM exception; if one does, restart it with an OOM exception behaving as a genuine asynchronous exception.
  • If none accepts OOM, abort.

The first step would also be responsible for allocating the landing pad evoked by Jacques-Henri as a solution for MPR#7185.

@xavierleroy
Copy link
Copy Markdown
Contributor

Nine months later, this PR is coming back in radicalized form as #8628 ("Deprecate the Out_of_memory exception"). Where are we with this more moderate PR?

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Apr 22, 2019

I hope that I did not derail this PR by thinking out loud about an ideal design (that no programming language currently comes close to implement). With a currently imperfect treatment of Out_of_memory, it makes sense to offer an option to turn it off and to abort (or dbuenzli-abort) instead. Given that the design admits flat-out aborting as a special case, providing an option to turn off Out_of_memory and abort (or dbuenzli-abort) instead brings us a bit closer to it. See also #8628 (comment) for another discussion.
(Edit: remove non-scientific claim)

@jorisgio
Copy link
Copy Markdown
Author

jorisgio commented Aug 1, 2019

this was superseded by #8630

@jorisgio jorisgio closed this Aug 1, 2019
@jorisgio jorisgio deleted the abort_on_oom branch August 1, 2019 00:55
@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Sep 13, 2019

I suggest we have a second look at the idea of adding a new runparam that causes fatal_error on OOM.

I tried to understand the effects of the current patch that proposes to always stop with fatal_error in caml_alloc_shr. Before the patch, it is guaranteed that the fatal_error never happens when trying to allocate blocks of size > Max_young_wosize. To me the fact that it achieves this by branching on the internal state of the GC is not relevant, because this makes the current behaviour consistent in that sense. After the current patch, it becomes inconsistent whether large allocations fail with fatal_error or by raising Out_of_memory. At cursory look, fatal_error becomes the default, including for instance in arrays, bytes and string allocations, and third-party libraries calling caml_alloc functions. This would remove one reliable appearance of Out_of_memory, after an accidental or obviously too large single allocation. I agree that it makes the Out_of_memory exception look like a synchronous exception, but I do not think that Out_of_memory coming out of thin air is the issue here.

A new runparam to turn off the Out_of_memory exception would cater the needs of developers who just want to assume that memory allocation does not fail. I argue that such a switch is justified because you would want it even if the Out_of_memory exception was properly implemented. Indeed, even then, graceful handling of memory exhaustion is not automatic and is likely to require additional care from the developer. But in many situations, developers are in a model where it is assumed that allocations never fail (witness overcommitting).

Having a way to choose whether to immediately abort on OOM has precedents in Rust, C++, and some malloc implementations. I refer to the experience of Rust developers (https://github.com/rust-lang/rfcs/blob/master/text/2116-alloc-me-maybe.md):

Early on in Rust's history we made a policy decision [...] to abort. This is because most developers aren't prepared to handle it, or interested. [...] We call this approach infallible collection allocation, because the developer model is that allocations just don't fail.

For OCaml, it is desirable to leave the current behaviour to raise Out_of_memory as a default, e.g. for the case of the REPL user. It fixes the current issue without requiring a breaking change. It does not make the Out_of_memory exception reliable, which is a problem left open.

@jorisgio, do you like the idea and do you want to reopen the PR and update your former patch? Or I can do it myself in a new PR if nobody is against.

@jorisgio
Copy link
Copy Markdown
Author

Personally i got hooked it by this proposal of having a global handler that by default raises abort but a user can register a callback to customize behavior. It looks like rust lang item feature, which is nice to use. The good thing about it is that you can log custom info about heap and program specific data before aborting.
But i think runtime flag is ok enoug i guess.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Sep 13, 2019

Since #8630 has been merged, it is understood that fatal_error already includes such a hook. It remains to call fatal_error on out_of_memory.

Note, given that fatal_error_hook is called during a failed minor gc, it should probably be documented that the hook should not call the ocaml runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.