Skip to content

Caml_fatal_user_error#8963

Open
gadmm wants to merge 3 commits intoocaml:trunkfrom
gadmm:caml_fatal_user_error
Open

Caml_fatal_user_error#8963
gadmm wants to merge 3 commits intoocaml:trunkfrom
gadmm:caml_fatal_user_error

Conversation

@gadmm
Copy link
Copy Markdown
Contributor

@gadmm gadmm commented Sep 22, 2019

This introduces a new kind of fatal error, caml_fatal_user_error, denoting unrecoverable error in user code, which by default calls exit(2) and is customisable with a dbünzli-hook. This is inspired by the guideline from #8630 (review). According to it, caml_fatal_error, which now calls abort, should only be used for internal runtime failures. I needed a fatal error which denotes an error in user code at #8962. Then I noticed that some actual uses of caml_fatal_error would be better as caml_fatal_user_error (mainly related to out of memory conditions). Overall, I do not have strong opinions on this patch. On the principle of a new function for error in user code, I think it corresponds to current usage in the compiler codebase (some calls to exit(2) are factored and now customisable), and to what people want according to what I read on this github. On the changing of some fatal_error into fatal_user_error, this is subjective, and I don't care that much. It is open to unlimited bikeshedding.

If the principle of the patch is accepted, then it is good to have it in 4.10 to avoid changing between abort/exit semantics back and forth.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Sep 23, 2019

@dbuenzli, does it correspond to what you want?

@dbuenzli
Copy link
Copy Markdown
Contributor

dbuenzli commented Sep 24, 2019

I'm not exactly sure I understand the distinction you are trying to make and/or if that is really needed. Maybe listing the cases for caml_fatal_error and caml_fatal_user_error and try to imagine what can be done in each of the cases could help.

For example the comment by @xavierleroy's from 8630 you linked to says:

Another instance, detailed below, is when the C code tries to raise an OCaml exception and the OCaml code isn't initialized yet. This case was carefully reported like an uncaught exception, and you turned it into an abort(). This is debatable.

As far as I'm concerned this is Invalid_argument, i.e. it's a C library programming error and I'm not sure I care about having the distinction you are making here for that.

@dbuenzli, does it correspond to what you want?

From a high-level perspective what I want is the ability when programming from C, to have enough hooks to be able to treat the OCaml system as a library, including if the OCaml system fails in some way, so that I can shutdown and restart it rather than having it aborting or exiting under my feets.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Sep 24, 2019

Thanks for having a look.

I'm not exactly sure I understand the distinction you are trying to make and/or if that is really needed.

The distinction is not mine, it is current OCaml's. When switching fatal_error to abort by default, some calls to exit have explicitly been excluded from the change. When trying to understand why, one finds the above guideline distinguishing between "internal runtime errors" and "errors in user code". This patch is not about whether there should be such a distinction or not, but whether to give the user the possibility to customise remaining calls to exit.

Maybe listing the cases for caml_fatal_error and caml_fatal_user_error and try to imaging what can be done in each of the cases could help.

I am not sure this would help clarify, because the main purpose of the patch is not to change this distinction (and I hope saying so clarifies), and I do not have strong opinions on which particular error falls in which category.

But if the principle of the patch (first commit) is accepted, there is indeed the opportunity to decide that some calls to fatal_error should actually be fatal_user_error (second commit). This is optional and secondary, but if deemed desirable then it is good to do it for 4.10 to avoid switching back and forth between abort and exit. Let's first discuss the principle (first commit).

For example the comment by @xavierleroy's from 8630 you linked to says:

Another instance, detailed below, is when the C code tries to raise an OCaml exception and the OCaml code isn't initialized yet. This case was carefully reported like an uncaught exception, and you turned it into an abort(). This is debatable.

As far as I'm concerned this is Invalid_argument, i.e. it's a C library programming error and I'm not sure I care about having the distinction you are making here for that.

I am not sure this is the best example indeed. Have a look for instance at calls to fatal_user_error introduced in the first commit (e.g. calls to exit which are being factored), or in the second commit (e.g. out_of_memory which arguably come from bugs in user code).

@dbuenzli, does it correspond to what you want?

From a high-level perspective what I want is the ability when programming from C, to have enough hooks to be able to treat the OCaml system as a library, including if the OCaml system fails in some way, so that I can shutdown and restart it rather than having it aborting or exiting under my feets.

This patch works towards this goal, by making remaining calls to exit(2) in the runtime customisable, and ensuring that I won't have to introduce further naked calls to exit(2) in the future. My feeling around this patch is that I'll need you to champion it/help me make it right, otherwise it might take a while before we (I) revisit this idea again.

Hope this helps.

@dbuenzli
Copy link
Copy Markdown
Contributor

My feeling around this patch is that I'll need you to champion it/help me make it right, otherwise it might take a while before we (I) revisit this idea again.

No problem. I'll try to have a deeper look some time until next monday.

@dbuenzli
Copy link
Copy Markdown
Contributor

I had a closer look at this and I remain unconvinced that a new error path and hook should be introduced for this. I still find the distinction between user and runtime error to be rather contentious e.g. labelling an out of memory error like here:

fatal_user_error "out of memory in uncaught exception handler"

as a user error is dubious to me.

I think that if it is thought to be important to be able to distinguish between fatal errors, then we should rather have a precise account of the actual error condition using an error code rather than have multiple hooks with a contentious notion ("user" error).

In other words I would rather call for a change in the signature of caml_fatal_error to have a precise descriptive error code + an optional msg so that the fatal error hook can detect particular error conditions.

This would allow for example to 1) funnel the current exit(127) through the default caml_fatal_error and 2) avoid repeating the string "out of memory" all over the code base each time that's the fatal one by having a dedicated error code for it.

This would then allow to reach the following design goal:

The runtime system always either exits normally, by explicit user program choice, via caml_sys_exit (which could have its own user defined hook) or abnormally via caml_fatal_error.

P.S. (I noticed along the way there are also a few stray abort calls in the runtime system).

@gadmm gadmm force-pushed the caml_fatal_user_error branch from ba6bf3d to 5dce3bf Compare September 29, 2019 14:39
@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Sep 29, 2019

Thank you again for the feedback. A couple of points:

  • I do not dispute that the distinction is a bit subjective; and there are other issues with the example you have shown which is not representative (however a different argument given on the other thread is that producing a core dump on out of memory situations is a bad default setting). It seems to me that you want to change the current practice of reporting fatal errors, into something more elaborate. If you have a concrete proposal for that, go for it! This patch does not try to be revolutionary. I am open to discussing which error should abort or exit by default in a second time.

  • The suggestion that one could call caml_sys_exit makes me think that I have poorly conveyed one premiss: the function caml_fatal_user_error should not touch the ocaml runtime and heap (caml_sys_exit invokes Pervasives.do_at_exit). We are past the point of recovery. If we had access to the runtime, we would do whatever we can to raise an exception, which is the appropriate thing to do rather than calling caml_sys_exit. In a sense, caml_sys_exit already has hooks in the form of the uncaught exception handler and the at_exit callbacks. Fatal_user_error is meant for after all this has failed (and exit is preferable to abort).

  • Further looking at calls to caml_sys_exit(2) in the runtime, I find that none of them make sense. They should be replaced either with an exception, or a call to fatal_user_error. Furthermore, the Unix.sys_exit function is not going to be reliable for resource-conscious ocaml programs.

  • If I am not mistaken, exit(127) only happens when called from the command line, so it does not go against the "ocaml as a library" point of view, it should be fine to leave it as is.

  • For the remaining aborts, I am not sure that spacetime and clambda_checks count, do they? Otherwise the only remaining abort is in caml_failed_assert, which is exclusively called in CAMLassert, which only runs in debug mode. I think this is fine.

One thing I can propose is to merge the two fatal hooks into one and have it take an error enum as an argument. Only the wrappers would differ. However, I still do not see the need for more than two wrappers caml_fatal_error and caml_fatal_user_error (the first one has to remain for backwards-compatibility so we have at least 2 wrappers).

@dbuenzli
Copy link
Copy Markdown
Contributor

  • It seems to me that you want to change the current practice of reporting fatal errors, into something more elaborate. If you have a concrete proposal for that, go for it! This patch does not try to be revolutionary. I am open to discussing which error should abort or exit by default in a second time.

Being revolutionary or not is besides the point, this PR proposes something new to solve an exisiting problem, and I don't think this new thing is appropriate. I'm only suggesting another way to go about these things.

  • The suggestion that one could call caml_sys_exit makes me think that I have poorly conveyed one premiss: the function caml_fatal_user_error should not touch the ocaml runtime and heap (caml_sys_exit invokes Pervasives.do_at_exit). We are past the point of recovery.

caml_sys_exit is what gets called when a user calls exit I'm not suggesting any runtime system error should call that. I'm precisely suggesting only the user program should call caml_sys_exit via Stdlib.exit.

What I wrote is that as a C observer of the OCaml runtime system I should only ever witness a termination of the runtime system either via caml_sys_exit (the user program exited normally and so does the runtime system) or caml_fatal_error (abnormal runtime termination).

  • For the remaining aborts, I am not sure that spacetime and clambda_checks count, do they? Otherwise the only remaining abort is in caml_failed_assert, which is exclusively called in CAMLassert, which only runs in debug mode. I think this is fine.

I don't really see what's wrong with simply calling caml_fatal_error on these. Again when I look the problem from C I want to be sure I get a say on how the runtime exits.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Oct 3, 2019

I agree with many things you wrote. Let us focus on the topic of the current PR: avoiding naked exit(2)s.

I would disagree with changing the current PR into any of the following:

  1. A proposal to change remaining calls to exit(2) into calls to abort()
  2. A proposal to introduce a fine-grained categorisation of kinds of user errors

IIUC, we disagree over 2. My concern is:

  • I believe that we are not going to rely on error codes that much programmatically
  • I want to avoid over-engineering
  • The current codebase informally already distinguishes two kinds of fatal error (exit(127) do not count, and remaining naked abort() and caml_sys_exit(2) can be the topic of a different PR/bug report.)

I mean, I do not exclude volunteering for it, but I would need to hear strong support for your idea before I start coding anything. In addition, I am concerned that this is a bit of work and that the deadline for this idea is OCaml 4.10.

With this in mind, I see two solutions where there are two fatal error functions and:

  1. two separate hooks
  2. one hook that receives an error code (with two possible values) as argument.

Another thing to think about is whether although imperfect for you, this can be a step in the good direction, that can be improved later.

@dbuenzli
Copy link
Copy Markdown
Contributor

dbuenzli commented Oct 3, 2019

@gadmm I don't really know what to say, I'm kind of repeating myself at that point.

I have the impression we should not multiply the hooks for abnormal termination and I do not find the notion of "user" error to be meaningful or precise.

When people will develop the runtime system and need to terminate the system they will constantly have to ask themselves is this a caml_fatal_error or caml_fatal_user_error ? I suspect this is not going to end up that well. An API that allows them to indicate it is this error condition (uncaught exception, out of memory, etc.) will be much less contentious and allow the hook to make better decisions about what should be done about it. I don't find this to be overly engineered.

Maybe a core dev should chime in with his own view on this.

(Also personally I rather have a good solution in say 4.11 than rush something in 4.10 to have it changed later)

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Oct 3, 2019

Of course, I mention 4.10 only because there will be fewer options after the current mechanism reaches users.

I would rather see the number of fatal error situations decrease rather than increase, that's why I do not see this dilemma arising. I also suspect that the categorisation of errors would be less clear-cut than you suggest. The mechanism for handling serious errors programmatically is exceptions. The hope with the approach of merely consolidating the current practice was to avoid discussing the semantics of fatal errors, about which I do not have opinions.

For now I do not see this PR converging (but that's ok).

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jan 16, 2020

I come back to this PR to decide what to do with it. One proposed alternative is now ruled out due to the freeze of 4.10 (and the signature of its caml_fatal_error_hook): have a single hook taking an argument denoting the kind of error.

I have a couple of questions to decide:

  1. Is there an agreement that the following goal is a good one? “When programming from C, [I would like] to have enough hooks to be able to treat the OCaml system as a library, including if the OCaml system fails in some way, so that I can shutdown and restart it rather than having it aborting or exiting under my feets.”

Without waiting for an answer, I will assume it is "yes" because this sounds sensible to me.

  1. Is there somebody actively thinking about a different design and working on an alternative implementation? (e.g. @dbuenzli)

If not I keep pushing for the current solution because I sympathize with the goal 1) and the current PR is the most straightforward one to achieve it, possibly with amendments if you wish. It is better to have a solution rather than hopefully wishing for an ideal solution which never comes.

One interest for the current patch is that it reverts the default behaviour on fatal out of memory situations to exiting instead of aborting, and I have read comments from at least two people here who reported that doing a core dump on a out of memory situation is a very bad idea.

@gadmm gadmm mentioned this pull request Jan 21, 2021
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants