Memprof: wrap uncaught exceptions before re-raising them by gadmm · Pull Request #9267 · ocaml/ocaml

gadmm · 2020-01-27T22:24:31Z

Before this patch, a Memprof callback can potentially raise any exception, including for instance Not_found out of thin air. This means that when doing try ... with Not_found ->..., this might not
always mean what the programmer intends. This is a well-known theoretical bug already affecting the raising behaviour of finalisers (#8873).

In general, any exception raised by Memprof is an asynchronous exception subject to the discipline of ML interrupts: one must not discriminate interrupts, and an interrupt must always be re-raised promptly except at isolation boundaries (e.g. at toplevel for hardening purposes).

This patch avoids this sort of bugs, and it encourages and facilitates following the discipline of interrupts for catching exceptions arising from Memprof callbacks, by wrapping any uncaught exception inside a new exception Memprof_raised.

Note that some code in the wild exists of the form try ... with Sys.Break -> .... Such code does not follow the interrupt discipline as it discriminates on interrupts. With this patch this practice becomes incompatible with Memprof (similarly to what it is already with Fun.protect). Users of such code who want to use Memprof must fix it to properly catch all exceptions, and can record the cancellation
state into a boolean flag inside the signal handler of SIGINT for the current thread.

Like for Fun.Finally_raised, choosing a particular form of enforcing a discipline for interrupts should be subject to careful discussion. In the case of this PR, the behaviour of Memprof callbacks and exceptions arising from them is documented as experimental still.

(cc @stedolan, @jhjourdan.)

jhjourdan · 2020-01-29T09:35:11Z

I do understand what you are trying to fix with this PR. However, even though your patch indeed fixes the case where a (bogus) memprof callback do not handle a synchronous exception (such as Not_found), it breaks the case where a program uses e.g., Sys.break to interrupt a program when Ctrl-C (or another asynchronous event) occurs. You seem to imply that this is a bad idea, and that program should use polling on a Boolean variable instead, and never catch these exceptions explicitly.

Doesn't this mean that these exceptions cannot be used directly by the OCaml program and should only be interpreted by the end user? If we take this decision, then Memprof should probably not propagate exceptions.

Moreover, you write: "As a general rule, this exception should not be caught except as part of a catch-all handler around isolated boundaries of your program.". Could you clarify what you mean by "isolated boundaries" and what kind of catch-all handler you do have in mind?

Last but not least: Memprof is supposed to be a tool to analyze the behavior of existing OCaml program. This is probably not a good idea to ask these program to change their behavior wrt. Sys.Break before using the tool...

stedolan · 2020-01-29T14:44:06Z

it breaks the case where a program uses e.g., Sys.break to interrupt a program when Ctrl-C (or another asynchronous event) occurs

I don't understand this. How does the behaviour of such a program change with this patch? I can't see where anything different would happen than what happens at the moment.

jhjourdan · 2020-01-29T16:12:59Z

it breaks the case where a program uses e.g., Sys.break to interrupt a program when Ctrl-C (or another asynchronous event) occurs

I don't understand this. How does the behaviour of such a program change with this patch? I can't see where anything different would happen than what happens at the moment.

If Ctrl-C is pressed during the memprof callback, then Sys.Break will be wrapped and the client code will not be able to use try ... with Sys.Break -> ... for catching it.

stedolan · 2020-01-29T17:23:05Z

Of course, thanks for the explanation.

gadmm · 2020-03-24T06:00:43Z

You are right that there is a lot more explaining to do. Sorry for the delay.

Making the issue more concrete

I do understand what you are trying to fix with this PR. However, even though your patch indeed fixes the case where a (bogus) memprof callback do not handle a synchronous exception (such as Not_found)

Nevertheless, let me make the issue more concrete.

Currently and since forever, Scanf internals rely on catching the Failure exception. And many other places in the stdlib catch it, and many places raises it. This code obviously assumes that the exception caught is the one expected. I have read a lot of finalisers' code from opam, and concluded that finalisers in the wild are complicated enough that I could not tell for sure that a bug could not resulting in Failure being raised unintentionally in a finaliser and thus interfering with these trywiths. (I have already seen Assert_failure being raised purposefully in a finaliser, and the documentation is not clear about Failure being catchable vs. Assert_failure being unexpected.) So, I believe that programmers in practice write complicated-enough finalisers to expose themselves to this sort of bug. As it stands, the language can only let you know about it at runtime, usually as an uncaught exception. Even if you believe that this only fixes secondary-consequences of primary bugs, such bugs should not be transformed into cosmic-ray-like corruption of the result of an unrelated function.

Now while I think the wrapping of exceptions would in a vacuum be a solution for finalisers, I did not push for it at #8873 because the scale of the breaking change was too large with no realistic transition path, and it might end up solved in a different manner for multicore-specific reasons. However we can avoid this issue with Memprof from the start.

When I discussed the semantics of exceptions arising from Memprof callbacks, it was under the assumption that something like that would be necessary.

What isolated boundaries means

Moreover, you write: "As a general rule, this exception should not be caught except as part of a catch-all handler around isolated boundaries of your program.". Could you clarify what you mean by "isolated boundaries"

OCaml has de facto two ways of ensuring the consistency of mutable state should an exception arise.

With the one, each piece of code is responsible for managing the consequences of being interrupted by an exception and in particular the consistency of mutable state (see Abraham's work on exception-safety in the C++ standard library which defines various kinds of exception-safety requirements). This model is suitable for remediable errors (Failure as used in the stdlib) and jumps for instance. With this model, when manipulating state, the programmer has to be aware of all the exceptions that might arise.

With the other, it is the person who catches an exception who is responsible for ensuring that no internal state (which may be corrupted) escapes from the scrutinee of the try clause, whereas each piece of code remains responsible for the consistency of the external state, which is done by using resource-management patterns (cf. Rust's panics and the UnwindSafe trait). This model of reasoning therefore determines by definition a boundary between what is internal and external to the scrutinee, and ensures safety through its isolation : any state shared with the exterior must be mediated with resource-management patterns. This model is suitable for exceptions that the user cannot expect, such as asynchronous exceptions and bugs. It makes it simpler for the programmer (they do not have to reason about the unexpected), and shifts the burden onto the person who catches the exception without re-raising it. For instance, this is what is done in Coq to safely interrupt running functions (including, I have been told, the use of a mechanism similar to poisoning like in Rust).

The exception safety model should be clarified separately

These are collected facts about how one deals with exceptions, including this notion of isolated boundary, which seems to come naturally when programming with asynchronous exceptions. Now, I have been involved in many discussions on this github showing that OCaml's exception-safety model ought to be clarified. I think this is an important thing for the evolution of the language, but it is a much bigger task that should be done separately. So I preferred to be succinct in the documentation. One option to clarify it could be to already agree on some common terminology, for instance, but not go too much into the details.

Moreover, you can already expect that it will be necessary as part of this clarification to assign to each exception defined one of the two exception-safety model (e.g. Failure, Not_found, Exit are always treated in the first way, asynchronous exceptions are always in the second). So, the language itself has to respect the distinction and not transform arbitrary exceptions into asynchronous exceptions. This is the second motivation for the current change, which ensures that memprof callbacks always raise the same exception.

The importance of using a catch-all

and what kind of catch-all handler you do have in mind?

In addition to isolation, there is another requirement for safely handling unexpected exceptions. It directly follows from the "double raise" issue: an exception might arise while handling another exception before re-raising (cf. Fun.protect). The direct consequence is that an exception from the second category can hide behind another exception, so catching an unexpected exception by its name is not sufficient, and the only sensible thing to do is to catch all exceptions at the boundary (try ... with _ -> ...). (In the clarification of the exception-safety model, I have a way to fix this problem but the solution is far away, see at the end.) In particular, unexpected exceptions cannot carry information.

Coq might get away with this problem by pretending that there is only one unexpected exception Sys.Break. Standard ML (way before OCaml) solved this by saying that there is indeed only one exception in this category, Interrupt, which can mean all of Ctrl+C, Out_of_memory, etc. (see again in particular the discipline Isabelle/ML has acquired for handling interrupts). With Fun.protect, when wrapping the exception raised in the finally inside Finally_raised, it was argued that forcing unexpected exceptions to only be caught with a catch-all (e.g. not being able to discriminate Sys.Break from Fun.Finally_raised Sys.Break) was a bit like having a single exception that does not carry information, while it remains backwards-compatible. This is what I mean when I write that Memprof_raised should only be caught with a catch-all handler.

Allowing interrupts even if they do not carry information

You seem to imply that this is a bad idea, and that program should use polling on a Boolean variable instead, and never catch these exceptions explicitly. Doesn't this mean that these exceptions cannot be used directly by the OCaml program and should only be interpreted by the end user? If we take this decision, then Memprof should probably not propagate exceptions.

The boolean flag is not meant for active polling, but to indicate that the program is in a certain interrupted state, and set before raising an asynchronous exception. An asynchronous exception can be used to interrupt the program without active polling from the interrupted code, but just cannot reliably carry information (e.g. "interrupted by the user"). One obvious solution is to use such a flag, and is only needed if the program relies on this information.

Alternatives to catching Sys.Break

it breaks the case where a program uses e.g., Sys.break to interrupt a program when Ctrl-C (or another asynchronous event) occurs

Last but not least: Memprof is supposed to be a tool to analyze the behavior of existing OCaml program. This is probably not a good idea to ask these program to change their behavior wrt. Sys.Break before using the tool...

You are correct point out that Coq and similar programs cannot just naively turn Memprof on with this patch and expect Ctrl+C to work as it does usually. I recommended to move away from the style of handling asynchronous exceptions with try ... with Sys.Break -> ..., but this is the same change they have to do if they started using Fun.protect, which will give rise to Fun.Finally_raised Sys.Break exceptions (as it was discussed in length at the time). This derives from fundamental limitations of interrupts. And in general it is not the only place where exceptions get wrapped currently, see e.g. Dynlink.Error.

I gave a general solution, but note also that the casual Memprof user can consider other options for this specific problem: 1) masking (once it will be available), 2) if they insist on using fragile patterns and want a quick & dirty solution temporarily, they can do try ... with Sys.Break | Gc.Memprof_raised Sys.Break -> .... There's something for everyone!

Evolving a language

I think I propose the correct strategy: do not break existing programs, but those who want to opt into the new feature are encouraged to convert to the new discipline. First, like Stephen wrote, this change does not break currently-working programs in the literal sense, given that Memprof is still experimental. This would be different if, say, we wanted to change the raising behaviour from finalisers (too much legacy code). (And it makes it urgent to fix the issue, before the behaviour fossilises and becomes a de facto standard.) Second, there is a realistic evolution path through which programs can gradually implement the new discipline already required by other parts of the stdlib.

I am worried of the implications of rejecting this strategy on the basis that it breaks programs. First, as far as async exceptions are concerned, not every programming pattern that you can come up with is meaningful and one currently has to depend on heavy unwritten programmer discipline. I have sometimes heard “these programs deserve to be broken” to describe such a situation, which is a bit glib but I think it reflects a sound way of reasoning about PL evolution: one aims to preserve meaningful idioms, not the language in the literal sense which can include programming patterns not amenable to reasoning. I fear that argumenting that this design breaks future programs on the basis that users are entitled to their fragile programming patterns, is a line of reasoning that could make it impossible to evolve the language to fix its current issues any further. (On the other extreme it was suggested that async exceptions are so bad that they should be removed from the language; as you know I want to achieve a more reasonable solution.)

Alternative designs

You might wonder if there could be an alternative design where some exceptions are treated specially like in Isabelle/ML, e.g. instead of:

try f x with e -> raise (Memprof_raised e)

one would do:

try f x with e -> raise (if is_interrupt e then e else Memprof_raised e)

where is_interrupt would classify exceptions in the second category. The problem is that this is not enough to get rid of the double-raise issue. There are two solutions:

Further enforce that there is only one Interrupt like in Isabelle/ML (e.g. define Sys.Break, Out_of_memory, Fun.Finally_raised, Memprof_raised, Assert_failure... to all be equal to a unique Interrupt exception). But this is a wildy-breaking change, and the path I propose with a documented programmer discipline that one should use a catch-all in the end gives the same result without breaking programs.
Skip code that might raise a second exception while handling an interrupt before re-raising it. The release of resources would still happen because we already know that they must not fail and they must mask asynchronous exceptions. The upside is that you earn the possibility of discriminating interrupts as people try to do now, but in a reliable way. The downside is that the machinery to treat release functions specially is not there yet, and it needs to be coupled with a language feature which is expressive enough that we can mandate that all resource management goes through it. So this solution is still far, far away (and if OCaml becomes mature enough one day to consider it, for backwards-compatibility reasons it would most likely be though a new mechanism programmers would transition to gradually). And the possiblity of having a better solution one day encourages me towards a lightweight discipline to fix today's problems.

Again, sorry for the delay, you were right to request these clarifications on the design, and please do not hesitate to ask if you think a point is missing or incomplete.

gadmm · 2020-03-26T21:18:16Z

Sorry for the noise. I pushed a bogus commit due to a rebase mistake.

Before this patch, a Memprof callback can potentially raise any exception, including for instance Not_found out of thin air. This means that when doing "try ... with Not_found ->...", this might not always mean what the programmer intends. This is a well-known theoretical bug affecting the raising behaviour of finalisers (ocaml#8873). In general, any exception raised by Memprof is an asynchronous exception subject to the discipline of ML interrupts (see e.g. http://isabelle.in.tum.de/dist/library/Doc/Implementation/ML.html): one must not discriminate interrupts, and an interrupt must always be re-raised promptly except at isolation boundaries. This patch avoids this sort of bugs, and it encourages and facilitates following the discipline of interrupts for catching exceptions arising from Memprof callbacks. Note that some code in the wild exists of the form `try ... with Sys.Break -> ...`. Such code does not follow the interrupt discipline as it discriminates on interrupts. With this patch this practice becomes incompatible with Memprof (similarly to what it is already with Fun.protect). Users of such code who want to use Memprof must fix it to properly catch all exceptions, and can record the cancellation state into a boolean flag inside the signal handler of SIGINT for the current thread.

gadmm force-pushed the memprof_raised branch 4 times, most recently from f6ed7a9 to 4ae4d52 Compare March 26, 2020 21:17

This was referenced Mar 31, 2020

Gc.Memprof.start: take a record instead of 5 optional parameters #9254

Merged

Memprof new api #8920

Merged

gadmm mentioned this pull request Apr 20, 2020

Memprof: fatal error if thread is stopped from a callback. #9458

Merged

stedolan self-assigned this Apr 22, 2020

gadmm added 2 commits September 22, 2020 16:32

[cosmetic] Fix documentation

4086fb8

gadmm force-pushed the memprof_raised branch from 4ae4d52 to 4086fb8 Compare September 22, 2020 14:39

gadmm mentioned this pull request Jan 19, 2023

statmemprof is absent in OCaml 5.0 #11911

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memprof: wrap uncaught exceptions before re-raising them#9267

Memprof: wrap uncaught exceptions before re-raising them#9267
gadmm wants to merge 2 commits intoocaml:trunkfrom
gadmm:memprof_raised

gadmm commented Jan 27, 2020

Uh oh!

jhjourdan commented Jan 29, 2020

Uh oh!

stedolan commented Jan 29, 2020

Uh oh!

jhjourdan commented Jan 29, 2020 •

edited

Loading

Uh oh!

stedolan commented Jan 29, 2020

Uh oh!

gadmm commented Mar 24, 2020 •

edited

Loading

Uh oh!

gadmm commented Mar 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gadmm commented Jan 27, 2020

Uh oh!

jhjourdan commented Jan 29, 2020

Uh oh!

stedolan commented Jan 29, 2020

Uh oh!

jhjourdan commented Jan 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stedolan commented Jan 29, 2020

Uh oh!

gadmm commented Mar 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Making the issue more concrete

What isolated boundaries means

The exception safety model should be clarified separately

The importance of using a catch-all

Allowing interrupts even if they do not carry information

Alternatives to catching Sys.Break

Evolving a language

Alternative designs

Uh oh!

gadmm commented Mar 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jhjourdan commented Jan 29, 2020 •

edited

Loading

gadmm commented Mar 24, 2020 •

edited

Loading