Skip to content

caml_ml_open_descriptor_out memory leak and file corrupt at_exit #12300

@edwintorok

Description

@edwintorok

Self-contained testcase reproduced on OCaml 4.14.1 (but should also reproduce on 4.08.1):

let ignored = ref 0
let ignore_exn f = try f () with Out_of_memory as e -> raise e | _ -> incr ignored

let wrap ~finally f =
  match f () with
  | () ->
      finally ()
  | exception e ->
      let bt = Printexc.get_raw_backtrace () in
      ignore_exn finally;
      Printexc.raise_with_backtrace e bt

let leaktest fd =
  (* not allowed to close 'fd' here, because it is not the channel that opened it *)
  let ch = Unix.out_channel_of_descr fd in
  wrap
    ~finally:(fun () -> flush ch )
      (* flush can raise here, which leaves unflushed data, causing a leak *)
    (fun () -> output_string ch "socketpairchannel")

let doleak = true

let consume = Thread.create @@ fun fd ->
  let b = Bytes.make 65535 ' ' in
  while Unix.read fd b 0 (Bytes.length b) > 0 do
    ()
  done;
  Unix.close fd

let () =
  let fdsend, fdrecv = Unix.socketpair Unix.PF_UNIX Unix.SOCK_STREAM 0 in
  let (_:Sys.signal_behavior) = Sys.signal Sys.sigpipe Sys.Signal_ignore in
  if doleak then
    Unix.close fdrecv
  else
    consume fdrecv |> ignore;
  (* this will make sending immediately raise *)
  for _ = 1 to 10000 do
    let () = ignore_exn @@ fun () -> leaktest fdsend in
    ()
  done ;
  Unix.close fdsend ;
  (* trigger the corruption bug at exit, now reuse these FD numbers for something completely different *)
  let test_open i =
    Unix.openfile
      (Printf.sprintf "leaktest%d.txt" i)
      [Unix.O_RDWR; Unix.O_TRUNC; Unix.O_CREAT]
      0o755
  in
  let _fd1, _fd2 = test_open 1, test_open 2 in
  (* ok we leaked 2 file descriptors here, but the OCaml runtime shouldn't corrupt the data that is there by flushing another channel's data into it,
     and there might be other reasons why these FDs are open at exit, e.g. some other global channel/etc.
   *)
()
$ ocamlfind ocamlopt leak.ml -o leak -package unix,threads.posix -linkpkg -thread
$ (ulimit -v 123456; ./leak)
Fatal error: exception Out of memory
$ ./leak
$ ls -l leak*
ls -l leak*.txt
-rwxr-xr-x 1 edvint edvint      0 Jun 12 15:28 leaktest1.txt
-rwxr-xr-x 1 edvint edvint 170000 Jun 12 15:28 leaktest2.txt

There are 2 bugs here:

  • './leak' is using a lot of memory (when run under ulimit it gets an OOM exception),
  • when it doesn't OOM and finishes then 'leaktest2.txt' has some data written to it, but looking at the source code we never write data to that file (we just open it).

(Bear with me until the 'Analysis' section below on why I'm reporting 2 bugs in the same issue: they are caused by the same piece of code)

(You can also run the program without ulimit, increase the 'for' loop limit and watch the process eat gigabytes of memory in 'top')

Analysis

Each 'struct channel' allocated when opening a channel (from a fd) contains a 64k buffer:

#define IO_BUFFER_SIZE 65536
char buff[IO_BUFFER_SIZE]

I think the leak is caused by this code in the finalizer of the channels:

if (chan->max == NULL && chan->curr != chan->buff){
    /*
      This is an unclosed out channel (chan->max == NULL) with a
      non-empty buffer: keep it around so the OCaml [at_exit] function
      gets a chance to flush it.  We would want to simply flush the
      channel now, but (i) flushing can raise exceptions, and (ii) it
      is potentially a blocking operation.  Both are forbidden in a
      finalization function.

      Refs:
      http://caml.inria.fr/mantis/view.php?id=6902
      https://github.com/ocaml/ocaml/pull/210
    */
    if (chan->name && caml_runtime_warnings_active())
      fprintf(stderr,
              "[ocaml] (moreover, it has unflushed data)\n"
              );

This is a 64KiB/channel leak, which can happen e.g. if the underlying file descriptor returns an error on flush,
such that flush never succeeds (e.g. it is a socket and the other end has closed the connection).
This can be particularly problematic for a long running daemon using sockets (e.g. a webserver).

The comment says that the reason the channel is kept alive is such that the data may be flushed at exit.
However by that time the underlying file descriptor may have been closed if the channel was constructed by the Unix module for example, and the 'fd' is not owned by the channel itself.
In fact it could be worse than closed: some unrelated file descriptor may have reused its number by the time the process exits, at which point it will flush one channel's data into some other file descriptor's file.... (potentially corrupting whatever data was meant to be in that other file).
Thus this is not merely an OOM bug, it is also a correctness bug.

Suggested fix

  • Remove this 'flush on exit' behaviour for channels which got created by the Unix module: the channel does not have full control of the file descriptor and it may have been closed. Could potentially cause unexpected behaviour for applications that rely on this (unsafe) behaviour

  • Remove this behaviour for channels that are based on Unix sockets (whether the channel fully manages the file descriptor or not): retaining it to be flushed on exit may not work

  • Change 'caml_ml_flush' to set a flag such that the finalizer may discard the channel if flushing was attempted but failed.

The latter would fix the immediate bug reported here, although it might still leave the possibility open that if someone forgets to call 'flush' then 'at_exit' the runtime may write data into the wrong file...

I'm happy to help with creating a patch for these, but would be happy to hear thoughts on how else this could be fixed.

Background

We've been trying to track a leak in a long running OCaml daemon for the past 3 months that only happened in a production environment, and caused that particular OCaml daemon to end up using gigabytes of memory after a while.
Nothing suspicious as far as OCaml heap usage goes, but using 'jemalloc' we got some statistics from the C side that showed that caml_ml_open_descriptor_out -> caml_open_descriptor_in -> caml_stat_alloc was leaking about 800 MiB+ of memory.

The caller was this OCaml code:

let with_out_channel_output fd f =
  let oc = Unix.out_channel_of_descr fd in
  finally
    (fun () ->
      let output = Xmlm.make_output (`Channel oc) in
      f output
    )
    (fun () -> flush oc)

We looked at this code and concluded that because we always call flush then the finalizer should always free the memory and the above codepath should not be entered. The breakthrough came this morning when we realized that these are sockets, and that flushing may indeed fail if the other end has already closed the connection, and with that information I've been able to write the above testcase that immediately reproduced the problem.

There really isn't much this code could do better: it cannot close the channel because that would close the underlying Unix file descriptor which this function does not own (it got passed as a parameter), and there isn't anything other than 'flush' it could call to 'clear' the pending data buffer AFAIK either.

We could refactor the code to use channels instead of Unix file descriptors, such that this function gets a channel passed in (that it can 'close_noerr' on error), but the OCaml runtime leak should still be fixed.

Looking closer at the reason for the OCaml runtime leak (attempting to flush the channel 'at_exit') that is unsafe in general: the underlying Unix file descriptor could've been closed and reused by something else at which point 'at_exit' will "corrupt" this other file by writing some other channel's data into it.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions