Skip to content

Restore basic functionality to the bytecode debugger#11065

Merged
xavierleroy merged 17 commits intoocaml:trunkfrom
damiendoligez:fix-bytecode-debugger
Jul 1, 2022
Merged

Restore basic functionality to the bytecode debugger#11065
xavierleroy merged 17 commits intoocaml:trunkfrom
damiendoligez:fix-bytecode-debugger

Conversation

@damiendoligez
Copy link
Copy Markdown
Member

@damiendoligez damiendoligez commented Feb 25, 2022

This is work in progress. I still need to figure out what must be done to handle backtraces in the presence of fibers.
[edit: now it's ready]

@OlivierNicole
Copy link
Copy Markdown
Contributor

I was also looking at this, but you are more advanced. If I can help with reviews or testing, let me know.

@xavierleroy
Copy link
Copy Markdown
Contributor

This looks rather good to me. Why don't you remove the Draft status so that it can get reviewed properly?

@xavierleroy
Copy link
Copy Markdown
Contributor

Ping. Why don't you remove the Draft status so that it can get reviewed properly?

@damiendoligez damiendoligez force-pushed the fix-bytecode-debugger branch 2 times, most recently from d9f30d9 to 3a4746d Compare March 23, 2022 10:10
@damiendoligez
Copy link
Copy Markdown
Member Author

Stack backtrace is now working. The next instruction still behaves a little strange around fibers, but the debugger is already usable. I'll do another PR for next but I think it's less urgent.

@damiendoligez damiendoligez marked this pull request as ready for review March 23, 2022 11:45
xavierleroy
xavierleroy previously approved these changes May 17, 2022
Copy link
Copy Markdown
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks a lot for this work.

I was able to run the demo program for effects from the OCaml 5 manual. It's fun to step through all these continuations...

I added a commit that stops the debuggee cleanly if it spawns a domain. This way we can still debug up to the point where a domain is created. This is like #10594.

@fabbing
Copy link
Copy Markdown
Contributor

fabbing commented May 17, 2022

A bit of code may have been lost during the PR and this should probably be added because the trap barrier doesn't seem to work anymore:

diff --git a/runtime/debugger.c b/runtime/debugger.c
index 187f681135..a220915784 100644
--- a/runtime/debugger.c
+++ b/runtime/debugger.c
@@ -560,6 +560,8 @@ void caml_debugger(enum event_kind event, value param)
       caml_flush(dbg_out);
       break;
     case REQ_SET_TRAP_BARRIER:
+      i = caml_getword(dbg_in);
+      CAMLassert(i == frame_block_number(frame_block));
       i = caml_getword(dbg_in);
       Caml_state->trap_barrier_off = -i;
       break;

I noticed the problem, thanks to @OlivierNicole, while trying to review this PR (I'm very new to bytecode and ocamldebug).

However, I don't think it's enough and trap_barrier_off should probably be a tuple { block, offset }.
Even a tuple may still cause issue if we switch to a different fiber at the same depth (block), in this case raising an exception could trigger a trap barrier into the wrong fiber.

@damiendoligez damiendoligez marked this pull request as draft May 19, 2022 08:47
@damiendoligez
Copy link
Copy Markdown
Member Author

damiendoligez commented May 19, 2022

@fabbing Thanks for finding this problem. This breaks the finish debugger command. I'm working on a solution.

@damiendoligez damiendoligez force-pushed the fix-bytecode-debugger branch from 0a3e028 to dd23dd0 Compare May 19, 2022 15:51
@fabbing
Copy link
Copy Markdown
Contributor

fabbing commented May 23, 2022

This is better than the "depth" in terms of number of fibers but will run into troubles if the fiber is reallocated since the struct stack_info* will change (see caml_try_realloc_stack from runtime/fiber.c).
As far as I know, there is currently no way to identify a fiber.
I was able to have a quick discussion on this topic with @kayceesrk and it seemed reasonable to add an unique identifier for each fiber. I'm working on a PR for this, I hope to have it ready soon.

@damiendoligez
Copy link
Copy Markdown
Member Author

Adapted to use fiber IDs instead of addresses.
Another possibility would be to have caml_try_realloc_stack update Caml_state->trap_barrier_block as needed.

@fabbing
Copy link
Copy Markdown
Contributor

fabbing commented Jun 9, 2022

I've resumed reviewing the PR and it looked good to me.
But I noticed that the tool-debugger tests from the testsuite where still skipped and tried to re-enable them again. All but one pass. The dynlink test fails with Fatal error: Fatal error during unlock: Operation not permitted.

It's not straightforward to debug such test as many fork are involved in the process.
I was able to pinpoint the issue at channel_mutex_unlock_default while attempting to create a checkpoint (debugger.c:485).
I don't understand why this only happens in this test and keep investigating!

@gasche
Copy link
Copy Markdown
Member

gasche commented Jun 9, 2022

With the OCaml 5 runtime you cannot safely call fork from C and then execute any OCaml code on the child that would release the runtime lock, you have to do a special dance first (and meditate on the sins of doing anything else than running an exec* function). The error I encountered in this situation is similar to yours, maybe look at #11111 to see if it is related; in your stead I would look at uses of fork in runtime/debugger.c.

i = fork();
Lock (dbg_in);
Lock (dbg_out);
caml_acquire_domain_lock ();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing this code with caml_atfork_default in domain.c, maybe you need to call caml_reset_domain_lock before re-acquiring the lock?

@fabbing
Copy link
Copy Markdown
Contributor

fabbing commented Jun 15, 2022

Thanks a lot for the hint @gasche. I had a look in that direction but it turns out to be something different:

This test raises an exception (Not_found from Sys.getenv) which calls Unlock_exn / channel_mutex_unlock_exn_default and since dbg_out channel was locked last, causes it to be unlocked.
Later, when trying to create a checkpoint, an attempt is made to unlock dbg_out before forking which fails since it's no longer locked.

A not very graceful solution is to set last_channel_locked to NULL to avoid this:

--- a/runtime/caml/io.h
+++ b/runtime/caml/io.h
@@ -115,6 +115,8 @@ CAMLextern struct channel * caml_all_opened_channels;
 #define Flush_if_unbuffered(channel) \
   if (channel->flags & CHANNEL_FLAG_UNBUFFERED) caml_flush(channel)
 
+CAMLextern void caml_channel_forget_last(void);
+
 /* Conversion between file_offset and int64_t */
 
 #define Val_file_offset(fofs) caml_copy_int64(fofs)
diff --git a/runtime/debugger.c b/runtime/debugger.c
index 861c37e3bb..448098e2fb 100644
--- a/runtime/debugger.c
+++ b/runtime/debugger.c
@@ -143,6 +143,7 @@ static void open_connection(void)
      private to this code and never used concurrently. */
   Lock (dbg_in);
   Lock (dbg_out);
+  caml_channel_forget_last();
 
   if (!caml_debugger_in_use) caml_putword(dbg_out, -1); /* first connection */
 #ifdef _WIN32
@@ -479,6 +480,7 @@ void caml_debugger(enum event_kind event, value param)
       i = fork();
       Lock (dbg_in);
       Lock (dbg_out);
+      caml_channel_forget_last();
       caml_acquire_domain_lock ();
       if (i == 0) {
         close_connection();     /* Close parent connection. */
diff --git a/runtime/io.c b/runtime/io.c
index 78bfea6d8a..fae4d6239e 100644
--- a/runtime/io.c
+++ b/runtime/io.c
@@ -106,6 +106,11 @@ CAMLexport void (*caml_channel_mutex_unlock_exn) (void)
 /* List of opened channels */
 CAMLexport struct channel * caml_all_opened_channels = NULL;
 
+CAMLexport void caml_channel_forget_last(void)
+{
+  last_channel_locked = NULL;
+}
+
 /* Basic functions over type struct channel *.
    These functions can be called directly from C.
    No locking is performed. */

Thanks again to @OlivierNicole for his help in understanding channel behaviour.

Unfortunately, this dynlink test still fails because unable to place a breakpoint in Plugin:

Can't find any event there.

Investigation continues!

@damiendoligez damiendoligez marked this pull request as ready for review June 15, 2022 13:56
@damiendoligez damiendoligez dismissed xavierleroy’s stale review June 15, 2022 13:56

Significant changes were added since this review, and the PR is not ready yet.

@damiendoligez damiendoligez force-pushed the fix-bytecode-debugger branch from 2eba56d to 8bb3526 Compare June 15, 2022 15:17
@damiendoligez
Copy link
Copy Markdown
Member Author

Can't find any event there.

Fixed by 3b34535, let's see if the testsuite passes now.

@fabbing
Copy link
Copy Markdown
Contributor

fabbing commented Jun 21, 2022

Do we also want a "symmetrical" behavior with resume that would trigger a trap like perform?

--- a/runtime/interp.c
+++ b/runtime/interp.c
@@ -1284,6 +1284,7 @@ value caml_interprete(code_t prog, asize_t prog_size)
       goto do_resume;
 
 do_resume: {
+      check_trap_barrier_for_effect(domain_state);
       struct stack_info* stk = Ptr_val(accu);
       if (stk == NULL) {
          accu = Field(caml_global_data, CONTINUATION_ALREADY_TAKEN_EXN);

Either way it looks good to me!

@xavierleroy
Copy link
Copy Markdown
Contributor

can we get an official review ?

Yes, but please leave me 1-2 weeks to work on it. I'm thinking about a radical solution to the locking problems that have plagued this PR. Stay tuned.

@xavierleroy
Copy link
Copy Markdown
Contributor

Here is a preview of the "radical solution" I propose: damiendoligez#7

Ideally, I should make a separate PR with the basic changes, have it merged, and then you would rebase this PR on top of this work. But first I'd like you (@damiendoligez , @fabbing , ...) to have a quick look at the approach and comment.

damiendoligez and others added 17 commits July 1, 2022 09:07
messages right and install it so the debugger can step through it.
Otherwise the debugger gets all confused.

It is still possible to debug the program up to the point it spawns a
new domain.  This is the same solution we used in ocaml#10594 for programs
that create threads.
Patch written by @fabbing, slightly adapted by Damien Doligez.
They are used in a single-threaded manner.
Locking is no longer necessary since ocaml#11356.

Also remove the recently-added caml_channel_reset_last_locked() function
which is no longer needed.
@xavierleroy xavierleroy force-pushed the fix-bytecode-debugger branch from 3b34535 to 10dcb5f Compare July 1, 2022 07:30
@xavierleroy
Copy link
Copy Markdown
Contributor

I took the liberty to rebase on trunk and hand-merge damiendoligez#7. I'm running a CI precheck test.

@damiendoligez
Copy link
Copy Markdown
Member Author

precheck is at https://ci.inria.fr/ocaml/job/precheck/726/

Copy link
Copy Markdown
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test results are as positive as they can be, and the whole PR is in good shape as far as I can see, so let's merge! Thanks @damiendoligez for all the work.

@xavierleroy xavierleroy merged commit 4e8f384 into ocaml:trunk Jul 1, 2022
xavierleroy pushed a commit that referenced this pull request Jul 1, 2022
* debugger: the main program is now code fragment number 3 instead of 0
* debugger: update to handle stack backtrace in the presence of fibers
* Fix embedded file name (and line number) for stdlib.ml.in to get error
messages right and install it so the debugger can step through it.
* Stop cleanly if a program being debugged spawns a domain.  It is still possible to debug the program up to the point it spawns a new domain.  This is the same solution we used in #10594 for programs that create threads.
* fix the trap barrier and make it work with algebraic effects (aka fibers)
* use Caml_inline instead of inline
* debugger: use fiber id instead of address
* re-enable debugger tests
* re-enable dynlink debugging

(cherry picked from commit 4e8f384)
@xavierleroy
Copy link
Copy Markdown
Contributor

Cherry-picked in 5.0: b4a61f5

@damiendoligez damiendoligez deleted the fix-bytecode-debugger branch July 1, 2022 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants