Skip to content

Conversation

@DvdGiessen
Copy link
Contributor

@DvdGiessen DvdGiessen commented Mar 13, 2024

When a fatal error occurs it's important to know which precise version it occurred on in order to be able to decode the crash dump information such as the backtrace.

I noticed it today in #14005 (comment), and quickly prototyped a small improvement.

By wrapping around the built-in IDF panic handler we can print some extra information whenever a fatal error occurs. The message link to a new wiki page which contains additional information on how to debug ESP32 issues links to the bug reporting issue template.

Example of current output:

A fatal error occurred. The crash dump printed below may be used to help
determine what caused it. If you are not already running the most recent
version of MicroPython, consider upgrading. New versions often fix bugs.

To learn more about how to debug and/or report this crash visit the wiki
page at: https://github.com/micropython/micropython/wiki/ESP32-debugging

MPY version : v1.23.0-preview.213.gfbf668cf3d.dirty on 2024-03-13
IDF version : v5.2.1
Machine     : GlobalLink 520 Bx with ESP32S3

Guru Meditation Error: Core  1 panic'ed (LoadProhibited). Exception was unhandled.

... (snip)

(Another suggestion is we could perhaps mention the SHA256 hash of each ESP32 image ELF file on the website. The IDF prints the "ELF file SHA256" at the end of each fatal error message, making it easy to look up the corresponding ELF file.)

@andrewleech
Copy link
Contributor

This does look truly helpful, thanks. Do you know a convenient way to trigger a panic at the moment to test it?

@DvdGiessen
Copy link
Contributor Author

DvdGiessen commented Mar 22, 2024

One I know of is to make use of the fact that the PPP implementation can call into MicroPython code in its own task without first setting up a proper thread state or holding the GIL (see my notes in #12729 (comment)). So if we give it a zero-sized BytesIO, then make it write to it, that will trigger an reallocation of the BytesIO buffer which will fail and crash:

from network import PPP
from io import BytesIO
ppp = PPP(BytesIO(0))
ppp.active(True)
ppp.connect()

@andrewleech
Copy link
Contributor

I tested this on my esp32c6 and it worked as intended.

I would suggest though rather than the message saying to raise an issue, it might be better to print a link to a (new) wiki page about these crashes, that could then describe how to raise an issue (if needed) and could even include the details on how to decode the dump.

@dpgeorge
Copy link
Member

Thanks, this looks great!

But I guess it doesn't work well on esp32-s2/s3, which have native USB. Because the USB disappears when there's a crash.

I would suggest though rather than the message saying to raise an issue, it might be better to print a link to a (new) wiki page about these crashes

Yes, I agree. Let's create a wiki page to get a URL and use that in the message.

@DvdGiessen
Copy link
Contributor Author

DvdGiessen commented Mar 27, 2024

But I guess it doesn't work well on esp32-s2/s3, which have native USB. Because the USB disappears when there's a crash.

That might indeed be a limitation depending on how the chip is connected, but that's not necessarily always the case for these chips. For example: I've tested this on a ESP32-S3 based board which uses the native USB serial/JTAG console (CONFIG_ESP_CONSOLE_USB_SERIAL_JTAG=y) and there it works fine (the chip resets but doesn't lose the USB connection while doing so).

it might be better to print a link to a (new) wiki page about these crashes

Agreed, that does indeed seem like a better solution. We can replace that second part of the message with something like this?

To learn more about how to debug and/or report this crash visit the wiki
page at: https://github.com/micropython/micropython/wiki/ESP32-debugging

@DvdGiessen DvdGiessen force-pushed the esp32_panic_message branch from c541f6d to 79b4b3a Compare April 17, 2024 12:46
@DvdGiessen
Copy link
Contributor Author

I've just written a basic wiki page on debugging crashes on the ESP32, and updated the panic message in this MR accordingly.

https://github.com/micropython/micropython/wiki/ESP32-debugging

@DvdGiessen DvdGiessen force-pushed the esp32_panic_message branch from 79b4b3a to f8373de Compare May 15, 2024 08:37
@walklinewang
Copy link

from network import PPP
from io import BytesIO
ppp = PPP(BytesIO(0))
ppp.active(True)
ppp.connect()

I tested this code on my esp32c3 board (idf5.0.4) with this pr code, it's crashed, but there's no extra message showed, I wonder why?

@DvdGiessen
Copy link
Contributor Author

it's crashed, but there's no extra message showed

Did the default ESP-IDF panic handler show up, or did it just reboot? The default panic handler output usually starts with Guru Meditation Error followed by some debug information, and the final line is Rebooting.... (See the documentation for v5.0.4 on the C3 here and the code here.)

If that didn't show, then you're likely losing some output, perhaps as mentioned above due to the USB connection going away before the error output has been received. That is not something this specific PR aims to change.

When a fatal error occurs it's important to know which precise version it
occurred on in order to be able to decode the crash dump information such
as the backtrace.

By wrapping around the built-in IDF panic handler we can print some extra
information whenever a fatal error occurs.  The message links to a new wiki
page which contains additional information on how to debug ESP32 issues,
and links to the bug reporting issue template.

Signed-off-by: Daniël van de Giessen <daniel@dvdgiessen.nl>
@dpgeorge dpgeorge force-pushed the esp32_panic_message branch from f8373de to c10a74b Compare May 16, 2024 02:27
@dpgeorge dpgeorge merged commit c10a74b into micropython:master May 16, 2024
@dpgeorge
Copy link
Member

Thanks for updating. Now merged.

@walklinewang
Copy link

or did it just reboot?

Yes, just like a normally hard reset.

I tested another board that connected to PC by using USB CDC mode, and this time the extra message and Guru Meditation Error showed up.

@andrewleech
Copy link
Contributor

Hi @DvdGiessen did you by any chance see a full fatal error output on any of the new risc-v chips? I ran into one yesterday on my C6 and got your new header show upv perfectly, so it was the perfect opportunity to follow the link to the wiki instructions and follow them.

I even thought then, hey it'd be handy to add a backtrace command to the esp32 Makefile that will handle the correct decode command for you (will try to finish it off still) but then ran into a hiccup.

No backtrace line.

Seems the fatal error output for these is slightly different? Or maybe IDF 5.2.2 is a bit different? The docs for C3 do look a bit different in this section: https://docs.espressif.com/projects/esp-idf/en/v5.2.2/esp32c3/api-guides/fatal-errors.html#register-dump-and-backtrace

Where I expected to see a backtrace in the printout there was a big block of stack trace which presumably has the same information but it was formatted differently so not sure if/how it can be get into the address decoder tool.

I ended up running make monitor to get the repl in IDF monitor and trigger the failure so it could decode it for me. I guess I can probably look at its source to see how it does it...

TLDR I think the wiki page might need updating for risc-v chips but not sure exactly how yet.

@DvdGiessen
Copy link
Contributor Author

DvdGiessen commented Jul 29, 2024

@andrewleech Indeed, I wrote the wiki page mostly based on the output of the Xtensa-based chips. For the RISC-V baseed chips there isn't a straight up backtrace printed, and I think instead idf.py monitor just decodes the MEPC address.

To be more precise, the monitor doesn't seem to know what it is decoding. I think it looks for things that look like addresses in the output and then prints a line for those, without actually knowing something is a backtrace, register dump, etc. The package implementing it is esp_idf_panic_decoder, and it's used through the handle_possible_pc_address_in_line method.

So for RISC-V based chips, the wiki should probably indeed explain to look at the MEPC (Machine Exception Program Counter) register. That value can be passed to addr2line to find the source code line where the crash happened. To get a full backtrace, attaching a debugger as described at the bottom of the page is probably the easiest way?

EDIT: For reference, the output on a random C3 board I had around:

>>> from network import PPP
>>> from io import BytesIO
>>> ppp = PPP(BytesIO(0))
>>> ppp.active(True)
True
>>> ppp.connect()

A fatal error occurred. The crash dump printed below may be used to help
determine what caused it. If you are not already running the most recent
version of MicroPython, consider upgrading. New versions often fix bugs.

To learn more about how to debug and/or report this crash visit the wiki
page at: https://github.com/micropython/micropython/wiki/ESP32-debugging

MPY version : v1.23.0-preview.360.g3069af0ff on 2024-05-01
IDF version : v5.2.1
Machine     : <some custom board> with ESP32C3

Guru Meditation Error: Core  0 panic'ed (Load access fault). Exception was unhandled.

Core  0 register dump:
MEPC    : 0x4200755c  RA      : 0x4200755c  SP      : 0x3fcdaea0  GP      : 0x3fc95a00
TP      : 0x3fc9ffe8  T0      : 0x00000000  T1      : 0x40390508  T2      : 0x05dc01cc
S0/FP   : 0x3fcccb50  S1      : 0x00000040  A0      : 0x00000000  A1      : 0x00000054
A2      : 0x00000001  A3      : 0x3fcdaf5c  A4      : 0x3c14928c  A5      : 0x3fc9e000
A6      : 0x00000000  A7      : 0x00000000  S2      : 0x00000001  S3      : 0x3fcdcec4
S4      : 0x00000000  S5      : 0x3fcccae0  S6      : 0x4203b364  S7      : 0xffffffff
S8      : 0x00000000  S9      : 0x00000000  S10     : 0x00000000  S11     : 0x00000000
T3      : 0x00000000  T4      : 0x00000000  T5      : 0xbca09d23  T6      : 0x00000000
MSTATUS : 0x00001881  MTVEC   : 0x40380001  MCAUSE  : 0x00000005  MTVAL   : 0x00000008
MHARTID : 0x00000000

Stack memory:
3fcdaea0: 0x00000000 0x00000004 0x3fcdaf30 0x3fcdcec4 0x0000002d 0x3fcccaf0 0x00000040 0x42007908
3fcdaec0: 0x00000004 0x00000003 0x00000040 0x42014060 0xc0380000 0x0000002c 0x3fcccaf0 0x42014224
3fcdaee0: 0x3fc9f908 0x0000002d 0x3fcccae0 0x4203b3ee 0x3fcdb308 0x3fcdaf9c 0x00000004 0x000005fc
3fcdaf00: 0xc0380000 0x00000000 0x00000000 0x3fcdcec4 0x00000000 0x3fcdaf5c 0x0000002d 0x420138d0
3fcdaf20: 0x00000004 0x00000000 0x000005ec 0x00000000 0x00000004 0x00000003 0x3fcdb308 0x3fcdaf9e
3fcdaf40: 0x3fcdb2a4 0x3fcdceb4 0x00000000 0x4202848e 0x00000000 0x00000000 0x3fcdb308 0x00000000
3fcdaf60: 0x3fcdb2a4 0x3fcdceb4 0x00000000 0x4208c53a 0x00000004 0x00000003 0x3fcdc8dc 0x0000c8dc
3fcdaf80: 0x3fcdb2a4 0x3fcdceb4 0x3fcdc8b4 0x4208c6dc 0xda5e2262 0x9d835fa9 0x00000000 0x19d6e962
3fcdafa0: 0x14cce763 0x00000003 0x3fca6e58 0x00000002 0x3fcdb308 0x3fcdc8b4 0x3fcdb398 0x4209579c
3fcdafc0: 0x00000000 0x00000000 0x00000000 0x00000010 0x3fc9e000 0xffffffff 0x3fcdb398 0x4209581a
3fcdafe0: 0x3fc9e000 0xffffffff 0x00000000 0x4208bbf0 0x00000004 0x00000003 0x3fca6e58 0x3fcdb308
3fcdb000: 0x3fc9e000 0xffffffff 0x3fca6e30 0x4207c88a 0x00000000 0x00000000 0x00000000 0x3fca6e30
3fcdb020: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
3fcdb040: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5
3fcdb060: 0xa5a5a5a5 0x00000158 0x3fcdaf10 0x017db5aa 0x3fc9855c 0x3fc9855c 0x3fcdb068 0x3fc98554
3fcdb080: 0x00000007 0x3fcda1b8 0x3fcda1b8 0x3fcdb068 0x00000000 0x00000012 0x3fcda264 0x00546974
3fcdb0a0: 0x00000000 0x00000000 0x00000000 0x3fcdb060 0x00000012 0x00000000 0x00000000 0x00000000
3fcdb0c0: 0x00000000 0x00000000 0x00000000 0x3fc9fc38 0x3fc9fca0 0x3fc9fd08 0x00000000 0x00000000
3fcdb0e0: 0x00000001 0x00000000 0x00000000 0x00000000 0x42002e76 0x00000000 0x00000000 0x00000000
3fcdb100: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
3fcdb120: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
3fcdb140: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
3fcdb160: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
3fcdb180: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
3fcdb1a0: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
3fcdb1c0: 0x00000054 0x3fcdb1c4 0x3fcdb1c4 0x3fcdb1c4 0x3fcdb1c4 0x00000000 0x3fcdb1dc 0xffffffff
3fcdb1e0: 0x3fcdb1dc 0x3fcdb1dc 0x00000000 0x3fcdb1f0 0xffffffff 0x3fcdb1f0 0x3fcdb1f0 0x00000001
3fcdb200: 0x00000001 0x00000000 0x3600ffff 0x00000000 0x99534d22 0x3fcdb1bc 0x0000000c 0x3fcdb22c
3fcdb220: 0x3fcc0014 0x3fcdb214 0x00000054 0x3fcdb22c 0x3fcdb22c 0x3fcdb22c 0x3fcdb22c 0x00000000
3fcdb240: 0x3fcdb244 0xffffffff 0x3fcdb244 0x3fcdb244 0x00000000 0x3fcdb258 0xffffffff 0x3fcdb258
3fcdb260: 0x3fcdb258 0x00000000 0x00000001 0x00000000 0x3f00ffff 0x00000000 0x00000000 0x3fcdb224
3fcdb280: 0x0000000c 0x3fcdb294 0x3fcc0014 0x3fcdb27c 0x0000000c 0x00000003 0x3fcdb21c 0x00000000

ELF file SHA256: ee7b3e0bb

Rebooting...

So we have memory and registers, but not a backtrace. I think for now just documenting that you can either look at the MEPC or attach a debugger to get a more complete backtrace is best, and then we can expand upon that later if we figure out additional things that might help.

@projectgus
Copy link
Contributor

projectgus commented Jul 29, 2024

Thanks for all your work on this, @DvdGiessen! Even though I wrote the first version of idf_monitor many years ago, I even learned something new reading your wiki page.

For the RISC-V baseed chips there isn't a straight up backtrace printed, and I think instead idf.py monitor just decodes the MEPC address.

Actually idf.py monitor can do a bit more than this. Here's the backtrace from ESP32-C3 if I insert an abort() into the IDF hello_world:

Hello world!

abort() was called at PC 0x42007655 on core 0
0x42007655: app_main at /home/gus/ry/george/esp-idf-v5/examples/get-started/hello_world/main/hello_world_main.c:40

Stack dump detected
Core  0 register dump:
MEPC    : 0x403806d0  RA      : 0x40383a70  SP      : 0x3fc8fc50  GP      : 0x3fc8b600  
0x403806d0: panic_abort at /home/gus/ry/george/esp-idf-v5/components/esp_system/panic.c:466
0x40383a70: __ubsan_include at /home/gus/ry/george/esp-idf-v5/components/esp_system/ubsan.c:313

TP      : 0x3fc87230  T0      : 0x37363534  T1      : 0x7271706f  T2      : 0x33323130  
S0/FP   : 0x00000004  S1      : 0x3fc8fcb4  A0      : 0x3fc8fc7c  A1      : 0x3fc8fcb2  
A2      : 0x00000000  A3      : 0x3fc8fca9  A4      : 0x00000001  A5      : 0x3fc8d000  
A6      : 0x7a797877  A7      : 0x76757473  S2      : 0x00000000  S3      : 0x00000000  
S4      : 0x00000000  S5      : 0x00000000  S6      : 0x00000000  S7      : 0x00000000  
S8      : 0x00000000  S9      : 0x00000000  S10     : 0x00000000  S11     : 0x00000000  
T3      : 0x6e6d6c6b  T4      : 0x6a696867  T5      : 0x66656463  T6      : 0x62613938  
MSTATUS : 0x00001881  MTVEC   : 0x40380001  MCAUSE  : 0x00000007  MTVAL   : 0x00000000  
0x40380001: _vector_table at ??:?

MHARTID : 0x00000000  


Backtrace:


panic_abort (details=details@entry=0x3fc8fc7c "abort() was called at PC 0x42007655 on core 0") at /home/gus/ry/george/esp-idf-v5/components/esp_system/panic.c:466
466         *((volatile int *) 0) = 0; // NOLINT(clang-analyzer-core.NullDereference) should be an invalid operation on targets
#0  panic_abort (details=details@entry=0x3fc8fc7c "abort() was called at PC 0x42007655 on core 0") at /home/gus/ry/george/esp-idf-v5/components/esp_system/panic.c:466
#1  0x40383a70 in esp_system_abort (details=details@entry=0x3fc8fc7c "abort() was called at PC 0x42007655 on core 0") at /home/gus/ry/george/esp-idf-v5/components/esp_system/port/esp_system_chip.c:93
#2  0x4038961a in abort () at /home/gus/ry/george/esp-idf-v5/components/newlib/abort.c:38
#3  0x42007658 in app_main () at /home/gus/ry/george/esp-idf-v5/examples/get-started/hello_world/main/hello_world_main.c:40
#4  0x42014b8a in main_task (args=<error reading variable: value has been optimized out>) at /home/gus/ry/george/esp-idf-v5/components/freertos/app_startup.c:208
#5  0x40384686 in vPortTaskWrapper (pxCode=<optimized out>, pvParameters=<optimized out>) at /home/gus/ry/george/esp-idf-v5/components/freertos/FreeRTOS-Kernel/portable/riscv/port.c:255
ELF file SHA256: 22ad941aa

The monitor actually recognises the block of stack memory, reprocesses it, and dumps it back out as a gdb backtrace.

I think the relevant code is here:
https://github.com/espressif/esp-idf-monitor/blob/master/esp_idf_monitor/base/serial_handler.py#L218

I assume this is the same on C6, but I don't actually have a C6 at hand yet to test (on its way).

I'm not sure what the best MicroPython-friendly solution here is. The easiest one is to run idf.py -B build-generic_C6 monitor and make it crash in the monitor, but that's obviously a bit limiting.

I think the best would be to build a little command line decoder tool that someone can paste a stack trace into, that calls the idf_panic_handler functions internally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants