Skip to content

Indeterministic hard fault in _mutex_lock(), with nRF52 SoftDevice #10122

@pekkanikander

Description

@pekkanikander

Description

With #9473, with the updated support for nRF52 SoftDevice, the way interrupts are re-enabled in _mutex_unlock seems to cause indeterministic hard faults.

Steps to reproduce the issue

The problem is hard to reproduce. We saw it a few weeks ago when working on our own nRF52 board with relatively heavy I/O and BlueTooth load. Applying the change proposed below increased the stability a lot, but not completely. (There seems to be a thread deadlock issue triggered in _xtimer_tsleep, also related to _mutex_lock, but that is not clear enough to me even to describe.)

I will try to reproduce this problem with nRF52 DK so that others can debug this too, but I think it is better to document this now and start some discussion already before that. Hence this issue.

Expected results

RIOT should not cause hard faults.

Actual results

Indeterministic crashes, with hard fault triggered on line 62 in mutex.c:
https://github.com/RIOT-OS/RIOT/blob/master/core/mutex.c#L62

Versions

arm-none-eabi-gcc (GNU Tools for Arm Embedded Processors 7-2017-q4-major) 7.2.1 20170904 (release) [ARM/embedded-7-branch revision 255204]

Suggested fix

Change the order in which interrupts are enabled again and higher priority threads are allowed to run:

   --- a/core/mutex.c
   +++ b/core/mutex.c
   @@ -58,8 +58,8 @@ int _mutex_lock(mutex_t *mutex, int blocking)
            else {
                thread_add_to_list(&mutex->queue, me);
            }
   -        irq_restore(irqstate);
            thread_yield_higher();
   +        irq_restore(irqstate);
            /* We were woken up by scheduler. Waker removed us from queue.
             * We have the mutex now. */
            return 1;

Discussion

For Cortex-M series, thread_yield_higher() simply triggers the PendSV interrupt:
https://github.com/RIOT-OS/RIOT/blob/master/cpu/cortexm_common/thread_arch.c#L291

In RIOT, if a thread called _mutex_lock, the PendSV interrupt will take place immediately, since by default all interrupts are on the same priority level, see #3511

Unless I understand something completely wrong, with the current version, at the window between when the IRQs are enabled and when isr_pendsv is triggered, there is a time window where other interrupts may be served. This may cause an interrupt routing to change the state of this mutex. At that point, the scheduler has not switched away from the calling thread's context, but is status of the thread is already changed to STATUS_MUTEX_BLOCKED.

Now, if someone triggers SVC, e.g. calls cpu_switch_context_exit triggering isr_svc, the scheduler will be entered while the blocking thread's context hasn't been saved yet. I have no idea if that ever happens. But note that Nordic uses the SVC interrupt heavily, but they should eat them all at their trampoline, I think.

If the order of the two lines is changed, as suggested above, it has no functional effects under Cortex-M. However, the timing of when interrupts are run and when the scheduler is run will change. As a side effect, this will fix the hard faults on nRF52 when SoftDevice is used, for reasons that I don't understand.

I am not familiar with the other supported CPU architectures, and therefore I am opening this as an issue, as changing the order of the lines may have adverse effects on other CPU architectures.

Metadata

Metadata

Assignees

Labels

Area: BLEArea: Bluetooth Low Energy supportArea: coreArea: RIOT kernel. Handle PRs marked with this with care!Area: pkgArea: External package portsPlatform: ARMPlatform: This PR/issue effects ARM-based platformsType: bugThe issue reports a bug / The PR fixes a bug (including spelling errors)

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions