Indeterministic hard fault in _mutex_lock(), with nRF52 SoftDevice

#### Description

With https://github.com/RIOT-OS/RIOT/pull/9473, with the updated support for nRF52 SoftDevice, the way interrupts are re-enabled in `_mutex_unlock` seems to cause indeterministic hard faults.

#### Steps to reproduce the issue

The problem is hard to reproduce.  We saw it a few weeks ago when working on our own nRF52 board with relatively heavy I/O and BlueTooth load.  Applying the change proposed below increased the stability a lot, but not completely.  (There seems to be a thread deadlock issue triggered in `_xtimer_tsleep`, also related to `_mutex_lock`, but that is not clear enough to me even to describe.)

I will try to reproduce this problem with nRF52 DK so that others can debug this too, but I think it is better to document this now and start some discussion already before that.  Hence this issue.

#### Expected results

RIOT should not cause hard faults.

#### Actual results

Indeterministic crashes, with hard fault triggered on line 62 in `mutex.c`:
https://github.com/RIOT-OS/RIOT/blob/master/core/mutex.c#L62

#### Versions

arm-none-eabi-gcc (GNU Tools for Arm Embedded Processors 7-2017-q4-major) 7.2.1 20170904 (release) [ARM/embedded-7-branch revision 255204]

#### Suggested fix

Change the order in which interrupts are enabled again and higher priority threads are allowed to run:

```
   --- a/core/mutex.c
   +++ b/core/mutex.c
   @@ -58,8 +58,8 @@ int _mutex_lock(mutex_t *mutex, int blocking)
            else {
                thread_add_to_list(&mutex->queue, me);
            }
   -        irq_restore(irqstate);
            thread_yield_higher();
   +        irq_restore(irqstate);
            /* We were woken up by scheduler. Waker removed us from queue.
             * We have the mutex now. */
            return 1;
```

#### Discussion

For Cortex-M series, `thread_yield_higher()` simply triggers the `PendSV` interrupt:
https://github.com/RIOT-OS/RIOT/blob/master/cpu/cortexm_common/thread_arch.c#L291

In RIOT, if a thread called `_mutex_lock`, the PendSV interrupt will take place immediately, since by default all interrupts are on the same priority level, see https://github.com/RIOT-OS/RIOT/pull/3511

Unless I understand something completely wrong, with the current version, at the window between when the IRQs are enabled and when `isr_pendsv` is triggered, there is a time window where other interrupts may be served.  This may cause an interrupt routing to change the state of this mutex.  At that point, the scheduler has not switched away from the calling thread's context, but is status of the thread is already changed to `STATUS_MUTEX_BLOCKED`.  

Now, if someone triggers SVC, e.g. calls `cpu_switch_context_exit` triggering `isr_svc`, the scheduler will be entered while the blocking thread's context hasn't been saved yet.  I have no idea if that ever happens.  But note that Nordic uses the SVC interrupt heavily, but they should eat them all at their trampoline, I think.

If the order of the two lines is changed, as suggested above, it has no functional effects under Cortex-M.  However, the timing of when interrupts are run and when the scheduler is run will change.  As a side effect, this will fix the hard faults on nRF52 when SoftDevice is used, for reasons that I don't understand.

I am not familiar with the other supported CPU architectures, and therefore I am opening this as an issue, as changing the order of the lines may have adverse effects on other CPU architectures.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indeterministic hard fault in _mutex_lock(), with nRF52 SoftDevice #10122

Description

Steps to reproduce the issue

Expected results

Actual results

Versions

Suggested fix

Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Indeterministic hard fault in _mutex_lock(), with nRF52 SoftDevice #10122

Description

Description

Steps to reproduce the issue

Expected results

Actual results

Versions

Suggested fix

Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions