-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
While running our usual batch of E2E tests on the just released v0.34-rc4, it seems that the newly introduced tx batching (commit ffe2742 part of #5321) seems to introduce a possible nil pointer dereference which causes the node to crash.
It seems to occurr in the txs function on line 275:
Lines 274 to 295 in 7eb4e5c
| for { | |
| memTx := next.Value.(*mempoolTx) | |
| if _, ok := memTx.senders.Load(peerID); !ok { | |
| // If current batch + this tx size is greater than max => return. | |
| batchMsg := protomem.Message{ | |
| Sum: &protomem.Message_Txs{ | |
| Txs: &protomem.Txs{Txs: append(batch, memTx.tx)}, | |
| }, | |
| } | |
| if batchMsg.Size() > memR.config.MaxBatchBytes { | |
| return batch | |
| } | |
| batch = append(batch, memTx.tx) | |
| } | |
| if next.Next() == nil { | |
| return batch | |
| } | |
| next = next.Next() | |
| } |
Did not investigate this much yet but on first glance it could be a concurrency issue -- at the end of the loop you have this check:
Lines 291 to 294 in 7eb4e5c
| if next.Next() == nil { | |
| return batch | |
| } | |
| next = next.Next() |
But each call to Next() seems to take a read lock:
tendermint/libs/clist/clist.go
Lines 115 to 121 in 7eb4e5c
| // Nonblocking, may return nil if at the end. | |
| func (e *CElement) Next() *CElement { | |
| e.mtx.RLock() | |
| val := e.next | |
| e.mtx.RUnlock() | |
| return val | |
| } |
While you do check that Next() is non-nil, you invoke Next() again when assigning to next. And for this second call (I assume that the internal mutex is there because e.next can change concurrently) you don't check whether that is nil. I would suggest that you only call Next() once, check the value for nil and then set next to that value without calling Next() the second time.
Of course the root cause could also be something else as I didn't really investigate this further.