-
Notifications
You must be signed in to change notification settings - Fork 1.8k
A potential dead lock of BSC client which result in block syncing halt #58
Description
System information
Geth version: 1.9.13
OS & Version: Linux
Expected behaviour
Client sync block normal
Actual behaviour
Client failed to sync block.
Steps to reproduce the behaviour
First change the code to make it easier to reproduce:
func (sub *Subscription) Unsubscribe() {
sub.unsubOnce.Do(func() {
uninstallLoop:
for {
select {
case sub.es.uninstall <- sub.f:
// newly added line
time.Sleep(3 * time.Second)
break uninstallLoop
case ...
}
}
<-sub.Err()
})
}
Do request for multi times:
curl -X POST --data '{"jsonrpc":"2.0","method":"eth_newPendingTransactionFilter","params":[],"id":73}'
Backtrace
Here is the pprof of go routine:
goroutine.txt
According to the pprof, the code is blocked on /server/bsc/event/feed.go:170.
1 @ 0x447b10 0x4573db 0x457eda 0x4b5630 0x6e90a4 0xcf0a5f 0xcf3800 0xcf2544 0xf68a60 0xf3aba5 0x475d61
# 0x457ed9 reflect.rselect+0x389 /usr/lib/golang/src/runtime/select.go:542
# 0x4b562f reflect.Select+0x16f /usr/lib/golang/src/reflect/value.go:2229
# 0x6e90a3 github.com/ethereum/go-ethereum/event.(*Feed).Send+0x593 /server/bsc/event/feed.go:170
# 0xcf0a5e github.com/ethereum/go-ethereum/core.(*BlockChain).writeBlockWithState+0xe2e /server/bsc/core/blockchain.go:1470
# 0xcf37ff github.com/ethereum/go-ethereum/core.(*BlockChain).insertChain+0x118f /server/bsc/core/blockchain.go:1757
- The subscription may consume the channel so that block above
select.
As we checked, newNewPendingTransactionFiltercalls are pending on fetchingfiltersMulock.
# 0x4580a6 sync.runtime_SemacquireMutex+0x46 /usr/lib/golang/src/runtime/sema.go:71
# 0x49408b sync.(*Mutex).lockSlow+0xfb /usr/lib/golang/src/sync/mutex.go:138
# 0xf13256 sync.(*Mutex).Lock+0x296 /usr/lib/golang/src/sync/mutex.go:81
# 0xf130e1 github.com/ethereum/go-ethereum/eth/filters.(*PublicFilterAPI).NewPendingTransactionFilter.func1+0x121 /server/bsc/eth/filters/api.go:119
- The routine that holds
filtersMulock istimeoutLoop:
# 0xf142b3 github.com/ethereum/go-ethereum/eth/filters.(*Subscription).Unsubscribe.func1+0xf3 /server/bsc/eth/filters/filter_system.go:164
# 0x4944d2 sync.(*Once).doSlow+0xe2 /usr/lib/golang/src/sync/once.go:66
# 0xf0f704 sync.(*Once).Do+0x64 /usr/lib/golang/src/sync/once.go:57
# 0xf0f6bd github.com/ethereum/go-ethereum/eth/filters.(*Subscription).Unsubscribe+0x1d /server/bsc/eth/filters/filter_system.go:157
# 0xf09d47 github.com/ethereum/go-ethereum/eth/filters.(*PublicFilterAPI).timeoutLoop+0xf7 /server/bsc/eth/filters/api.go:88
The code is:
func (sub *Subscription) Unsubscribe() {
sub.unsubOnce.Do(func() {
uninstallLoop:
for {
// write uninstall request and consume logs/hashes. This prevents
// the eventLoop broadcast method to deadlock when writing to the
// filter event channel while the subscription loop is waiting for
// this method to return (and thus not reading these events).
select {
case sub.es.uninstall <- sub.f:
break uninstallLoop
case <-sub.f.logs:
case <-sub.f.hashes:
case <-sub.f.headers:
}
}
// wait for filter to be uninstalled in work loop before returning
// this ensures that the manager won't use the event channel which
// will probably be closed by the client asap after this method returns.
<-sub.Err()
})
}
Obvious it failed to do sub.es.uninstall <- sub.f.
eventLoopis responsible for consuming thesub.es.uninstallchannel.
The routine is busy doinghandleTxsEvent:
# 0xf1128b github.com/ethereum/go-ethereum/eth/filters.(*EventSystem).handleTxsEvent+0x20b /server/bsc/eth/filters/filter_system.go:350
# 0xf12f90 github.com/ethereum/go-ethereum/eth/filters.(*EventSystem).eventLoop+0xa60 /server/bsc/eth/filters/filter_system.go:461
It will push hashes to f.hashes channel, and now it is blocking by that channel.
func (es *EventSystem) handleTxsEvent(filters filterIndex, ev core.NewTxsEvent) {
hashes := make([]common.Hash, 0, len(ev.Txs))
for _, tx := range ev.Txs {
hashes = append(hashes, tx.Hash())
}
for _, f := range filters[PendingTransactionsSubscription] {
f.hashes <- hashes
}
}
- What happened to
f.hasheschannel?
That is because, in the above step 3, theUnsubscribemay not consume all thehashesand break out so that step4 stucked.
The deadlock dependency is:
Routine A:NewPendingTransactionFilter want to lock filtersMu to consume hashes;
Routine B: eventLoop is waiting Routine A to consume hashes so that it can push new hash to channel.
Routine C: Unsubscribe is holding lock filtersMu, but it is waiting for Routine B to consume uninstall channel.
