Skip to content

A potential dead lock of BSC client which result in block syncing halt #58

@unclezoro

Description

@unclezoro

System information

Geth version: 1.9.13
OS & Version: Linux

Expected behaviour

Client sync block normal

Actual behaviour

Client failed to sync block.

Steps to reproduce the behaviour

First change the code to make it easier to reproduce:

func (sub *Subscription) Unsubscribe() {
	sub.unsubOnce.Do(func() {
	uninstallLoop:
		for {
			select {
			case sub.es.uninstall <- sub.f:
                                // newly added line
				time.Sleep(3 * time.Second)
				break uninstallLoop
                        case ...
			}
		}
		<-sub.Err()
	})
}

Do request for multi times:

curl -X POST --data '{"jsonrpc":"2.0","method":"eth_newPendingTransactionFilter","params":[],"id":73}'

Backtrace

Here is the pprof of go routine:
goroutine.txt

  1. The lock here is not released:
    image

According to the pprof, the code is blocked on /server/bsc/event/feed.go:170.

1 @ 0x447b10 0x4573db 0x457eda 0x4b5630 0x6e90a4 0xcf0a5f 0xcf3800 0xcf2544 0xf68a60 0xf3aba5 0x475d61
#       0x457ed9        reflect.rselect+0x389                                                           /usr/lib/golang/src/runtime/select.go:542
#       0x4b562f        reflect.Select+0x16f                                                            /usr/lib/golang/src/reflect/value.go:2229
#       0x6e90a3        github.com/ethereum/go-ethereum/event.(*Feed).Send+0x593                        /server/bsc/event/feed.go:170
#       0xcf0a5e        github.com/ethereum/go-ethereum/core.(*BlockChain).writeBlockWithState+0xe2e    /server/bsc/core/blockchain.go:1470
#       0xcf37ff        github.com/ethereum/go-ethereum/core.(*BlockChain).insertChain+0x118f           /server/bsc/core/blockchain.go:1757
  1. The subscription may consume the channel so that block above select.
    As we checked, new NewPendingTransactionFilter calls are pending on fetching filtersMu lock.
#	0x4580a6	sync.runtime_SemacquireMutex+0x46									/usr/lib/golang/src/runtime/sema.go:71
#	0x49408b	sync.(*Mutex).lockSlow+0xfb										/usr/lib/golang/src/sync/mutex.go:138
#	0xf13256	sync.(*Mutex).Lock+0x296										/usr/lib/golang/src/sync/mutex.go:81
#	0xf130e1	github.com/ethereum/go-ethereum/eth/filters.(*PublicFilterAPI).NewPendingTransactionFilter.func1+0x121	/server/bsc/eth/filters/api.go:119
  1. The routine that holds filtersMu lock is timeoutLoop:
#	0xf142b3	github.com/ethereum/go-ethereum/eth/filters.(*Subscription).Unsubscribe.func1+0xf3	/server/bsc/eth/filters/filter_system.go:164
#	0x4944d2	sync.(*Once).doSlow+0xe2								/usr/lib/golang/src/sync/once.go:66
#	0xf0f704	sync.(*Once).Do+0x64									/usr/lib/golang/src/sync/once.go:57
#	0xf0f6bd	github.com/ethereum/go-ethereum/eth/filters.(*Subscription).Unsubscribe+0x1d		/server/bsc/eth/filters/filter_system.go:157
#	0xf09d47	github.com/ethereum/go-ethereum/eth/filters.(*PublicFilterAPI).timeoutLoop+0xf7		/server/bsc/eth/filters/api.go:88

The code is:

func (sub *Subscription) Unsubscribe() {
	sub.unsubOnce.Do(func() {
	uninstallLoop:
		for {
			// write uninstall request and consume logs/hashes. This prevents
			// the eventLoop broadcast method to deadlock when writing to the
			// filter event channel while the subscription loop is waiting for
			// this method to return (and thus not reading these events).
			select {
			case sub.es.uninstall <- sub.f:
				break uninstallLoop
			case <-sub.f.logs:
			case <-sub.f.hashes:
			case <-sub.f.headers:
			}
		}

		// wait for filter to be uninstalled in work loop before returning
		// this ensures that the manager won't use the event channel which
		// will probably be closed by the client asap after this method returns.
		<-sub.Err()
	})
}

Obvious it failed to do sub.es.uninstall <- sub.f.

  1. eventLoop is responsible for consuming the sub.es.uninstall channel.
    The routine is busy doing handleTxsEvent:
#	0xf1128b	github.com/ethereum/go-ethereum/eth/filters.(*EventSystem).handleTxsEvent+0x20b	/server/bsc/eth/filters/filter_system.go:350
#	0xf12f90	github.com/ethereum/go-ethereum/eth/filters.(*EventSystem).eventLoop+0xa60	/server/bsc/eth/filters/filter_system.go:461

It will push hashes to f.hashes channel, and now it is blocking by that channel.

func (es *EventSystem) handleTxsEvent(filters filterIndex, ev core.NewTxsEvent) {
	hashes := make([]common.Hash, 0, len(ev.Txs))
	for _, tx := range ev.Txs {
		hashes = append(hashes, tx.Hash())
	}
	for _, f := range filters[PendingTransactionsSubscription] {
		f.hashes <- hashes
	}
}
  1. What happened to f.hashes channel?
    That is because, in the above step 3, the Unsubscribe may not consume all the hashes and break out so that step4 stucked.

The deadlock dependency is:

Routine A:NewPendingTransactionFilter want to lock filtersMu to consume hashes;
Routine B: eventLoop is waiting Routine A to consume hashes so that it can push new hash to channel.
Routine C: Unsubscribe is holding lock filtersMu, but it is waiting for Routine B to consume uninstall channel.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions