Skip to content

p2p: Fix error logging for connection stop#3824

Merged
melekes merged 11 commits intomasterfrom
fix-p2p-connection-stop
Jul 25, 2019
Merged

p2p: Fix error logging for connection stop#3824
melekes merged 11 commits intomasterfrom
fix-p2p-connection-stop

Conversation

@melekes
Copy link
Contributor

@melekes melekes commented Jul 22, 2019

Original author and PR: @defunctzombie #3644


This changeset fixes two types of false-positive errors occurring during
connection shutdown.

The first occurs when the process invokes FlushStop() or Stop() on a
connection. While the previous behavior did properly wait for the sendRoutine
to finish, it did not notify the recvRoutine that the connection was shutting
down. This would cause the recvRouting to receive and error when reading and
log this error. The changeset fixes this by notifying the recvRoutine that
the connection is shutting down.

The second occurs when the connection is terminated (gracefully) by the other side.
The recvRoutine would get an EOF error during the read, log it, and stop the connection
with an error. The changeset detects EOF and gracefully shuts down the connection.

I am not intimately familiar with idiomatic go practices so if there is a better approach for performing the signaling I did please let me know and I will update the PR. I used a channel buffer size of 1 to ensure the signal is received and non-blocking messaging to avoid breaking existing logic.

Roman Shtylman and others added 4 commits June 3, 2019 16:53
This changeset fixes two types of false-positive errors occurring during
connection shutdown.

The first occurs when the process invokes FlushStop() or Stop() on a
connection. While the previous behavior did properly wait for the sendRoutine
to finish, it did not notify the recvRoutine that the connection was shutting
down. This would cause the recvRouting to receive and error when reading and
log this error. The changeset fixes this by notifying the recvRoutine that
the connection is shutting down.

The second occurs when the connection is terminated (gracefully) by the other side.
The recvRoutine would get an EOF error during the read, log it, and stop the connection
with an error. The changeset detects EOF and gracefully shuts down the connection.
@melekes
Copy link
Contributor Author

melekes commented Jul 22, 2019

The p2p tests have been running for 20 min now. The node(s) must be blocked on select or smth.

/cc @defunctzombie

@defunctzombie
Copy link
Contributor

@melekes did the tests run successfully on your local machine?

@melekes
Copy link
Contributor Author

melekes commented Jul 23, 2019

I haven't, but I'm pretty sure if I do it, I will get the same result.

Restarted the tests => test_p2p is running forever

@melekes
Copy link
Contributor Author

melekes commented Jul 23, 2019

starting test client container with CMD=test/p2p/fast_sync/check_peer.sh 1
Other peer is on height 14 with state "0800000000000000"

Waiting for peer 1 to catch up

... 0

... 0

... 0

@melekes melekes force-pushed the fix-p2p-connection-stop branch from 96fd88a to 6ff48b6 Compare July 23, 2019 14:40
@melekes
Copy link
Contributor Author

melekes commented Jul 23, 2019

@defunctzombie I think we have to call stopForError because otherwise peer will stay in the peer set. see 6ff48b6 for details

Note this mostly reverts the second fix cause the user is going to see "Stopping peer for error" error when err == io.EOF

@codecov-io
Copy link

codecov-io commented Jul 23, 2019

Codecov Report

Merging #3824 into master will decrease coverage by 0.11%.
The diff coverage is 92.85%.

@@            Coverage Diff             @@
##           master    #3824      +/-   ##
==========================================
- Coverage   66.07%   65.95%   -0.12%     
==========================================
  Files         234      234              
  Lines       20242    20255      +13     
==========================================
- Hits        13374    13360      -14     
- Misses       5802     5823      +21     
- Partials     1066     1072       +6
Impacted Files Coverage Δ
p2p/conn/connection.go 80.71% <92.85%> (+0.13%) ⬆️
privval/signer_validator_endpoint.go 75.55% <0%> (-10%) ⬇️
libs/events/events.go 93.2% <0%> (-4.86%) ⬇️
privval/signer_remote.go 78% <0%> (-4%) ⬇️
privval/signer_service_endpoint.go 83.63% <0%> (-1.82%) ⬇️
p2p/pex/pex_reactor.go 83.13% <0%> (-1.17%) ⬇️
consensus/reactor.go 70.92% <0%> (-0.71%) ⬇️
consensus/state.go 79.92% <0%> (+0.11%) ⬆️
blockchain/v0/pool.go 80.26% <0%> (+0.32%) ⬆️
consensus/ticker.go 95.83% <0%> (+4.16%) ⬆️

@brapse
Copy link
Contributor

brapse commented Jul 25, 2019

In the following snippet, isn't it the goal/conclusion to break out of the loop?

// stopServices was invoked and we are shutting down
// receiving is excpected to fail since we will close the connection
select {
case <-c.quitRecvRoutine:
break FOR_LOOP
default:
}
if c.IsRunning() {
if err == io.EOF {
c.Logger.Info("Connection is closed @ recvRoutine (likely by the other side)", "conn", c)
} else {
c.Logger.Error("Connection failed @ recvRoutine (reading byte)", "conn", c, "err", err)
}
c.stopForError(err)
}
break FOR_LOOP

If so then wouldn't it be possible to remove the channel checking select as the result is to break out of the for loop on line 582?

One problem which might cause live locks seems to be the call to stopServices might cleanup up the tickers or it might not depending on the contents of the two channels. It would be preferable here to remove any variance to the cleanup process to remove any cases in which stopService is called but is unable to cleanup the channels, tickers and any other resources which would create a zombie process 💀

Would it be possible here to use the lifecycle management facilities provided BaseService?

// Stop implements Service by calling OnStop (if defined) and closing quit
// channel. An error will be returned if the service is already stopped.
func (bs *BaseService) Stop() error {
if atomic.CompareAndSwapUint32(&bs.stopped, 0, 1) {
if atomic.LoadUint32(&bs.started) == 0 {
bs.Logger.Error(fmt.Sprintf("Not stopping %v -- have not been started yet", bs.name), "impl", bs.impl)
// revert flag
atomic.StoreUint32(&bs.stopped, 0)
return ErrNotStarted
}
bs.Logger.Info(fmt.Sprintf("Stopping %v", bs.name), "impl", bs.impl)
bs.impl.OnStop()
close(bs.quit)
return nil
}
bs.Logger.Debug(fmt.Sprintf("Stopping %v (ignoring: already stopped)", bs.name), "impl", bs.impl)
return ErrAlreadyStopped
}
// OnStop implements Service by doing nothing.
// NOTE: Do not put anything in here,
// that way users don't need to call BaseService.OnStop()
func (bs *BaseService) OnStop() {}

This way we could potentially unify lifecycle management and the call to IsRunning on line 574 would have the semantics we want and eliminate any need to additional channels. 🤔

@melekes
Copy link
Contributor Author

melekes commented Jul 25, 2019

If so then wouldn't it be possible to remove the channel checking select as the result is to break out of the for loop on line 582?

we can, but then we'll log the error - this PR fixes that

Would it be possible here to use the lifecycle management facilities provided BaseService?

Ideally, yes. But since FlushStop exists and it does not call Stop(), we are forced to use

  1. mutex to coordinate between Stop and FlushStop
  2. channels (not the Quit() standard channel) to signal exiting

@melekes melekes self-assigned this Jul 25, 2019
@brapse
Copy link
Contributor

brapse commented Jul 25, 2019

Ah thanks that makes sense. I would agree that the lifecycle management cocnern is out scope so i made a follow up issue #3833. With the non-determinism fixed I think we are good to go 👍

Copy link
Contributor

@brapse brapse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤸‍♂

@melekes melekes merged commit af77077 into master Jul 25, 2019
@melekes melekes deleted the fix-p2p-connection-stop branch July 25, 2019 11:06
jackzampolin pushed a commit that referenced this pull request Aug 1, 2019
* p2p: fix false-positive error logging when stopping connections

This changeset fixes two types of false-positive errors occurring during
connection shutdown.

The first occurs when the process invokes FlushStop() or Stop() on a
connection. While the previous behavior did properly wait for the sendRoutine
to finish, it did not notify the recvRoutine that the connection was shutting
down. This would cause the recvRouting to receive and error when reading and
log this error. The changeset fixes this by notifying the recvRoutine that
the connection is shutting down.

The second occurs when the connection is terminated (gracefully) by the other side.
The recvRoutine would get an EOF error during the read, log it, and stop the connection
with an error. The changeset detects EOF and gracefully shuts down the connection.

* bring back the comment about flushing

* add changelog entry

* listen for quitRecvRoutine too

* we have to call stopForError

Otherwise peer won't be removed from the peer set and maybe readded
later.
tessr pushed a commit that referenced this pull request Nov 19, 2019
This commit contains commit messages from the 52 commits from Tendermint 0.32.0 to 0.32.7. This is a result of creating releases from our security advisories, rather than merging these advisories back into the main repo before creating releases. In the future, we will adopt a git workflow that will reduce these commits to only the commits that make up RC2 for (for example) Tendermint 0.32.8.

* docs: fix consensus spec formatting (#3804)

* abci/server: recover from app panics in socket server (#3809)

fixes #3800

* abci/client: fix DATA RACE in gRPC client (#3798)

* Remove go func {}()

closes #357

- Remove go func(){}() that caused race condiditon

- To reproduce
	- add -race in make file to `install_abci`
	- Remove `CGO_ENABLED=0` & add -race to `install`

Signed-off-by: Marko Baricevic <marbar3778@yahoo.com>

* remove -race

* fix data race

also, reorder callbacks similarly to socket client

* docs: "Writing a built-in Tendermint Core application in Go" guide (#3608)

* docs: go built-in guide

* fix package imports, add badger db, simplify Query

* newTendermint function

* working example

* finish the first guide

* add one more note

* add the second Golang guide - external ABCI app

* fix typos

* libs: Remove db from tendermint in favor of tendermint/tm-cmn (#3811)

* Remove db from tendemrint in favor of tendermint/tm-cmn

- remove db from `libs`
- update dependancy, there have been no breaking changes in the updated deps
	- https://github.com/grpc/grpc-go/releases
	- https://github.com/golang/protobuf/releases

Signed-off-by: Marko Baricevic <marbar3778@yahoo.com>

* changelog add

* gofmt

* more gofmt

*  docs: add A TOC to the Readme.md of ADR Section (#3820)

* ADR TOC in readme.md

* Added A TOC to the Readme.md of ADR Section

- Added table of contents to the Readme of the architecture section.
	- Easier to traverse and when you know what is there.
	- If the Adr's become viewable online it would help guide the user

Signed-off-by: Marko Baricevic <marbar3778@yahoo.com>

* add tm-cmn to subprojects

* normalize word

* rpc: make max_body_bytes and max_header_bytes configurable (#3818)

* rpc: make max_body_bytes and max_header_bytes configurable

* update changelog pending

* p2p/conn: Add Bufferpool (#3664)

* use byte buffer pool to decreass allocs

* wrap to put buffer in defer

* wapper defer

* add dependency

* remove Gopkg,*

* add change log

* rpc: /broadcast_evidence (#3481)

* implement broadcast_duplicate_vote endpoint

* fix test_cover

* address comments

* address comments

* Update abci/example/kvstore/persistent_kvstore.go

Co-Authored-By: mossid <torecursedivine@gmail.com>

* Update rpc/client/main_test.go

Co-Authored-By: mossid <torecursedivine@gmail.com>

* address comments in progress

* reformat the code

* make linter happy

* make tests pass

* replace BroadcastDuplicateVote with BroadcastEvidence

* fix test

* fix endpoint name

* improve doc

* fix TestBroadcastEvidenceDuplicateVote

* Update rpc/core/evidence.go

Co-Authored-By: Thane Thomson <connect@thanethomson.com>

* add changelog entry

* fix TestBroadcastEvidenceDuplicateVote

* mempool: make max_msg_bytes configurable (#3826)

* mempool: make max_msg_bytes configurable

* apply suggestions from code review

* update changelog pending

* apply suggestions from code review again

* rpc: return err if page is incorrect (less than 0 or greater than tot… (#3825)

* rpc: return err if page is incorrect (less than 0 or greater than total pages)

Fixes #3813

* fix rpc_test

* blockchain: Reorg reactor (#3561)

* go routines in blockchain reactor

* Added reference to the go routine diagram

* Initial commit

* cleanup

* Undo testing_logger change, committed by mistake

* Fix the test loggers

* pulled some fsm code into pool.go

* added pool tests

* changes to the design

added block requests under peer

moved the request trigger in the reactor poolRoutine, triggered now by a ticker

in general moved everything required for making block requests smarter in the poolRoutine

added a simple map of heights to keep track of what will need to be requested next

added a few more tests

* send errors to FSM in a different channel than blocks

send errors (RemovePeer) from switch on a different channel than the
one receiving blocks
renamed channels
added more pool tests

* more pool tests

* lint errors

* more tests

* more tests

* switch fast sync to new implementation

* fixed data race in tests

* cleanup

* finished fsm tests

* address golangci comments :)

* address golangci comments :)

* Added timeout on next block needed to advance

* updating docs and cleanup

* fix issue in test from previous cleanup

* cleanup

* Added termination scenarios, tests and more cleanup

* small fixes to adr, comments and cleanup

* Fix bug in sendRequest()

If we tried to send a request to a peer not present in the switch, a
missing continue statement caused the request to be blackholed in a peer
that was removed and never retried.

While this bug was manifesting, the reactor kept asking for other
blocks that would be stored and never consumed. Added the number of
unconsumed blocks in the math for requesting blocks ahead of current
processing height so eventually there will be no more blocks requested
until the already received ones are consumed.

* remove bpPeer's didTimeout field

* Use distinct err codes for peer timeout and FSM timeouts

* Don't allow peers to update with lower height

* review comments from Ethan and Zarko

* some cleanup, renaming, comments

* Move block execution in separate goroutine

* Remove pool's numPending

* review comments

* fix lint, remove old blockchain reactor and duplicates in fsm tests

* small reorg around peer after review comments

* add the reactor spec

* verify block only once

* review comments

* change to int for max number of pending requests

* cleanup and godoc

* Add configuration flag fast sync version

* golangci fixes

* fix config template

* move both reactor versions under blockchain

* cleanup, golint, renaming stuff

* updated documentation, fixed more golint warnings

* integrate with behavior package

* sync with master

* gofmt

* add changelog_pending entry

* move to improvments

* suggestion to changelog entry

* Renamed wire.go to codec.go (#3827)

* Renamed wire.go to codec.go

- Wire was the previous name of amino
- Codec describes the file better than `wire` & `amino`

Signed-off-by: Marko Baricevic <marbar3778@yahoo.com>

* ide error

* rename amino.go to codec.go

* docs: add guides to docs (#3830)

* add staticcheck linting (#3828)

cleanup to add linter

    grpc change:
        https://godoc.org/google.golang.org/grpc#WithContextDialer
        https://godoc.org/google.golang.org/grpc#WithDialer
        grpc/grpc-go#2627
    prometheous change:
        due to UninstrumentedHandler, being deprecated in the future
    empty branch = empty if or else statement
        didn't delete them entirely but commented
        couldn't find a reason to have them
    could not replicate the issue #3406
        but if want to keep it commented then we should comment out the if statement as well

* types: move MakeVote / MakeBlock functions (#3819)

to the types package

Paritally Fixes #3584

* p2p: Fix error logging for connection stop (#3824)

* p2p: fix false-positive error logging when stopping connections

This changeset fixes two types of false-positive errors occurring during
connection shutdown.

The first occurs when the process invokes FlushStop() or Stop() on a
connection. While the previous behavior did properly wait for the sendRoutine
to finish, it did not notify the recvRoutine that the connection was shutting
down. This would cause the recvRouting to receive and error when reading and
log this error. The changeset fixes this by notifying the recvRoutine that
the connection is shutting down.

The second occurs when the connection is terminated (gracefully) by the other side.
The recvRoutine would get an EOF error during the read, log it, and stop the connection
with an error. The changeset detects EOF and gracefully shuts down the connection.

* bring back the comment about flushing

* add changelog entry

* listen for quitRecvRoutine too

* we have to call stopForError

Otherwise peer won't be removed from the peer set and maybe readded
later.

* p2p: Do not write 'Couldn't connect to any seeds' if there are no seeds (#3834)

* Do not write 'Couldn't connect to any seeds' if there are no seeds

* changelog

* remove privValUpgrade

* Fix typo in changelog

* Update CHANGELOG_PENDING.md

Co-Authored-By: Marko <marbar3778@yahoo.com>

I'm setting up all peers dynamically by calling dial_peers, so p2p.seeds in configs is empty, and I'm seeing error log a lot in logs.

* docs: add a footer to guides (#3835)

* docs: "Writing a Tendermint Core application in Kotlin (gRPC)" guide (#3838)

* add abci grpc kotlin guide

* Update docs/guides/kotlin.md

Co-Authored-By: Anton Kaliaev <anton.kalyaev@gmail.com>

* Update docs/guides/kotlin.md

Co-Authored-By: Anton Kaliaev <anton.kalyaev@gmail.com>

* Update docs/guides/kotlin.md

Co-Authored-By: Anton Kaliaev <anton.kalyaev@gmail.com>

* Update kotlin.md

*  node: allow replacing existing p2p.Reactor(s)  (#3846)

* node: allow replacing existing p2p.Reactor(s)

using [`CustomReactors`
option](https://godoc.org/github.com/tendermint/tendermint/node#CustomReactors).
Warning: beware of accidental name clashes. Here is the list of existing
reactors: MEMPOOL, BLOCKCHAIN, CONSENSUS, EVIDENCE, PEX.

* check the absence of "CUSTOM" prefix

* merge 2 tests

* add doc.go to node package

* gocritic (1/2) (#3836)

    Add gocritic as a linter

    The linting is not complete, but should i complete in this PR or in a following.

    23 files have been touched so it may be better to do in a following PR


Commits:

* Add gocritic to linting

- Added gocritic to linting

Signed-off-by: Marko Baricevic <marbar3778@yahoo.com>

* gocritic

* pr comments

* remove switch in cmdBatch

* tm-cmn to tm-db (#3850)

* tm-cmn to tm-db

* go.mod changes

* go.mod changes

* more go.mod

* fix tm-db

* ci fix, pending change

* version tmdb (#3854)

* txindexer: Refactor Tx Search Aggregation (#3851)

- Replace the previous intersect call, which was called at each query condition, with a map intersection.
- Replace fmt.Sprintf with string()

closes: #3076

Benchmarks

```
Old
goos: darwin
goarch: amd64
pkg: github.com/tendermint/tendermint/state/txindex/kv
BenchmarkTxSearch-4   	     200	 103641206 ns/op	 7998416 B/op	   71171 allocs/op
PASS
ok  	github.com/tendermint/tendermint/state/txindex/kv	26.019s

New
goos: darwin
goarch: amd64
pkg: github.com/tendermint/tendermint/state/txindex/kv
BenchmarkTxSearch-4   	    1000	  38615024 ns/op	13515226 B/op	  166460 allocs/op
PASS
ok  	github.com/tendermint/tendermint/state/txindex/kv	53.618s
```

~62% performance improvement

Commits:

* Refactor tx search

* Add pending changelog entry

* Add tx search benchmarking

* remove intermediate hashes list

also reset timer in BenchmarkTxSearch
and fix other benchmark

* fix import

* Add test cases

* Fix searching

* Replace fmt.Sprintf with string

* Update state/txindex/kv/kv.go

Co-Authored-By: Anton Kaliaev <anton.kalyaev@gmail.com>

* Rename params

* Cleanup

* Check error in benchmarks

* release for v0.32.2

* Merge PR #3860: Update log v0.32.2

* changelog updates

* pr comments

* Fix for panic in signature verification if a peer sends a nil public key.

* update version.go

* Changelog update

* Update CHANGELOG.md

Co-Authored-By: Anton Kaliaev <anton.kalyaev@gmail.com>

* update changelog

* p2p: only allow ed25519 pubkeys when connecting

also, recover from any possible failures in acceptPeers

Refs #4030

* update changelog and bump version to v0.32.6

* set the date to today

* cs: panic only when WAL#WriteSync fails

- modify WAL#Write and WAL#WriteSync to return an error

* types: validate Part#Proof

add ValidateBasic to crypto/merkle/SimpleProof

* cs: limit max bit array size and block parts count

* cs: test new limits

* cs: only assert important stuff

* update changelog and bump version to 0.32.7

* fixes after Ethan's review

* align max wal msg and max consensus msg sizes

* fix tests

* fix test

* Rc2 v0.32.8

Signed-off-by: Marko Baricevic <marbar3778@yahoo.com>

* move issue to big fix
iKapitonau pushed a commit to scrtlabs/tendermint that referenced this pull request Sep 30, 2024
…#3819) (tendermint#3824)

---

#### PR checklist

- [ ] Tests written/updated
- [ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments
<hr>This is an automatic backport of pull request tendermint#3819 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com>
Co-authored-by: hvanz <hernan.vanzetto@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants