Skip to content
This repository was archived by the owner on Oct 28, 2021. It is now read-only.

[WIP] Aleth now waits for 2 secs after sending disconnect packet #5650#5658

Closed
twinstar26 wants to merge 1 commit intoethereum:masterfrom
twinstar26:master
Closed

[WIP] Aleth now waits for 2 secs after sending disconnect packet #5650#5658
twinstar26 wants to merge 1 commit intoethereum:masterfrom
twinstar26:master

Conversation

@twinstar26
Copy link
Copy Markdown
Contributor

@twinstar26 twinstar26 commented Jul 8, 2019

Fixes #5650

@halfalicious
Copy link
Copy Markdown
Contributor

You're getting a build error:

[100/253] Building CXX object libp2p/CMakeFiles/p2p.dir/Session.cpp.o
FAILED: libp2p/CMakeFiles/p2p.dir/Session.cpp.o 
/usr/bin/ccache /usr/bin/clang++-6.0  -DETH_FATDB -I/home/builder/project -isystem /home/builder/.hunter/_Base/3511e5f/e2d20c6/0d35112/Install/include -I. -isystem deps/include -O3 -DNDEBUG -fPIC   -fstack-protector-strong -Wimplicit-fallthrough -Wall -Wextra -Werror -Wno-unknown-pragmas -fvisibility=hidden -std=c++11 -MD -MT libp2p/CMakeFiles/p2p.dir/Session.cpp.o -MF libp2p/CMakeFiles/p2p.dir/Session.cpp.o.d -o libp2p/CMakeFiles/p2p.dir/Session.cpp.o -c /home/builder/project/libp2p/Session.cpp
/home/builder/project/libp2p/Session.cpp:263:49: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture]
                    disconnectTimer.async_wait([this](boost::system::error_code const& ) {});
                                                ^
1 error generated.

@halfalicious
Copy link
Copy Markdown
Contributor

halfalicious commented Jul 9, 2019

@twinstar26 Can you please add a changelog entry? Please see item 2 here for details: https://github.com/ethereum/aleth/blob/master/CONTRIBUTING.md

@halfalicious
Copy link
Copy Markdown
Contributor

halfalicious commented Jul 9, 2019

@twinstar26 Can you please update the PR description to reference the issue you're fixing (so that the issue will be automatically updated with a link to the PR and closed when the PR is closed)? Be sure to include a hash char (#) before the issue ID.


void Session::send(bytes&& _msg)
{
if(m_dropped)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Nit) Formatting...can you please format your changes using git clang-format as described in item 3 here: https://github.com/ethereum/aleth/blob/master/CONTRIBUTING.md

{
bytes const* out = nullptr;

bool disconnectImminent = false;
Copy link
Copy Markdown
Contributor

@halfalicious halfalicious Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified to:

bool disconnectImminent = (writeQueue[0][0] == DisconnectPacket);

Also, since the type of the disconnectImminent value is obvious, we can make disconnectImmiment auto (and const since we're not going to modify its value after initially setting it).

However, you can remove the disconnectImmiment check completely since @gumb0 brought up a good way to simplify the changes - start the timer in Session::disconnect after the sealAndSend call. The resulting wait won't be exactly 2 seconds after the disconnect packet is actually sent out but it should be close enough.

m_writeQueue.pop_front();
if(m_writeQueue.empty() && disconnectImminent)
{
ba::io_context ioc;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should re-use the Host's io_context (Host::m_ioContext) - one way to do this would be to create the timer in Host::startPeerSession and pass it into the Session ctor. @gumb0 - thoughts?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, timer should better use io_context of Host.

Not sure about passing timer to constructor. Session has a pointer to Host, maybe better to introduce a public method in Host that will create a timer? This way we won't create it until it's really needed in disconnect.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright!!

Copy link
Copy Markdown
Contributor

@halfalicious halfalicious Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, timer should better use io_context of Host.

Not sure about passing timer to constructor. Session has a pointer to Host, maybe better to introduce a public method in Host that will create a timer? This way we won't create it until it's really needed in disconnect.

@gumb0 : Good idea, this way we can also easily create timers in other classes if needed.

ba::steady_timer disconnectTimer(ioc);
disconnectTimer.expires_from_now(chrono::seconds(2));
disconnectTimer.async_wait([this](boost::system::error_code const& ) {});
ioc.run();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to explicitly call run, Host already does this (see Host::doWork)

ba::io_context ioc;
ba::steady_timer disconnectTimer(ioc);
disconnectTimer.expires_from_now(chrono::seconds(2));
disconnectTimer.async_wait([this](boost::system::error_code const& ) {});
Copy link
Copy Markdown
Contributor

@halfalicious halfalicious Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll want to close the socket in the timer handler (since we want to close the socket after the timer has expired)...your code currently kicks off the timer wait but then immediately closes the socket (if it's already open).

Note that the timer wait is asynchronous, not synchronous, which means that async_wait call returns immediately (the way this works behind the scenes at a high level is that the wait operation is kicked off and once the wait has completed the handler is placed into a completion handler queue, and the handlers in the queue are executed on whichever thread called io_context::run, which in this case is the network thread. Note that the io_context::run call blocks until all asynchronous operations have completed and associated handlers have been executed)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's what the anomaly was in my testing before and after changes. So stupid of me.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's what the anomaly was in my testing before and after changes. So stupid of me.

@twinstar26 No worries accidents happen 😄

@twinstar26
Copy link
Copy Markdown
Contributor Author

Okay so I will be shifting the timer into Session::Disconnect. And we will wait for 2 secs. We woundn't care if the packet has been sent or not. Just wait for 2 secs and let the async writing happen (may happen in worst case that timer runs out before writing happens and we just close the socket in the middle of the operation). But I think we should make some kinda provision to let us know for sure that the operation has been completed. I mean the async is really cool but before closing we should make sure everything is fine so that we will have a goodnight's sleep. :-)

Also you mentioned that we should use Host::io_context. I did think about that at first but then I realized that we will be sharing it with something and if so will it cause problems like some code is using it for its async ops and we are using it in the middle of it??

@twinstar26 twinstar26 changed the title Aleth now waits for 2 secs after sending disconnect packet Aleth now waits for 2 secs after sending disconnect packet (#5650) Jul 9, 2019
@twinstar26 twinstar26 changed the title Aleth now waits for 2 secs after sending disconnect packet (#5650) Aleth now waits for 2 secs after sending disconnect packet #5650 Jul 9, 2019
@twinstar26
Copy link
Copy Markdown
Contributor Author

Also you mentioned that we should use Host::io_context. I did think about that at first but then I realized that we will be sharing it with something and if so will it cause problems like some code is using it for its async ops and we are using it in the middle of it??

Distinct objects: Safe.
Shared objects: Safe, with the specific exceptions of the restart() and notify_fork() functions. Calling restart() while there are unfinished run(), run_one(), run_for(), run_until(), poll() or poll_one() calls results in undefined behaviour. The notify_fork() function should not be called while any io_context function, or any function on an I/O object that is associated with the io_context, is being called in another thread.
From official docs.
And

aleth/libp2p/Host.cpp

Lines 206 to 214 in 007d89a

void Host::doneWorking()
{
// Return early if we have no capabilities since there's nothing to do. We've already stopped
// the io service and cleared the node table timers which means discovery is no longer running.
if (!haveCapabilities())
return;
// reset io_context (allows manually polling network, below)
m_ioContext.restart();

Host do call restart(). So I think we are better off with new io_context.
Let me know.

@gumb0
Copy link
Copy Markdown
Member

gumb0 commented Jul 9, 2019

@twinstar26 This restart() is called only when run() call is finished, so it shouldn't be of any concern for your solution.

We should reuse Host::m_context because this way we can be sure that all handlers
(including your new timer handler) are called from the same thread, and thread-safety problems are avoided.

@twinstar26
Copy link
Copy Markdown
Contributor Author

@gumb0
So I have created Host::disconnectTimer() which is public and I want to call Session::drop() inside handler of timer.async_wait(). Session::drop() is private so I cannot call it using object of Session.
So options maybe to convert drop() to public, convert drop() to protected and inherit, use friend, other concept unknown to me.
So which one to go by??

@halfalicious
Copy link
Copy Markdown
Contributor

halfalicious commented Jul 11, 2019

@gumb0
So I have created Host::disconnectTimer() which is public and I want to call Session::drop() inside handler of timer.async_wait(). Session::drop() is private so I cannot call it using object of Session.
So options maybe to convert drop() to public, convert drop() to protected and inherit, use friend, other concept unknown to me.
So which one to go by??

@twinstar26 I would name the Host function which returns a new timer something more general e.g. Host::createTimer() since the returned timer doesn't have to be used for managing disconnects.

Also, Session::drop() being private shouldn't be an issue, you should be able to call it from within the timer handler inside of Session::disconnect e.g. (note that this is just example code and might not compile but you should get the general idea):

auto self = shared_from_this();
auto disconnectTimer = m_host->createTimer();
disconnectTimer->expires_from_now(2000);
disconnectTimer->async_wait([self, this, disconnectTimer](boost::system::error_code _ec){
this->drop();
});

You can also take a look at Host::scheduleCapabilityWorkLook for an example of how we pass handlers to timers.

@twinstar26
Copy link
Copy Markdown
Contributor Author

@gumb0 @halfalicious
When I am using m_ioContext as io_context, I am not getting a 2 sec wait and socket is getting closed. But if I use other io_context object with run() in Host::Disconnect() I am getting desired result.
As @halfalicious pointed out that Host::dowork() is where m_ioContext.run() is present and it should work out but for some reason it's not.
Also for every timer start, socket closes twice and i still haven't found a reason for this.

This is my log when using m_ioContext except socket closes immediately and doesn't wait for 2 secs.
Test Case "host":
timer starting
//////////socket is closed due to reason 8 //////////
//////////socket is closed due to reason 0 //////////
Test Case "saveNodes":
timer starting
timer starting
timer starting
timer starting
//////////socket is closed due to reason 0 //////////
timer starting
//////////socket is closed due to reason 0 //////////
//////////socket is closed due to reason 8 //////////
//////////socket is closed due to reason 0 //////////
//////////socket is closed due to reason 0 //////////
//////////socket is closed due to reason 8 //////////
//////////socket is closed due to reason 0 //////////
//////////socket is closed due to reason 8 //////////
//////////socket is closed due to reason 8 //////////
//////////socket is closed due to reason 8 //////////
timer starting
//////////socket is closed due to reason 8 //////////
//////////socket is closed due to reason 0 //////////
timer starting
//////////socket is closed due to reason 8 //////////
//////////socket is closed due to reason 0 //////////
timer starting
//////////socket is closed due to reason 8 //////////
//////////socket is closed due to reason 0 //////////
timer starting
//////////socket is closed due to reason 8 //////////
//////////socket is closed due to reason 0 //////////

@halfalicious
Copy link
Copy Markdown
Contributor

halfalicious commented Jul 16, 2019

@gumb0 @halfalicious
When I am using m_ioContext as io_context, I am not getting a 2 sec wait and socket is getting closed. But if I use other io_context object with run() in Host::Disconnect() I am getting desired result.
As @halfalicious pointed out that Host::dowork() is where m_ioContext.run() is present and it should work out but for some reason it's not.
Also for every timer start, socket closes twice and i still haven't found a reason for this.

This is my log when using m_ioContext except socket closes immediately and doesn't wait for 2 secs.
Test Case "host":
timer starting
//////////socket is closed due to reason 8 //////////
//////////socket is closed due to reason 0 //////////
Test Case "saveNodes":
timer starting
timer starting
timer starting
timer starting
//////////socket is closed due to reason 0 //////////
timer starting
//////////socket is closed due to reason 0 //////////
//////////socket is closed due to reason 8 //////////
//////////socket is closed due to reason 0 //////////
//////////socket is closed due to reason 0 //////////
//////////socket is closed due to reason 8 //////////
//////////socket is closed due to reason 0 //////////
//////////socket is closed due to reason 8 //////////
//////////socket is closed due to reason 8 //////////
//////////socket is closed due to reason 8 //////////
timer starting
//////////socket is closed due to reason 8 //////////
//////////socket is closed due to reason 0 //////////
timer starting
//////////socket is closed due to reason 8 //////////
//////////socket is closed due to reason 0 //////////
timer starting
//////////socket is closed due to reason 8 //////////
//////////socket is closed due to reason 0 //////////
timer starting
//////////socket is closed due to reason 8 //////////
//////////socket is closed due to reason 0 //////////

Can you please push your latest changes so one of us can take a look and try to repro your issue locally? (I took a look at your branch and it doesn't have the changes you're referring to)

Take a look at this prototype I threw together, it only closes each socket once and waits 2 seconds after disconnect: 74b04c9

@twinstar26
Copy link
Copy Markdown
Contributor Author

@twinstar26
Copy link
Copy Markdown
Contributor Author

twinstar26 commented Jul 17, 2019

@halfalicious
So I confirmed that our logic (your prototype with corrections) was indeed correct. I also confirmed that async_wait's handler was never executed during unit-tests (Actually the handler was exiting with exception stating Operation failed which is boost::asio::error::operation_aborted and is due to timer being cancelled). Now it makes sense why we thought our solution was not working. Host::doWork() has run() but due to m_ioContext object being restarted / stopped in a certain way that async_wait's handler which was queued was never executed.

Possible solution might be to create another method inside Host which will run m_ioContext's handlers without interfering with Host::doWork().
But we have a problem as Run network. Not thread-safe; to be called only by worker. this is the comment left by creator of doWork().
@halfalicious @gumb0 please confirm my findings and possible solution that I have suggested. Also if you have better approach, let me know.

My branch is https://github.com/twinstar26/aleth/tree/aleth-test

@twinstar26
Copy link
Copy Markdown
Contributor Author

twinstar26 commented Jul 18, 2019

log_wait_handler_not _executed.txt

Here in the logs there is no log closing of socket due to " << _reason which i am printing before
https://github.com/twinstar26/aleth/blob/2610662b476398c561cdee1d68010f4fbfa67cf2/libp2p/Session.cpp#L319

@codecov-io
Copy link
Copy Markdown

codecov-io commented Jul 18, 2019

Codecov Report

Merging #5658 into master will decrease coverage by 0.06%.
The diff coverage is 76.92%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5658      +/-   ##
==========================================
- Coverage   62.97%   62.91%   -0.07%     
==========================================
  Files         350      349       -1     
  Lines       29991    29734     -257     
  Branches     3361     3347      -14     
==========================================
- Hits        18886    18706     -180     
+ Misses       9885     9819      -66     
+ Partials     1220     1209      -11

@twinstar26 twinstar26 changed the title Aleth now waits for 2 secs after sending disconnect packet #5650 [WIP] Aleth now waits for 2 secs after sending disconnect packet #5650 Jul 18, 2019
else
drop(_reason);
});
m_server->pollAsyncTimerHandler();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you don't do this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gumb0
Generally we poll when we are calling the function to which the handler belongs multiple times (in a recursion or a loop) so that any handlers ready to run will start its execution.
The original motivation was when in a way I knew unit-tests were calling disconnect multiple times which later you confirmed that they create and destroy host multiple times.

But I realized that there are various parts of code that do call disconnect in a loop for eg.

aleth/libp2p/Host.cpp

Lines 247 to 263 in 4946ed6

// disconnect peers
for (unsigned n = 0;; n = 0)
{
DEV_RECURSIVE_GUARDED(x_sessions)
for (auto i: m_sessions)
if (auto p = i.second.lock())
if (p->isConnected())
{
p->disconnect(ClientQuit);
n++;
}
if (!n)
break;
// poll so that peers send out disconnect packets
m_ioContext.poll();
}

So I kept pollAsyncTimerHandler public in Host definition of which is here
https://github.com/twinstar26/aleth/blob/e83913941ace68e2b41597ddae7fa5ba8e6c3b6c/libp2p/Host.h#L266-L269

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I realize that there is a bug that we should be calling _drop(DisconnectReason) in here too right?
https://github.com/twinstar26/aleth/blob/e83913941ace68e2b41597ddae7fa5ba8e6c3b6c/libp2p/Session.cpp#L312-L314
So that if timer fails for some reason we will issue a drop. What do you think @gumb0 ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't think you need to call drop() there...I think that the handler will only ever called with a boost error code if the timer is cancelled but you never call cancel on the timer.

@twinstar26
Copy link
Copy Markdown
Contributor Author

twinstar26 commented Jul 20, 2019

ToDo list

  1. add m_dropped checks
  2. Need to add changelog entry
  3. Deprecated steady_timer function

@halfalicious
Copy link
Copy Markdown
Contributor

ToDo list

1. halfalicious suggested to replace socket->isConnected() with m_dropped (as to prevent wasting resources e.g. continuing to sync with a client that's disconnecting from it) but that's causing multiple unit-tests to fail. They state that `check host.peerCount() == c_peers has failed` (need to research)

This should be !m_dropped - I tested this locally with my prototype and all of the libp2p tests pass (testeth -t libp2p/*)

@halfalicious
Copy link
Copy Markdown
Contributor

halfalicious commented Jul 24, 2019

I don't think you need to check the error code in the timer handler, you can just check if the socket is open and if so, close the socket. There's no risk of use-after-free since the handler only accesses Session state and the Session instance is guaranteed to still be available since the handler takes a shared_from_this.

@halfalicious
Copy link
Copy Markdown
Contributor

Good call on the timer handlers not being executed when the timers are cancelled (on shutdown) - I don't think we care about this when Aleth is shutting down since I think that the sockets which are left open will be reclaimed when the process exits, but I do think we care about this when running tests since I think that the same process is used to run all tests.

I think the best way to ensure that the handlers are executed is to modify Host::doneWorking rather than adding polling the io context from within Session since the role of doneWorking is to handle cleaning up Host resources on network shutdown. I think we can just move the n++ above the "if (p->isConnected())" check here: https://github.com/twinstar26/aleth/blob/e83913941ace68e2b41597ddae7fa5ba8e6c3b6c/libp2p/Host.cpp#L247-L257

This would prevent the disconnect peers loop from exiting until all of the Session weak pointers expired, which would happen after all of the disconnect timer handlers execute (the timer handler takes a Session shared_ptr which is what keeps a Session alive - all other Session pointers e.g. the ones in Host are weak_ptr so they don't count towards the Session reference counting).

I've tested this change locally and the handlers are executed in the relevant network tests. The down side of this change is that some of the network tests will take a little longer to run since they wait 2 seconds for the disconnect timers to expire but there aren't many tests which do this so think that's okay (@gumb0 : please correct me if I'm wrong here)

@gumb0
Copy link
Copy Markdown
Member

gumb0 commented Jul 24, 2019

@halfalicious Would this result in 2 seconds (or more) wait at aleth exit?

I don't think it's very necessary to wait at exit. And sockets will be anyway closed in Session destructor, so in tests it should be fine already.

@halfalicious
Copy link
Copy Markdown
Contributor

@halfalicious Would this result in 2 seconds (or more) wait at aleth exit?

I don't think it's very necessary to wait at exit. And sockets will be anyway closed in Session destructor, so in tests it should be fine already.

@gumb0 Yes, it would. Thanks for calling out that the sockets are closed in the dtor, @twinstar26 they are closed here:

Session::~Session()

We also poll after calling disconnect in Host::doneWorking which will ensure that the disconnect packets are actually sent out on shutdown. As such, I don't think any modifications to Host::doneWorking are necessary.

@twinstar26
Copy link
Copy Markdown
Contributor Author

Thanks for the reviews and suggestions!! @gumb0 and @halfalicious

So to conclude:

  1. No need of m_server->pollAsyncTimerHandler() polling function
  2. No need of modifying Host::doneWorking and calling Host::doneWorking inside disconnect.
  3. No need of error code inside time handler.
  4. Implement this todo list ([WIP] Aleth now waits for 2 secs after sending disconnect packet #5650 #5658 (comment))

After these changes aleth will not wait for 2 secs during tests but actual working of aleth is quite different and tests don't actually simulate that.

@halfalicious
Copy link
Copy Markdown
Contributor

@twinstar26 Just checking in to see if you're still working on this?

@twinstar26
Copy link
Copy Markdown
Contributor Author

@twinstar26 Just checking in to see if you're still working on this?

@halfalicious . Sorry! Yes I am working. I was busy with my college stuff. I will push a commit in 3-4 hours from now.

@halfalicious
Copy link
Copy Markdown
Contributor

@twinstar26 Just checking in to see if you're still working on this?

@halfalicious . Sorry! Yes I am working. I was busy with my college stuff. I will push a commit in 3-4 hours from now.

No worries, just wanted to see if you were still interested in wrapping this up 😄 I've reviewed your new changes, left comments in the new PR.

@gumb0
Copy link
Copy Markdown
Member

gumb0 commented Jul 31, 2019

Replaced by #5707

@gumb0 gumb0 closed this Jul 31, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Aleth should wait 2 seconds to close socket after sending disconnect to peer

4 participants