-
Notifications
You must be signed in to change notification settings - Fork 38.7k
http: Release work queue after event base finish #19033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
http: Release work queue after event base finish #19033
Conversation
|
@MarcoFalke can't find the issue where this was discussed 😞 |
|
Thanks! I was filtering for your issues... |
|
Concept ACK, assuming that it does what we do in the python code. Haven't looked at the code here. self.network_event_loop.call_soon_threadsafe(self.network_event_loop.stop) |
4c0a617 to
c8c1f24
Compare
|
setting env variable: QT_QPA_PLATFORM=windows ->same warning also on windows 10 & msvc the first test exits silent becoz |
@tarboss 38e2be4be00914e8be7ae9c7215985c9b481b0ae fixes it right? |
|
at least that warning ********* Start testing of AppTests ********* E:\win10dev\bc\bitcoin\build_msvc\x64\Debug\test_bitcoin-qt.exe (process 11012) exited with code 3. |
c8c1f24 to
1d39b7c
Compare
|
This might result in a connection reset by peer or something like that because one request job can be enqueued but not handled (no response sent). |
|
ACK fec477712f9b37dc16b61a54e037073974633f67 |
|
@laanwj see #19033 (comment). I can drop the commit, again. |
|
I prefer a new pull as well. The gui test thing doesn't need backport |
1d39b7c to
fec4777
Compare
|
Done, and rebased. |
hebasto
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK fec477712f9b37dc16b61a54e037073974633f67, I have not tested the code, but I have reviewed it and it looks OK, I agree it can be merged.
|
Testing with diff: diff --git a/src/httpserver.cpp b/src/httpserver.cpp
index 5e78fd1d71..fe3a3a8aab 100644
--- a/src/httpserver.cpp
+++ b/src/httpserver.cpp
@@ -264,6 +264,7 @@ static void http_request_cb(struct evhttp_request* req, void* arg)
// Dispatch to worker thread
if (i != iend) {
std::unique_ptr<HTTPWorkItem> item(new HTTPWorkItem(std::move(hreq), path, i->handler));
+ UninterruptibleSleep(std::chrono::milliseconds{55});
assert(workQueue);
if (workQueue->Enqueue(item.get()))
item.release(); /* if true, queue took ownership */Before: After: |
|
Thanks for these instructions @MarcoFalke. |
93eefaa to
d54596a
Compare
|
@MarcoFalke updated, see #19033 (comment). |
|
Ping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm able to reproduce the problem consistently by adding the artificial delay suggested by @MarcoFalke and then starting a constant stream of RPC requests (of any kind):
$ while src/bitcoin-cli listbanned ; do :; done
and then stopping bitcoind with either control-c or the stop RPC.
bitcoind: httpserver.cpp:271: void http_request_cb(evhttp_request*, void*): Assertion `workQueue' failed.
Aborted (core dumped)
I verified that with this PR, this doesn't happen. Nice PR! My suggestions are optional, although guarding the LogPrintf is probably worth doing.
ACK 4e353cb
| { | ||
| LOCK(cs); | ||
| if (queue.size() >= maxDepth) { | ||
| if (!running || queue.size() >= maxDepth) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a necessary change, but because of this it, the following can appear in the debug.log:
2020-12-14T19:03:41Z WARNING: request rejected because http work queue depth exceeded, it can be increased with the -rpcworkqueue= setting
which is harmless, but a bit confusing and definitely misleading. I suggest adding this guard around that LogPrintf:
if (workQueue->Enqueue(item.get()))
item.release(); /* if true, queue took ownership */
else {
if (!ShutdownRequested()) {
LogPrintf("WARNING: request rejected because http work queue depth exceeded, it can be increased with the -rpcworkqueue= setting\n");
}
item->req->WriteReply(HTTP_INTERNAL_SERVER_ERROR, "Work queue depth exceeded");
(I did manually test that change.) Or you could change Enqueue() to return more than just a bool to indicate the reason the request was rejected, but it's probably not worth it. Or, you could check ShutdownRequested() before calling Enqueue() and there would be no need to check running in Enqueue. But I think your change is best because the WorkQueue object knows if it's running or not, makes sense to take advantage of that knowledge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well in that case the RPC error is also misleading. I'd like to avoid more calls to ShutdownRequested. Maybe just reword the error to something along "work queue full or shutdown in progress".
| while (running && queue.empty()) | ||
| cond.wait(lock); | ||
| if (!running) | ||
| if (!running && queue.empty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good change! This ensures that the queue is empty before exiting the thread, so there are no leftover items in the work queue when the destructor runs; without this, I got the following to appear in debug.log during testing:
2020-12-14T19:11:40Z ~HTTPRequest: Unhandled request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I would have to test to make sure but I think without this the client wouldn't get a response. I also think that very last response would now have connection: close header, see HTTPRequest::WriteReply - another case I'd like to remove the call to ShutdownRequested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I tested this, and without the changes to Enqueue and Run, bitcoin-cli prints the following:
error: Could not connect to the server 127.0.0.1:8332 (error code 1 - "EOF reached")
Make sure the bitcoind server is running and that you are connecting to the correct RPC port.
(I think this is the client not getting a response.)
With your changes to Enqueue and Run:
error: couldn't parse reply from server
This is a slight improvement, I guess. This error seems familiar... If I cherry-pick the 2-line change from #18335 then we get:
error: Server response: Work queue depth exceeded
It's the same wrong error message that I mentioned for the LogPrintf 😆. I guess that PR should change to say something like work queue depth exceeded or client or shutdown in progress, as you suggested. (Nothing for this needed by your PR here.)
| if (workQueue) { | ||
| delete workQueue; | ||
| workQueue = nullptr; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that this is the correct fix, because this now occurs after the call to g_thread_http.join() (just above, line 468). This join ensures that no new items can be enqueued.
Suggestion, you may want to keep all of the workQueue shutdown code together and move it to just after the call to g_thread_http.join(), that is, after line 469 (with the current line numbering, in other words just before if (eventHTTP)), instead of having it split between two places (beginning at line 453 above, and here). I don't think there's any reason those worker threads can't live a little longer (and there's no harm that I can think of). This would follow the pattern of tearing down the higher-level stuff before the lower-level stuff (not doing that is the source of this bug in the first place!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can't be 😄 this is the goal of this PR - destroy the work queue after event base finish. After g_thread_http.join() there can be stuff pending on the event base, and if you delete the workQueue before the event base finishes then all pending workItems will be destroyed (because they are unique_ptr) and then it's only problems from here on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay, I didn't understand about the event base; so maybe we could do all the shutdown (deconstruct) of the workQueue down here, where you now have the delete. Would this work?
suggestion
void StopHTTPServer()
{
LogPrint(BCLog::HTTP, "Stopping HTTP server\n");
// Unlisten sockets, these are what make the event loop running, which means
// that after this and all connections are closed the event loop will quit.
for (evhttp_bound_socket *socket : boundSockets) {
evhttp_del_accept_socket(eventHTTP, socket);
}
boundSockets.clear();
if (eventBase) {
LogPrint(BCLog::HTTP, "Waiting for HTTP event thread to exit\n");
if (g_thread_http.joinable()) g_thread_http.join();
}
if (eventHTTP) {
evhttp_free(eventHTTP);
eventHTTP = nullptr;
}
if (eventBase) {
event_base_free(eventBase);
eventBase = nullptr;
}
if (workQueue) {
LogPrint(BCLog::HTTP, "Waiting for HTTP worker threads to exit\n");
for (auto& thread: g_thread_http_workers) {
thread.join();
}
g_thread_http_workers.clear();
delete workQueue;
workQueue = nullptr;
}
LogPrint(BCLog::HTTP, "Stopped HTTP server\n");
}
(That seems slightly simpler and clearer to me, but the way you have it now is good.) If you make changes, I don't mind taking another look (and testing) and quickly giving another ack.
|
Thanks for reviewing @LarryRuane! I'll hold on your suggestion to keep your and @fjahr ACK. I also think the error and log messages can be improved in a follow to avoid dragging this. In any case I hope my above comments are clear to you, otherwise let me know. @hebasto mind to test again? It would be great to push this early on the release cycle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 4e353cb, tested (rebased on top of master 9313c4e) on Linux Mint 20.1 (x86_64) using MarcoFalke's patch, including different -rpcthreads/-rpcworkqueue cases. The bug is fixed. The code is correct.
Agree, that the message "error: Server response: Work queue depth exceeded" should be reworded as it covers two different cases now.
Edit (laanwj): removed the @ in the ACK line
…mbers 34b04ee refactor: Add TSA annotations to the WorkQueue class members (Hennadii Stepanov) Pull request description: Noted while reviewing #19033, and hoping this will not conflict with it :) ACKs for top commit: promag: Code review ACK 34b04ee. Tree-SHA512: 4c15729acd95223263c19bc0dd64b9e7960872b48edee6eee97a5d0c2b99b8838185ac3a2ccd5bee992cb3a12498633427fe9919be5a12da9949fcf69a6275a0
…lass members 34b04ee refactor: Add TSA annotations to the WorkQueue class members (Hennadii Stepanov) Pull request description: Noted while reviewing bitcoin#19033, and hoping this will not conflict with it :) ACKs for top commit: promag: Code review ACK 34b04ee. Tree-SHA512: 4c15729acd95223263c19bc0dd64b9e7960872b48edee6eee97a5d0c2b99b8838185ac3a2ccd5bee992cb3a12498633427fe9919be5a12da9949fcf69a6275a0
|
ACK 4e353cb Reviewed the code and it makes sense. However I was unable to reproduce the race condition so I cannot test this. |
|
(re-)ACK 4e353cb |
4e353cb http: Release work queue after event base finish (João Barbosa) Pull request description: This fixes a race between `http_request_cb` and `StopHTTPServer` where the work queue is used after release. Fixes bitcoin#18856. ACKs for top commit: fjahr: Code review ACK 4e353cb achow101: ACK 4e353cb LarryRuane: ACK 4e353cb hebasto: ACK 4e353cb, tested (rebased on top of master 9313c4e) on Linux Mint 20.1 (x86_64) using MarcoFalke's [patch](bitcoin#19033 (comment)), including different `-rpcthreads`/`-rpcworkqueue` cases. The bug is fixed. The code is correct. Tree-SHA512: 185d2a9744d0d5134d782bf321ac9958ba17b11a5b3d70b4897c8243e6b146dfd3f23c57aef8e10ae9484374120b64389c1949a9cf0a21dccc47ffc934c20930
…lass members 34b04ee refactor: Add TSA annotations to the WorkQueue class members (Hennadii Stepanov) Pull request description: Noted while reviewing bitcoin#19033, and hoping this will not conflict with it :) ACKs for top commit: promag: Code review ACK 34b04ee. Tree-SHA512: 4c15729acd95223263c19bc0dd64b9e7960872b48edee6eee97a5d0c2b99b8838185ac3a2ccd5bee992cb3a12498633427fe9919be5a12da9949fcf69a6275a0
4e353cb http: Release work queue after event base finish (João Barbosa) Pull request description: This fixes a race between `http_request_cb` and `StopHTTPServer` where the work queue is used after release. Fixes bitcoin#18856. ACKs for top commit: fjahr: Code review ACK 4e353cb achow101: ACK 4e353cb LarryRuane: ACK 4e353cb hebasto: ACK 4e353cb, tested (rebased on top of master 9313c4e) on Linux Mint 20.1 (x86_64) using MarcoFalke's [patch](bitcoin#19033 (comment)), including different `-rpcthreads`/`-rpcworkqueue` cases. The bug is fixed. The code is correct. Tree-SHA512: 185d2a9744d0d5134d782bf321ac9958ba17b11a5b3d70b4897c8243e6b146dfd3f23c57aef8e10ae9484374120b64389c1949a9cf0a21dccc47ffc934c20930
…lass members 34b04ee refactor: Add TSA annotations to the WorkQueue class members (Hennadii Stepanov) Pull request description: Noted while reviewing bitcoin#19033, and hoping this will not conflict with it :) ACKs for top commit: promag: Code review ACK 34b04ee. Tree-SHA512: 4c15729acd95223263c19bc0dd64b9e7960872b48edee6eee97a5d0c2b99b8838185ac3a2ccd5bee992cb3a12498633427fe9919be5a12da9949fcf69a6275a0
4e353cb http: Release work queue after event base finish (João Barbosa) Pull request description: This fixes a race between `http_request_cb` and `StopHTTPServer` where the work queue is used after release. Fixes bitcoin#18856. ACKs for top commit: fjahr: Code review ACK 4e353cb achow101: ACK 4e353cb LarryRuane: ACK 4e353cb hebasto: ACK 4e353cb, tested (rebased on top of master 9313c4e) on Linux Mint 20.1 (x86_64) using MarcoFalke's [patch](bitcoin#19033 (comment)), including different `-rpcthreads`/`-rpcworkqueue` cases. The bug is fixed. The code is correct. Tree-SHA512: 185d2a9744d0d5134d782bf321ac9958ba17b11a5b3d70b4897c8243e6b146dfd3f23c57aef8e10ae9484374120b64389c1949a9cf0a21dccc47ffc934c20930
This fixes a race between
http_request_cbandStopHTTPServerwherethe work queue is used after release.
Fixes #18856.