delete has no more data after the key#53886
Conversation
💊 CI failures summary and remediationsAs of commit 08e0964 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
3076c34 to
3fcb5eb
Compare
Codecov Report
@@ Coverage Diff @@
## master #53886 +/- ##
==========================================
- Coverage 77.30% 77.30% -0.01%
==========================================
Files 1888 1888
Lines 183589 183589
==========================================
- Hits 141925 141917 -8
- Misses 41664 41672 +8 |
H-Huang
left a comment
There was a problem hiding this comment.
Awesome fix and good catch @t-vi! I am a bit wary about adding that new test case test_delkey_perf since testing a perf change using multiprocessing can introduce flakiness / false signals in our CI.
Since TcpStore is used internally for storing membership information for nodes in distributed training, deletes are not frequent and time taken is small when compared to the actual training process.
There was a problem hiding this comment.
I would keep this part if you are able to add an assert when comparing the duration and it is consistent. I don't think we need the tests below
There was a problem hiding this comment.
Oh, sorry, I didn't mean to have that in there. This part doesn't actually check anything useful related to the fix.
There was a problem hiding this comment.
looks like this wasn't updated
facebook-github-bot
left a comment
There was a problem hiding this comment.
@H-Huang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
I think the part of the test you are concerned about is the part we want to test which only happens when you actually have multiple processes. Note that the timing bound is exceedingly generous, so I would be optimistic that it's not too flaky. Of course, if we had a better place to put a basic perf test, it'd be better. The main alternative I'd see is not have a test if we conclude that testing timings in multi-process is too flaky. |
|
Under the hood, TCPStore is single threaded. There is a thread on the master which sequentially accepts query types (e.g. SET, GET, DELETE) and handles them, which is why testing the perf under multiprocessing does not have that much added benefits compared to a single process. You are correct about the timing bound it is pretty generous though 🙂 |
|
The bug was in the communication, not the backend. |
|
Ah good point, my misunderstanding. Code looks LGTM to me then! If you remove the extra part under |
The tcpstore delete key implementation inadvertendly set "moreData" when sending the key when it was in fact the last message. Thank you, @PetrochukM for the reproducing example. Fixes pytorch#53872
|
failure in mac os (https://app.circleci.com/pipelines/github/pytorch/pytorch/284617/workflows/812c7265-5c1b-4ecf-becf-fb119cd226b8/jobs/11504465). I think this is intermittent, I didn't see this in the earlier run |
|
That must be the flakyness. :/ Do we have hope that it is only mac or do we think this type of test is just touchy? |
rohan-varma
left a comment
There was a problem hiding this comment.
Thanks for the quick and great fix! In general I'm pretty unsure about adding a unittest that validates performance, as unittesting is mostly about correctness and we generally have benchmarks for performance. I'll defer the discussion to folks who've owned/worked on Store more, though.
| processes = [] | ||
| num_processes = world_size | ||
| for i in range(num_processes): | ||
| p = mp.Process(target=self._test_delkey_perf_worker, args=(i, addr, port, world_size, messages)) |
There was a problem hiding this comment.
It would be preferable for all distributed tests needing multiprocessing to inherit from MultiProcessTestCase, as that class has had significant work to ensure proper error handling, failure reporting, etc. If we do add this test, could we add a class that inherits from MultiProcessTestCase and runs this?
There was a problem hiding this comment.
Yeah, so I thought copying the test from within the case would be reasonably safe but I lack the expertise. I do agree that this isn't the right thing to do here - and it's demonstrated by the present failure.
| store.delete_key(k) | ||
| dur_delete = (time.perf_counter() - start) * 1000 | ||
|
|
||
| if dur_delete > 5 * max(dur_wait, dur_store): |
There was a problem hiding this comment.
This seems quite prone to flakiness, at the least could we make it a much more generous threshold?
| ) | ||
| sys.exit(MultiProcessTestCase.TEST_ERROR_EXIT_CODE) | ||
|
|
||
| time.sleep(0.1) |
There was a problem hiding this comment.
Generally we try to avoid time.sleep() in tests as it has previously been the source of a lot of nondeterminism and flakiness.
|
I agree with Rohan's comments. Regarding whether mac is the only tests that we've seen hit these flaky test, there have been multiple different CIs that we have seen multiprocessing cause issues with and we are also currently investigating them. Adding to our cpp tests (TCPStoreTest) may also be more stable, but that would involve a bit more work. For now, I think we should just leave out the perf test, and get the fix itself checked in. |
|
OK, so based on the feedback, I'll remove the test for now. If we have a place for benchmark, I'd be glad to add it there in a separate PR. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@H-Huang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: The tcpstore delete key implementation inadvertendly set "moreData" when sending the key when it was in fact the last message. Thank you, PetrochukM, for the reproducing example which was instrumental in developing the fix (and is the blueprint for the test case). Fixes pytorch#53872 Pull Request resolved: pytorch#53886 Reviewed By: jbschlosser Differential Revision: D27011846 Pulled By: H-Huang fbshipit-source-id: 5c460d1e4d095a8bc267bf63613b556856ced3e8
The tcpstore delete key implementation inadvertendly set "moreData" when sending the key when it was in fact the last message.
Thank you, @PetrochukM, for the reproducing example which was instrumental in developing the fix (and is the blueprint for the test case).
Fixes #53872