Skip to content

delete has no more data after the key#53886

Closed
t-vi wants to merge 2 commits intopytorch:masterfrom
t-vi:tcpstore_del_perf
Closed

delete has no more data after the key#53886
t-vi wants to merge 2 commits intopytorch:masterfrom
t-vi:tcpstore_del_perf

Conversation

@t-vi
Copy link
Copy Markdown
Collaborator

@t-vi t-vi commented Mar 12, 2021

The tcpstore delete key implementation inadvertendly set "moreData" when sending the key when it was in fact the last message.

Thank you, @PetrochukM, for the reproducing example which was instrumental in developing the fix (and is the blueprint for the test case).

Fixes #53872

@facebook-github-bot facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Mar 12, 2021
@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Mar 12, 2021

💊 CI failures summary and remediations

As of commit 08e0964 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@t-vi t-vi force-pushed the tcpstore_del_perf branch from 61b4a9f to c37714d Compare March 12, 2021 09:30
@t-vi t-vi force-pushed the tcpstore_del_perf branch 2 times, most recently from 3076c34 to 3fcb5eb Compare March 12, 2021 10:09
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 12, 2021

Codecov Report

Merging #53886 (08e0964) into master (1772e26) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #53886      +/-   ##
==========================================
- Coverage   77.30%   77.30%   -0.01%     
==========================================
  Files        1888     1888              
  Lines      183589   183589              
==========================================
- Hits       141925   141917       -8     
- Misses      41664    41672       +8     

Copy link
Copy Markdown
Contributor

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome fix and good catch @t-vi! I am a bit wary about adding that new test case test_delkey_perf since testing a perf change using multiprocessing can introduce flakiness / false signals in our CI.

Since TcpStore is used internally for storing membership information for nodes in distributed training, deletes are not frequent and time taken is small when compared to the actual training process.

Comment thread test/distributed/test_c10d.py Outdated
Comment on lines 343 to 351
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep this part if you are able to add an assert when comparing the duration and it is consistent. I don't think we need the tests below

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I didn't mean to have that in there. This part doesn't actually check anything useful related to the fix.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this wasn't updated

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@H-Huang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@t-vi
Copy link
Copy Markdown
Collaborator Author

t-vi commented Mar 12, 2021

I think the part of the test you are concerned about is the part we want to test which only happens when you actually have multiple processes. Note that the timing bound is exceedingly generous, so I would be optimistic that it's not too flaky. Of course, if we had a better place to put a basic perf test, it'd be better. The main alternative I'd see is not have a test if we conclude that testing timings in multi-process is too flaky.

@H-Huang
Copy link
Copy Markdown
Contributor

H-Huang commented Mar 12, 2021

Under the hood, TCPStore is single threaded. There is a thread on the master which sequentially accepts query types (e.g. SET, GET, DELETE) and handles them, which is why testing the perf under multiprocessing does not have that much added benefits compared to a single process. You are correct about the timing bound it is pretty generous though 🙂

@t-vi
Copy link
Copy Markdown
Collaborator Author

t-vi commented Mar 12, 2021

The bug was in the communication, not the backend.

@H-Huang
Copy link
Copy Markdown
Contributor

H-Huang commented Mar 12, 2021

Ah good point, my misunderstanding. Code looks LGTM to me then! If you remove the extra part under _test_numkeys_delkeys I will merge it ASAP

The tcpstore delete key implementation inadvertendly set "moreData"
when sending the key when it was in fact the last message.

Thank you, @PetrochukM for the reproducing example.

Fixes pytorch#53872
@H-Huang
Copy link
Copy Markdown
Contributor

H-Huang commented Mar 12, 2021

failure in mac os (https://app.circleci.com/pipelines/github/pytorch/pytorch/284617/workflows/812c7265-5c1b-4ecf-becf-fb119cd226b8/jobs/11504465). I think this is intermittent, I didn't see this in the earlier run

@t-vi
Copy link
Copy Markdown
Collaborator Author

t-vi commented Mar 12, 2021

That must be the flakyness. :/ Do we have hope that it is only mac or do we think this type of test is just touchy?

Copy link
Copy Markdown
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick and great fix! In general I'm pretty unsure about adding a unittest that validates performance, as unittesting is mostly about correctness and we generally have benchmarks for performance. I'll defer the discussion to folks who've owned/worked on Store more, though.

Comment thread test/distributed/test_c10d.py Outdated
processes = []
num_processes = world_size
for i in range(num_processes):
p = mp.Process(target=self._test_delkey_perf_worker, args=(i, addr, port, world_size, messages))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be preferable for all distributed tests needing multiprocessing to inherit from MultiProcessTestCase, as that class has had significant work to ensure proper error handling, failure reporting, etc. If we do add this test, could we add a class that inherits from MultiProcessTestCase and runs this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so I thought copying the test from within the case would be reasonably safe but I lack the expertise. I do agree that this isn't the right thing to do here - and it's demonstrated by the present failure.

Comment thread test/distributed/test_c10d.py Outdated
store.delete_key(k)
dur_delete = (time.perf_counter() - start) * 1000

if dur_delete > 5 * max(dur_wait, dur_store):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems quite prone to flakiness, at the least could we make it a much more generous threshold?

Comment thread test/distributed/test_c10d.py Outdated
)
sys.exit(MultiProcessTestCase.TEST_ERROR_EXIT_CODE)

time.sleep(0.1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally we try to avoid time.sleep() in tests as it has previously been the source of a lot of nondeterminism and flakiness.

@H-Huang
Copy link
Copy Markdown
Contributor

H-Huang commented Mar 12, 2021

I agree with Rohan's comments. Regarding whether mac is the only tests that we've seen hit these flaky test, there have been multiple different CIs that we have seen multiprocessing cause issues with and we are also currently investigating them. Adding to our cpp tests (TCPStoreTest) may also be more stable, but that would involve a bit more work. For now, I think we should just leave out the perf test, and get the fix itself checked in.

@t-vi
Copy link
Copy Markdown
Collaborator Author

t-vi commented Mar 13, 2021

OK, so based on the feedback, I'll remove the test for now. If we have a place for benchmark, I'd be glad to add it there in a separate PR.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@H-Huang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@H-Huang merged this pull request in 8734e88.

laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
The tcpstore delete key implementation inadvertendly set "moreData" when sending the key when it was in fact the last message.

Thank you, PetrochukM, for the reproducing example which was instrumental in developing the fix (and is the blueprint for the test case).

Fixes pytorch#53872

Pull Request resolved: pytorch#53886

Reviewed By: jbschlosser

Differential Revision: D27011846

Pulled By: H-Huang

fbshipit-source-id: 5c460d1e4d095a8bc267bf63613b556856ced3e8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TCPStore.delete is 150x to 600x slower than TCPStore.set

5 participants