Kill the dummy TaskOutput when task.get_step()#10739
Closed
xush6528 wants to merge 2 commits intopytorch:masterfrom
xush6528:export-D9413150
Closed
Kill the dummy TaskOutput when task.get_step()#10739xush6528 wants to merge 2 commits intopytorch:masterfrom xush6528:export-D9413150
xush6528 wants to merge 2 commits intopytorch:masterfrom
xush6528:export-D9413150
Conversation
Summary:
Given the checkpoint manager was initialized with a **session** passed.
- On initialization, the checkpoint manager loads the checkpoint to session if there is one.
- Later, each time the .save(...) method is called, save the session to a checkpoint file.
- Checkpoint manager can immediately run the `checkpoint_save_task` in the session.
- **checkpoint_init_task?** Method `CheckpointManager.init()` to make checkpoint_init_task (group) will be removed to simplify API.
- In the old framework, `checkpoint_init_task` runs after init_group and does these works:
- epoch_num == 0: Collect all blobs. In the future, Blob collection will be done together with save, actually right before save.
- epoch_num >0: Load all blobs from checkpoint files.
Steps to save.
caffe2/caffe2/python/checkpoint.py
caffe2/caffe2/python/checkpoint_test.py
Differential Revision: D9408812
fbshipit-source-id: 2f663446298ad8e795864402f34d9d9c75e82e79
Summary: Pull Request resolved: #10739 I wanted to assert that the blobs in the workspace blobs after loading checkpoint are exactly the same as the blobs in the work save before saving to a checkpoint. But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net "task:output" was also added along. This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan". ZMQ socket can't send empty list. As a result, if the Task on the Worker had no output, The master would never stop waiting and hang forever. So a dummy TaskOutput was added when `task.get_step()` to work around it. After thinking it twice, I think this should be fixed. Because TaskOuput is at User layer. The issued shouldn't have been solved by adding a TaskOuput. Instead, we should move the creating of the placeholder blob to some deeper layer, and remove the placeholder blob in the workspace afterwards to avoid polluting user workspace. After this change, the workaround becomes totally transparent and no side-effect to users. Differential Revision: D9413150 fbshipit-source-id: 49221745351880115b66a13973eb8584c2bd6b32
facebook-github-bot
pushed a commit
that referenced
this pull request
Aug 30, 2018
Summary: Pull Request resolved: #11048 Pull Request resolved: #10739 I wanted to assert that the blobs in the workspace of the new session after loading checkpoint are exactly the same as the blobs in the workspace of the old session before saving to a checkpoint. But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net `task:output` was also added along with it. See https://fburl.com/937lf2yk This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan". This adding a dummy TaskOutput when user specifies no TaskOutput is a hack. The reason for this is that ZMQ socket can't send empty blob list. As a result, if the Task on the Worker had no output, The master would never stop waiting and hang forever. See https://fburl.com/rd7fhy6p and imagine `socket.recv(net, 0)`. TaskOuput is at user layer. The hack shouldn't be exposed to user layer, polluting user workspaces. Instead, we should move the creating of the dummy blob to some deeper layer, and remove the dummy blob in the workspace afterwards to avoid polluting user workspaces. After this change, the workaround becomes totally transparent and no side-effect to users. Reviewed By: mraway Differential Revision: D9566744 fbshipit-source-id: 18292dd64a6d48192c34034200a7c9811d2172af
PenghuiCheng
pushed a commit
to PenghuiCheng/pytorch
that referenced
this pull request
Sep 11, 2018
Summary: Pull Request resolved: pytorch#10739 I wanted to assert that the blobs in the workspace of the new session after loading checkpoint are exactly the same as the blobs in the workspace of the old session before saving to a checkpoint. But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net `task:output` was also added along with it. See https://fburl.com/937lf2yk This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan". This adding a dummy TaskOutput when user specifies no TaskOutput is a hack. The reason for this is that ZMQ socket can't send empty blob list. As a result, if the Task on the Worker had no output, The master would never stop waiting and hang forever. See https://fburl.com/rd7fhy6p and imagine `socket.recv(net, 0)`. TaskOuput is at user layer. The hack shouldn't be exposed to user layer, polluting user workspaces. Instead, we should move the creating of the dummy blob to some deeper layer, and remove the dummy blob in the workspace afterwards to avoid polluting user workspaces. After this change, the workaround becomes totally transparent and no side-effect to users. Reviewed By: mraway Differential Revision: D9413150 fbshipit-source-id: 51aaf3201e26570b4fcf5738e9b9aa17c58777ac
PenghuiCheng
pushed a commit
to PenghuiCheng/pytorch
that referenced
this pull request
Sep 11, 2018
Summary: Pull Request resolved: pytorch#11048 Pull Request resolved: pytorch#10739 I wanted to assert that the blobs in the workspace of the new session after loading checkpoint are exactly the same as the blobs in the workspace of the old session before saving to a checkpoint. But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net `task:output` was also added along with it. See https://fburl.com/937lf2yk This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan". This adding a dummy TaskOutput when user specifies no TaskOutput is a hack. The reason for this is that ZMQ socket can't send empty blob list. As a result, if the Task on the Worker had no output, The master would never stop waiting and hang forever. See https://fburl.com/rd7fhy6p and imagine `socket.recv(net, 0)`. TaskOuput is at user layer. The hack shouldn't be exposed to user layer, polluting user workspaces. Instead, we should move the creating of the dummy blob to some deeper layer, and remove the dummy blob in the workspace afterwards to avoid polluting user workspaces. After this change, the workaround becomes totally transparent and no side-effect to users. Reviewed By: mraway Differential Revision: D9566744 fbshipit-source-id: 18292dd64a6d48192c34034200a7c9811d2172af
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
I wanted to assert that the blobs in the workspace blobs after loading checkpoint are exactly the same as the blobs in the work save before saving to a checkpoint.
But I found that when calling
task.get_step(), a dummy task output blob,task:output/ConstIntFill:0, is added. Also a dummy net "task:output" was also added along.This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan".
ZMQ socket can't send empty list.
As a result, if the Task on the Worker had no output,
The master would never stop waiting and hang forever.
So a dummy TaskOutput was added when
task.get_step()to work around it.After thinking it twice, I think this should be fixed.
Because TaskOuput is at User layer. The issued shouldn't have been solved by adding a TaskOuput.
Instead, we should move the creating of the placeholder blob to some deeper layer,
and remove the placeholder blob in the workspace afterwards to avoid polluting user workspace.
After this change, the workaround becomes totally transparent and no side-effect to users.
Differential Revision: D9413150