Kill the dummy TaskOutput when task.get_step() by xush6528 · Pull Request #10739 · pytorch/pytorch

xush6528 · 2018-08-21T17:34:22Z

Summary:
I wanted to assert that the blobs in the workspace blobs after loading checkpoint are exactly the same as the blobs in the work save before saving to a checkpoint.

But I found that when calling task.get_step(), a dummy task output blob, task:output/ConstIntFill:0, is added. Also a dummy net "task:output" was also added along.

This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan".

ZMQ socket can't send empty list.
As a result, if the Task on the Worker had no output,
The master would never stop waiting and hang forever.

So a dummy TaskOutput was added when task.get_step() to work around it.

After thinking it twice, I think this should be fixed.

Because TaskOuput is at User layer. The issued shouldn't have been solved by adding a TaskOuput.

Instead, we should move the creating of the placeholder blob to some deeper layer,
and remove the placeholder blob in the workspace afterwards to avoid polluting user workspace.
After this change, the workaround becomes totally transparent and no side-effect to users.

Differential Revision: D9413150

Summary: Given the checkpoint manager was initialized with a **session** passed. - On initialization, the checkpoint manager loads the checkpoint to session if there is one. - Later, each time the .save(...) method is called, save the session to a checkpoint file. - Checkpoint manager can immediately run the `checkpoint_save_task` in the session. - **checkpoint_init_task?** Method `CheckpointManager.init()` to make checkpoint_init_task (group) will be removed to simplify API. - In the old framework, `checkpoint_init_task` runs after init_group and does these works: - epoch_num == 0: Collect all blobs. In the future, Blob collection will be done together with save, actually right before save. - epoch_num >0: Load all blobs from checkpoint files. Steps to save. caffe2/caffe2/python/checkpoint.py caffe2/caffe2/python/checkpoint_test.py Differential Revision: D9408812 fbshipit-source-id: 2f663446298ad8e795864402f34d9d9c75e82e79

Summary: Pull Request resolved: #10739 I wanted to assert that the blobs in the workspace blobs after loading checkpoint are exactly the same as the blobs in the work save before saving to a checkpoint. But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net "task:output" was also added along. This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan". ZMQ socket can't send empty list. As a result, if the Task on the Worker had no output, The master would never stop waiting and hang forever. So a dummy TaskOutput was added when `task.get_step()` to work around it. After thinking it twice, I think this should be fixed. Because TaskOuput is at User layer. The issued shouldn't have been solved by adding a TaskOuput. Instead, we should move the creating of the placeholder blob to some deeper layer, and remove the placeholder blob in the workspace afterwards to avoid polluting user workspace. After this change, the workaround becomes totally transparent and no side-effect to users. Differential Revision: D9413150 fbshipit-source-id: 49221745351880115b66a13973eb8584c2bd6b32

Summary: Pull Request resolved: #11048 Pull Request resolved: #10739 I wanted to assert that the blobs in the workspace of the new session after loading checkpoint are exactly the same as the blobs in the workspace of the old session before saving to a checkpoint. But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net `task:output` was also added along with it. See https://fburl.com/937lf2yk This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan". This adding a dummy TaskOutput when user specifies no TaskOutput is a hack. The reason for this is that ZMQ socket can't send empty blob list. As a result, if the Task on the Worker had no output, The master would never stop waiting and hang forever. See https://fburl.com/rd7fhy6p and imagine `socket.recv(net, 0)`. TaskOuput is at user layer. The hack shouldn't be exposed to user layer, polluting user workspaces. Instead, we should move the creating of the dummy blob to some deeper layer, and remove the dummy blob in the workspace afterwards to avoid polluting user workspaces. After this change, the workaround becomes totally transparent and no side-effect to users. Reviewed By: mraway Differential Revision: D9566744 fbshipit-source-id: 18292dd64a6d48192c34034200a7c9811d2172af

Summary: Pull Request resolved: pytorch#10739 I wanted to assert that the blobs in the workspace of the new session after loading checkpoint are exactly the same as the blobs in the workspace of the old session before saving to a checkpoint. But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net `task:output` was also added along with it. See https://fburl.com/937lf2yk This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan". This adding a dummy TaskOutput when user specifies no TaskOutput is a hack. The reason for this is that ZMQ socket can't send empty blob list. As a result, if the Task on the Worker had no output, The master would never stop waiting and hang forever. See https://fburl.com/rd7fhy6p and imagine `socket.recv(net, 0)`. TaskOuput is at user layer. The hack shouldn't be exposed to user layer, polluting user workspaces. Instead, we should move the creating of the dummy blob to some deeper layer, and remove the dummy blob in the workspace afterwards to avoid polluting user workspaces. After this change, the workaround becomes totally transparent and no side-effect to users. Reviewed By: mraway Differential Revision: D9413150 fbshipit-source-id: 51aaf3201e26570b4fcf5738e9b9aa17c58777ac

Summary: Pull Request resolved: pytorch#11048 Pull Request resolved: pytorch#10739 I wanted to assert that the blobs in the workspace of the new session after loading checkpoint are exactly the same as the blobs in the workspace of the old session before saving to a checkpoint. But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net `task:output` was also added along with it. See https://fburl.com/937lf2yk This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan". This adding a dummy TaskOutput when user specifies no TaskOutput is a hack. The reason for this is that ZMQ socket can't send empty blob list. As a result, if the Task on the Worker had no output, The master would never stop waiting and hang forever. See https://fburl.com/rd7fhy6p and imagine `socket.recv(net, 0)`. TaskOuput is at user layer. The hack shouldn't be exposed to user layer, polluting user workspaces. Instead, we should move the creating of the dummy blob to some deeper layer, and remove the dummy blob in the workspace afterwards to avoid polluting user workspaces. After this change, the workaround becomes totally transparent and no side-effect to users. Reviewed By: mraway Differential Revision: D9566744 fbshipit-source-id: 18292dd64a6d48192c34034200a7c9811d2172af

xush6528 added 2 commits August 21, 2018 10:45

weiyangfb added the caffe2 label Aug 21, 2018

facebook-github-bot closed this in 6ca2898 Aug 29, 2018

xush6528 mentioned this pull request Aug 30, 2018

Kill the dummy TaskOutput when task.get_step() (#10739) #11048

Closed

ezyang added the merged label Jun 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kill the dummy TaskOutput when task.get_step()#10739

Kill the dummy TaskOutput when task.get_step()#10739
xush6528 wants to merge 2 commits intopytorch:masterfrom
xush6528:export-D9413150

xush6528 commented Aug 21, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xush6528 commented Aug 21, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants