Skip to content

Kill the dummy TaskOutput when task.get_step()#10739

Closed
xush6528 wants to merge 2 commits intopytorch:masterfrom
xush6528:export-D9413150
Closed

Kill the dummy TaskOutput when task.get_step()#10739
xush6528 wants to merge 2 commits intopytorch:masterfrom
xush6528:export-D9413150

Conversation

@xush6528
Copy link
Contributor

Summary:
I wanted to assert that the blobs in the workspace blobs after loading checkpoint are exactly the same as the blobs in the work save before saving to a checkpoint.

But I found that when calling task.get_step(), a dummy task output blob, task:output/ConstIntFill:0, is added. Also a dummy net "task:output" was also added along.

This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan".

ZMQ socket can't send empty list.
As a result, if the Task on the Worker had no output,
The master would never stop waiting and hang forever.

So a dummy TaskOutput was added when task.get_step() to work around it.

After thinking it twice, I think this should be fixed.

Because TaskOuput is at User layer. The issued shouldn't have been solved by adding a TaskOuput.

Instead, we should move the creating of the placeholder blob to some deeper layer,
and remove the placeholder blob in the workspace afterwards to avoid polluting user workspace.
After this change, the workaround becomes totally transparent and no side-effect to users.

Differential Revision: D9413150

Summary:
Given the checkpoint manager was initialized with a **session** passed.

- On initialization, the checkpoint manager loads the checkpoint to session if there is one.
- Later, each time the .save(...) method is called, save the session to a checkpoint file.

- Checkpoint manager can immediately run the `checkpoint_save_task` in the session.

- **checkpoint_init_task?** Method `CheckpointManager.init()` to make checkpoint_init_task (group) will be removed to simplify API.
    - In the old framework, `checkpoint_init_task` runs after init_group and does these works:
       - epoch_num == 0: Collect all blobs. In the future, Blob collection will be done together with save, actually right before save.
       - epoch_num >0: Load all blobs from checkpoint files.

Steps to save.

caffe2/caffe2/python/checkpoint.py

caffe2/caffe2/python/checkpoint_test.py

Differential Revision: D9408812

fbshipit-source-id: 2f663446298ad8e795864402f34d9d9c75e82e79
Summary:
Pull Request resolved: #10739

I wanted to assert that the blobs in the workspace blobs after loading checkpoint are exactly the same as the blobs in the work save before saving to a checkpoint.

But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net "task:output" was also added along.

This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan".

ZMQ socket can't send empty list.
As a result, if the Task on the Worker had no output,
The master would never stop waiting and hang forever.

So a dummy TaskOutput was added when `task.get_step()` to work around it.

After thinking it twice, I think this should be fixed.

Because TaskOuput is at User layer. The issued shouldn't have been solved by adding a TaskOuput.

Instead, we should move the creating of the placeholder blob to some deeper layer,
and remove the placeholder blob in the workspace afterwards to avoid polluting user workspace.
After this change, the workaround becomes totally transparent and no side-effect to users.

Differential Revision: D9413150

fbshipit-source-id: 49221745351880115b66a13973eb8584c2bd6b32
facebook-github-bot pushed a commit that referenced this pull request Aug 30, 2018
Summary:
Pull Request resolved: #11048

Pull Request resolved: #10739

I wanted to assert that the blobs in the workspace of the new session after loading checkpoint are exactly the same as the blobs in the workspace of the old session before saving to a checkpoint.

But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net `task:output` was also added along with it. See https://fburl.com/937lf2yk

This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan".

This adding a dummy TaskOutput when user specifies no TaskOutput is a hack.
The reason for this is that ZMQ socket can't send empty blob list.
As a result, if the Task on the Worker had no output,
The master would never stop waiting and hang forever. See https://fburl.com/rd7fhy6p and imagine `socket.recv(net, 0)`.

TaskOuput is at user layer. The hack shouldn't be exposed to user layer, polluting user workspaces.

Instead, we should move the creating of the dummy blob to some deeper layer,
and remove the dummy blob in the workspace afterwards to avoid polluting user workspaces.
After this change, the workaround becomes totally transparent and no side-effect to users.

Reviewed By: mraway

Differential Revision: D9566744

fbshipit-source-id: 18292dd64a6d48192c34034200a7c9811d2172af
PenghuiCheng pushed a commit to PenghuiCheng/pytorch that referenced this pull request Sep 11, 2018
Summary:
Pull Request resolved: pytorch#10739

I wanted to assert that the blobs in the workspace of the new session after loading checkpoint are exactly the same as the blobs in the workspace of the old session before saving to a checkpoint.

But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net `task:output` was also added along with it. See https://fburl.com/937lf2yk

This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan".

This adding a dummy TaskOutput when user specifies no TaskOutput is a hack.
The reason for this is that ZMQ socket can't send empty blob list.
As a result, if the Task on the Worker had no output,
The master would never stop waiting and hang forever. See https://fburl.com/rd7fhy6p and imagine `socket.recv(net, 0)`.

TaskOuput is at user layer. The hack shouldn't be exposed to user layer, polluting user workspaces.

Instead, we should move the creating of the dummy blob to some deeper layer,
and remove the dummy blob in the workspace afterwards to avoid polluting user workspaces.
After this change, the workaround becomes totally transparent and no side-effect to users.

Reviewed By: mraway

Differential Revision: D9413150

fbshipit-source-id: 51aaf3201e26570b4fcf5738e9b9aa17c58777ac
PenghuiCheng pushed a commit to PenghuiCheng/pytorch that referenced this pull request Sep 11, 2018
Summary:
Pull Request resolved: pytorch#11048

Pull Request resolved: pytorch#10739

I wanted to assert that the blobs in the workspace of the new session after loading checkpoint are exactly the same as the blobs in the workspace of the old session before saving to a checkpoint.

But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net `task:output` was also added along with it. See https://fburl.com/937lf2yk

This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan".

This adding a dummy TaskOutput when user specifies no TaskOutput is a hack.
The reason for this is that ZMQ socket can't send empty blob list.
As a result, if the Task on the Worker had no output,
The master would never stop waiting and hang forever. See https://fburl.com/rd7fhy6p and imagine `socket.recv(net, 0)`.

TaskOuput is at user layer. The hack shouldn't be exposed to user layer, polluting user workspaces.

Instead, we should move the creating of the dummy blob to some deeper layer,
and remove the dummy blob in the workspace afterwards to avoid polluting user workspaces.
After this change, the workaround becomes totally transparent and no side-effect to users.

Reviewed By: mraway

Differential Revision: D9566744

fbshipit-source-id: 18292dd64a6d48192c34034200a7c9811d2172af
@ezyang ezyang added the merged label Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants