New chaining/partitioning algorithm for async_scheduling for inference by yinghai · Pull Request #11957 · pytorch/pytorch

yinghai · 2018-09-21T22:47:34Z

Summary:
For distributed inference, we want to use async_scheduling net to run the net as we need its async part. However, according to the profiling, async_net has big overhead of dispatching tasks onto worker threads. This diff improves the issue by generating a smaller number of chains/tasks by grouping the sync ops that can be run in one shot. Note that it also schedule individual async ops as a single chain because unlike gpu ops, rpc ops are not guaranteed to be linearized at the remote site. For example, if you have two rps ops op1->op2, op2 won't implicitly block until op1 finishes. Therefore we need to put each of the async op as one chain as async_scheduling net will only sync the tail of the chain.

For the all sync op nets, this change give us 1.5X slower than simple_net, while without the change, it is 7X slower.

Next step is to work on the executor to make the task scheduling faster. And add a fallback path to be able to run ops inline if it's a all-sync net.

Differential Revision: D9874140

pytorch#11957) Summary: Pull Request resolved: pytorch#11957 For distributed inference, we want to use async_scheduling net to run the net as we need its async part. However, according to the profiling, async_net has big overhead of dispatching tasks onto worker threads. This diff improves the issue by generating a smaller number of chains/tasks by grouping the sync ops that can be run in one shot. Note that it also schedule individual async ops as a single chain because unlike gpu ops, rpc ops are not guaranteed to be linearized at the remote site. For example, if you have two rps ops `op1->op2`, op2 won't implicitly block until op1 finishes. Therefore we need to put each of the async op as one chain as async_scheduling net will only sync the tail of the chain. For the all sync op nets, this change give us `1.5X` slower than simple_net, while without the change, it is `7X` slower. Next step is to work on the executor to make the task scheduling faster. And add a fallback path to be able to run ops inline if it's a all-sync net. Differential Revision: D9874140 fbshipit-source-id: 060cd78a64af77219d59b3a3890c45cdaf1ed854

yinghai requested a review from ilia-cher September 22, 2018 05:00

yinghai force-pushed the export-D9874140 branch from ad2bbf1 to a482128 Compare September 22, 2018 05:02

yinghai force-pushed the export-D9874140 branch from a482128 to 2ec6bbf Compare September 22, 2018 05:05

yinghai force-pushed the export-D9874140 branch from 2ec6bbf to c81f5b4 Compare September 22, 2018 06:13

yinghai force-pushed the export-D9874140 branch from c81f5b4 to dbd194f Compare October 3, 2018 18:43

yinghai force-pushed the export-D9874140 branch from dbd194f to 053b9f9 Compare October 4, 2018 18:03

yinghai force-pushed the export-D9874140 branch from 053b9f9 to 854731e Compare October 4, 2018 18:16

yinghai force-pushed the export-D9874140 branch from 854731e to 2c07920 Compare October 4, 2018 21:27

facebook-github-bot closed this in e7653c7 Oct 8, 2018

yinghai deleted the export-D9874140 branch October 8, 2018 21:54

ezyang added the merged label Jun 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New chaining/partitioning algorithm for async_scheduling for inference#11957

New chaining/partitioning algorithm for async_scheduling for inference#11957
yinghai wants to merge 1 commit intopytorch:masterfrom
yinghai:export-D9874140

yinghai commented Sep 21, 2018 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yinghai commented Sep 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yinghai commented Sep 21, 2018 •

edited

Loading