Skip to content

New chaining/partitioning algorithm for async_scheduling for inference#11957

Closed
yinghai wants to merge 1 commit intopytorch:masterfrom
yinghai:export-D9874140
Closed

New chaining/partitioning algorithm for async_scheduling for inference#11957
yinghai wants to merge 1 commit intopytorch:masterfrom
yinghai:export-D9874140

Conversation

@yinghai
Copy link
Contributor

@yinghai yinghai commented Sep 21, 2018

Summary:
For distributed inference, we want to use async_scheduling net to run the net as we need its async part. However, according to the profiling, async_net has big overhead of dispatching tasks onto worker threads. This diff improves the issue by generating a smaller number of chains/tasks by grouping the sync ops that can be run in one shot. Note that it also schedule individual async ops as a single chain because unlike gpu ops, rpc ops are not guaranteed to be linearized at the remote site. For example, if you have two rps ops op1->op2, op2 won't implicitly block until op1 finishes. Therefore we need to put each of the async op as one chain as async_scheduling net will only sync the tail of the chain.

For the all sync op nets, this change give us 1.5X slower than simple_net, while without the change, it is 7X slower.

Next step is to work on the executor to make the task scheduling faster. And add a fallback path to be able to run ops inline if it's a all-sync net.

Differential Revision: D9874140

pytorch#11957)

Summary:
Pull Request resolved: pytorch#11957

For distributed inference, we want to use async_scheduling net to run the net as we need its async part. However, according to the profiling, async_net has big overhead of dispatching tasks onto worker threads. This diff improves the issue by generating a smaller number of chains/tasks by grouping the sync ops that can be run in one shot. Note that it also schedule individual async ops as a single chain because unlike gpu ops, rpc ops are not guaranteed to be linearized at the remote site. For example, if you have two rps ops `op1->op2`, op2 won't implicitly block until op1 finishes. Therefore we need to put each of the async op as one chain as async_scheduling net will only sync the tail of the chain.

For the all sync op nets, this change give us `1.5X` slower than simple_net, while without the change, it is `7X` slower.

Next step is to work on the executor to make the task scheduling faster. And add a fallback path to be able to run ops inline if it's a all-sync net.

Differential Revision: D9874140

fbshipit-source-id: 060cd78a64af77219d59b3a3890c45cdaf1ed854
@yinghai yinghai deleted the export-D9874140 branch October 8, 2018 21:54
@ezyang ezyang added the merged label Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants