[Distributed][auto-parallel] Automatic Partitioning by soodoshll · Pull Request #344 · hidet-org/hidet

soodoshll · 2023-08-04T23:57:33Z

This PR contains:

An ILP solver to find the optimal partitioning plan for the given model, minimizing the communication cost while ensuring the parameters not exceed the memory budget;
hidet.distributed.partition, which partitions the whole flow graph according to the plan given by 1), and save the partitions to disk;
hidet.distributed.load_partition, which loads partition from the disk to the desired device;
a launch script
a small resnet example(example/distributed/resnet.py). Which can be run by python -m hidet.distributed.launch [n_gpus] resnet.py. The memory budget has been set as 24MiB, which is less than the weight size of a resnet18 model. It can help test tensor parallel

Several issues I plan to solve in the future

ILP solving is slow for large models like llama. We need to coalesce nodes before sending the graph to ILP;
Multi processes on one machine running compilation in parallel often trigger conflicts in local filesystem. We should implement a locking mechanism in the future.

soodoshll · 2023-08-10T22:41:02Z

@yaoyaoding @xinli-git this pr is ready for review

xinli-git · 2023-08-17T19:25:12Z

examples/distributed/resnet.py

+    x = hidet.zeros([32, 3, 224, 224], device='cuda')
+    opt_graph = hidet.graph.optimize(flow_graph)
+    compiled = opt_graph.build()
+    print(compiled(x))


consider adding all_close with y_truth ?

xinli-git · 2023-08-17T19:25:57Z

python/hidet/distributed/launch.py

+parser.add_argument('script_args', nargs=argparse.REMAINDER)
+args = parser.parse_args()
+
+procs = []


would it be possible to follow this convention:
https://github.com/hidet-org/hidet/blob/main/setup.py#L40 ? make it a cli command

The current launch method is fine to me. Another option is to add a sub-command like

$ hidet dist launch resnet.py

xinli-git · 2023-08-17T19:26:54Z

python/hidet/distributed/partition/analyze.py

+from .rule import op_shard_rule_search
+from .shard import OpShardSpec, TensorShardSpec, connect, node_comm_cost
+
+# I copied it from compiled_graph.py


nitpick: possibly remove this

xinli-git · 2023-08-17T19:29:19Z

python/hidet/distributed/partition/analyze.py

+    for node in tqdm.tqdm(g.nodes):
+        node_str = str(node)
+        if node_str not in cache:
+            cache[node_str] = op_shard_rule_search(node, num_shards)


If node_str can uniquely identify the sharding rule, maybe cleaner to add lru_cache decorator to op_shard_rule_search function?

xinli-git · 2023-08-17T19:31:10Z

python/hidet/distributed/partition/analyze.py

+        sharded = xsum(p_vars)
+        param_mem += (num_shards - ((num_shards - 1) * sharded)) * (p.nbytes // num_shards)
+        param_tot += p.nbytes
+    print(f"Total paramter size: {param_tot/1024**3} GiB")


print -> logging.info

logger = logging.Logger(__name__) logger.setLevel(logging.INFO) logger.addHandler(logging.StreamHandler()) like here

https://github.com/hidet-org/hidet/blob/main/python/hidet/drivers/build_task.py

xinli-git · 2023-08-17T19:34:29Z

requirements.txt

 requests
+
+# for auto-parallelization
+mip


I recommend that we make the distributed feature into this in setup.py

extras_require={ 'distributed': ['filelock', 'mip', ...], },

Yeah, I agree with Xin that it is better to put the extra dependency like 'mip' to extra and require the user to install via pip install hidet[distributed].

python/hidet/distributed/utils.py

yaoyaoding

I roughly go over all the code and it looks good to me. Thanks @soodoshll !

yaoyaoding · 2023-08-18T01:44:19Z

python/hidet/distributed/launch.py

+parser.add_argument('script_args', nargs=argparse.REMAINDER)
+args = parser.parse_args()
+
+procs = []


The current launch method is fine to me. Another option is to add a sub-command like

$ hidet dist launch resnet.py

yaoyaoding · 2023-08-18T02:02:59Z

requirements.txt

 requests
+
+# for auto-parallelization
+mip


Yeah, I agree with Xin that it is better to put the extra dependency like 'mip' to extra and require the user to install via pip install hidet[distributed].

xinli-git

Thanks! @soodoshll , looking forward to try it out on some LLMs :)

…e` module with `align_corners=True` (#344) Closes #343

soodoshll added 6 commits August 4, 2023 22:55

update

74ce05e

update

31061a3

update

7ad849b

update

b429cd8

update

297f4c6

format

4aa353b

xinli-git reviewed Aug 17, 2023

View reviewed changes

python/hidet/distributed/utils.py Show resolved Hide resolved

yaoyaoding approved these changes Aug 18, 2023

View reviewed changes

soodoshll added 3 commits August 22, 2023 00:11

update

1fc08f1

update

3479295

format

88b27e0

xinli-git approved these changes Aug 22, 2023

View reviewed changes

soodoshll merged commit a12680d into hidet-org:auto-parallel Aug 22, 2023

vadiklyutiy pushed a commit that referenced this pull request Jul 22, 2024

[Fix] Fixing an error triggered while compiling the `torch.nn.Upsampl…

7db8629

…e` module with `align_corners=True` (#344) Closes #343

vadiklyutiy pushed a commit that referenced this pull request Jul 23, 2024

[Fix] Fixing an error triggered while compiling the `torch.nn.Upsampl…

2c34cfc

…e` module with `align_corners=True` (#344) Closes #343

vadiklyutiy pushed a commit that referenced this pull request Dec 26, 2024

[Fix] Fixing an error triggered while compiling the `torch.nn.Upsampl…

bc0a987

…e` module with `align_corners=True` (#344) Closes #343

Conversation

soodoshll commented Aug 4, 2023

Uh oh!

soodoshll commented Aug 10, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yaoyaoding left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xinli-git left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants