[Distributed][auto-parallel] Automatic Partitioning#344
[Distributed][auto-parallel] Automatic Partitioning#344soodoshll merged 9 commits intohidet-org:auto-parallelfrom
Conversation
|
@yaoyaoding @xinli-git this pr is ready for review |
| x = hidet.zeros([32, 3, 224, 224], device='cuda') | ||
| opt_graph = hidet.graph.optimize(flow_graph) | ||
| compiled = opt_graph.build() | ||
| print(compiled(x)) |
There was a problem hiding this comment.
consider adding all_close with y_truth ?
| parser.add_argument('script_args', nargs=argparse.REMAINDER) | ||
| args = parser.parse_args() | ||
|
|
||
| procs = [] |
There was a problem hiding this comment.
would it be possible to follow this convention:
https://github.com/hidet-org/hidet/blob/main/setup.py#L40 ? make it a cli command
There was a problem hiding this comment.
The current launch method is fine to me. Another option is to add a sub-command like
$ hidet dist launch resnet.py
| from .rule import op_shard_rule_search | ||
| from .shard import OpShardSpec, TensorShardSpec, connect, node_comm_cost | ||
|
|
||
| # I copied it from compiled_graph.py |
There was a problem hiding this comment.
nitpick: possibly remove this
| for node in tqdm.tqdm(g.nodes): | ||
| node_str = str(node) | ||
| if node_str not in cache: | ||
| cache[node_str] = op_shard_rule_search(node, num_shards) |
There was a problem hiding this comment.
If node_str can uniquely identify the sharding rule, maybe cleaner to add lru_cache decorator to op_shard_rule_search function?
| sharded = xsum(p_vars) | ||
| param_mem += (num_shards - ((num_shards - 1) * sharded)) * (p.nbytes // num_shards) | ||
| param_tot += p.nbytes | ||
| print(f"Total paramter size: {param_tot/1024**3} GiB") |
There was a problem hiding this comment.
print -> logging.info
logger = logging.Logger(__name__)
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHandler()) like here
https://github.com/hidet-org/hidet/blob/main/python/hidet/drivers/build_task.py
requirements.txt
Outdated
| requests | ||
|
|
||
| # for auto-parallelization | ||
| mip No newline at end of file |
There was a problem hiding this comment.
I recommend that we make the distributed feature into this in setup.py
extras_require={
'distributed': ['filelock', 'mip', ...],
},
There was a problem hiding this comment.
Yeah, I agree with Xin that it is better to put the extra dependency like 'mip' to extra and require the user to install via pip install hidet[distributed].
yaoyaoding
left a comment
There was a problem hiding this comment.
I roughly go over all the code and it looks good to me. Thanks @soodoshll !
| parser.add_argument('script_args', nargs=argparse.REMAINDER) | ||
| args = parser.parse_args() | ||
|
|
||
| procs = [] |
There was a problem hiding this comment.
The current launch method is fine to me. Another option is to add a sub-command like
$ hidet dist launch resnet.py
requirements.txt
Outdated
| requests | ||
|
|
||
| # for auto-parallelization | ||
| mip No newline at end of file |
There was a problem hiding this comment.
Yeah, I agree with Xin that it is better to put the extra dependency like 'mip' to extra and require the user to install via pip install hidet[distributed].
xinli-git
left a comment
There was a problem hiding this comment.
Thanks! @soodoshll , looking forward to try it out on some LLMs :)
This PR contains:
hidet.distributed.partition, which partitions the whole flow graph according to the plan given by 1), and save the partitions to disk;hidet.distributed.load_partition, which loads partition from the disk to the desired device;example/distributed/resnet.py). Which can be run bypython -m hidet.distributed.launch [n_gpus] resnet.py. The memory budget has been set as 24MiB, which is less than the weight size of a resnet18 model. It can help test tensor parallelSeveral issues I plan to solve in the future