Skip to content

Make SAC save_list configurable#1675

Merged
fegin merged 2 commits into
mainfrom
chienchin/sac_save_list
Sep 3, 2025
Merged

Make SAC save_list configurable#1675
fegin merged 2 commits into
mainfrom
chienchin/sac_save_list

Conversation

@fegin

@fegin fegin commented Sep 2, 2025

Copy link
Copy Markdown
Contributor

This PR refactors the apply_ac implementation by moving save_list out of activation_checkpoint.py and making it an argument of apply_ac(). This change improves configurability without altering the behavior of any models, as the set of operations saved remains unchanged after this PR.

This PR moves save_list out from activation_checkpoint.py. save_list is now an argument of apply_ac(). The behaviors of all the models shouldn't change as this PR doesn't change what ops are going to be saved.
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 2, 2025

@wwwjn wwwjn left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I think it's reasonable for each model to specify which ops to be used in AC. I have a follow up question: Does it make sense to have a default set for SAC, as most of the _save_list now are the same and some models share the same patten of commutation? If user specify a _save_list, we could override the default list

@fegin

fegin commented Sep 2, 2025

Copy link
Copy Markdown
Contributor Author

@wwwjn That's a good question. I was thinking the same question. I decided not to have a default list because I want this to be explicit -- each model author should be responsible to fine tune the save_list or the performance will not be good. Without the default value, the performance should be bad. The is the explicit part.

But I don't have a strong preference here. If we believe there are some ops that we will never want to recompute regardless the model type, then having a default one seems to be reasonable.

@wwwjn

wwwjn commented Sep 2, 2025

Copy link
Copy Markdown
Contributor

@wwwjn That's a good question. I was thinking the same question. I decided not to have a default list because I want this to be explicit -- each model author should be responsible to fine tune the save_list or the performance will not be good. Without the default value, the performance should be bad. The is the explicit part.

But I don't have a strong preference here. If we believe there are some ops that we will never want to recompute regardless the model type, then having a default one seems to be reasonable.

I don't have a strong opinion as well

@tianyu-l tianyu-l left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, had one comment on apply_ac API

Another thing I'm thinking is
does it make sense to move / refactor llama3/infra/pipeline.py to distributed/pipeline_parallel.py, as it's shared by all models (llama3, llama4, dsv3), similar to apply_ac

Comment on lines 123 to +125
model_compile_enabled: bool,
use_flex_attn: bool,
):
save_list: set[torch._ops.OpOverload] | None = None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should comment this on the previous PR --
I'd recommend we make these into non positional args (after *), similar to _apply_ac_to_transformer_block. Otherwise it'll break BC without good reason.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, I was thinking the same thing when I did this factor. Let me add this.

@fegin

fegin commented Sep 3, 2025

Copy link
Copy Markdown
Contributor Author

Another thing I'm thinking is does it make sense to move / refactor llama3/infra/pipeline.py to distributed/pipeline_parallel.py, as it's shared by all models (llama3, llama4, dsv3), similar to apply_ac

I haven't looked the pipeline parallelism part. But I'm going to review @H-Huang PRs and future PRs, will address this question at that time.

@H-Huang

H-Huang commented Sep 3, 2025

Copy link
Copy Markdown
Contributor

Another thing I'm thinking is does it make sense to move / refactor llama3/infra/pipeline.py to distributed/pipeline_parallel.py, as it's shared by all models (llama3, llama4, dsv3), similar to apply_ac

Yes, this sounds good to me, can do it in a follow up PR

@H-Huang H-Huang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

ac_config: ACConfig,
*,
base_fqn: str | None = None,
save_list: set[torch._ops.OpOverload] | None = None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can save_list default value just be set()?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a mutable data structure as the default argument can cause unexpected bugs. So I would still use None.

@fegin fegin merged commit 2179939 into main Sep 3, 2025
9 checks passed
@fegin fegin deleted the chienchin/sac_save_list branch September 3, 2025 22:43
xrsrke pushed a commit to NousResearch/torchtitan that referenced this pull request Feb 13, 2026
This PR refactors the `apply_ac` implementation by moving `save_list`
out of `activation_checkpoint.py` and making it an argument of
`apply_ac()`. This change improves configurability without altering the
behavior of any models, as the set of operations saved remains unchanged
after this PR.
xrsrke pushed a commit to NousResearch/torchtitan that referenced this pull request Feb 25, 2026
This PR refactors the `apply_ac` implementation by moving `save_list`
out of `activation_checkpoint.py` and making it an argument of
`apply_ac()`. This change improves configurability without altering the
behavior of any models, as the set of operations saved remains unchanged
after this PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants