Introduce GradientCheckpointingLayer#37223
Conversation
|
run-slow: llama |
|
This comment contains run-slow, running the specified jobs: This comment contains run-slow, running the specified jobs: models: ['models/llama'] |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
ArthurZucker
left a comment
There was a problem hiding this comment.
Nice! Lets put it in another file to separate a bit! Otherwise marvelleous
|
Ended up applying to all llama-based models and SigLIP/2 for the first iteration, all relevant tests pass CI error is unrelated cc @ArthurZucker to merge if OK for you |
|
|
||
|
|
||
| class GradientCheckpointingLayer(nn.Module): | ||
| """Base class for layers with gradient checkpointing. |
There was a problem hiding this comment.
thanks for documenting as well!
|
This is very neat, Pavel! Might be useful for us users to add to the doc:
Perhaps even add redundancy/shortcut by adding to the doc: it's not very intuitive, I first tried to call: and it failed |
* GradientCheckpointingLayer * trigger * Move GC layer to a separate file * Update import * Expose and document GC layer * Fix dummy * Apply to llama-based models * Update modulars * Update a few more models for consistency * Update glm4 * Update Janus
What does this PR do?
A super minimal abstraction for a layer with gradient checkpointing that keeps the logic for enabling and disabling gradient checkpointing within PreTrainedModel for backward compatibility. It allows for a gradual rollout of the feature by supporting both checkpointing mechanisms: with a the current wrap of
_gradient_checkpointing_funcand using inheritance fromGradientCheckpointingLayer.I've applied this to Llama, but it's just a PoC for the discussion. Perhaps it's better to start with another less popular model that has fewer dependent models to see how it goes and check if it can be breaking for the hub custom code
Who can review?