[1/x]: Make Float8Linear support dynamic scaling#290
[1/x]: Make Float8Linear support dynamic scaling#290vkuzo wants to merge 5 commits intogh/vkuzo/11/basefrom
Conversation
Summary: At a high level, we need to make dynamic vs delayed scaling configurable separately for activations, weights and gradients. The way I am approaching this is as follows: * PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear` * PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary * last PR: delete `Float8DynamicLinear` Test Plan: ``` ./test/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: At a high level, we need to make dynamic vs delayed scaling configurable separately for activations, weights and gradients. The way I am approaching this is as follows: * PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear` * PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary * last PR: delete `Float8DynamicLinear` Test Plan: ``` ./test/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 7020d5e Pull Request resolved: #290
Summary: At a high level, we need to make dynamic vs delayed scaling configurable separately for activations, weights and gradients. The way I am approaching this is as follows: * PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear` * PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary * last PR: delete `Float8DynamicLinear` Test Plan: ``` ./test/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: At a high level, we need to make dynamic vs delayed scaling configurable separately for activations, weights and gradients. The way I am approaching this is as follows: * PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear` * PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary * last PR: delete `Float8DynamicLinear` Test Plan: ``` ./test/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 7020d5e Pull Request resolved: #290
Summary: At a high level, we need to make dynamic vs delayed scaling configurable separately for activations, weights and gradients. The way I am approaching this is as follows: * PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear` * PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary * last PR: delete `Float8DynamicLinear` Test Plan: ``` ./test/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: At a high level, we need to make dynamic vs delayed scaling configurable separately for activations, weights and gradients. The way I am approaching this is as follows: * PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear` * PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary * last PR: delete `Float8DynamicLinear` Test Plan: ``` ./test/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 0aeb524 Pull Request resolved: #290
Summary: At a high level, we need to make dynamic vs delayed scaling configurable separately for activations, weights and gradients. The way I am approaching this is as follows: * PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear` * PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary * last PR: delete `Float8DynamicLinear` Test Plan: ``` ./test/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: At a high level, we need to make dynamic vs delayed scaling configurable separately for activations, weights and gradients. The way I am approaching this is as follows: * PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear` * PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary * last PR: delete `Float8DynamicLinear` Test Plan: ``` ./test/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: bb1bfe5 Pull Request resolved: #290
| ): | ||
| """Returns whether the given linear_type requires sync before forward.""" | ||
| return linear_type in REQUIRES_SYNC | ||
| return linear_type is LinearType.DELAYED and any( |
There was a problem hiding this comment.
since Float8DynamicLinear does not support TensorScalingType, I think and is right?
| emulate=emulate, | ||
| ) | ||
| if linear_type is LinearType.DYNAMIC: | ||
| return Float8DynamicLinear.from_float( |
There was a problem hiding this comment.
should we assert that non of the scaling types are True?
There was a problem hiding this comment.
technically yes, I was just lazy and unmotivated since this stack is trying to delete Float8DynamicLinear
| def linear_requires_sync(linear_type: LinearType): | ||
| def linear_requires_sync( | ||
| linear_type: LinearType, | ||
| scaling_type_x: TensorScalingType = TensorScalingType.DELAYED, |
There was a problem hiding this comment.
Nit: we should probably remove the defaults right?
There was a problem hiding this comment.
hmm, long term probably yes. I'm taking the approach of tackling the "what's the default" recipe separately and not changing default behavior for now.
Summary: At a high level, we need to make dynamic vs delayed scaling configurable separately for activations, weights and gradients. The way I am approaching this is as follows: * PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear` * PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary * last PR: delete `Float8DynamicLinear` Test Plan: ``` ./test/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
|
@vkuzo has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
This pull request has been merged in 3cb42e1. |
Stack from ghstack (oldest at bottom):
Summary:
At a high level, we need to make dynamic vs delayed scaling configurable
separately for activations, weights and gradients. The way I am
approaching this is as follows:
Float8LinearFloat8DynamicLineartoFloat8Linear, as necessaryFloat8DynamicLinearTest Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: D59305792