Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
run-slow: gpt_oss |
ArthurZucker
left a comment
There was a problem hiding this comment.
Looks good! Let's agree on transposed vs not transposed (shape ccould help?)
| def _apply_gate(self, gate_up: torch.Tensor) -> torch.Tensor: | ||
| gate, up = gate_up[..., ::2], gate_up[..., 1::2] | ||
| gate = gate.clamp(min=None, max=self.limit) | ||
| up = up.clamp(min=-self.limit, max=self.limit) | ||
| glu = gate * torch.sigmoid(gate * self.alpha) | ||
| gated_output = (up + 1) * glu | ||
| return gated_output |
There was a problem hiding this comment.
This is fine, makes it more readable!
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
vasqu
left a comment
There was a problem hiding this comment.
Hmm, so we no longer transpose during weight conversion but do it manually? Not against it but we will need to solve the loading/reverse issue either way, not necessarily here
Left a few comments but they are mostly aesthetic / nits
yes and the transposition can be achieved and works fine (all tests passà with the temporary weight renaming trick, however the biggest problem here is that megablocks expect the transposed format of expert weights so it might make more sense to transpose the other MoEs to be compatible with megablocks (if it's worth the speedup) |
Argh, yea this will be hard to solve otherwise (would need to rewrite that kernel). Maybe we transpose the other way around for the other models at some point - depends on how much improvement that kernel provides On another note, is the transpose expensive on runtime? I would expect slight perf regression having to call transpose here and there (in |
no, the grouped_mm work on strided tensors so the transpose is just a view (no op). |
| # Add the ones from the quantizer as well if provided | ||
| if hf_quantizer is not None: | ||
| weight_conversions.extend(hf_quantizer.get_weight_conversions()) | ||
| weight_conversions = hf_quantizer.get_weight_conversions() + weight_conversions |
There was a problem hiding this comment.
it's seems that the order of operations matters when both quantization conversions and weight conversions are defined. e.g. with gpt_oss if I want to apply the Force16ByteAlignment conversion, it has to be placed after the dequantization otherwise the loader is confused (see output below). @vasqu @ArthurZucker
device = torch.device("cuda:1")
model_id = "openai/gpt-oss-20b"
model = GptOssForCausalLM.from_pretrained(
model_id,
device_map=device,
dtype=torch.bfloat16,
quantization_config=Mxfp4Config(dequantize=True),
).eval()results in:
Loading weights: 100%|█████████████████████████████████████████████| 363/363 [00:00<00:00, 1003.90it/s, Materializing param=model.norm.weight]
GptOssForCausalLM LOAD REPORT from: openai/gpt-oss-20b
Key | Status |
------------------------------------------------------+------------+-
model.layers.{0...23}.mlp.experts.down_proj_scales | UNEXPECTED |
model.layers.{0...23}.mlp.experts.down_proj_blocks | UNEXPECTED |
model.layers.{0...23}.mlp.experts.gate_up_proj_scales | UNEXPECTED |
model.layers.{0...23}.mlp.experts.gate_up_proj_blocks | UNEXPECTED |
model.layers.{0...23}.mlp.experts.gate_up_proj | MISSING |
model.layers.{0...23}.mlp.experts.down_proj | MISSING |
Notes:
- UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING :those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
There was a problem hiding this comment.
Yes! Since get_weight_conversions is only to DEQUANTIZE it makes sense that it's first. However, as discussed offline, for now it's not possible to match 1 param with 2 converter (i.e. 1 dequant converter, and 1 from the hardcoded mapping). So it means that any model with a mapping registered cannot be dequantized 😭 So in theory you're right it should come first, but as it does not work anyway currently it does not make a difference 😭 Let's still keep this change adding a comment on all that please!
There was a problem hiding this comment.
thanks @Cyrilvallez added a comment mostly rewording your explanation !
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
vasqu
left a comment
There was a problem hiding this comment.
LGTM, I guess the outputs slightly change now with the interface's grouped_mm/batched_mm or even the eager rewrite?
the eager for training is the same, I kept everything as is when i realized that the slow tests are brokens (see #43246), I removed the batched implementation from there as it is redundant, so users will have to set experts impl to batched to get the same eval output as before. |
|
Probably merging next week, currently don't have the permission to merge with red CI 😄 |
ArthurZucker
left a comment
There was a problem hiding this comment.
Great work! LGTM, lets make sure quantized tests pass but let's go!
| # NOTE: Since get_weight_conversions() only serves to dequantize, we need to put them first in the list. | ||
| # However, for now it's not possible to match 1 param with 2 converters (i.e. 1 dequantization converter | ||
| # and 1 model-specific converter). Which means that if a model that has model-specific conversions and is being | ||
| # dequantized, the model-specific conversion that has patterns matching the dequantization patterns will be ignored. | ||
| weight_conversions = hf_quantizer.get_weight_conversions() + weight_conversions |
There was a problem hiding this comment.
I am not sure it makes a difference at all no? Because the operation are ordered by length of collected tensors I thing.
There was a problem hiding this comment.
Yeah now that I'm digging deeper into the weight loader I think the reason I got the error above is because it's not possible to cascade converters (i.e., applying model-specific conversions on top of tensors created by the dequantization conversions). Not because you can't match one tensor with two converters (that's a valid limitation, but not the one happening here in gpt oss).
I added $ to the end of Force16BytesAlignement source pattern to fix my error without changing this order. Basically making sure that the sources of the mxfp dequant converter and Force16BytesAlignement are exclusive.
I will revert the line change and make this comment instead:
# NOTE: Since get_weight_conversions() only serve to dequantize, we normally want to apply them first.
# However, for now it's not possible to cascade converters (i.e., applying model-specific conversions on top
# of tensors created by the dequantization conversions)
# This means that if a model has model-specific conversions and is being dequantized, the model-specific conversion
# that relies on tensors created by dequantization conversions will not be applied.
# GptOss example: with Mxfp4Config(dequantize=True), Force16BytesAlignment converters are ignored because the tensors
# "mlp.experts.gate_up_proj$" and "mlp.experts.down_proj$" are only created after dequantization conversions are applied.
weight_conversions.extend(hf_quantizer.get_weight_conversions())There was a problem hiding this comment.
cool, as we talked offline let's add more comments about how to handle the weight converter
| """ | ||
| Ensures that the given tensor is 16-bytes aligned in memory and clones it if not. | ||
| This garantees 16-bytes alignmenet for kernels / implementations that use TMA or SIMD instructions like torch._grouped_mm. | ||
| """ |
|
Should be good to merge |
|
#43353 leads me to believe that a simple weight converter won't be enough. Somehow under the concurrent workers we hit the same 16byte alignment issue on our CI with (qwen 3 omni moe). I'll take a proper look tomorrow but it might make sense to add the fallback regardless sadly. Just wanted to put it out before we hastly merge |
|
@vasqu I think what you are dealing with might be the cpu+safetensors memory misalignement issue i have seen with gptoss ; there are two kinds of misalignment issues i have encountered so far:
I would suggest you try the weight converter defined here because the ci runner uses cpu (I had to set TRANSFORMERS_TEST_DEVICE=cpu to reproduce the gpt_oss issue locally) |
|
To sum up the conversation I had with @vasqu internally:
|
|
[For maintainers] Suggested jobs to run (before merge) run-slow: gpt_oss |
36ff79a to
711a652
Compare
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=43227&sha=711a65 |
* experts impl gpt oss * no need to transpose dequantized experts * skip test_reverse_loading_mapping * fix custom gating * revert transposition and simply support transposed experts to avoid modifying eager * style * don't rely on weight shapes as they can be square matrices * no need to relaod * fallback to eager * Update src/transformers/models/gpt_oss/modeling_gpt_oss.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix * force 16 bytes alignmenet during weight loading * simplify logic * quantization conversions should be applied first * avoid baddbmm as it is less performant / less optimizable by max-autotune * no need for logger * add comment explaining limitation * standarize operations and only reshape when needed * fixup conversion and test * Update src/transformers/conversion_mapping.py Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * force alignment docstring * move default apply gate * offsets * add docs and make kernel_config optional * use reshapes as they are equivalent to views when memory is contiguous * fix and better notes * reshapes instead of views * keep model saving and reloading in grouped_mm test to catch misalignment issues --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: vasqu <antonprogamer@gmail.com> Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
* experts impl gpt oss * no need to transpose dequantized experts * skip test_reverse_loading_mapping * fix custom gating * revert transposition and simply support transposed experts to avoid modifying eager * style * don't rely on weight shapes as they can be square matrices * no need to relaod * fallback to eager * Update src/transformers/models/gpt_oss/modeling_gpt_oss.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix * force 16 bytes alignmenet during weight loading * simplify logic * quantization conversions should be applied first * avoid baddbmm as it is less performant / less optimizable by max-autotune * no need for logger * add comment explaining limitation * standarize operations and only reshape when needed * fixup conversion and test * Update src/transformers/conversion_mapping.py Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * force alignment docstring * move default apply gate * offsets * add docs and make kernel_config optional * use reshapes as they are equivalent to views when memory is contiguous * fix and better notes * reshapes instead of views * keep model saving and reloading in grouped_mm test to catch misalignment issues --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: vasqu <antonprogamer@gmail.com> Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
What does this PR do?
[wip] Fixes #43193 @vasqu @mattteochen

Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.