[MODEL] Add Falcon H1#38249
Conversation
| class FalconH1ModelIntegrationTest(unittest.TestCase): | ||
| # TODO: add integration tests for all model sizes | ||
| pass No newline at end of file |
There was a problem hiding this comment.
missing before we can merge
|
QQ: Will there be a transformers release for this model soon? Trying to figure out when to update the transformers version for vLLM. |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Thanks all for the contribution!!! 🤗 |
|
@DarkLight1337 we just did one yesterday 😢 |
|
We are gonna do a model based release! |
|
@younesbelkada |
|
I am getting the following output, is it normal? |
|
Looks like a decoding issue — locally the model generates valid completions (e.g., detailed summary of the French Revolution). Could be related to model loading precision (should bf16 ) or tokenizer mismatch. |
|
@dhiaEddineRhaiem Do you want the test or you check it with a different code snippet? I am getting the the same strange outputs both with T4 and A10. |
|
hello again @ydshieh , one other potential cause is that Falcon H1 is particularly sensitive to temperature changes above 0.3 or 0.4, likely because it already produces well-calibrated and sharply peaked logits by default, Basically: 🔹 Its raw logits are already well-separated, so lowering temperature (e.g. to 0.1) keeps that separation strong → stable behavior. 🔹 Increasing T > 0.3 or 0.4 flattens that, letting weaker tokens sneak in → instability. Ampirically, i would advise to set T=0.1 ! To experiment chatting with FalconH1 series of models , please use this playground |
|
I suggest to run the test
or even the same code but in a script (see below). I don't change anything from what is written in this PR. It's best if you or @younesbelkada take a look what is happening and update the test, so If the test or the following script still pass or give normal outputs, we can discuss what might be the cause. |
|
i have just right now tested locally with the exact same script happy to have deeper discussion about it @ydshieh |
|
oh my god .... that is going be tough this issue. Thank you for checking. Could you first share your machine type? T4/A10/A100/H100 etc? And copy paste the output of |
|
|
I am seeing
maybe it's the cause. I will check further |
|
i think also it is the cause , we saw similar behaviour locally when fast path is not used. |
|
Sure, but a question: do we want to maintain the slow path (even if not identical results, at least outputs that look normal ..?) I change the prompts, and still get non-sense outputs 😰 |
|
i would say we might want to retain the slow path, as the same logic applies across hybrid and pure Mamba2 models like Zamba, Bamba and others. Mamba2 is highly sensitive to numerical precision: key components like A_log, dt, and internal ssm_states operate in fp32. Without the Triton fast path, the fallback accumulates precision errors across tokens — especially in long contexts — which likely may lead the degraded outputs. we will further debug internally to see if other causes may be found and overcomed. |
|
Got it. I confirmed that with fast path, the outputs look normal. |
|
@dhiaEddineRhaiem I think the slow path is missing a recent fix, see #37533 Tl;dr: the repeat pattern on the mamba heads has been wrong --> we need a repeat interleave, not a simple repeat Will try to make the mamba model paths inheritable in the future. The copy pasting atm is very error prone atm 😢 |
|
Hey @vasqu, Many Thanks for pointing this out:
I will soon raise a PR to fix that. |
|
Nice, glad to hear that :D |
|
thanks all for diving! 🚀 |
No description provided.