Conversation
Maybe it's added accidently. There is no reason to quantize
This is true in general. But in the case of quantization, we can make sure that the quantized floating-point numbers are in the representable range of integer data type. |
|
We can remove the "f16 quantized to int16" test case. The PR also looks good. |
Yes, I think it what we should do |
|
@yaoyaoding By some reason it works with on Tests workflow but fail on Publish. So, in code just Seems use different versions of CUDA there. What is the right way to convert to int8? |
### PR Comment: In the new version of the `transformers` library (version 4.45.0) used in CI, the `merges` field in the configuration has changed to a list of lists. To ensure compatibility with this update, I have modified our code base to accommodate this change. Without this adjustment, the `test_tokenizer` in CI would fail to execute successfully. This update ensures that the tests run as expected with the new library version.
I encountered this problem multiple times. The reason is that C++ finds multiple "middle type" during converting bfloat16 to int8 when there is not a direct conversion from bf16 to int8. To address the issue we can explicitly specify a middle type like |
|
Actually I fixed it with updating test image. With new one there is no such problem. |
### PR Comment: In the new version of the `transformers` library (version 4.45.0) used in CI, the `merges` field in the configuration has changed to a list of lists. To ensure compatibility with this update, I have modified our code base to accommodate this change. Without this adjustment, the `test_tokenizer` in CI would fail to execute successfully. This update ensures that the tests run as expected with the new library version.
Add preliminary conversion to
f32for quantization.Describe on example
f16quantize toint16(it is a bit unclear why we need it but we support it and test it).f16can not hold exact value ofint16.max_valueand we gotf16number that more thanint16.max_value. In result we gotmax_minvalue in quantizedint16instead ofmax_int.In fact, converting a float to an int can have unpredictable behavior when the float exceeds the maximum value that the target int can hold.