Sin
Sin
理论上rotary embedding是支持无限长的,前提是显存放的下那么多kv_cache。不过如果训练数据都没有超过2048的话,不知道外推到2048以外会不会影响生成效果
LGTM, I was wondering about the performance improvement. And can we run the fp8 intrinsic on Volta/Ampere/Ada arch or is it just Hopper only?
And I want to know which one should we use for better precision and performance between E5M2 and E4M3? I guess this may be related to the specific model.
> Hi @irasin it looks like you are using an older version of MII. Your error message for line 31 of `mii/grpc_related/restful_gateway.py` indicates you are trying to get the `request`...
> @irasin can you please try the following instead? > > ```python > import json > import requests > url = f"http://localhost:8000/mii/mii_test" > params = {"prompts": ["DeepSpeed is", "Seattle is...
> > > @irasin can you please try the following instead? > > > ```python > > > import json > > > import requests > > > url =...
@ZihanWang314, Got the same warning, but the model is still running. It seems that the disk space is not enough, just use `df -h` to check the disk space
> > > 是的,按道理来说,应该会忽略pad的值。所以这个感觉更像是是transformer的一个bug吧 > > > > > > 我的困惑点是在是否使用use_cache,如果不使用那padding在右边也可以,只要解码出下一个token_ids时候接到上次padding之前,如下所示: > > ```python > > # 原始输入 > > input = tensor([[1,2,3,4,5], [1,2,0,0,0]]) > > # 解码得到各自的结果是6, 3,那下一次输入的就应该是 >...
> @irasin 是的,我看了下padding在左边,其position_ids也是从非padding位置开始的,这样不仅是chatglm,应该所有的decoder架构的都可以padding在左边来实现batch_generate 确实是这样的,但是chatglm相比其他的模型还是要特殊一点的,由于涉及到position_2d,做batch_generation的时候,两种position的处理方式还不一样,这个需要注意一下
你好@chenyiwan ,感谢回复 想请教一下,在推理的代码中,有比如说判断是eos然后停止生成的地方吗?好像没看到有相关的代码哎