The results of the hugging face version OFA model.

I used the code in https://huggingface.co/OFA-Sys/ofa-large-caption to inference, it could generate good captions and everything is ok. But when I used this code to generate 5000 captions for MSCOCO test split (with my own code), the final CIDEr is about 132, which is obviously lower than the reported result.
Is there some problem with the hugging face version OFA caption model?