Numerical inaccuracy in unpad_image (LlavaOnevison)

### System Info

**System Info:**
- `transformers` version: 4.45.0.dev0
- Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.35
- Python version: 3.11.0
- Huggingface_hub version: 0.24.7
- Safetensors version: 0.4.5
- Accelerate version: 0.34.2
- Accelerate config:    not found
- PyTorch version (GPU?): 2.3.1+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA A100-SXM4-80GB



### Who can help?

@amyeroberts, @qubvel

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

In [unpad_image](https://github.com/huggingface/transformers/blob/ce62a41880b5b70a304d068eb58f55894a5a7af8/src/transformers/models/llava_onevision/modeling_llava_onevision.py#L114) we found a numerical inaccuracy, if `original_aspect_ratio==current_aspect_ratio`. Which occurs in DocVQA on training sample 32673. See for example the snippet below:
```python import torch

original_size = torch.tensor([2136,3212], device = "cuda:0", dtype = torch.bfloat16)
original_height, original_width = original_size
current_height, current_width = 108, 162

original_aspect_ratio = original_width / original_height #tensor(1.5000)
current_aspect_ratio = current_width / current_height #1.5

scale_factor = current_height / original_height
new_width = int(original_width * scale_factor) # 163


```
Testing showed, if orignal_height and original_width are integers, that this inaccuracy does not occur. 

In die docstring the unpad function asks to be original_size to be a tuple (no type annotation tho), however it will always get a torch.tensor.
```python
"""
Args:
            image_features (`List[torch.Tensor]` of length num_images, each of shape `(num_patches, image_length, embed_dim)`)
                List of image feature tensor, each contains all the visual feature of all patches.
            image_sizes (`torch.Tensor` of shape `(num_images, 2)`)
                Actual image size of each images (H, W)."""
.
.
.
image_feature = unpad_image(image_feature, image_sizes[image_idx])
```

### Expected behavior

The new_width value shoud be 162. You can see that, if you write down the formula for the aspect ratios, equal them, and multiply by current_height, then you have original_width*scaling_factor=current_width(=new_width). 

PS My first issue ever, have patience please.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numerical inaccuracy in unpad_image (LlavaOnevison) #33531

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Numerical inaccuracy in unpad_image (LlavaOnevison) #33531

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions