ACM MM 2024 Grand Challenge: Visual Spatial Description

Leaderboard

Congratulations to the winners!

Rank Team Score
1 Token 63.2064
2 USTC-IAT-United 62.1149
3 ppjj 59.8638
4 GXU-LIPE 59.6864
5 DMCV 59.3998

Introduction

We propose the visual spatial description (VSD) challenge on the platform of ACM MM. This challenge falls within the research domain of visual spatial semantics understanding. In the VSD challenge, models and systems are expected to generate an accurate textual descriptive sentence to describe the spatial relationship between two given target objects in an input image. Alongside the challenge, a large-scale Visual Spatial Description dataset will be provided, consisting of 29,272 high-quality manually annotated image-text pairs.

The challenge contains three subtasks, from easy to hard.

Challenge Task Definition and Metrtics

Task 1: Classification of Visual Spatial Relationship.

Task-1: Classification of Visual Spatial Relationship. Participants are required to construct models that can extract spatial relationships between two given objects O₁ and O₂ and output a triplet containing the spatial relationships of them. The relationships could be chosen from nine labels: “on”, “in”, “next”, “under”, “above”, “behind”, “in front of”, “left”, and “right”. The evaluation of this sub-task is the F1 score of the multi-label classification.

z1 = F1

Task 2: Description of Single Spatial Relationship.

Task-2: Description of Single Spatial Relationship. Participants are required to build models that generate a textual description of the single spatial relationship between the two objects O₁ and O₂. There are one or more ground-truth for each data entry. We calculate BLEU-4 and SPICE for the predicted sentence and each ground truth and choose the max score. We rank the submitted models by a weighted sum z of BLEU-4 and SPICE score:

z2 = 0.4BLEU4 + 0.6SPICE

Task 3: Description of open-ended spatial relationship.

Task-3: Description of open-ended spatial relationship. In this task, we provide a more challenging dataset that contains more complex spatial description. Contestants are required to construct models that can generate textual description to describe the spatial relationship between O₁ and O₂. Different to task-2, in the task-3, the model should generate 3 diverse descriptions. The evaluation of task-3 contains two parts, the Correctness and the Diversity. For the correctness we use the SPICE as task-2. For the diversity, we use mBLEU-4.

z3 = 0.5*(50 - mBLEU4) + 0.5SPICE

Finally, we use the following score for ranking:

overall=0.2z1+0.3z2+0.5z3

We provide python scripts for evaluation, please refer to the baseline codes.

Dataset

Our dataset comprises of two versions, VSDv1 and VSDv2, containing same images, two objects with bounding boxes, and spatial descriptions in English. The v2 contains more complicate sentences than v1.

Example of a sample:

[
{
  "img_id": "1.jpg", // image id
  "triple_list": [
    {
      "s": "book",                          // tag of the subject 
      "o": "table",                         // tag of the object
      "p": "on",                            // predicate label, one of “on”, “in”, “next”, “under”, “above”, “behind”, “in front of”, “left”, and “right”
      "s_bbox": [ymin,ymax,xmin,xmax],      // coordinates of the subject box, with the 0 points at upper-left corner of the image. (xmin, ymin) is the upper-left corner of the box.
      "o_bbox": ymin,ymax,xmin,xmax],            // coordinates of the object box
    }
  ],
  "description": "The book is on the table."
}
]

The datasets (with images) are avaliable on Google Drive

Submission

Please sumbit predicted results with TWO json files, one for task1 and task2 (task1-2.json), the other for the task3 (task3.json).

For task1-2.json:

[
{
    "img_id": "1.jpg", // image id
    "triple_list": [
    {
        "s": "book",                          // tag of the subject 
        "o": "table",                         // tag of the object
        "p": "predicted label",                            // predicate label, one of “on”, “in”, “next”, “under”, “above”, “behind”, “in front of”, “left”, and “right”
        "s_bbox": [ymin,ymax,xmin,xmax],      // coordinates of the subject box, with the 0 points at upper-left corner of the image. (xmin, ymin) is the upper-left corner of the box.
        "o_bbox": ymin,ymax,xmin,xmax],       // coordinates of the object box
    }
    ],
    "description": ["sentence1"] // generated sentences
}
]

For task3.json:

[
{
    "img_id": "1.jpg", // image id
    "description": ["sentence1", "sentence2", "sentence3"] // generated sentences
}
]

Participants can submit at Codabench, or send the results at in a Zip file to vsdchallenge@gmail.com. We will review the submissions publish the ranking here.

Baseline

We use pre-trained Vison-Language Models as the baselines.

Link to the code https://github.com/LLLogen/VSDcode

Registration

Welcome and please apply for the VSD challenge via a form at this link.

Feel free to contact us at vsdchallenge@gmail.com.

Timeline

Please note: The submission deadline is at 11:59 p.m. (Anywhere on Earth) of the stated deadline date.

Registration open May 24, 2024
Release of all datasets MAy 28, 2024
Evaluation results and ranking open June 10, 2024
System reports and conference paper deadline July 12, 2024
Results Submission Deadline August 1, 2024
Challenge Paper Submission Deadline(follow MM2024 Workshop Dates) August 19, 2024

Rewards

Top-ranked participants in this competition will receive a certificate of achievement and will be recommended to write a technical paper for submission to the ACM ToMM Special Issue.

Organizers

Yu Zhao. College of Intelligence and Computin, Tianjin University, China.

Hao Fei. Skywork AI, National University of Songalpore, Singapore.

Bobo Li. School of Cyber Science and Engineering, Wuhan University, China.

Meishan Zhang. Harbin Institute of Technology (Shenzhen), Shenzhen, China.

Min Zhang. Harbin Institute of Technology (Shenzhen), Shenzhen, China.

References

[1] Zhao, Y., Wei, J., Lin, Z., Sun, Y., Zhang, M., & Zhang, M. (2022, December). Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (pp. 1437-1449).

[2] Zhao, Y., Fei, H., Ji, W., Wei, J., Zhang, M., Zhang, M., & Chua, T. S. (2023, July). Generating Visual Spatial Description via Holistic 3D Scene Understanding. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (pp. 7960-7977).

[3] Yang, K., Russakovsky, O., & Deng, J. (2019). Spatialsense: An adversarially crowdsourced benchmark for spatial relation recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 2051-2060).

[4] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., ... & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140), 1-67.