Skip to main content
Log in

Generic Scene Graph Generation Model with Hierarchical Prompt Learning

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Scene Graph Generation (SGG) delivers structured knowledge to represent complex scenes and has proven effective in many computer vision tasks. However, traditional SGG models suffer from two limitations that hinder their applicability for higher-level visual tasks: (1) a rigid structure that results in low efficiency and limited flexibility, and (2) biased optimization that results in biased predictions that favor uninformative predicates. To resolve these two issues, we propose GSGG (Generic Scene Graph Generation), a novel, efficient, and flexible SGG model that (1) combines generalized modules to construct top-performance and high-efficiency SGG model and (2) employs a prompt learning-based relation decoder with a novel Hierarchical Prompt (HP) learning method to mitigate biased optimization. HP utilizes the composition of basic prompts constrained to progressively narrowed class groups and encourages the corresponding prompts to focus on the learning of increasingly informative predicates. Extensive evaluations on three SGG benchmarks demonstrate the excellent efficiency and performance of GSGG with HP. We also introduce a novel predicate generalization task with a new benchmark, and experiments on it demonstrate the effectiveness of HP in base-to-novel predicate generalization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from €37.37 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price includes VAT (Netherlands)

Instant access to the full article PDF.

Fig. 1
The alternative text for this image may have been generated using AI.
Fig. 2
The alternative text for this image may have been generated using AI.
Fig. 3
The alternative text for this image may have been generated using AI.
Fig. 4
The alternative text for this image may have been generated using AI.
Fig. 5
The alternative text for this image may have been generated using AI.
Fig. 6
The alternative text for this image may have been generated using AI.
Fig. 7
The alternative text for this image may have been generated using AI.
Fig. 8
The alternative text for this image may have been generated using AI.
Fig. 9
The alternative text for this image may have been generated using AI.
Fig. 10
The alternative text for this image may have been generated using AI.
Fig. 11
The alternative text for this image may have been generated using AI.
Fig. 12
The alternative text for this image may have been generated using AI.

Similar content being viewed by others

Data Availibility

All experiments are conducted on publicly available datasets; see the references cited. Code of the model will be available at https://github.com/ZHUXUHAN/GSGG.

References

  • Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In: European conference on computer vision, Springer, 213–229.

  • Chen, S., Jin, Q., Wang, P., & Wu, Q. (2020). Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9962–9971.

  • Chen, Z., Wu, J., Lei, Z., Zhang, Z., & Chen, C. (2023). Expanding scene graph boundaries: Fully open-vocabulary scene graph generation via visual-concept alignment and retention. arXiv preprint arXiv:2311.10988.

  • Chiou, M. J., Ding, H., Yan, H., Wang, C., Zimmermann, R., & Feng, J. (2021). Recovering the unbiased scene graphs from the biased ones. In: Proceedings of the 29th ACM International Conference on Multimedia, 1581—-1590.

  • Cong, Y., Yang, M. Y., & Rosenhahn, B. (2023). Reltr: Relation transformer for scene graph generation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9), 11169–11183.

    Article  Google Scholar 

  • Cui, Y., Jia, M., Lin, T. Y., Song, Y., & Belongie, S. (2019). Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9268–9277.

  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

  • Dong, X., Gan, T., Song, X., Wu, J., Cheng, Y., & Nie, L. (2022). Stacked hybrid-attention and group collaborative learning for unbiased scene graph generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19427–19436.

  • Du, Y., Wei, F., Zhang, Z., Shi, M., Gao, Y., & Li, G. (2022). Learning to prompt for open-vocabulary object detection with vision-language model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14084–14093.

  • Gu, J., Joty, S., Cai, J., Zhao, H., Yang, X., & Wang, G. (2019). Unpaired image captioning via scene graph alignments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 10323–10332.

  • Gu, J., Han, Z., Chen, S., et al. (2023). A systematic survey of prompt engineering on vision-language foundation models. arXiv preprint arXiv:2307.12980.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, 2961–2969.

  • He, T., Gao, L., Song, J., & Li, Y. F. (2022). Towards open-vocabulary scene graph generation with prompt-based finetuning. In: European Conference on Computer Vision, Springer, 56–73.

  • Hildebrandt, M., Li, H., Koner, R., Tresp, V., & Günnemann, S. (2020). Scene graph reasoning for visual question answering. arXiv preprint arXiv:2007.01072.

  • Hu, J., Huang, L., Ren, T., Zhang, S., Ji, R., & Cao, L. (2023). You only segment once: Towards real-time panoptic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17819–17829.

  • Im, J., Nam, J., Park, N., Lee, H., & Park, S. (2024). Egtr: Extracting graph from transformer for scene graph generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 24229–24238.

  • Jeon, J., Kim, K., Yoon, K., & Park, C. (2025). Semantic diversity-aware prototype-based learning for unbiased scene graph generation. In: European Conference on Computer Vision, Springer, 379–395.

  • Jia, C., Yang, Y., Xia, Y., Chen, Y. T., Parekh, Z., Pham, H., Le, Q. V., Sung, Y., Li, Z., & Duerig, T. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML.

  • Jin, W., Cheng, Y., Shen, Y., Chen, W., & Ren, X. (2021a). A good prompt is worth millions of parameters? low-resource prompt-based learning for vision-language models. arXiv preprint arXiv:2110.08484.

  • Jin, Y., Chen, Y., Wang, L., Wang, J., Yu, P., Liu, Z., & Hwang, J. N. (2021b). Is object detection necessary for human-object interaction recognition? arXiv preprint arXiv:2107.13083.

  • Jocher, G., Chaurasia, A., & Qiu, J. (2023). Ultralytics YOLO. https://github.com/ultralytics/ultralytics.

  • Johnson, J., Krishna, R., Stark, M., Li, L. J., Shamma, D., Bernstein, M., & Fei-Fei, L. (2015). Image retrieval using scene graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 3668–3678.

  • Kirillov, A., Girshick, R., He, K., Dollár, P. (2019). Panoptic feature pyramid networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 6399–6408.

  • Knyazev, B., de Vries, H., Cangea, C., Taylor, G. W., Courville, A., & Belilovsky, E. (2020). Graph density-aware losses for novel compositions in scene graph generation. In: BMVC.

  • Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L. J., Shamma, D. A., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1), 32–73.

    Article  MathSciNet  Google Scholar 

  • Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A., Duerig, T., & Ferrari, V. (2020). The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision(IJCV), 128(7), 1956–1981.

    Article  Google Scholar 

  • Li, F., Zhang, H., Xu, H., Liu, S., Zhang, L., Ni, L. M., Shum, H. Y. (2023a). Mask dino: Towards a unified transformer-based framework for object detection and segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3041–3050.

  • Li, L., Chen, L., Huang, Y., Zhang, Z., Zhang, S., & Xiao, J. (2022a). The devil is in the labels: Noisy label correction for robust scene graph generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18869–18878.

  • Li, L., Chen, G., Xiao, J., Yang, Y., Wang, C., & Chen, L. (2023b). Compositional feature augmentation for unbiased scene graph generation. arXiv preprint arXiv:2308.06712.

  • Li, L., Xiao, J., Chen, G., Shao, J., Zhuang, Y., & Chen, L. (2023). Zero-shot visual relation detection via composite visual cues from large language models. Advances in Neural Information Processing Systems, 36, 50105–50116.

    Google Scholar 

  • Li, R., Zhang, S., Wan, B., & He, X. (2021). Bipartite graph network with adaptive message passing for unbiased scene graph generation. In: CVPR.

  • Li, R., Zhang, S., & He, X. (2022b). Sgtr: End-to-end scene graph generation with transformer. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 19486–19496.

  • Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). Focal loss for dense object detection.

  • Lin, X., Ding, C., Zeng, J., & Tao, D. (2020). Gps-net: Graph property sensing network for scene graph generation. In: CVPR.

  • Lin, X., Ding, C., Zhan, Y., Li, Z., & Tao, D. (2022). Hl-net: Heterophily learning network for scene graph generation. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 19476–19485.

  • Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2021a). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586.

  • Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al. (2024). Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: European Conference on Computer Vision, Springer, 38–55.

  • Liu, X., Zheng, Y., Du, Z., Ding, M., Qian, Y., Yang, Z., & Tang, J. (2021b). Gpt understands, too. arXiv preprint arXiv:2103.10385.

  • Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021c). Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022.

  • Lorenz, J., Pest, A., Kienzle, D., Ludwig, K., & Lienhart, R. (2024). A fair ranking and new model for panoptic scene graph generation. arXiv preprint arXiv:2407.09216.

  • Lyu, X., Gao, L., Guo, Y., Zhao, Z., Huang, H., Shen, H. T., & Song, J. (2022). Fine-grained predicates learning for scene graph generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19467–19475.

  • Ma, N., Zhang, X., Zheng, H. T., & Sun, J. (2018). Shufflenet v2: Practical guidelines for efficient cnn architecture design. In: Proceedings of the European conference on computer vision (ECCV), 116–131.

  • Neau, M., Santos, P. E., Bosser, A. G., & Buche, C. (2024). React: Real-time efficiency and accuracy compromise for tradeoffs in scene graph generation. arxiv:2405.16116.

  • Peng, Y., Li, H., Wu, P., Zhang, Y., Sun, X., & Wu, F. (2025). D-FINE: Redefine regression task of DETRs as fine-grained distribution refinement. In: The Thirteenth International Conference on Learning Representations, https://openreview.net/forum?id=MFZjrTFE7h.

  • Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al. (2018). Improving language understanding by generative pre-training. arXiv preprint arXiv:2109.11797.

  • Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.

    Google Scholar 

  • Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In: ICML, PMLR, 8748–8763.

  • Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 779–788.

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.

  • SegmentsAI (2023) Panoptic segment anything. https://github.com/segments-ai/panoptic-segment-anything.

  • Shi, H., Li, L., Xiao, J., Zhuang, Y., & Chen, L. (2024). From easy to hard: Learning curricular shape-aware features for robust panoptic scene graph generation. International Journal of Computer Vision, 1–20.

  • Tan, J., Wang, C., Li, B., Li, Q., Ouyang, W., Yin, C., & Yan, J. (2020). Equalization loss for long-tailed object recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 11662–11671.

  • Tan, J., Lu, X., Zhang, G., Yin, C., & Li, Q. (2021). Equalization loss v2: A new gradient balance approach for long-tailed object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 1685–1694.

  • Tang, K., Zhang, H., Wu, B., Luo, W., & Liu, W. (2019). Learning to compose dynamic tree structures for visual contexts. In: CVPR.

  • Tang, K., Niu, Y., Huang, J., Shi, J., & Zhang, H. (2020). Unbiased scene graph generation from biased training. In: CVPR.

  • Teney, D., Liu, L., & van Den Hengel, A. (2017). Graph-structured representations for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1–9.

  • Wang, H., Li, Y., Yao, H., & Li, X. (2023a). Clipn for zero-shot ood detection: Teaching clip to say no. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 1802–1812.

  • Wang, J., Wen, Z., Li, X., Guo, Z., Yang, J., & Liu, Z. (2024). Pair then relation: Pair-Net for panoptic scene graph generation. https://doi.org/10.48550/arXiv.2307.08699, arxiv:2307.08699 [cs].

  • Wang, W., Wang, R., Shan, S., & Chen, X. (2023). Importance first: Generating scene graph of human interest. International Journal of Computer Vision, 131(10), 2489–2515.

    Article  Google Scholar 

  • Xian, Y., Schiele, B., & Akata, Z. (2017). Zero-shot learning-the good, the bad and the ugly. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 4582–4591.

  • Xu, D., Zhu, Y., Choy, C., Fei-Fei, L. (2017). Scene graph generation by iterative message passing. In: CVPR.

  • Xu, G., Chai, J., & Kordjamshidi, P. (2024). Gipcol: Graph-injected soft prompting for compositional zero-shot learning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 5774–5783.

  • Yan, S., Shen, C., Jin, Z., Huang, J., Jiang, R., Chen, Y., & Hua, X. (2020). PCPL: predicate-correlation perception learning for unbiased scene graph generation. In: ACM MM.

  • Yang, J., Lu, J., Lee, S., Batra, D., & Parikh, D. (2018). Graph r-cnn for scene graph generation. In: ECCV, 670–685.

  • Yang, J., Ang, Y. Z., Guo, Z., Zhou, K., Zhang, W., & Liu, Z. (2022). Panoptic scene graph generation. In: European Conference on Computer Vision, Springer, 178–196.

  • Yao, Y., Zhang, A., Zhang, Z., Liu, Z., Chua, T. S., & Sun, M. (2021). Cpt: Colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797.

  • Yu, Q., Li, J., Wu, Y., Tang, S., Ji, W., & Zhuang, Y. (2023). Visually-prompted language model for fine-grained scene graph generation in an open world. arXiv preprint arXiv:2303.13233.

  • Zellers, R., Yatskar, M., Thomson, S., & Choi, Y. (2018). Neural motifs: Scene graph parsing with global context. In: CVPR.

  • Zhang, A., Yao, Y., Chen, Q., Ji, W., Liu, Z., Sun, M., & Chua, T. S. (2022). Fine-grained scene graph generation with data transfer. In: ECCV.

  • Zhang, C., Chao, W. L., & Xuan, D. (2019a). An empirical study on leveraging scene graphs for visual question answering. arXiv preprint arXiv:1907.12133.

  • Zhang, J., Shih, K., Elgammal, A., Tao, A., & Catanzaro, B. (2019b). Graphical contrastive losses for scene graph parsing. In: CVPR.

  • Zhang, Y., Huang, X., Ma, J., Li, Z., Luo, Z., Xie, Y., Qin, Y., Luo, T., Li, Y., Liu, S., et al. (2023a). Recognize anything: A strong image tagging model. arXiv preprint arXiv:2306.03514.

  • Zhang, Y., Pan, Y., Yao, T., Huang, R., Mei, T., & Chen, C. W. (2023b). Learning to generate language-supervised and open-vocabulary scene graph using pre-trained visual-semantic space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2915–2924.

  • Zheng, C., Gao, L., Lyu, X., Zeng, P., El Saddik, A., & Shen, H. T. (2023). Dual-branch hybrid learning network for unbiased scene graph generation. IEEE Transactions on Circuits and Systems for Video Technology, 34(3), 1743–1756.

    Article  Google Scholar 

  • Zheng, C., Lyu, X., Gao, L., Dai, B., & Song, J. (2023b). Prototype-based embedding network for scene graph generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22783–22792.

  • Zhong, Y., Wang, L., Chen, J., Yu, D., & Li, Y. (2020). Comprehensive image captioning via scene graph decomposition. In: Proceedings of the European Conference on Computer Vision, 211–229.

  • Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022a). Conditional prompt learning for vision-language models. In: CVPR, 16816–16825.

  • Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9), 2337–2348.

    Article  Google Scholar 

  • Zhou, Z., Lei, Y., Zhang, B., Liu, L., & Liu, Y. (2023a). Zegclip: Towards adapting clip for zero-shot semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11175–11185.

  • Zhou, Z., Shi, M., & Caesar, H. (2023b). HiLo: Exploiting high low frequency relations for unbiased panoptic scene graph generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 21637–21648.

  • Zhu, P., Wang, X., Zhu, L., Sun, Z., Zheng, W., Wang, Y., & Chen, C. (2022). Prompt-based learning for unpaired image captioning. arXiv preprint arXiv:2205.13125.

  • Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2020). Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159.

  • Zhu, X., Xing, Y., Wang, R., Wang, Y., & Lan, X. (2024). Hierarchical prompt learning for scene graph generation. In: 35th British Machine Vision Conference 2024, BMVC 2024, Glasgow, UK, November 25-28, 2024, BMVA, https://papers.bmvc2024.org/0183.pdf.

Download references

Acknowledgements

This work is supported by Pengcheng Laboratory Research Project No. PCL2023A08 and partially supported by the National Natural Science Foundation of China under contracts Nos. U21B2025 and 62402252.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Ruiping Wang or Xiangyuan Lan.

Additional information

Communicated by Carlos Moreno-Garcia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, X., Xing, Y., Wang, R. et al. Generic Scene Graph Generation Model with Hierarchical Prompt Learning. Int J Comput Vis 133, 6813–6831 (2025). https://doi.org/10.1007/s11263-025-02499-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s11263-025-02499-z

Keywords

Profiles

  1. Yifei Xing