Can we Use GPT-4 as a Mathematics Evaluator in Education?: Exploring the Efficacy and Limitation of LLM-based Automatic Assessment System for Open-ended Mathematics Question

Lee, Unggi; Kim, Youngin; Lee, Sangyun; Park, Jaehyeon; Mun, Jin; Lee, Eunseo; Kim, Hyeoncheol; Lim, Cheolil; Yoo, Yun Joo

doi:10.1007/s40593-024-00448-4

Can we Use GPT-4 as a Mathematics Evaluator in Education?: Exploring the Efficacy and Limitation of LLM-based Automatic Assessment System for Open-ended Mathematics Question

Letter to the Editor
Published: 19 December 2024

Volume 35, pages 1560–1596, (2025)
Cite this article

International Journal of Artificial Intelligence in Education Aims and scope

2003 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

This paper explores the potential of Large Language Models (LLMs), specifically GPT-4, to enhance the precision and effectiveness of Automated Assessment Systems (AAS) for open-ended mathematics problems. While LLMs have demonstrated transformative capabilities across various disciplines, their application in AAS, particularly for mathematical logic and open-ended problem-solving, still needs to be explored. Our research addresses this gap by developing and critically evaluating a GPT-4-based AAS. We analyzed 4,180 responses to open-ended mathematics questions from 380 6th-grade primary school students. Three human experts and the GPT-4 model independently assessed these responses using a pre-established rubric. Our findings reveal high consistency between human and GPT-4 assessments in most instances, highlighting the potential of integrating GPT-4 into AAS. We categorized scoring discrepancies from GPT-4 and human raters by error type and identified specific mathematical content areas where automated assessment faced limitations. We evaluated two strategies to enhance GPT-4’s assessment capabilities: (1) using elaborate prompts and (2) implementing advanced prompt engineering techniques such as Chain-of-thought, Self-consistency, and Tree-of-thought. While comprehensive prompts significantly improved assessment quality, applying advanced prompt engineering techniques directly produced suboptimal results, indicating a need for further refinement. This study contributes to the emerging body of research evaluating GPT-4 in the context of AAS for open-ended mathematics problems, shedding light on both the strengths and limitations of this approach. Our findings provide valuable insights and a foundation for future research to refine the integration of LLMs in AAS, particularly in mathematics education.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Fig. 1

Fig. 8

Data Availability

The data used in this study were collected after obtaining consent from the participants. The collected data were cleansed to remove any personally identifiable information before being used for research purposes. However, separate consent was not obtained for the public disclosure of the data used in the research. Therefore, it is difficult to disclose the data if there is a specific request relevant to the research objective.

References

Adeshola, I., & Adepoju, A. P. (2023). The opportunities and challenges of ChatGPT in education. Interactive Learning Environments, 1–14. https://doi.org/10.1080/10494820.2023.2253858
Ahn, D. Y., Son, T. K., & Lee, K. H. (2023). ChatGPT as a scaffolding tool: Evaluating the impact on elementary students’ mathematical logic problem-solving skills. Brain, Digital, & Learning, 13(2), 189–196.
Google Scholar
Amarasinghe, I., Marques, F., Ortiz-Beltrán, A., & Hernández-Leo, D. (2023). Generative pre-trained transformers for coding text data? An analysis with classroom orchestration data. In European Conference on Technology Enhanced Learning (pp. 32–43). Cham: Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-42682-7_3
Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z., Chu, E., Clark, J. H., El Shafey, L., Huang, Y., Meier-Hellstern, K., Mishra, G., Moreira, E., Omernick, M., Robinson, K., & Wu, Y. (2023). PaLM 2 Technical Report. arXiv preprint arXiv: https://doi.org/10.48550/arXiv.2305.10403
Baidoo-Anu, D., & Ansah, L. O. (2023). Education in the era of generative artificial intelligence (AI): Understanding the potential benefits of ChatGPT in promoting teaching and learning. Journal of AI, 7(1), 52–62. https://doi.org/10.61969/jai.1337500
Article Google Scholar
Baral, S., Botelho, A. F., Erickson, J. A., Benachamardi, P., & Heffernan, N. T. (2021). Improving automated scoring of student open responses in mathematics. International Educational Data Mining Society
Baral, S., Seetharaman, K., Botelho, A. F., Wang, A., Heineman, G., & Heffernan, N. T. (2022). Enhancing Auto-scoring of Student Open Responses in the Presence of Mathematical Terms and Expressions. In Artificial Intelligence in Education: 23rd International Conference, AIED 2022, Proceedings, 1, 685–690. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-031-11644-5_68
Beevers, C., & Paterson, J. (2003). Automatic assessment of problem-solving skills in mathematics. Active Learning in Higher Education, 4(2), 127–144. https://doi.org/10.1177/1469787403004002002
Article Google Scholar
Botelho, A., Baral, S., Erickson, J. A., Benachamardi, P., & Heffernan, N. T. (2023). Leveraging natural language processing to support automated assessment and feedback for student open responses in mathematics. Journal of Computer Assisted Learning. https://doi.org/10.1111/jcal.12793
Article Google Scholar
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901. https://doi.org/10.48550/arXiv.2005.14165
Article Google Scholar
Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., & Zhang, Y. (2023). Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv Preprint arXiv: https://doi.org/10.48550/arXiv.2303.12712
Caines, A., Benedetto, L., Taslimipoor, S., Davis, C., Gao, Y., Andersen, O., Yuan, Z., Elliott, M., Moore, R., Bryant, C., Rei, M., Yannakoudakis, H., Mullooly, A., Nicholls, D., & Buttery, P. (2023). On the application of large language models for language teaching and assessment technology. arXiv Preprint. https://doi.org/10.48550/arXiv.2307.08393
Article Google Scholar
Chase, H. (2023). LangChain. Retrieved from https://github.com/hwchase17/langchain
Cicchetti, D. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment, 6(4), 284. https://doi.org/10.1037/1040-3590.6.4.284
Article Google Scholar
Conejo, R., Guzmán, E., & Trella, M. (2016). The SIETTE automatic assessment environment. International Journal of Artificial Intelligence in Education, 26, 270–292. https://doi.org/10.1007/s40593-015-0078-4
Article Google Scholar
Crust, G. (2023). ChatGPT employability study skills and curriculum development. https://doi.org/10.13140/RG.2.2.35643.28960
Cui, L., Wu, Y., Liu, J., Yang, S., & Zhang, Y. (2021). Template-based named entity recognition using BART. In findings of the association for computational linguistics: ACL-IJCNLP 2021, 1835–1845. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.findings-acl.161
Cutrone, L. A., & Chang, M. (2010). Automarking: Automatic assessment of open questions. In 2010 10th IEEE International Conference on Advanced Learning Technologies, 143–147. IEEE. https://doi.org/10.1109/ICALT.2010.47
Denny, P., Kumar, V., & Giacaman, N. (2023). Conversing with Copilot: Exploring prompt engineering for solving CS1 problems using natural language. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education, 1, 1136–1142. https://doi.org/10.1145/3545945.3569823
Erickson, J. A., Botelho, A. F., McAteer, S., Varatharaj, A., & Heffernan, N. T. (2020). The automated grading of student open responses in mathematics. In Proceedings of the tenth international conference on learning analytics & knowledge, 615–624. https://doi.org/10.1145/3375462.3375523
Foong, P. Y. (2000). Open-ended problems for higher-order thinking in mathematic. Teaching and Learning, 20(2), 49–57.
Google Scholar
Foong, P. Y. (2002). The role of problems to enhance pedagogical practices in the Singapore. The Mathematics Educator, 6(2), 15–31.
Google Scholar
Gao, R., Merzdorf, H. E., Anwar, S., Hipwell, M. C., & Srinivasa, A. (2024). Automatic assessment of text-based responses in post-secondary education: A systematic review. Computers and Education: Artificial Intelligence, 100206. https://doi.org/10.1016/j.caeai.2024.100206
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8(1), 23. https://doi.org/10.20982/tqmp.08.1.p023
Article Google Scholar
Hancock, C. L. (1995). Implementing the assessment standards for school mathematics: Enhancing mathematics learning with open-ended questions. Mathematics Teacher, 88(6). https://doi.org/10.5951/MT.88.6.0496. 496 – 99.
Harjula, M. (2008). Mathematics exercise system with automatic assessment. Master’s thesis, Helsinki University of Technology, Department of Automation and Systems Technology.
Hoofman, J., & Secord, E. (2021). The effect of COVID-19 on education. Pediatric Clinics, 68(5), 1071–1079. https://doi.org/10.1016/j.pcl.2021.05.009
Article Google Scholar
Hutchins, E. (1995). How a cockpit remembers its speeds. Cognitive Science, 19(3), 265–288. https://doi.org/10.1016/0364-0213
Article Google Scholar
Insani, S., & Akbar, P. (2019). Development of open-ended based mathematics problem to measure high-level thinking ability. Journal Physics Conference Series, 1315(1), 012047. https://doi.org/10.1088/1742-6596/1315/1/012047
Article Google Scholar
Jiang, Z., Xu, F. F., Araki, J., & Neubig, G. (2020). How can we know what language models know? Transactions of the Association for Computational Linguistics, 8, 423–438. https://doi.org/10.1162/tacl_a_00324
Article Google Scholar
Kajetanowicz, P., & Wierzejewski, J. (2008). Application of computer algebra systems in automatic assessment of math skills. . https://doi.org/10.5485/TMCS.2008.0187
Kasneci, E., Sessler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., Krusche, S., Kutyniok, G., Michaeli, T., Nerdel, C., Pfeffer, J., Poquet, O., Sailer, M., Schmidt, A., Seidel, T., … Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274. https://doi.org/10.1016/j.lindif.2023.102274
Article Google Scholar
Keh, L. K., Ismail, Z., & Yusof, Y. M. (2016). A review of open-ended mathematical problem. Anatolian Journal of Education, 1(1), 1–18. https://doi.org/10.29333/aje.2016.111a
Article Google Scholar
Kim, J., Lee, H., & Cho, Y. H. (2022). Learning design to support student-AI collaboration: Perspectives of leading teachers for AI in education. Education and Information Technologies, 27(5), 6069–6104. https://doi.org/10.1007/s10639-021-10831-6
Article Google Scholar
Kwon, O., Park, J., & Park, J. (2006). Cultivating divergent thinking in mathematics through an open-ended approach. Asia Pacific Education Review, 7(1), 51–61. https://doi.org/10.1007/BF03036784
Article Google Scholar
Kwon, O. N., Oh, S. J., Yoon, J. E., Shin, B. C., & Jung, W. (2023). Analyzing mathematical performances of ChatGPT: Focusing on the solution of national assessment of educational achievement and the college scholastic ability test. Communications of Mathematical Education, 37(2), 233–256. https://doi.org/10.7468/jksmee.2023.37.2.233
Article Google Scholar
La Velle, L., Newman, S., Montgomery, C., & Hyatt, D. (2020). Initial teacher education in England and the Covid-19 pandemic: Challenges and opportunities. Journal of Education for Teaching, 46(4), 596–608. https://doi.org/10.1080/02607476.2020.1803051
Article Google Scholar
Lai, S., Liu, K., He, S., & Zhao, J. (2016). How to generate a good word embedding. IEEE Intelligent Systems, 31(6), 5–14. https://doi.org/10.1109/MIS.2016.45
Article Google Scholar
Lee, G. G., Latif, E., Wu, X., Liu, N., & Zhai, X. (2024). Applying large language models and chain-of-thought for automatic scoring. Computers and Education: Artificial Intelligence, 6, 100213. https://doi.org/10.1016/j.caeai.2024.100213
Article Google Scholar
Li, X. L., & Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv: https://doi.org/10.48550/arXiv.2101.00190
Lin, Z. L., Yen, C. H., Xu, J. C., Watty, D., & Hsieh, S. K. (2023). Solving linguistic olympiad problems with tree-of-thought prompting. In Proceedings of the 35th conference on computational linguistics and speech processing (ROCLING 2023) (pp. 262–269). https://aclanthology.org/2023.rocling-1.
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9), 1–35. https://doi.org/10.1145/3560815
Article Google Scholar
Lo, C. (2023). What is the impact of ChatGPT on education? A rapid review of the literature. Education Sciences, 13(4), 410. https://doi.org/10.3390/educsci13040410
Article Google Scholar
Luo, R., Zhang, A., Wang, Y., Li, H., Xu, Y., Guo, K., & Si, J. (2024). Math attitudes and math anxiety predict students’ perception of teacher support in primary school, but not vice versa. British Journal of Educational Psychology, 94(1), 6–21. https://doi.org/10.1111/bjep.12628
Article Google Scholar
Marchisio, M., Barana, A., Fioravera, M., Rabellino, S., & Conte, A. (2018). A model of formative automatic assessment and interactive feedback for STEM. In 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC), 1, 1016–1025. https://doi.org/10.1109/compsac.2018.00178
Nandini, V., & Uma Maheswari, P. (2020). Automatic assessment of descriptive answers in online examination system using semantic relational features. The Journal of Supercomputing, 76(6), 4430–4448. https://doi.org/10.1007/s11227-018-2381-y
Article Google Scholar
Nardi, B., & O’Day, V. (2000). Information ecologies: Using technology with heart. MIT Press. https://doi.org/10.7551/mitpress/3767.001.0001
Book Google Scholar
Nguyen, H. A., Stec, H., Hou, X., Di, S., & McLaren, B. M. (2023). Evaluating chatgpt’s decimal skills and feedback generation in a digital learning game. In European Conference on Technology Enhanced Learning (pp. 278–293). Cham: Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-42682-7_19
Oh, S. J. (2023). Effective ChatGPT prompts in mathematical problem solving: Focusing on quadratic equations and quadratic functions. Communications of Mathematical Education, 37(3), 545–567.
Google Scholar
OpenAI. (2023). GPT-4 Technical Report. arXiv Preprint arXiv: 2303 08774. https://doi.org/10.48550/arXiv.2303.08774
Article Google Scholar
Oppenlaender, J. (2022). Prompt engineering for text-based generative art. Researchgate Preprint. https://doi.org/10.48550/arXiv.2204.13988
Article Google Scholar
Pankiewicz, M., & Baker, R. S. (2023). Large Language models (GPT) for automating feedback on programming assignments. arXiv Preprint. https://doi.org/10.48550/arXiv.2307.00150
Article Google Scholar
Perkins, D. N. (1993). Person-plus: A distributed view of thinking and learning. Distributed cognitions: Psychological and educational considerations, 88–110.
Petroni, F., Rocktäschel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller, A. H., & Riedel, S. (2019). Language models as knowledge bases? arXiv Preprint. https://doi.org/10.48550/arXiv.1909.01066
Article Google Scholar
Ranaldi, L., & Zanzotto, F. (2023). Empowering multi-step reasoning across languages via tree-of-thoughts. arXiv Preprint. https://doi.org/10.48550/arXiv.2311.08097
Article Google Scholar
Rasila, A., Malinen, J., & Tiitu, H. (2015). On automatic assessment and conceptual understanding. Teaching Mathematics and Its Applications: An International Journal of the IMA, 34(3), 149–159. https://doi.org/10.1093/teamat/hrv013
Article Google Scholar
Rasila, A., Harjula, M., & Zenger, K. (2007). Automatic assessment of mathematics exercises: Experiences and future prospects. In ReflekTori 2007 Symposium of Engineering Education, 1, 70–80.
Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint from https://doi.org/10.48550/arXiv.1908.10084
Rodrigues, F., & Araújo, L. (2012). Automatic assessment of short free text answers. In Proceedings of the 4th International Conference on Computer Supported Education, 2, 50–57. https://doi.org/10.5220/0003920800500057
Rudolph, J., Tan, S., & Tan, S. (2023). ChatGPT: Bullshit spewer or the end of traditional assessments in higher education? Journal of Applied Learning and Teaching, 6(1), 342–363. https://doi.org/10.37074/jalt.2023.6.1.9
Article Google Scholar
Sallam, M. (2023). ChatGPT utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare, 11(6), 887. https://doi.org/10.3390/healthcare11060887
Sangwin, C., & Köcher, N. (2016). Automation of mathematics examinations. Computers & Education, 94, 215–227. https://doi.org/10.1016/j.compedu.2015.11.014
Article Google Scholar
Schleicher, A. (2020). The Impact of COVID-19 on Education: Insights from Education at a Glance 2020. OECD Publishing
Senk, S. L., Beckmann, C. E., & Thompson, D. R. (1997). Assessment and grading in high school mathematics classrooms. Journal for Research in Mathematics Education, 28(2), 187–215. https://doi.org/10.5951/jresematheduc.28.2.0187
Article Google Scholar
Shin, R., Lin, C. H., Thomson, S., Chen, C., Roy, S., Platanios, E. A., Pauls, A., Klein, D., Eisner, J., & Van Durme, B. (2021). Constrained language models yield few-shot semantic parsers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 7699–7715. https://doi.org/10.18653/v1/2021.emnlp-main.608
Shuster, K., Xu, J., Komeili, M., Ju, D., Smith, E. M., Roller, S., Ung, M., Chen, M., Arora, K., Lane, J., Behrooz, M., Ngan, W., Poff, S., Goyal, N., Szlam, A., Boureau, Y.-L., Kambadur, M., & Weston, J. (2022). Blenderbot 3: A deployed conversational agent that continually learns to responsibly engage. arXiv preprint. https://doi.org/10.48550/arXiv.2208.03188
Article Google Scholar
Silver, E. A. (1995). The nature and use of open problems in mathematics education: Mathematical and pedagogical perspectives. Zentralblatt fur Didaktik Der Mathematik/International Reviews on Mathematical Education, 27(2), 67–72.
Google Scholar
Stephan, M., & Clements, D. H. (2003). Linear and area measurement in prekindergarten to grade 2. Learning and Teaching Measurement, 5(1), 3–16.
Google Scholar
Strauss, A., & Corbin, J. (1998). Basics of qualitative research: Techniques and procedures for developing grounded theory (2nd ed.). Sage Publications, Inc.
Google Scholar
Su, J., & Yang, W. (2023). Unlocking the power of ChatGPT: A framework for applying generative AI in education. ECNU Review of Education, 20965311231168424. https://doi.org/10.1177/20965311231168423
Sullivan, P. A. (2003). The potential of open-ended mathematics tasks for overcoming barriers to learning. In L. Bragg, C. Campbell, G. Herbert, & J. Mousley (Eds.), Mathematics education research: innovation, networking, opportunity, proceedings of the 26th annual conference of the Mathematics Education Research Group of Australasia, 2, 813–816.
Tanudjaya, C. P., & Doorman, M. (2020). Examining higher order thinking in indonesian lower secondary mathematics classrooms. Journal on Mathematics Education, 11(2), 277–300. https://doi.org/10.22342/jme.11.2.11000.277-300
Article Google Scholar
Van Vaerenbergh, S., & Pérez-Suay, A. (2022). A classification of artificial intelligence systems for mathematics education. Mathematics Education in the Age of Artificial Intelligence: How Artificial Intelligence can Serve Mathematical Human Learning (pp. 89–106). Springer International Publishing. https://doi.org/10.1007/978-3-030-86909-0_5
Chapter Google Scholar
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., & Zhou, D. (2022). Self-consistency improves chain of thought reasoning in language models. arXiv Preprint from. https://doi.org/10.48550/arXiv.2203.11171
Article Google Scholar
Wardat, Y., Tashtoush, M. A., AlAli, R., & Jarrah, A. M. (2023). ChatGPT: A revolutionary tool for teaching and learning mathematics. Eurasia Journal of Mathematics Science and Technology Education, 19(7), em2286. https://doi.org/10.29333/ejmste/13272
Article Google Scholar
Wei, J., Wang, X., Schuurmans, D., Bosma, M., ichter, B., Xia, F., Chi, E., Le, V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837.
Google Scholar
Wu, H. (1994). The role of open-ended problems in mathematics education. The Journal of Mathematical Behavior, 13(1), 115–128. https://doi.org/10.1016/0732-3123(94)90044-2
Article Google Scholar
Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., & Duan, N. (2023a). Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint from https://doi.org/10.48550/arXiv.2303.04671
Wu, Y., Jia, F., Zhang, S., Wu, Q., Li, H., Zhu, E., Wang, Y., Lee, Y., Peng, R., Wu, Q., & Wang, C. (2023c). An empirical study on challenging math problem solving with GPT-4. arXiv preprint from https://doi.org/10.48550/arXiv.2306.01337
Wu, Y., Henriksson, A., Duneld, M., & Nouri, J. (2023b). Towards improving the reliability and transparency of ChatGPT for educational question answering. In European conference on technology enhanced learning (pp. 475–488). Cham: Springer Nature. https://doi.org/10.1007/978-3-031-42682-7_32
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint from https://doi.org/10.48550/arXiv.2305.10601
Yu, H. (2023). Reflection on whether Chat GPT should be banned by academia from the perspective of education and teaching. Frontiers in Psychology, 14, 1181712. https://doi.org/10.3389/fpsyg.2023.1181712
Article Google Scholar
Zhao, Y., & Watterston, J. (2021). The changes we need: Education post COVID-19. Journal of Educational Change, 22(1), 3–12. https://doi.org/10.1007/s10833-021-09417-3
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Korea University, Seoul, South Korea
Unggi Lee & Hyeoncheol Kim
Department of Mathematics Education, Seoul National University, Seoul, South Korea
Youngin Kim, Jaehyeon Park, Jin Mun & Yun Joo Yoo
Department of Education, Seoul National University, Seoul, South Korea
Sangyun Lee, Eunseo Lee & Cheolil Lim

Authors

Unggi Lee
View author publications
Search author on:PubMed Google Scholar
Youngin Kim
View author publications
Search author on:PubMed Google Scholar
Sangyun Lee
View author publications
Search author on:PubMed Google Scholar
Jaehyeon Park
View author publications
Search author on:PubMed Google Scholar
Jin Mun
View author publications
Search author on:PubMed Google Scholar
Eunseo Lee
View author publications
Search author on:PubMed Google Scholar
Hyeoncheol Kim
View author publications
Search author on:PubMed Google Scholar
Cheolil Lim
View author publications
Search author on:PubMed Google Scholar
Yun Joo Yoo
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Yun Joo Yoo.

Ethics declarations

Conflict of Interest

None.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1. Evaluation Rubric of Each Question by Knowledge Elements

9

Table 9 Question 1. How can we promise that a shape with certain characteristics is a trapezoid?

Full size table

10

Table 10 Question 2. Find the three line segments you need to find the area of the trapezoid, mark them with symbols, and write down their names. And also measure their lengths

Full size table

11

Table 11 Question 3. Calculate the area of a trapezoid by dividing it into two or more shapes. Describe how you calculated the area in an expression and process

Full size table

12

Table 12 Question 4. Calculate the area of a trapezoid by adding other shapes to the trapezoid. Describe how you calculated the area as an expression and a process

Full size table

13

Table 13 Question 5. Calculate the area of a trapezoid by replacing the trapezoid with another shape. Explain how you calculated the area in terms of an expression

Full size table

Appendix 2

14

Table 14 Standardized model answer

Full size table

Appendix 3

15

Table 15 Analysis results for research question 4.1 estimates

Full size table

Appendix 4. Detail Analysis

Figure 9, 10 and 11

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, U., Kim, Y., Lee, S. et al. Can we Use GPT-4 as a Mathematics Evaluator in Education?: Exploring the Efficacy and Limitation of LLM-based Automatic Assessment System for Open-ended Mathematics Question. Int J Artif Intell Educ 35, 1560–1596 (2025). https://doi.org/10.1007/s40593-024-00448-4

Download citation

Accepted: 19 November 2024
Published: 19 December 2024
Version of record: 19 December 2024
Issue date: September 2025
DOI: https://doi.org/10.1007/s40593-024-00448-4

Profiles

Unggi Lee View author profile
Jin Mun View author profile

Access this article

Log in via an institution

Can we Use GPT-4 as a Mathematics Evaluator in Education?: Exploring the Efficacy and Limitation of LLM-based Automatic Assessment System for Open-ended Mathematics Question

Abstract

Access this article

Explore related subjects

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher’s Note

Appendices

Appendix 1. Evaluation Rubric of Each Question by Knowledge Elements

Appendix 2

Appendix 3

Appendix 4. Detail Analysis

Rights and permissions

About this article

Cite this article

Share this article

Profiles