Skip to main content
Log in

Can we Use GPT-4 as a Mathematics Evaluator in Education?: Exploring the Efficacy and Limitation of LLM-based Automatic Assessment System for Open-ended Mathematics Question

  • Letter to the Editor
  • Published:
International Journal of Artificial Intelligence in Education Aims and scope

Abstract

This paper explores the potential of Large Language Models (LLMs), specifically GPT-4, to enhance the precision and effectiveness of Automated Assessment Systems (AAS) for open-ended mathematics problems. While LLMs have demonstrated transformative capabilities across various disciplines, their application in AAS, particularly for mathematical logic and open-ended problem-solving, still needs to be explored. Our research addresses this gap by developing and critically evaluating a GPT-4-based AAS. We analyzed 4,180 responses to open-ended mathematics questions from 380 6th-grade primary school students. Three human experts and the GPT-4 model independently assessed these responses using a pre-established rubric. Our findings reveal high consistency between human and GPT-4 assessments in most instances, highlighting the potential of integrating GPT-4 into AAS. We categorized scoring discrepancies from GPT-4 and human raters by error type and identified specific mathematical content areas where automated assessment faced limitations. We evaluated two strategies to enhance GPT-4’s assessment capabilities: (1) using elaborate prompts and (2) implementing advanced prompt engineering techniques such as Chain-of-thought, Self-consistency, and Tree-of-thought. While comprehensive prompts significantly improved assessment quality, applying advanced prompt engineering techniques directly produced suboptimal results, indicating a need for further refinement. This study contributes to the emerging body of research evaluating GPT-4 in the context of AAS for open-ended mathematics problems, shedding light on both the strengths and limitations of this approach. Our findings provide valuable insights and a foundation for future research to refine the integration of LLMs in AAS, particularly in mathematics education.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Fig. 1
The alternative text for this image may have been generated using AI.
Fig. 2
The alternative text for this image may have been generated using AI.
Fig. 3
The alternative text for this image may have been generated using AI.
Fig. 4
The alternative text for this image may have been generated using AI.
Fig. 5
The alternative text for this image may have been generated using AI.
Fig. 6
The alternative text for this image may have been generated using AI.
Fig. 7
The alternative text for this image may have been generated using AI.
Fig. 8
The alternative text for this image may have been generated using AI.

Data Availability

The data used in this study were collected after obtaining consent from the participants. The collected data were cleansed to remove any personally identifiable information before being used for research purposes. However, separate consent was not obtained for the public disclosure of the data used in the research. Therefore, it is difficult to disclose the data if there is a specific request relevant to the research objective.

References

  • Adeshola, I., & Adepoju, A. P. (2023). The opportunities and challenges of ChatGPT in education. Interactive Learning Environments, 1–14. https://doi.org/10.1080/10494820.2023.2253858

  • Ahn, D. Y., Son, T. K., & Lee, K. H. (2023). ChatGPT as a scaffolding tool: Evaluating the impact on elementary students’ mathematical logic problem-solving skills. Brain, Digital, & Learning, 13(2), 189–196.

    Google Scholar 

  • Amarasinghe, I., Marques, F., Ortiz-Beltrán, A., & Hernández-Leo, D. (2023). Generative pre-trained transformers for coding text data? An analysis with classroom orchestration data. In European Conference on Technology Enhanced Learning (pp. 32–43). Cham: Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-42682-7_3

  • Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z., Chu, E., Clark, J. H., El Shafey, L., Huang, Y., Meier-Hellstern, K., Mishra, G., Moreira, E., Omernick, M., Robinson, K., & Wu, Y. (2023). PaLM 2 Technical Report. arXiv preprint arXiv: https://doi.org/10.48550/arXiv.2305.10403

  • Baidoo-Anu, D., & Ansah, L. O. (2023). Education in the era of generative artificial intelligence (AI): Understanding the potential benefits of ChatGPT in promoting teaching and learning. Journal of AI, 7(1), 52–62. https://doi.org/10.61969/jai.1337500

    Article  Google Scholar 

  • Baral, S., Botelho, A. F., Erickson, J. A., Benachamardi, P., & Heffernan, N. T. (2021). Improving automated scoring of student open responses in mathematics. International Educational Data Mining Society

  • Baral, S., Seetharaman, K., Botelho, A. F., Wang, A., Heineman, G., & Heffernan, N. T. (2022). Enhancing Auto-scoring of Student Open Responses in the Presence of Mathematical Terms and Expressions. In Artificial Intelligence in Education: 23rd International Conference, AIED 2022, Proceedings, 1, 685–690. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-031-11644-5_68

  • Beevers, C., & Paterson, J. (2003). Automatic assessment of problem-solving skills in mathematics. Active Learning in Higher Education, 4(2), 127–144. https://doi.org/10.1177/1469787403004002002

    Article  Google Scholar 

  • Botelho, A., Baral, S., Erickson, J. A., Benachamardi, P., & Heffernan, N. T. (2023). Leveraging natural language processing to support automated assessment and feedback for student open responses in mathematics. Journal of Computer Assisted Learning. https://doi.org/10.1111/jcal.12793

    Article  Google Scholar 

  • Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901. https://doi.org/10.48550/arXiv.2005.14165

    Article  Google Scholar 

  • Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., & Zhang, Y. (2023). Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv Preprint arXiv: https://doi.org/10.48550/arXiv.2303.12712

  • Caines, A., Benedetto, L., Taslimipoor, S., Davis, C., Gao, Y., Andersen, O., Yuan, Z., Elliott, M., Moore, R., Bryant, C., Rei, M., Yannakoudakis, H., Mullooly, A., Nicholls, D., & Buttery, P. (2023). On the application of large language models for language teaching and assessment technology. arXiv Preprint. https://doi.org/10.48550/arXiv.2307.08393

    Article  Google Scholar 

  • Chase, H. (2023). LangChain. Retrieved from https://github.com/hwchase17/langchain

  • Cicchetti, D. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment, 6(4), 284. https://doi.org/10.1037/1040-3590.6.4.284

    Article  Google Scholar 

  • Conejo, R., Guzmán, E., & Trella, M. (2016). The SIETTE automatic assessment environment. International Journal of Artificial Intelligence in Education, 26, 270–292. https://doi.org/10.1007/s40593-015-0078-4

    Article  Google Scholar 

  • Crust, G. (2023). ChatGPT employability study skills and curriculum development. https://doi.org/10.13140/RG.2.2.35643.28960

  • Cui, L., Wu, Y., Liu, J., Yang, S., & Zhang, Y. (2021). Template-based named entity recognition using BART. In findings of the association for computational linguistics: ACL-IJCNLP 2021, 1835–1845. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.findings-acl.161

  • Cutrone, L. A., & Chang, M. (2010). Automarking: Automatic assessment of open questions. In 2010 10th IEEE International Conference on Advanced Learning Technologies, 143–147. IEEE. https://doi.org/10.1109/ICALT.2010.47

  • Denny, P., Kumar, V., & Giacaman, N. (2023). Conversing with Copilot: Exploring prompt engineering for solving CS1 problems using natural language. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education, 1, 1136–1142. https://doi.org/10.1145/3545945.3569823

  • Erickson, J. A., Botelho, A. F., McAteer, S., Varatharaj, A., & Heffernan, N. T. (2020). The automated grading of student open responses in mathematics. In Proceedings of the tenth international conference on learning analytics & knowledge, 615–624. https://doi.org/10.1145/3375462.3375523

  • Foong, P. Y. (2000). Open-ended problems for higher-order thinking in mathematic. Teaching and Learning, 20(2), 49–57.

    Google Scholar 

  • Foong, P. Y. (2002). The role of problems to enhance pedagogical practices in the Singapore. The Mathematics Educator, 6(2), 15–31.

    Google Scholar 

  • Gao, R., Merzdorf, H. E., Anwar, S., Hipwell, M. C., & Srinivasa, A. (2024). Automatic assessment of text-based responses in post-secondary education: A systematic review. Computers and Education: Artificial Intelligence, 100206. https://doi.org/10.1016/j.caeai.2024.100206

  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.

  • Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8(1), 23. https://doi.org/10.20982/tqmp.08.1.p023

    Article  Google Scholar 

  • Hancock, C. L. (1995). Implementing the assessment standards for school mathematics: Enhancing mathematics learning with open-ended questions. Mathematics Teacher, 88(6). https://doi.org/10.5951/MT.88.6.0496. 496 – 99.

  • Harjula, M. (2008). Mathematics exercise system with automatic assessment. Master’s thesis, Helsinki University of Technology, Department of Automation and Systems Technology.

  • Hoofman, J., & Secord, E. (2021). The effect of COVID-19 on education. Pediatric Clinics, 68(5), 1071–1079. https://doi.org/10.1016/j.pcl.2021.05.009

    Article  Google Scholar 

  • Hutchins, E. (1995). How a cockpit remembers its speeds. Cognitive Science, 19(3), 265–288. https://doi.org/10.1016/0364-0213

    Article  Google Scholar 

  • Insani, S., & Akbar, P. (2019). Development of open-ended based mathematics problem to measure high-level thinking ability. Journal Physics Conference Series, 1315(1), 012047. https://doi.org/10.1088/1742-6596/1315/1/012047

    Article  Google Scholar 

  • Jiang, Z., Xu, F. F., Araki, J., & Neubig, G. (2020). How can we know what language models know? Transactions of the Association for Computational Linguistics, 8, 423–438. https://doi.org/10.1162/tacl_a_00324

    Article  Google Scholar 

  • Kajetanowicz, P., & Wierzejewski, J. (2008). Application of computer algebra systems in automatic assessment of math skills. . https://doi.org/10.5485/TMCS.2008.0187

  • Kasneci, E., Sessler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., Krusche, S., Kutyniok, G., Michaeli, T., Nerdel, C., Pfeffer, J., Poquet, O., Sailer, M., Schmidt, A., Seidel, T., … Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274. https://doi.org/10.1016/j.lindif.2023.102274

    Article  Google Scholar 

  • Keh, L. K., Ismail, Z., & Yusof, Y. M. (2016). A review of open-ended mathematical problem. Anatolian Journal of Education, 1(1), 1–18. https://doi.org/10.29333/aje.2016.111a

    Article  Google Scholar 

  • Kim, J., Lee, H., & Cho, Y. H. (2022). Learning design to support student-AI collaboration: Perspectives of leading teachers for AI in education. Education and Information Technologies, 27(5), 6069–6104. https://doi.org/10.1007/s10639-021-10831-6

    Article  Google Scholar 

  • Kwon, O., Park, J., & Park, J. (2006). Cultivating divergent thinking in mathematics through an open-ended approach. Asia Pacific Education Review, 7(1), 51–61. https://doi.org/10.1007/BF03036784

    Article  Google Scholar 

  • Kwon, O. N., Oh, S. J., Yoon, J. E., Shin, B. C., & Jung, W. (2023). Analyzing mathematical performances of ChatGPT: Focusing on the solution of national assessment of educational achievement and the college scholastic ability test. Communications of Mathematical Education, 37(2), 233–256. https://doi.org/10.7468/jksmee.2023.37.2.233

    Article  Google Scholar 

  • La Velle, L., Newman, S., Montgomery, C., & Hyatt, D. (2020). Initial teacher education in England and the Covid-19 pandemic: Challenges and opportunities. Journal of Education for Teaching, 46(4), 596–608. https://doi.org/10.1080/02607476.2020.1803051

    Article  Google Scholar 

  • Lai, S., Liu, K., He, S., & Zhao, J. (2016). How to generate a good word embedding. IEEE Intelligent Systems, 31(6), 5–14. https://doi.org/10.1109/MIS.2016.45

    Article  Google Scholar 

  • Lee, G. G., Latif, E., Wu, X., Liu, N., & Zhai, X. (2024). Applying large language models and chain-of-thought for automatic scoring. Computers and Education: Artificial Intelligence, 6, 100213. https://doi.org/10.1016/j.caeai.2024.100213

    Article  Google Scholar 

  • Li, X. L., & Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv: https://doi.org/10.48550/arXiv.2101.00190

  • Lin, Z. L., Yen, C. H., Xu, J. C., Watty, D., & Hsieh, S. K. (2023). Solving linguistic olympiad problems with tree-of-thought prompting. In Proceedings of the 35th conference on computational linguistics and speech processing (ROCLING 2023) (pp. 262–269). https://aclanthology.org/2023.rocling-1.

  • Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9), 1–35. https://doi.org/10.1145/3560815

    Article  Google Scholar 

  • Lo, C. (2023). What is the impact of ChatGPT on education? A rapid review of the literature. Education Sciences, 13(4), 410. https://doi.org/10.3390/educsci13040410

    Article  Google Scholar 

  • Luo, R., Zhang, A., Wang, Y., Li, H., Xu, Y., Guo, K., & Si, J. (2024). Math attitudes and math anxiety predict students’ perception of teacher support in primary school, but not vice versa. British Journal of Educational Psychology, 94(1), 6–21. https://doi.org/10.1111/bjep.12628

    Article  Google Scholar 

  • Marchisio, M., Barana, A., Fioravera, M., Rabellino, S., & Conte, A. (2018). A model of formative automatic assessment and interactive feedback for STEM. In 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC), 1, 1016–1025. https://doi.org/10.1109/compsac.2018.00178

  • Nandini, V., & Uma Maheswari, P. (2020). Automatic assessment of descriptive answers in online examination system using semantic relational features. The Journal of Supercomputing, 76(6), 4430–4448. https://doi.org/10.1007/s11227-018-2381-y

    Article  Google Scholar 

  • Nardi, B., & O’Day, V. (2000). Information ecologies: Using technology with heart. MIT Press. https://doi.org/10.7551/mitpress/3767.001.0001

    Book  Google Scholar 

  • Nguyen, H. A., Stec, H., Hou, X., Di, S., & McLaren, B. M. (2023). Evaluating chatgpt’s decimal skills and feedback generation in a digital learning game. In European Conference on Technology Enhanced Learning (pp. 278–293). Cham: Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-42682-7_19

  • Oh, S. J. (2023). Effective ChatGPT prompts in mathematical problem solving: Focusing on quadratic equations and quadratic functions. Communications of Mathematical Education, 37(3), 545–567.

    Google Scholar 

  • OpenAI. (2023). GPT-4 Technical Report. arXiv Preprint arXiv: 2303 08774. https://doi.org/10.48550/arXiv.2303.08774

    Article  Google Scholar 

  • Oppenlaender, J. (2022). Prompt engineering for text-based generative art. Researchgate Preprint. https://doi.org/10.48550/arXiv.2204.13988

    Article  Google Scholar 

  • Pankiewicz, M., & Baker, R. S. (2023). Large Language models (GPT) for automating feedback on programming assignments. arXiv Preprint. https://doi.org/10.48550/arXiv.2307.00150

    Article  Google Scholar 

  • Perkins, D. N. (1993). Person-plus: A distributed view of thinking and learning. Distributed cognitions: Psychological and educational considerations, 88–110.

  • Petroni, F., Rocktäschel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller, A. H., & Riedel, S. (2019). Language models as knowledge bases? arXiv Preprint. https://doi.org/10.48550/arXiv.1909.01066

    Article  Google Scholar 

  • Ranaldi, L., & Zanzotto, F. (2023). Empowering multi-step reasoning across languages via tree-of-thoughts. arXiv Preprint. https://doi.org/10.48550/arXiv.2311.08097

    Article  Google Scholar 

  • Rasila, A., Malinen, J., & Tiitu, H. (2015). On automatic assessment and conceptual understanding. Teaching Mathematics and Its Applications: An International Journal of the IMA, 34(3), 149–159. https://doi.org/10.1093/teamat/hrv013

    Article  Google Scholar 

  • Rasila, A., Harjula, M., & Zenger, K. (2007). Automatic assessment of mathematics exercises: Experiences and future prospects. In ReflekTori 2007 Symposium of Engineering Education, 1, 70–80.

  • Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint from https://doi.org/10.48550/arXiv.1908.10084

  • Rodrigues, F., & Araújo, L. (2012). Automatic assessment of short free text answers. In Proceedings of the 4th International Conference on Computer Supported Education, 2, 50–57. https://doi.org/10.5220/0003920800500057

  • Rudolph, J., Tan, S., & Tan, S. (2023). ChatGPT: Bullshit spewer or the end of traditional assessments in higher education? Journal of Applied Learning and Teaching, 6(1), 342–363. https://doi.org/10.37074/jalt.2023.6.1.9

    Article  Google Scholar 

  • Sallam, M. (2023). ChatGPT utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare, 11(6), 887. https://doi.org/10.3390/healthcare11060887

  • Sangwin, C., & Köcher, N. (2016). Automation of mathematics examinations. Computers & Education, 94, 215–227. https://doi.org/10.1016/j.compedu.2015.11.014

    Article  Google Scholar 

  • Schleicher, A. (2020). The Impact of COVID-19 on Education: Insights from Education at a Glance 2020. OECD Publishing

  • Senk, S. L., Beckmann, C. E., & Thompson, D. R. (1997). Assessment and grading in high school mathematics classrooms. Journal for Research in Mathematics Education, 28(2), 187–215. https://doi.org/10.5951/jresematheduc.28.2.0187

    Article  Google Scholar 

  • Shin, R., Lin, C. H., Thomson, S., Chen, C., Roy, S., Platanios, E. A., Pauls, A., Klein, D., Eisner, J., & Van Durme, B. (2021). Constrained language models yield few-shot semantic parsers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 7699–7715. https://doi.org/10.18653/v1/2021.emnlp-main.608

  • Shuster, K., Xu, J., Komeili, M., Ju, D., Smith, E. M., Roller, S., Ung, M., Chen, M., Arora, K., Lane, J., Behrooz, M., Ngan, W., Poff, S., Goyal, N., Szlam, A., Boureau, Y.-L., Kambadur, M., & Weston, J. (2022). Blenderbot 3: A deployed conversational agent that continually learns to responsibly engage. arXiv preprint. https://doi.org/10.48550/arXiv.2208.03188

    Article  Google Scholar 

  • Silver, E. A. (1995). The nature and use of open problems in mathematics education: Mathematical and pedagogical perspectives. Zentralblatt fur Didaktik Der Mathematik/International Reviews on Mathematical Education, 27(2), 67–72.

    Google Scholar 

  • Stephan, M., & Clements, D. H. (2003). Linear and area measurement in prekindergarten to grade 2. Learning and Teaching Measurement, 5(1), 3–16.

    Google Scholar 

  • Strauss, A., & Corbin, J. (1998). Basics of qualitative research: Techniques and procedures for developing grounded theory (2nd ed.). Sage Publications, Inc.

    Google Scholar 

  • Su, J., & Yang, W. (2023). Unlocking the power of ChatGPT: A framework for applying generative AI in education. ECNU Review of Education, 20965311231168424. https://doi.org/10.1177/20965311231168423

  • Sullivan, P. A. (2003). The potential of open-ended mathematics tasks for overcoming barriers to learning. In L. Bragg, C. Campbell, G. Herbert, & J. Mousley (Eds.), Mathematics education research: innovation, networking, opportunity, proceedings of the 26th annual conference of the Mathematics Education Research Group of Australasia, 2, 813–816.

  • Tanudjaya, C. P., & Doorman, M. (2020). Examining higher order thinking in indonesian lower secondary mathematics classrooms. Journal on Mathematics Education, 11(2), 277–300. https://doi.org/10.22342/jme.11.2.11000.277-300

    Article  Google Scholar 

  • Van Vaerenbergh, S., & Pérez-Suay, A. (2022). A classification of artificial intelligence systems for mathematics education. Mathematics Education in the Age of Artificial Intelligence: How Artificial Intelligence can Serve Mathematical Human Learning (pp. 89–106). Springer International Publishing. https://doi.org/10.1007/978-3-030-86909-0_5

    Chapter  Google Scholar 

  • Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., & Zhou, D. (2022). Self-consistency improves chain of thought reasoning in language models. arXiv Preprint from. https://doi.org/10.48550/arXiv.2203.11171

    Article  Google Scholar 

  • Wardat, Y., Tashtoush, M. A., AlAli, R., & Jarrah, A. M. (2023). ChatGPT: A revolutionary tool for teaching and learning mathematics. Eurasia Journal of Mathematics Science and Technology Education, 19(7), em2286. https://doi.org/10.29333/ejmste/13272

    Article  Google Scholar 

  • Wei, J., Wang, X., Schuurmans, D., Bosma, M., ichter, B., Xia, F., Chi, E., Le, V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837.

    Google Scholar 

  • Wu, H. (1994). The role of open-ended problems in mathematics education. The Journal of Mathematical Behavior, 13(1), 115–128. https://doi.org/10.1016/0732-3123(94)90044-2

    Article  Google Scholar 

  • Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., & Duan, N. (2023a). Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint from https://doi.org/10.48550/arXiv.2303.04671

  • Wu, Y., Jia, F., Zhang, S., Wu, Q., Li, H., Zhu, E., Wang, Y., Lee, Y., Peng, R., Wu, Q., & Wang, C. (2023c). An empirical study on challenging math problem solving with GPT-4. arXiv preprint from https://doi.org/10.48550/arXiv.2306.01337

  • Wu, Y., Henriksson, A., Duneld, M., & Nouri, J. (2023b). Towards improving the reliability and transparency of ChatGPT for educational question answering. In European conference on technology enhanced learning (pp. 475–488). Cham: Springer Nature. https://doi.org/10.1007/978-3-031-42682-7_32

  • Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint from https://doi.org/10.48550/arXiv.2305.10601

  • Yu, H. (2023). Reflection on whether Chat GPT should be banned by academia from the perspective of education and teaching. Frontiers in Psychology, 14, 1181712. https://doi.org/10.3389/fpsyg.2023.1181712

    Article  Google Scholar 

  • Zhao, Y., & Watterston, J. (2021). The changes we need: Education post COVID-19. Journal of Educational Change, 22(1), 3–12. https://doi.org/10.1007/s10833-021-09417-3

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yun Joo Yoo.

Ethics declarations

Conflict of Interest 

None.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1. Evaluation Rubric of Each Question by Knowledge Elements

9

Table 9 Question 1. How can we promise that a shape with certain characteristics is a trapezoid?

10

Table 10 Question 2. Find the three line segments you need to find the area of the trapezoid, mark them with symbols, and write down their names. And also measure their lengths

11

Table 11 Question 3. Calculate the area of a trapezoid by dividing it into two or more shapes. Describe how you calculated the area in an expression and process

12

Table 12 Question 4. Calculate the area of a trapezoid by adding other shapes to the trapezoid. Describe how you calculated the area as an expression and a process

13

Table 13 Question 5. Calculate the area of a trapezoid by replacing the trapezoid with another shape. Explain how you calculated the area in terms of an expression

Appendix 2

14

Table 14 Standardized model answer

Appendix 3

15

Table 15 Analysis results for research question 4.1 estimates

Appendix 4. Detail Analysis

Figure 9, 10 and 11

Fig. 9
Fig. 9The alternative text for this image may have been generated using AI.
Full size image

Comparing after modifying the prompt: graph shows reliability values for human evaluations and GPTs before (left) and after (right) modifying the prompt for question 1_1. The x-axis represents each human individual, and the y-axis is the reliability value. Circles are point estimates, vertical lines are 95% confidence values. a and (b) are ICC values (top), and (c) and (d) are Krippendorffs’ alpha values (bottom)

Fig. 10
Fig. 10The alternative text for this image may have been generated using AI.
Full size image

Comparing after modifying the prompt: graph shows reliability values for human evaluations and GPTs before (left) and after (right) modifying the prompt for question 2_3. The x-axis represents each human individual, and the y-axis is the reliability value. Circles are point estimates, vertical lines are 95% confidence values. a and (b) are ICC values (top), and (c) and (d) are Krippendorffs’ alpha values (bottom)

Fig. 11
Fig. 11The alternative text for this image may have been generated using AI.
Full size image

Comparing after modifying the prompt: graph shows reliability values for human evaluations and GPTs before (left) and after (right) modifying the prompt for question 4_1. The x-axis represents each human individual, and the y-axis is the reliability value. Circles are point estimates, vertical lines are 95% confidence values. a and (b) are ICC values (top), and (c) and (d) are Krippendorffs’ alpha values (bottom)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, U., Kim, Y., Lee, S. et al. Can we Use GPT-4 as a Mathematics Evaluator in Education?: Exploring the Efficacy and Limitation of LLM-based Automatic Assessment System for Open-ended Mathematics Question. Int J Artif Intell Educ 35, 1560–1596 (2025). https://doi.org/10.1007/s40593-024-00448-4

Download citation

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s40593-024-00448-4