Can AI be a Teaching Partner? Evaluating ChatGPT, Gemini, and DeepSeek across Three Teaching Strategies

Authors

Talita de Paula Cypriano de Souza^{1, 2}, Shruti Mehta³, Matheus Arataque Uema¹, Luciano Bernardes de Paula², Seiji Isotani^{3, 4}

Institutions

¹ University of Sao Paulo (USP)
² Federal Institute of Education, Science and Technology of São Paulo (IFSP)
³ University of Pennsylvania (UPENN)
⁴ Nucleus of Excellence in Social Technologies (NEES) of Federal University of Alagoas (UFAL)

arXiv (Paper) Code (GitHub) Dataset

Infographic

Figure 1. Project's overview

Abstract

There are growing promises that Large Language Models (LLMs) can support students’ learning by providing explanations, feedback, and guidance. However, despite their rapid adoption and widespread attention, there is still limited empirical evidence regarding the pedagogical skills of LLMs. This article presents a comparative study of LLMs, namely, ChatGPT, DeepSeek, and Gemini acting as teacher agents. An evaluation protocol focusing on three pedagogical strategies was developed: Examples, Explanations and Analogies, and the Socratic Method. The evaluations were conducted with six human judges in the context of teaching the C programming language to beginners. The results indicate that the models exhibited very similar interaction patterns in the pedagogical strategies of Examples and Explanations and Analogies. In the Socratic Method, however, the models showed greater sensitivity to the pedagogical strategy and the initial prompt. Overall, ChatGPT and Gemini received higher scores, whereas DeepSeek obtained lower scores across the criteria, indicating differences in pedagogical performance across models.

Results

Examples approach

In the “Relevance” criterion, ChatGPT outperformed DeepSeek and Gemini, which showed similar results. For “Abstract-concrete concepts”, Gemini achieved the highest scores. In the “Correctness” and “Level of Detail” criteria, all three models performed satisfactorily, while “Variety” received the lowest scores across all models. Regarding the “Providing immediate solutions” criterion, Gemini obtained the best results. Overall, according to the judges’ perception, DeepSeek was rated “partially satisfactory”, while ChatGPT and Gemini were considered “satisfactory”.

Figure 2. Evaluation of pedagogical skills using the Example approach. The criteria marked with asterisks highlight those that showed statistically significant differences in the analysis conducted.

Explanations and Analogies approach

ChatGPT outperformed DeepSeek and Gemini in the “Clarity, consistency and ease”, “Critical parts focus”, and “Usefulness” criteria, receiving the highest scores. All three models performed satisfactorily on the “Correctness” criterion. DeepSeek obtained the lowest scores on the “Level adaptation” and “Provided immediate solutions” criteria. Overall, based on the Judges’ perception, ChatGPT and Gemini were mostly rated as “satisfactory”, whereas DeepSeek was considered “partially satisfactory”.

Figure 3. Evaluation of pedagogical skills using the Explanations and Analogies approach. The criteria marked with asterisks highlight those that showed statistically significant differences in the analysis conducted.

Socratic Method approach

In the “Initial question” criterion, Gemini received the lowest scores, whereas ChatGPT and DeepSeek performed more satisfactorily. For the “Counterexamples”, “Questions only”, “Well-formulated questions”, “Critical thinking promotion”, and “Provided immediate solutions” criteria, DeepSeek obtained the lowest scores, while ChatGPT and Gemini performed better. Overall, consistent with the other two pedagogical approaches, the Judges’ perception rated ChatGPT and Gemini as “satisfactory”, while DeepSeek was considered “partially satisfactory”.

Figure 4. Evaluation of pedagogical skills using the Socratic Method approach. The criteria marked with asterisks highlight those that showed statistically significant differences in the analysis conducted.

Acknowledgments

We would like to thank Luis Henrique Hergesel Lima, Rafael Mansur, and João Boaretto for their valuable contributions to the research presented in this paper.

Repository structure

/dataset - Dataset containing the judges’ evaluation scores.
/images - Images used in the paper, figures, and visualizations of the study.
/notebooks - Jupyter Notebooks files containing statistics analyses.
/protocol - Detailed protocol of evaluation.