Authors: Matthew Fredrik, Akhil Krishnan, Sanjay Mohan Kumar, Stefan Popescu
With the increasing popularity and accessibility of AI chatbots, models like ChatGPT and DeepSeek have emerged as prominent tools for users across diverse domains. While these tools offer extensive capabilities, they are not infallible. Specifically, we focus on ChatGPT-4 Turbo and DeepSeek Chat 1.5, evaluating which model performs more reliably across a range of user tasks and metrics.
The rise of large language models (LLMs) has sparked significant interest in developing AI systems capable of answering any query accurately and efficiently. While ChatGPT and DeepSeek have demonstrated impressive capabilities, their architectures, training paradigms, and performance can vary significantly. This project aims to evaluate and compare these models to better understand their individual strengths, limitations, and potential future developments in the field of generative AI.
We utilize a large Kaggle dataset containing a series of queries posed to both ChatGPT and DeepSeek. Each query entry includes several evaluation metrics:
- Number of tokens used
- Response time
- User-generated rating
- Response accuracy
- Additional metadata
To analyze this dataset, we take the following approach:
- Attribute Weighting: Metrics such as accuracy will be given more importance than secondary metrics like response time.
- Supervised Learning: Use part of the dataset for training a K-Nearest Neighbors (KNN) classifier and apply Principal Component Analysis (PCA) for dimensionality reduction to predict response quality on the test set.
- Unsupervised Learning: Apply clustering algorithms to group similar queries and identify whether certain types of queries are better handled by one model over the other.
This multifaceted approach allows us to assess not only which model performs better overall, but also under what circumstances each model may excel or fail.
- Overall Performance: How do ChatGPT and DeepSeek compare in terms of overall performance metrics (e.g., accuracy or user satisfaction scores)? Which model performs better on average, and is this difference statistically significant?
- Performance by Query Type: Do the models have varying performance across different categories or types of queries? For example, does one model excel in certain topics or tasks while the other struggles, or vice versa?
- Shared Strengths/Weaknesses: Are there patterns in how the models perform on the same queries? Do they tend to succeed on the same queries and fail on the same queries (indicating certain queries are inherently easy or hard for both), or does one model often succeed where the other fails?
- Impact of Query Characteristics: How do characteristics of the queries (such as query length or complexity) affect each model’s performance? Are differences between ChatGPT and DeepSeek correlated with these query attributes (e.g., do longer or more complex queries favor one model over the other)?
- Hidden Patterns in Data: Can we discover any hidden trends or groupings in the dataset using unsupervised techniques? For instance, by applying clustering or PCA, can we identify clusters of interactions with similar outcomes (such as a cluster of queries where one model consistently outperforms the other, or clusters corresponding to distinct difficulty levels)?
The following studies have conducted similar comparative evaluations:
- Jiang, Q., Gao, Z., & Karniadakis, G. E. (2025). DeepSeek vs. ChatGPT vs. Claude: A Comparative Study for Scientific Computing and Scientific Machine Learning Tasks. arXiv:2502.17764
- Manik, M. M. H. (2025). ChatGPT vs. DeepSeek: A Comparative Study on AI-Based Code Generation. arXiv:2502.18467
- Shakya, R., Vadiee, F., & Khalil, M. (2025). A Showdown of ChatGPT vs DeepSeek in Solving Programming Tasks. arXiv:2503.13549
These works provide a valuable foundation and context for our study, particularly in specific application domains such as programming and scientific computation.
To create a reproducible environment for this project, we use Conda and the provided environment.yml file.
If you don't already have Conda installed, install Miniconda or Anaconda.
git clone https://github.com/San68bot/beyond-the-prompt.git
cd beyond-the-promptconda env create -f environment.ymlconda activate beyond-the-promptpython -m ipykernel install --user --name=beyond-the-promptconda env update --file environment.yml --prune