Assistant Professor in Psychology at Stony Brook University. I’m interested in how people interact with LLMs and they impact they might have on our psychology.
Incredibly excited to announce I’ll be starting as an Asst Professor in the Psychology Department at Stony Brook this fall!
I’ll also be recruiting students this year so let me know if you know any students who might be interested!
New preprint: we evaluated LLMs in a 3-party Turing test (participants speak to a human & AI simultaneously and decide which is which).
GPT-4.5 (when prompted to adopt a humanlike persona) was judged to be the human 73% of the time, suggesting it passes the Turing test (🧵)
New Preprint: People cannot distinguish GPT-4 from a human in a Turing test.
In a pre-registered Turing test we found GPT-4 is judged to be human 54% of the time.
On some interpretations this constitutes the most robust evidence to date that any system passes the Turing test 🧵
Timed open book exams reflect exactly this pressure though. Knowing enough and knowing how to access the rest quickly is the optimal balance.
Also, this was exactly my experience working in software engineering. Memorising obscure documentation/syntax is pointless.
A few (late) life updates
- I defended my PhD!
- I’ve started a postdoc (at UCSD) on persuasion and deception in LLMs.
- I’ve (confusingly) moved to New York! Let me know if you’re here and want to hang out or know anyone who’s looking for a roommate.
So do LLMs pass the Turing test? We think this is pretty strong evidence that they do. People were no better than chance at distinguishing humans from GPT-4.5 and LLaMa (with the persona prompt). And 4.5 was even judged to be human significantly *more* often than actual humans!
Can you tell the difference between a human and an AI? Really excited to share turingtest.live, a site where you can play the Turing Test with GPT-4. You get randomly matched with either a human or an AI, and you have 5 minutes to decide which it is.
How much does language help you to reason recursively about other people’s beliefs (e.g. “I know that you think that Alice hates Bob”)? A 2015 study found that people can do this up to 7 levels! We replicated this with humans and found GPT-3 is accurate up to ~4.
People judged GPT-4 to be human 54% of the time, compared to 22% for ELIZA and 67% for humans. The implication is that people are at chance in determining that GPT-4 is an AI, even though the study is powerful enough to detect differences from 50% accuracy.
In previous work we found GPT-4 was judged to be human ~50% of the time in a 2-party Turing test, where ppts speak to *either* a human or a model.
This is probably easier for several reasons. Here we ran a new study with Turing's original 3-party setup
arxiv.org/abs/2503.23674
Initial results from turingtest.live! Humans were correctly classed as humans only 65% of the time. The best performing GPT-4 prompt fooled humans around 39% of the time. Eliza performs significantly worse at around 25%, but still better than GPT-3.5 at only 5%!
Reasonable people will disagree about what constitutes passing the TT. I think the more urgent implication of these findings is that people cannot reliably determine whether current AI models are human after a 5 minute conversation dedicated to figuring this out.