Patrick J. Burns

Text Analysis for Historical Language Research

2026-05-19T00:00:00+00:00

Text Analysis for Historical Language Research

Taught at the Institute for the Study of the Ancient World (ISAW-GA 3023), Fall 2026. This course meets 1 time a week in a 3-hour block.

Course Description

This course introduces students to computational research methods helpful for producing data-driven scholarship involving large collections of historical-language text. Drawing on relevant topics in exploratory data science, corpus linguistics, and natural language processing, the course provides a forum for students to develop hands-on skills in computer programming (using Python), focused primarily on managing textual data, string manipulation, text mining and analysis, language modeling, and data visualization. Special attention will be given to the use of word embeddings, transformer models, and large language models and their applicability to historical-language text collections. Demonstrations throughout the course will draw primarily on English-language examples, but because of the philological range and diversity at ISAW, students are encouraged to work with digitized text collections in the languages most relevant to their research. There are no prerequisites, though students are expected to be open to reading, writing, and editing computer programs; students are required to bring laptops to class. Note that historical-language text for the purpose of this course covers texts or collections of texts written before the Early Modern period.

RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering

2026-04-23T00:00:00+00:00

RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering

Abstract for RespondeoQA paper on arXiv.
Written with M. Hudspeth and B. O’Connor.

Abstract

We introduce a benchmark dataset for question answering and translation in bilingual Latin and English settings, containing about 7,800 question-answer pairs. The questions are drawn from Latin pedagogical sources, including exams, quizbowl-style trivia, and textbooks ranging from the 1800s to the present. After automated extraction, cleaning, and manual review, the dataset covers a diverse range of question types: knowledge- and skill-based, multihop reasoning, constrained translation, and mixed language pairs. To our knowledge, this is the first QA benchmark centered on Latin. As a case study, we evaluate three large language models – LLaMa 3, Qwen QwQ, and OpenAI’s o3-mini – finding that all perform worse on skill-oriented questions. Although the reasoning models perform better on scansion and literary-device tasks, they offer limited improvement overall. QwQ performs slightly better on questions asked in Latin, but LLaMa3 and o3-mini are more task dependent. This dataset provides a new resource for assessing model capabilities in a specialized linguistic and cultural domain, and the creation process can be easily adapted for other languages. The dataset is available at: https://github.com/slanglab/RespondeoQA

Is Agentic Philology an Oxymoron? Some Thoughts on Error, Control, and Disciplinary Definition

2026-03-13T00:00:00+00:00

Is Agentic Philology an Oxymoron? Some Thoughts on Error, Control, and Disciplinary Definition

Abstract for talk at AI & the Study of Antiquity, Rutgers University Classics Department. New Brunswick, NJ.

Abstract

In this talk, I look at the Latin text content found in large language model (LLM) training data repositories and specifically the high rates of corrupted text resulting from decades of OCR errors and other scanning artifacts. I argue here that the extent of the corruption—measured in the billions of errors—is beyond human scale, perhaps even testing the limits of smaller model-based correction approaches, and that we should therefore consider the possibility of an “agentic philology.” That is, we should look at the possibility of using AI agents to perform computational-scale text critical work on these collections in such a way that we take advantage of what Stuart Russell and Peter Norvig (2021, p. 3) have defined as the five defining characteristics of agents, namely the ability to “operate autonomously, perceive their environment, persist over a prolonged period of time, adapt to change, and create and pursue goals.” I discuss the ways in which such human-out-of-the-loop approaches do not easily align with existing philological expectations of authority and control in the face of error and call for more critical discussion of novel agentic methods in the face of computational-scale error correction challenges.

References

Russell, S. and Norvig, P. (2021). Artificial Intelligence: A Modern Approach (4th ed.). Pearson.

Recovering 34 Billion Latin Words from AI Training Data: Or Philology’s Collaborative Demands at Computational Scale

2025-12-01T00:00:00+00:00

Recovering 34 Billion Latin Words from AI Training Data: Or Philology’s Collaborative Demands at Computational Scale

Abstract for paper delivered at SCS2026.
Written with D. Bamman, C. Brooks, M. Hudspeth, and B. O’Connor.

Keywords

Latin philology, computational thinking, large language models

Abstract

When researchers in artificial intelligence (Langlais, Stasenko, and Arnett 2024) release a text repository advertising 34 billion Latin tokens—a number over 5,000 times larger than a comprehensive repository of canonically classical Latin like the Perseus Digital Library—how are philologists supposed to assess the contents of such a collection? How are Latinists expected to use such a collection? The number is so outstandingly large relative to other Latin collections that, as we argue in this talk, it requires collaboration with colleagues in computer science and information science to even have an entry point into the question (Crane et al. 2014). This talk will describe the Latin content of an LLM training data repository, with particular attention to how we navigate its almost 150GB of files and how we deal with the massive amount of textual corruption, high rates of duplicates, and other concerns raised when working with these volumes. Researchers have shown that, in spite of such concerns, similarly large data can be used to train state-of-the-art Latin language models (Bamman/Burns 2020; Riemenschneider/Frank 2023; Hudspeth, Burns, and O’Connor 2025). Researchers have also shown the basic value of large-scale quantitative description and assessment of available Latin textual resources (Bamman/Smith 2012; Burns 2023; Hudspeth, O’Connor, and Thompson 2024). Still, more exploration is needed to understand the trade-off of quantity versus quality for the language. Accordingly, we discuss ways to filter low-quality, “noisy” texts and entertain ideas about at-scale OCR correction and related computational mitigations on the remaining texts (Smith/Cordell 2023; Cowen-Breen et al. 2023), as well as prospects for enriching these corpora with metadata (e.g. author, genre, or historical period), which could aid deeper philological investigation. In undertaking this study, we further take the following interdisciplinary position: working with billions of Latin words is an intellectual endeavor that requires both philological method and computational method, philological thinking and computational thinking (Wing 2006). It should relate to the broader need to approach machine learning data collection through a sociocultural archival lens (Jo/Gebru 2020; Desai et al. 2024), joining other work on characterizing implicit or undocumented data curation decisions behind web-based LLM training data and available models (Dodge et al. 2021; Soldaini et al. 2024). In this respect, the talk sets an agenda for computational philology in our current LLM-focused environment.

Works Cited

Bamman, D., and Burns, P.J. 2020. “Latin BERT: A Contextual Language Model for Classical Philology.” arXiv. http://arxiv.org/abs/2009.10053.
Bamman, D., and Smith, D. 2012. “Extracting Two Thousand Years of Latin from a Million Book Library.” Journal on Computing and Cultural Heritage (JOCCH) 5(1): 2:1-2:13.
Burns, P.J. 2023. “Research Recap: How Much Latin Does ChatGPT ‘Know’?” ISAW Library Blog. https://isaw.nyu.edu/library/blog/research-recap-how-much-latin-does-chatgpt-know.
Cowen-Breen, C., Brooks, C., Haubold, J., and Graziosi, B. 2023. “Logion: Machine Learning for Greek Philology.” arXiv. http://arxiv.org/abs/2305.01099.
Crane, G., Almas, B., Babeu, A., Cerrato, L., Krohn, A., Baumgart, F., Berti, M., Franzini, G., and Stoyanova, S. 2014. “Cataloging for a Billion Word Library of Greek and Latin.” In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage. DATeCH ’14. New York, NY, USA: Association for Computing Machinery. 83–88. https://dl.acm.org/doi/10.1145/2595188.2595190.
Desai, M.A., Pasquetto, I.V., Jacobs, A.Z., and Card, D. 2024. “An Archival Perspective on Pretraining Data.” Patterns 5(4): 100966.
Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., and Gardner, M. 2021. “Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus.” arXiv. http://arxiv.org/abs/2104.08758.
Hudspeth, M., O’Connor, B., and Thompson, L. 2024. “Latin Treebanks in Review: An Evaluation of Morphological Tagging Across Time.” In Pavlopoulos, J., et al. eds. Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024). Hybrid in Bangkok, Thailand and online: ACL. 203–18. https://aclanthology.org/2024.ml4al-1.21/.
Hudspeth, M., Burns, P.J., and O’Connor, B. 2025. “Contextual morphologically-guided tokenization for pretrained Latin BERT models.” Under review at Association for Computational Linguistics.
Jo, E.S., and Gebru, T. 2020. “Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning.” In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. FAT* ’20. New York, NY, USA: Association for Computing Machinery. 306–16. https://dl.acm.org/doi/10.1145/3351095.3372829.
Langlais, P.-C., Stasenko, A., and Arnett, C. 2024. “Releasing the Largest Multilingual Open Pretraining Dataset.” Hugging Face. November 13. https://huggingface.co/blog/Pclanglais/two-trillion-tokens-open.
Riemenschneider, F., and Frank, A. 2023. “Exploring Large Language Models for Classical Philology.” arXiv. http://arxiv.org/abs/2305.13698.
Smith, D., and Cordell, R. 2023. “Textual Criticism as Language Modeling.” In Going the Rounds: Virality in Nineteenth-Century American Newspapers. Minneapolis, MN: U. of Minnesota Press. https://manifold.umn.edu/read/untitled-883630b9-c054-44e1-91db-d053a7106ecb/section/ea1f849a-bac1-4e9d-85f4-149d0083a6a4.
Soldaini, L., et al. 2024. “Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.” arXiv. http://arxiv.org/abs/2402.00159.
Wing, J.M. 2006. “Computational Thinking.” Communications of the ACM 49 (3): 33–35.

How to Read Latin Like a Computer: The Philology of Latin Word Sense Disambiguation

2025-10-20T00:00:00+00:00

How to Read Latin Like a Computer: The Philology of Latin Word Sense Disambiguation

Abstract for workshop at TLL on October 20.

Abstract

Cum/with vs. cum/when. Levis/light vs. levis/smooth. Ius/law vs. ius/broth. Homographs present a vocabulary challenge to the emergent Latin reader. They also continue to challenge accurate computational “reading” of the Latin language, affecting natural language processing (NLP) tasks such as lemmatization, part-of-speech tagging, and syntactical parsing, among others. So too, word sense disambiguation. As W. G. Hale writes in The Art of Reading Latin about the student experience of seeing the word ut: “How will I translate it? There are some half-dozen or more ‘meanings’: which does it have here?” Again a processing challenge to both the human and the computer.

This presentation, drawn from my work-in-progress book project How to Read Latin Like a Computer, looks at the problem of word sense disambiguation from a philological and lexicographical perspective, inviting such questions as: What are the strategies that people use to read Latin and especially to understand ambiguous lexical situations encountered while reading? What are the strategies that computer models use to “read”—that is, to process—Latin, and in particular to “understand” Latin semantics? And what we can learn about one from the other?

In this talk, I will briefly review important concepts from the history of NLP-driven word sense disambiguation, including the Lesk algorithm, the “one sense per discourse” approach, and corpus-derived clustering, to name a few. I will also cover the state of Latin NLP, with particular attention to distributional approaches to Latin semantics. The goal of the talk is to get feedback from Latin philologists and lexicographers on how a comparative—i.e. human vs. computational—approach to “reading” Latin can be applied to the hundreds of millions, if not billions, of words of Latin available online that have yet to be systematically curated, classified, and catalogued.

Contextual Morphologically-guided Tokenization for Latin Encoder Models

2025-10-20T00:00:00+00:00

Contextual Morphologically-guided Tokenization for Latin Encoder Models

Abstract for arXiv paper on morphologically-guided tokenization for Latin.
Written with M. Hudspeth and B. O’Connor.

Abstract

Tokenization is a critical component of language model pretraining, yet standard tokenization methods often prioritize information-theoretical goals like high compression and low fertility rather than linguistic goals like morphological alignment. In fact, they have been shown to be suboptimal for morphologically rich languages, where tokenization quality directly impacts downstream performance. In this work, we investigate morphologically-aware tokenization for Latin, a morphologically rich language that is medium-resource in terms of pretraining data, but high-resource in terms of curated lexical resources – a distinction that is often overlooked but critical in discussions of low-resource language modeling. We find that morphologically-guided tokenization improves overall performance on four downstream tasks. Performance gains are most pronounced for out of domain texts, highlighting our models’ improved generalization ability. Our findings demonstrate the utility of linguistic resources to improve language modeling for morphologically complex languages. For low-resource languages that lack large-scale pretraining data, the development and incorporation of linguistic resources can serve as a feasible alternative to improve LM performance.

Ciceronianus/Christianus/Other: Experimenting with a Multilabel Classification Approach to Latin Intertextuality Using Jerome’s Letters

2025-10-17T00:00:00+00:00

Ciceronianus/Christianus/Other: Experimenting with a Multilabel Classification Approach to Latin Intertextuality Using Jerome’s Letters

Abstract for paper at Zitieren Als Narrative Strategie Formen, Funktionen Und Methoden Von Referentialität im Werk Des Kirchenlehrers Hieronymus’, Universität Konstanz. October 17.

Abstract

This paper presents experiments in measuring intertextuality in Latin texts using document classification with the goal of lexicon induction. Lexicon construction allows us assign all words in the collection a source-specific weights that can help us measure and map intertextual patterns. For this set of experiments, I use a simple word-based text classification approach to distinguish between the works of Cicero and the Vulgate, inducing a Cicero-Vulgate lexicon based on the feature importance from the classifier. It is not only that we learn from this experiment that certain words like res, senatus, and natura are strongly associated with Cicero texts and that words like dominus, deus, and anima are strongly associated with the Vulgate, but more importantly we get relative weights for all words in the vocabulary. Through these weights, we can do something more than detect that allusive activity may be present in a text; we can identify and illustrate changing patterns of intertextuality throughout a text. Moreover, I argue here that this approach, by changing the source texts and varying the number of classification labels, offers a novel way to read and analyze intertextuality in Latin literature.

Rebuilding the Library of Al3xandr!a: Latin Post-OCR Correction as a Philological Task

2025-09-01T00:00:00+00:00

Rebuilding the Library of Al3xandr!a: Latin Post-OCR Correction as a Philological Task

Abstract for paper at Digital Neo-latin Studies: Ideas and Perspectives, University of Aarhus & Centre for Danish Neo-Latin. September 24.

Abstract

The “unfathomable” training data used to train large language models (LLMs) contains a similarly unfathomable amount of Latin text, much of which derives from scanned volumes of Neo-Latin text. The recently released common_corpus from Pleias advertises 34 billion Latin tokens (Langlais, Stasenko, and Arnett 2024), a number over 5000 times larger than the Perseus Digital Library, over 100 times larger than the Corpus Corporum; but on closer inspection, due to the extremely variable quality of optical character recognition (OCR), we are as likely to find in this collection running text like eadem dicta esse repertum sit as we are to find fim ego exfâ‚¬r$mJkm)iMdem (Bamman n.d.). This paper will analyze the Latin—and Latin-ish—content of common_corpus with an eye toward the degree to which LLM-assisted post-OCR correction (cf. e.g. Thomas, Gaizauskas, and Lu 2024) can be used to recover corrupted Latin text from these modern-day Libraries of Alexandria (Kahle 2021). If a defining activity of philology can be considered, as James Zetzel (Zetzel 2015) writes, “reconstructing what [was] written rather than enshrining or embalming the errors transmitted,” post-OCR correction may turn out to be, I argue here, a defining (computational) philological activity for Latin in the LLM era.

Works Cited

Bamman, D. n.d. “11K Latin Texts.” http://www.cs.cmu.edu/~dbamman/latin.html.
Kahle, B. 2021. “I Set Out to Build the Next Library of Alexandria. Now I Wonder: Will There Be Libraries in 25 Years?” Time. Oct. 22. https://time.com/6108581/internet-archive-future-books/.
Langlais, P.-C., Stasenko, A., and Arnett, C. 2024. “Releasing the Largest Multilingual Open Pretraining Dataset.” Hugging Face. Nov. 13. https://huggingface.co/blog/Pclanglais/two-trillion-tokens-open.
Thomas, A., Gaizauskas, R., and Lu, H. 2024. “Leveraging LLMs for Post-OCR Correction of Historical Newspapers.” In Sprugnoli, R. and Passarotti, M. eds. Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024: 116–21. https://aclanthology.org/2024.lt4hala-1.14.
Zetzel, J.E.G. 2015. “The Bride of Mercury: Confessions of a ’Pataphilologist.” In Pollock, S. ed. World Philology. Cambridge, MA: Harvard University Press. 45–62.

The Role of ‘Small’ Models for Ancient NLP in a World of Large Language Models

2025-04-08T00:00:00+00:00

The Role of “Small” Models for Ancient NLP in a World of Large Language Models

Abstract for invited talk at ALP2025

Abstract

In the field of Latin natural language processing, there are tasks for which competitive, if not state-of-the-art, performance is exhibited by large language models like GPT 4o or Claude. Yet as opposed to modern English (for which that statement may also be arguably true), there are some Latin NLP tasks like coreference resolution or automatic question generation for which work on smaller, task-specific models is either just underway or does not yet exist. LLMs in this case have “skipped” steps on a path of continuous development and improvement. I argue in this talk that, while we should take advantage of such LLM advancements in Latin NLP, some significant part of our attention should also be directed backwards on filling in these skipped steps. By returning to and focusing again on “small” language models—including everything from rigorously evaluated and field-tested static embeddings models to the last iterations of smaller LLMs like BERT models, most especially those with custom task-specific heads—we can promote a culture of interpretable and explainable philology: interpretable, following Russell and Norvig (Artificial Intelligence 4th ed. [2021], p. 711-12), because these smaller models—from their training data to their configuration and parameterization—can be directly inspected, and explainable because such models allow us to maintain an understanding of how specific outputs result from specific inputs. In sum, I argue that, although LLMs will serve (and serve well) short-terms interests in ancient-language NLP, we should redouble our efforts—through data curation, through attention to model parameterization, and through competitive evaluation (like shared tasks)—to develop smaller models equally up to our biggest language challenges. While my talk will use Latin as its ancient-language focus, the conclusion will discuss ways to adapt lessons learned to other ancient languages, such as Ancient Greek, Akkadian, and Middle Egyptian, among others.

The Digital Afterlife of a Dead Language: Or Recovering 34 Billion(!) Latin Words from AI Training Data

2025-02-10T00:00:00+00:00

The Digital Afterlife of a Dead Language: Or Recovering 34 Billion(!) Latin Words from AI Training Data

Abstract for public lecture at Taft Center for Humanities, University of Cincinnati

Abstract

Latin has been a perhaps unexpected beneficiary of recently published Large Language Model (LLM) training datasets. For example, an artificial intelligence firm just released a text repository advertising 34 billion Latin tokens—a number over 5,000 times larger than a comprehensive repository of canonically classical Latin like the Perseus Digital Library. The number is so outstandingly large relative to other Latin collections—“unfathomable” in the parlance of AI critique—that it demands a fuller accounting of what it means for humanities scholars to work with such collections, leading us to ask questions like—What novel methods are necessary to explore such a library? How do we handle the massive amount of textual corruption found in these volumes? What tools and models can we build—and build responsibly—with that amount of textual data? In this talk, I will bring in threads from natural language processing, cryptography, and textual criticism, among other disciplines, to redefine philology at scale for our computational, LLM-inflected moment. While the presentation will lead with examples from Latin texts, the talk invites humanities scholars working in any language or literature to reflect on how issues of training data quantity and quality affect their areas of research.