Code and Data to support following paper:
Alisha Srivastava*, Emir Korukluoglu*, Minh Nhat Le*, Duyen Tran, Chau Minh Pham, Marzena Karpinska, Mohit Iyyer, "OWL: Probing Cross-Lingual Recall of Memorized Texts via World Literature"
This repository contains necessary pipelines in order to create a Multilingual Aligned Dataset and scripts necessary to evaluate LLM's on Cross-Lingual Knowledge Transfer and Memorization.
Read our paper at https://arxiv.org/abs/2505.22945
- Hypothesis: Large language models (LLMs) memorize the content of translated books.
- Follow-up Question: Do the models perform better in English if the original work is in Turkish, Spanish, and Vietnamese?
- Hypothesis: LLMs can transfer their memorization across languages.
- Follow-up Question: Can LLMs memorize translations into languages not present in their pre-training dataset, and will their performance remain strong for out-of-distribution languages?
- Hypothesis: LLMs can transfer their knowledge across modalities.
- Follow-up Question: Can LLMs transfer knowledge over modalities?
![]() |
||||||
| Alisha Srivastava | Nhat Minh Le | Emir Korukluoglu | Duyen Tran | Chau Minh Pham | Marzena Karpinska | Mohit Iyyer |
- Chau Minh Pham - For guiding our research and being our research mentor.
- Dr. Marzena Karpinska - For guiding our research and for her invaluable expertise.
- Dr. Mohit Iyyer - For guiding our research and being our research advisor.
-- from Project Gutenberg and Online Sources
- Extract excerpts from different languages while ensuring they contain full sentences and at least single named entity.
- Clean text of metadata and align excerpts across four languages.
- Retain excerpts that pass length checks, contain only one named entity, and verify alignment.
- Using Microsoft Translate, translate English data to Sesotho, Yoruba, Maithili, Malagasy, Setswana, Tahitian.
- Models Used:
- OpenAI API for GPT-4o
- vLLM for Qwen-2.5, LLama-3.1-8B-70B, Llama-3.3-70B, and quantized models
- OpenRouter for Llama-3.1-405B
- Experiment 0: Direct Probing
- Assessing accuracy based on exact and fuzzy matches.
- Experiment 1: Name Cloze Task
- Input excerpts with masked names and evaluate exact matches.
- Experiment 2: Prefix Probing/Continuation Generation
- Prompting models to continue provided sentences and evaluating performance metrics.
- Experiment 3: Cross-Modality experiments
- Conducting first 3 experiments over audio models
- Model capacity and quantization effects.
- Examination of quotes and named entities prevalence.
- Examination of order of words, syntax over knowledge recall
- Investigate how prefix token counts affect model performance.
- Examine the knowledge recall across modalities.
For any inquiries or discussions related to this research, please contact our corresponding authors: chau@umd.edu
