Call for Papers
Underserved communities often lack adequate access to advanced natural language processing (NLP) technologies due to limited linguistic data, insufficient computational resources, or inadequate AI governance frameworks. This gap hinders equitable access to NLP advancements, exacerbating the digital divide. Our workshop aims to address this by fostering a multidisciplinary dialogue around the development of language models (LMs) that prioritize cultural sensitivity, resource efficiency, and sustainable AI practices. We invites researchers, practitioners, and policymakers to address challenges and propose innovative solutions for building and deploying language models for underserved languages and communities.
Topics of Interest
We invite submissions of full papers, ongoing work, position papers, and survey papers on topics including, but not limited to:
Measuring and Governing AI
Developing reliable evaluation methods for LMs under constraints in data, compute, and expertise. How can psychometrics, auditing frameworks, or validity theory guide responsible measurement and governance?Benchmarking and Fairness
Building inclusive benchmarks and evaluation pipelines that reduce bias, improve cultural and linguistic representation, and ensure fair performance across underserved communities.Pluralistic Alignment
Designing approaches for aligning LMs with diverse values, cultural norms, and epistemologies, including participatory and community-driven methods.Open and Inclusive Infrastructure
Creating open datasets, benchmarks, models, and participatory platforms that support sustainable and equitable NLP research and deployment.
Submission Guidelines
We welcome long papers (8 pages) and short papers (4 pages), excluding references. Submissions must follow the AAAI 2026 style guidelines.
Important Dates
- Submission deadline:
November 14, 2025November 20, 2025November 24, 2025 - Notification of acceptance: December 12, 2025
- Camera-ready paper due: January 10, 2026
- Workshop dates: January 27, 2026
Please note that all the deadlines are in AoE (Anywhere on Earth) timezone.
Submission Link
Papers should be submitted via OpenReview.
Contact Us
For inquiries, please contact the workshop organizers: lm4uc.organizers (at) gmail.com
Alternatively, you can reach us via our Discord server.
Shared Task: AI Measurement for the Public Interest
In this year, we are excited to announce a shared task on “AI Measurement for the Public Interest,” organized as part of the Language Models for Underserved Communities (LM4UC) workshop at AAAI 2026. The shared task aims to foster the development of evaluation methodologies and infrastructures that prioritize the needs of underserved communities, focusing on context-aware and institutionally grounded measurement practices.
This shared task invites participants to design and prototype evaluation workflows tailored to underserved linguistic and cultural contexts. The focus is not on optimizing model performance, but on developing measurement infrastructures that reflect how institutions, researchers, and communities actually assess and deploy language technologies under varied resource, governance, and environmental constraints. The task is organized around three complementary layers of an evaluation ecosystem:
- AI Evaluation Infrastructure and Stewardship — Where and how evaluation is conducted. This includes workflows that enable local institutions to run assessments, control access to evaluation assets, and maintain their own scoring and deployment environments.
- AI Measurement Design — What is being measured and how it is operationalized. This includes defining new evaluation dimensions, benchmarks, scoring criteria, and documentation practices that capture capabilities relevant to local use cases.
- AI Downstream Impact Assessment — How system behavior varies across populations, domains, or deployment settings. This includes methods for quantifying performance variation, robustness, or utility across communities and identifying areas needing further capability development.
These layers together support end-to-end evaluation: designing the evaluation environment, specifying meaningful constructs, and analyzing performance in real deployments. Submissions may address any layer independently or propose workflows that integrate multiple layers. The shared task welcomes contributions such as datasets, protocols, analysis pipelines, benchmark definitions, evaluation software, and institutional frameworks. Submissions will be evaluated on clarity, methodological rigor, practical feasibility, and relevance to settings where existing benchmark infrastructure is limited or mismatched to local priorities. This initiative reflects LM4UC’s broader goal of advancing scalable, context-aware measurement infrastructures that support the long-term development of language technologies beyond traditional benchmark settings.
Track 1: AI Evaluation Infrastructure & Stewardship
This track focuses on designing evaluation workflows that can be operated by local institutions rather than relying on centralized infrastructure. Submissions may include device-side evaluation, federated scoring, offline test packages, access-controlled scoring interfaces, or procedures for maintaining and updating evaluation assets over time. We are looking for clear and feasible mechanisms that allow organizations to run evaluations, control access to evaluation artifacts, and adapt workflows to their institutional constraints. Deliverables include, but are not limited to, a process card or documentation outline describing the evaluation pipeline design and usage constraints, a workflow diagram or prototype demonstrating how the evaluation is run end-to-end, and a short technical memo (about four pages) detailing assumptions, governance structure, and system requirements.
Track 2: AI Measurement Design
This track invites new evaluation dimensions that capture aspects of model behavior relevant to real-world use cases not covered by existing benchmarks. Submissions may define cultural, linguistic, functional, domain-specific, communicative, or socio-institutional constructs and propose schemas, item formats, scoring procedures, and documentation standards. We are looking for well-defined constructs with clear motivating use cases, explicit assumptions, and verifiable measurement strategies. Deliverables include, but are not limited to, a benchmark schema or dataset card describing the construct, example items or evaluation prompts with scoring criteria, and a short write-up (≈4 pages) explaining construct definition, related work, and measurement rationale.
Track 3: AI Downstream Impact Assessment
This track focuses on methods that assess how model performance varies across contexts—e.g., across languages, institutions, domains, or deployment environments. Submissions may include empirical studies, diagnostic dashboards, error analyses, reliability studies, or pipelines that surface capability gaps. We look for clear methodologies for quantifying variation in behavior across settings and interpreting those differences in terms of practical deployment needs. Deliverables include, but are not limited to, a report or dashboard summarizing comparative results, a reproducible analysis pipeline or evaluation notebook, and a brief documentation (≈4 pages) clarifying assumptions, data sources, and interpretive limitations.
Please contact us on Discord or email if you are interested in participating in the shared task. We are open via appointment to help facilitate team formation, find resources, and brainstorm ideas with you.
Important Dates
- Submission deadline via OpenReview: January 9, 2025
- Feedback release: January 16, 2025
- Submission portal: OpenReview
List of Speakers
List of Organizers
Schedule
| Time | Session |
|---|---|
| 09:00 – 09:10 | Opening Remarks |
| 09:10 – 09:50 | Keynote 1: Jian Gang Ngui (30 min talk + 10 min Q&A) |
| 09:50 – 10:05 | Oral Presentation 1 |
| 10:05 – 10:40 | Poster Session 1 (10 posters) |
| 10:40 – 10:55 | Break / Coffee |
| 10:55 – 11:35 | Keynote 2: Simon Chesterman (30 min talk + 10 min Q&A) |
| 11:35 – 11:50 | Oral Presentation 2 |
| 11:50 – 12:25 | Poster Session 2 (10 posters) |
| 12:25 – 01:10 | Lunch Break (45 min) |
| 01:10 – 01:50 | Keynote 3: Tan Zhi Xuan (30 min talk + 10 min Q&A) |
| 01:50 – 02:05 | Oral Presentation 3 |
| 02:05 – 02:40 | Poster Session 3 (10 posters) |
| 02:40 – 02:55 | Break / Movement |
| 02:55 – 03:35 | Keynote 4: Elina Noor (30 min talk + 10 min Q&A) |
| 03:35 – 04:05 | Panel Discussion |
| 04:05 – 04:40 | Poster Session 4 (10 posters) |
| 04:40 – 05:00 | Awards & Closing Remarks. |
Accepted Papers
Oral Presentations
| Title | Authors |
|---|---|
| Safe at the Margins: A General Approach to Safety Alignment in Low-Resource English Languages – A Singlish Case Study OpenReview | Isaac Lim, Shaun Khoo, Watson Wei Khong Chua, Jessica Foo, Jia Yi Goh, Roy Ka-Wei Lee |
| What Would an LLM Do? Evaluating Policymaking Capabilities of Large Language Models OpenReview | Pierre Le Coz, Jiaan Liu, Debarun Bhattacharjya, Georgina Curto, Serge Stinckwich |
| Harnessing Linguistic Dissimilarity for Language Generalization on Unseen Low-Resource Varieties OpenReview | Jinju Kim, Haeji Jung, Youjeong Roh, Jong Hwan Ko, David R. Mortensen |
Poster Presentations
Session 1
| Title | Authors |
|---|---|
| From Scarcity to Efficiency: Investigating the Effects of Data Augmentation on African Machine Translation OpenReview | Mardiyyah Oduwole, Oluwatosin Olajide, Jamiu Tunde Suleiman, Faith Hunja, Busayo Awobade, Comfort Oyindamola Akanni, Fatimo Adebanjo, Chinonyelum Rosemary Igwe, Peace Ododo, Promise Omoigui, Abraham Toluwase Owodunni, Steven Kolawole |
| Open, Reproducible Morphology Probes for Plains Cree OpenReview | Duncan Stothers |
| When Gujarati Meets English: Toward Robust Translation of Code-Mixed Low Resourced Indian Language OpenReview | Mukund Agarwalla, Himanshu Kumar, Nishat Afshan Ansari |
| Sentence-Aware Bahnaric-Vietnamese Lexical Mapping with Contrastive Contextual Representations OpenReview | Thi Ty Nguyen, Phat T. Tran-Truong, Long Nguyen, Tan Sang Nguyen, Tho Quan |
| One Model, Many Worlds: Cross-Lingual Fine-Tuning Can Improve Low-Resource Capabilities of Language Model OpenReview | Tyler Slomianyj, Rudraansh Korlakunta, Victor He, Daniel Gao, Sunishchal Dev, Kevin Zhu, Aryan Shrivastava |
| Reflective Translation: Enhancing Low-Resource Machine Translation through Self-Reflection OpenReview | Lailah Denny, Nicholas Cheng, Agrim Sharma, Erin Tan |
| ENLIVEN-1000: A Comprehensive Revitalization Framework for 1000+ Endangered Languages via Broad-Coverage LID and LLM-Augmented MT OpenReview | Philip Meng |
| From Bias to Balance: How Multilingual Dataset Composition Affects Tokenizer Performance Across Languages OpenReview | Aishwarya Selvamurugan, Raj Dandekar, Rajat Dandekar, Sreedath Panat |
| Not All Data Augmentation Works: A Typology-Aware Study for Low-Resource Neural Machine Translation in Vietnamese Ethnic Minority Languages OpenReview | Long Nguyen, Dat T. Truong, Nhan D. Tran, Quynh Vo, Quy Tran Nguyen, Tho Quan |
Session 2
| Title | Authors |
|---|---|
| BAID: A Benchmark for Bias Assessment of AI Detectors OpenReview | Priyam Basu |
| Lost in Localization: Building RabakBench with Human-in-the-Loop Validation to Expose Multilingual Safety Gaps OpenReview | Gabriel Chua, Leanne Tan, Ziyu Ge, Roy Ka-Wei Lee |
| SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth OpenReview | Wenpeng Xing, Lanyi Wei, Haixiao Hu, Rongchang Li, Mohan Li, Changting Lin, Meng Han |
| Pluralistic AI Alignment: A Cross-Cultural Pilot Survey OpenReview | Khashayar Alavi, Lucie Flek, Florian Mai |
| Bias Dynamics in BabyLMs: Towards a Compute-Efficient Sandbox for Democratising Pre-Training Debiasing OpenReview | Filip Trhlík, Andrew Caines, Paula Buttery |
| Language Models Entangle Language and Culture OpenReview | Shourya Jain, Paras Chopra |
| UbuntuGuard: A Policy-Based Safety Benchmark for Low-Resource African Languages OpenReview | Tassallah Abdullahi, Macton Mgonzo, Abraham Toluwase Owodunni, Ritambhara Singh, Carsten Eickhoff |
| Jo.E(Joint Evaluation) : A Multi-Agent Collaborative Framework for Comprehensive AI Safety Evaluation of Language Models OpenReview | Himanshu Joshi |
Session 3
| Title | Authors |
|---|---|
| Curiosity-Driven LLM-as-a-judge for Personalized Creative Judgment OpenReview | Vanya Bannihatti Kumar, Divyanshu Goyal, Akhil Eppa, Neel Bhandari |
| Inverse Language Modeling towards Robust and Grounded LLMs OpenReview | Davide Gabrielli, Simone Sestito, Iacopo Masi |
| Advancing NLP Equity: A Secondary Benchmark Evaluation of Multilingual Language Models for Underrepresented Languages OpenReview | Md Muntaqim Meherab, SALMAN, Md. Maruf Billah, Kazi Shakkhar Rahman, Liza Sharmin, Tanvirul Islam, Z N M Zarif Mahmud, Nuruzzaman Faruqui, Sheak Rashed Haider Noori, Touhid Bhuiyan |
| CultureManip: A Novel Benchmark for Mental Manipulation Detection Across Multilingual Settings OpenReview | JingFeng Liang, Joshua Casuga, Austin Chen, Lang Xiong, Kevin Zhu |
| Beyond Static Leaderboards: A Roadmap to Naturalistic, Functional Evaluation of LLMs OpenReview | Victor Ojewale, Suresh Venkatasubramanian |
| Why It Failed: A Benchmark to Evaluate Interpretability OpenReview | Joel Mathew, Aditya Lagu, Anthony Tang, Prudhviraj Naidu |
| Not Funny Anymore: LLM Judges Confuse Literal Similarity for Humor in Translated Jokes OpenReview | Fabricio Rivera, Rohit Pochugari, Tessa Chan, Devansh Katakwar, Kevin Zhu, Michael Saxon |
| MULTILINGUAL EVALUATION OF HUMAN VS. AI TEXT CLASSIFICATION WITH ZERO-SHOT ANALYSIS OF CONTEMPORARY LLM ARCHITECTURES. OpenReview | Pranamya Nilesh Deshpande, Raj Dandekar, Rajat Dandekar, Sreedath Panat |
| OCER and OCWER: Integrating Visual Similarity and Segmentation in OCR Evaluation OpenReview | Samy Ouzerrout |
Session 4
| Title | Authors |
|---|---|
| VLM-guided Object-level Segmentation from Dynamic Scene OpenReview | Feiran Yang |
| PeerCoPilot: A Language Model-Powered Assistant for Behavioral Health Organizations OpenReview | Gao Mo, Naveen Janaki Raman, Megan Chai, Cindy Peng, Shannon Pagdon, Nev Jones, Hong Shen, Margaret Swarbrick, Fei Fang |
| The Resonance Corpus: Chinese Caregiver-Child Dialogue for Community-Aligned Language Models OpenReview | Lingqin Meng, Yang Gao, Zhongzhi Huo, Stella Christie |
| Scribes, Scripts, and Scarcity: Re-thinking Benchmarking for Arabic-Script Handwritten Text Recognition in Historical Manuscript Traditions OpenReview | Yuanhao Zou |
| RuSignBot: Russian Sign Language Synthesis via Customized MimicMotion OpenReview | Daria Bondarenko, Emilia Bojarskaja, Maxim Novopoltsev, Aleksandr Tulenkov, Ruslan Murtazin, Iuliia Zemtsova, Ilya Makarov, Andrey Savchenko |
| Beyond Monolithic Culture: Evaluating Understandability of Online Text Across Cultural Dimensions OpenReview | Saurabh Kumar Pandey, Harshit Gupta, Sougata Saha, Monojit Choudhury |
| CAMA: A Culturally Adaptive Multi-Agent Framework for Postpartum Depression Support in Multilingual and Low-Resource Settings OpenReview | Zhiqi Zhang, Ziyi LIU, rite Bo |
| CESLR: A Multi-Signer Benchmark and SpatioTemporal End-to-End Framework for Continuous Ethiopian Sign Language Recognition OpenReview | Anteneh Yehalem Tegegne, Yohannes Ayana Ejigu, Surafel Amsalu |