Call for Papers

Underserved communities often lack adequate access to advanced natural language processing (NLP) technologies due to limited linguistic data, insufficient computational resources, or inadequate AI governance frameworks. This gap hinders equitable access to NLP advancements, exacerbating the digital divide. Our workshop aims to address this by fostering a multidisciplinary dialogue around the development of language models (LMs) that prioritize cultural sensitivity, resource efficiency, and sustainable AI practices. We invites researchers, practitioners, and policymakers to address challenges and propose innovative solutions for building and deploying language models for underserved languages and communities.

Topics of Interest

We invite submissions of full papers, ongoing work, position papers, and survey papers on topics including, but not limited to:

Measuring and Governing AI
Developing reliable evaluation methods for LMs under constraints in data, compute, and expertise. How can psychometrics, auditing frameworks, or validity theory guide responsible measurement and governance?
Benchmarking and Fairness
Building inclusive benchmarks and evaluation pipelines that reduce bias, improve cultural and linguistic representation, and ensure fair performance across underserved communities.
Pluralistic Alignment
Designing approaches for aligning LMs with diverse values, cultural norms, and epistemologies, including participatory and community-driven methods.
Open and Inclusive Infrastructure
Creating open datasets, benchmarks, models, and participatory platforms that support sustainable and equitable NLP research and deployment.

Submission Guidelines

We welcome long papers (8 pages) and short papers (4 pages), excluding references. Submissions must follow the AAAI 2026 style guidelines.

Important Dates

Submission deadline: ~~November 14, 2025~~ ~~November 20, 2025~~ November 24, 2025
Notification of acceptance: December 12, 2025
Camera-ready paper due: January 10, 2026
Workshop dates: January 27, 2026

Please note that all the deadlines are in AoE (Anywhere on Earth) timezone.

Submission Link

Papers should be submitted via OpenReview.

Contact Us

For inquiries, please contact the workshop organizers: lm4uc.organizers (at) gmail.com

Alternatively, you can reach us via our Discord server.

Shared Task: AI Measurement for the Public Interest

In this year, we are excited to announce a shared task on “AI Measurement for the Public Interest,” organized as part of the Language Models for Underserved Communities (LM4UC) workshop at AAAI 2026. The shared task aims to foster the development of evaluation methodologies and infrastructures that prioritize the needs of underserved communities, focusing on context-aware and institutionally grounded measurement practices.

This shared task invites participants to design and prototype evaluation workflows tailored to underserved linguistic and cultural contexts. The focus is not on optimizing model performance, but on developing measurement infrastructures that reflect how institutions, researchers, and communities actually assess and deploy language technologies under varied resource, governance, and environmental constraints. The task is organized around three complementary layers of an evaluation ecosystem:

AI Evaluation Infrastructure and Stewardship — Where and how evaluation is conducted. This includes workflows that enable local institutions to run assessments, control access to evaluation assets, and maintain their own scoring and deployment environments.
AI Measurement Design — What is being measured and how it is operationalized. This includes defining new evaluation dimensions, benchmarks, scoring criteria, and documentation practices that capture capabilities relevant to local use cases.
AI Downstream Impact Assessment — How system behavior varies across populations, domains, or deployment settings. This includes methods for quantifying performance variation, robustness, or utility across communities and identifying areas needing further capability development.

These layers together support end-to-end evaluation: designing the evaluation environment, specifying meaningful constructs, and analyzing performance in real deployments. Submissions may address any layer independently or propose workflows that integrate multiple layers. The shared task welcomes contributions such as datasets, protocols, analysis pipelines, benchmark definitions, evaluation software, and institutional frameworks. Submissions will be evaluated on clarity, methodological rigor, practical feasibility, and relevance to settings where existing benchmark infrastructure is limited or mismatched to local priorities. This initiative reflects LM4UC’s broader goal of advancing scalable, context-aware measurement infrastructures that support the long-term development of language technologies beyond traditional benchmark settings.

Track 1: AI Evaluation Infrastructure & Stewardship

This track focuses on designing evaluation workflows that can be operated by local institutions rather than relying on centralized infrastructure. Submissions may include device-side evaluation, federated scoring, offline test packages, access-controlled scoring interfaces, or procedures for maintaining and updating evaluation assets over time. We are looking for clear and feasible mechanisms that allow organizations to run evaluations, control access to evaluation artifacts, and adapt workflows to their institutional constraints. Deliverables include, but are not limited to, a process card or documentation outline describing the evaluation pipeline design and usage constraints, a workflow diagram or prototype demonstrating how the evaluation is run end-to-end, and a short technical memo (about four pages) detailing assumptions, governance structure, and system requirements.

Track 2: AI Measurement Design

This track invites new evaluation dimensions that capture aspects of model behavior relevant to real-world use cases not covered by existing benchmarks. Submissions may define cultural, linguistic, functional, domain-specific, communicative, or socio-institutional constructs and propose schemas, item formats, scoring procedures, and documentation standards. We are looking for well-defined constructs with clear motivating use cases, explicit assumptions, and verifiable measurement strategies. Deliverables include, but are not limited to, a benchmark schema or dataset card describing the construct, example items or evaluation prompts with scoring criteria, and a short write-up (≈4 pages) explaining construct definition, related work, and measurement rationale.

Track 3: AI Downstream Impact Assessment

This track focuses on methods that assess how model performance varies across contexts—e.g., across languages, institutions, domains, or deployment environments. Submissions may include empirical studies, diagnostic dashboards, error analyses, reliability studies, or pipelines that surface capability gaps. We look for clear methodologies for quantifying variation in behavior across settings and interpreting those differences in terms of practical deployment needs. Deliverables include, but are not limited to, a report or dashboard summarizing comparative results, a reproducible analysis pipeline or evaluation notebook, and a brief documentation (≈4 pages) clarifying assumptions, data sources, and interpretive limitations.

Please contact us on Discord or email if you are interested in participating in the shared task. We are open via appointment to help facilitate team formation, find resources, and brainstorm ideas with you.

Important Dates

Submission deadline via OpenReview: January 9, 2025
Feedback release: January 16, 2025
Submission portal: OpenReview

List of Speakers

List of Organizers

Schedule

Time	Session
09:00 – 09:10	Opening Remarks
09:10 – 09:50	Keynote 1: Jian Gang Ngui (30 min talk + 10 min Q&A)
09:50 – 10:05	Oral Presentation 1
10:05 – 10:40	Poster Session 1 (10 posters)
10:40 – 10:55	Break / Coffee
10:55 – 11:35	Keynote 2: Simon Chesterman (30 min talk + 10 min Q&A)
11:35 – 11:50	Oral Presentation 2
11:50 – 12:25	Poster Session 2 (10 posters)
12:25 – 01:10	Lunch Break (45 min)
01:10 – 01:50	Keynote 3: Tan Zhi Xuan (30 min talk + 10 min Q&A)
01:50 – 02:05	Oral Presentation 3
02:05 – 02:40	Poster Session 3 (10 posters)
02:40 – 02:55	Break / Movement
02:55 – 03:35	Keynote 4: Elina Noor (30 min talk + 10 min Q&A)
03:35 – 04:05	Panel Discussion
04:05 – 04:40	Poster Session 4 (10 posters)
04:40 – 05:00	Awards & Closing Remarks.

Accepted Papers

Oral Presentations

Title	Authors
Safe at the Margins: A General Approach to Safety Alignment in Low-Resource English Languages – A Singlish Case Study OpenReview	Isaac Lim, Shaun Khoo, Watson Wei Khong Chua, Jessica Foo, Jia Yi Goh, Roy Ka-Wei Lee
What Would an LLM Do? Evaluating Policymaking Capabilities of Large Language Models OpenReview	Pierre Le Coz, Jiaan Liu, Debarun Bhattacharjya, Georgina Curto, Serge Stinckwich
Harnessing Linguistic Dissimilarity for Language Generalization on Unseen Low-Resource Varieties OpenReview	Jinju Kim, Haeji Jung, Youjeong Roh, Jong Hwan Ko, David R. Mortensen

Poster Presentations

Session 1

Title	Authors
From Scarcity to Efficiency: Investigating the Effects of Data Augmentation on African Machine Translation OpenReview	Mardiyyah Oduwole, Oluwatosin Olajide, Jamiu Tunde Suleiman, Faith Hunja, Busayo Awobade, Comfort Oyindamola Akanni, Fatimo Adebanjo, Chinonyelum Rosemary Igwe, Peace Ododo, Promise Omoigui, Abraham Toluwase Owodunni, Steven Kolawole
Open, Reproducible Morphology Probes for Plains Cree OpenReview	Duncan Stothers
When Gujarati Meets English: Toward Robust Translation of Code-Mixed Low Resourced Indian Language OpenReview	Mukund Agarwalla, Himanshu Kumar, Nishat Afshan Ansari
Sentence-Aware Bahnaric-Vietnamese Lexical Mapping with Contrastive Contextual Representations OpenReview	Thi Ty Nguyen, Phat T. Tran-Truong, Long Nguyen, Tan Sang Nguyen, Tho Quan
One Model, Many Worlds: Cross-Lingual Fine-Tuning Can Improve Low-Resource Capabilities of Language Model OpenReview	Tyler Slomianyj, Rudraansh Korlakunta, Victor He, Daniel Gao, Sunishchal Dev, Kevin Zhu, Aryan Shrivastava
Reflective Translation: Enhancing Low-Resource Machine Translation through Self-Reflection OpenReview	Lailah Denny, Nicholas Cheng, Agrim Sharma, Erin Tan
ENLIVEN-1000: A Comprehensive Revitalization Framework for 1000+ Endangered Languages via Broad-Coverage LID and LLM-Augmented MT OpenReview	Philip Meng
From Bias to Balance: How Multilingual Dataset Composition Affects Tokenizer Performance Across Languages OpenReview	Aishwarya Selvamurugan, Raj Dandekar, Rajat Dandekar, Sreedath Panat
Not All Data Augmentation Works: A Typology-Aware Study for Low-Resource Neural Machine Translation in Vietnamese Ethnic Minority Languages OpenReview	Long Nguyen, Dat T. Truong, Nhan D. Tran, Quynh Vo, Quy Tran Nguyen, Tho Quan

Session 2

Title	Authors
BAID: A Benchmark for Bias Assessment of AI Detectors OpenReview	Priyam Basu
Lost in Localization: Building RabakBench with Human-in-the-Loop Validation to Expose Multilingual Safety Gaps OpenReview	Gabriel Chua, Leanne Tan, Ziyu Ge, Roy Ka-Wei Lee
SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth OpenReview	Wenpeng Xing, Lanyi Wei, Haixiao Hu, Rongchang Li, Mohan Li, Changting Lin, Meng Han
Pluralistic AI Alignment: A Cross-Cultural Pilot Survey OpenReview	Khashayar Alavi, Lucie Flek, Florian Mai
Bias Dynamics in BabyLMs: Towards a Compute-Efficient Sandbox for Democratising Pre-Training Debiasing OpenReview	Filip Trhlík, Andrew Caines, Paula Buttery
Language Models Entangle Language and Culture OpenReview	Shourya Jain, Paras Chopra
UbuntuGuard: A Policy-Based Safety Benchmark for Low-Resource African Languages OpenReview	Tassallah Abdullahi, Macton Mgonzo, Abraham Toluwase Owodunni, Ritambhara Singh, Carsten Eickhoff
Jo.E(Joint Evaluation) : A Multi-Agent Collaborative Framework for Comprehensive AI Safety Evaluation of Language Models OpenReview	Himanshu Joshi

Session 3

Title	Authors
Curiosity-Driven LLM-as-a-judge for Personalized Creative Judgment OpenReview	Vanya Bannihatti Kumar, Divyanshu Goyal, Akhil Eppa, Neel Bhandari
Inverse Language Modeling towards Robust and Grounded LLMs OpenReview	Davide Gabrielli, Simone Sestito, Iacopo Masi
Advancing NLP Equity: A Secondary Benchmark Evaluation of Multilingual Language Models for Underrepresented Languages OpenReview	Md Muntaqim Meherab, SALMAN, Md. Maruf Billah, Kazi Shakkhar Rahman, Liza Sharmin, Tanvirul Islam, Z N M Zarif Mahmud, Nuruzzaman Faruqui, Sheak Rashed Haider Noori, Touhid Bhuiyan
CultureManip: A Novel Benchmark for Mental Manipulation Detection Across Multilingual Settings OpenReview	JingFeng Liang, Joshua Casuga, Austin Chen, Lang Xiong, Kevin Zhu
Beyond Static Leaderboards: A Roadmap to Naturalistic, Functional Evaluation of LLMs OpenReview	Victor Ojewale, Suresh Venkatasubramanian
Why It Failed: A Benchmark to Evaluate Interpretability OpenReview	Joel Mathew, Aditya Lagu, Anthony Tang, Prudhviraj Naidu
Not Funny Anymore: LLM Judges Confuse Literal Similarity for Humor in Translated Jokes OpenReview	Fabricio Rivera, Rohit Pochugari, Tessa Chan, Devansh Katakwar, Kevin Zhu, Michael Saxon
MULTILINGUAL EVALUATION OF HUMAN VS. AI TEXT CLASSIFICATION WITH ZERO-SHOT ANALYSIS OF CONTEMPORARY LLM ARCHITECTURES. OpenReview	Pranamya Nilesh Deshpande, Raj Dandekar, Rajat Dandekar, Sreedath Panat
OCER and OCWER: Integrating Visual Similarity and Segmentation in OCR Evaluation OpenReview	Samy Ouzerrout

Session 4

Title	Authors
VLM-guided Object-level Segmentation from Dynamic Scene OpenReview	Feiran Yang
PeerCoPilot: A Language Model-Powered Assistant for Behavioral Health Organizations OpenReview	Gao Mo, Naveen Janaki Raman, Megan Chai, Cindy Peng, Shannon Pagdon, Nev Jones, Hong Shen, Margaret Swarbrick, Fei Fang
The Resonance Corpus: Chinese Caregiver-Child Dialogue for Community-Aligned Language Models OpenReview	Lingqin Meng, Yang Gao, Zhongzhi Huo, Stella Christie
Scribes, Scripts, and Scarcity: Re-thinking Benchmarking for Arabic-Script Handwritten Text Recognition in Historical Manuscript Traditions OpenReview	Yuanhao Zou
RuSignBot: Russian Sign Language Synthesis via Customized MimicMotion OpenReview	Daria Bondarenko, Emilia Bojarskaja, Maxim Novopoltsev, Aleksandr Tulenkov, Ruslan Murtazin, Iuliia Zemtsova, Ilya Makarov, Andrey Savchenko
Beyond Monolithic Culture: Evaluating Understandability of Online Text Across Cultural Dimensions OpenReview	Saurabh Kumar Pandey, Harshit Gupta, Sougata Saha, Monojit Choudhury
CAMA: A Culturally Adaptive Multi-Agent Framework for Postpartum Depression Support in Multilingual and Low-Resource Settings OpenReview	Zhiqi Zhang, Ziyi LIU, rite Bo
CESLR: A Multi-Signer Benchmark and SpatioTemporal End-to-End Framework for Continuous Ethiopian Sign Language Recognition OpenReview	Anteneh Yehalem Tegegne, Yohannes Ayana Ejigu, Surafel Amsalu

Call for Papers

Topics of Interest

Submission Guidelines

Important Dates

Submission Link

Contact Us

Shared Task: AI Measurement for the Public Interest

Track 1: AI Evaluation Infrastructure & Stewardship

Track 2: AI Measurement Design

Track 3: AI Downstream Impact Assessment

List of Speakers

Simon Chesterman

NUS | AI Singapore

Jian Gang Ngui

AI Singapore

Elina Noor

Carnegie Endowment for International Peace

Tan Zhi Xuan

NUS | A*STAR IHPC

List of Organizers

Sang Truong

Stanford

Sarah Luger

MLCommons

Rafael Mosquera

MLCommons

Duc Nguyen

NUS

Fagun Patel

Stanford

Francesca Vera

Stanford

Tracy Navichoque

Stanford

Sanmi Koyejo

Stanford

Schedule

Accepted Papers

Oral Presentations

Poster Presentations

Session 1

Session 2

Session 3

Session 4