AAAI 2026 Workshop Language Models for Underserved Communities

Singapore

January 27, 2026

Call for Papers

Underserved communities often lack adequate access to advanced natural language processing (NLP) technologies due to limited linguistic data, insufficient computational resources, or inadequate AI governance frameworks. This gap hinders equitable access to NLP advancements, exacerbating the digital divide. Our workshop aims to address this by fostering a multidisciplinary dialogue around the development of language models (LMs) that prioritize cultural sensitivity, resource efficiency, and sustainable AI practices. We invites researchers, practitioners, and policymakers to address challenges and propose innovative solutions for building and deploying language models for underserved languages and communities.

Topics of Interest

We invite submissions of full papers, ongoing work, position papers, and survey papers on topics including, but not limited to:

  1. Measuring and Governing AI
    Developing reliable evaluation methods for LMs under constraints in data, compute, and expertise. How can psychometrics, auditing frameworks, or validity theory guide responsible measurement and governance?

  2. Benchmarking and Fairness
    Building inclusive benchmarks and evaluation pipelines that reduce bias, improve cultural and linguistic representation, and ensure fair performance across underserved communities.

  3. Pluralistic Alignment
    Designing approaches for aligning LMs with diverse values, cultural norms, and epistemologies, including participatory and community-driven methods.

  4. Open and Inclusive Infrastructure
    Creating open datasets, benchmarks, models, and participatory platforms that support sustainable and equitable NLP research and deployment.

Submission Guidelines

We welcome long papers (8 pages) and short papers (4 pages), excluding references. Submissions must follow the AAAI 2026 style guidelines.

Important Dates

  • Submission deadline: November 14, 2025 November 20, 2025 November 24, 2025
  • Notification of acceptance: December 12, 2025
  • Camera-ready paper due: January 10, 2026
  • Workshop dates: January 27, 2026

Please note that all the deadlines are in AoE (Anywhere on Earth) timezone.

Papers should be submitted via OpenReview.

Contact Us

For inquiries, please contact the workshop organizers: lm4uc.organizers (at) gmail.com

Alternatively, you can reach us via our Discord server.

Shared Task: AI Measurement for the Public Interest

In this year, we are excited to announce a shared task on “AI Measurement for the Public Interest,” organized as part of the Language Models for Underserved Communities (LM4UC) workshop at AAAI 2026. The shared task aims to foster the development of evaluation methodologies and infrastructures that prioritize the needs of underserved communities, focusing on context-aware and institutionally grounded measurement practices.

This shared task invites participants to design and prototype evaluation workflows tailored to underserved linguistic and cultural contexts. The focus is not on optimizing model performance, but on developing measurement infrastructures that reflect how institutions, researchers, and communities actually assess and deploy language technologies under varied resource, governance, and environmental constraints. The task is organized around three complementary layers of an evaluation ecosystem:

  1. AI Evaluation Infrastructure and StewardshipWhere and how evaluation is conducted. This includes workflows that enable local institutions to run assessments, control access to evaluation assets, and maintain their own scoring and deployment environments.
  2. AI Measurement DesignWhat is being measured and how it is operationalized. This includes defining new evaluation dimensions, benchmarks, scoring criteria, and documentation practices that capture capabilities relevant to local use cases.
  3. AI Downstream Impact AssessmentHow system behavior varies across populations, domains, or deployment settings. This includes methods for quantifying performance variation, robustness, or utility across communities and identifying areas needing further capability development.

These layers together support end-to-end evaluation: designing the evaluation environment, specifying meaningful constructs, and analyzing performance in real deployments. Submissions may address any layer independently or propose workflows that integrate multiple layers. The shared task welcomes contributions such as datasets, protocols, analysis pipelines, benchmark definitions, evaluation software, and institutional frameworks. Submissions will be evaluated on clarity, methodological rigor, practical feasibility, and relevance to settings where existing benchmark infrastructure is limited or mismatched to local priorities. This initiative reflects LM4UC’s broader goal of advancing scalable, context-aware measurement infrastructures that support the long-term development of language technologies beyond traditional benchmark settings.

Track 1: AI Evaluation Infrastructure & Stewardship

This track focuses on designing evaluation workflows that can be operated by local institutions rather than relying on centralized infrastructure. Submissions may include device-side evaluation, federated scoring, offline test packages, access-controlled scoring interfaces, or procedures for maintaining and updating evaluation assets over time. We are looking for clear and feasible mechanisms that allow organizations to run evaluations, control access to evaluation artifacts, and adapt workflows to their institutional constraints. Deliverables include, but are not limited to, a process card or documentation outline describing the evaluation pipeline design and usage constraints, a workflow diagram or prototype demonstrating how the evaluation is run end-to-end, and a short technical memo (about four pages) detailing assumptions, governance structure, and system requirements.

Track 2: AI Measurement Design

This track invites new evaluation dimensions that capture aspects of model behavior relevant to real-world use cases not covered by existing benchmarks. Submissions may define cultural, linguistic, functional, domain-specific, communicative, or socio-institutional constructs and propose schemas, item formats, scoring procedures, and documentation standards. We are looking for well-defined constructs with clear motivating use cases, explicit assumptions, and verifiable measurement strategies. Deliverables include, but are not limited to, a benchmark schema or dataset card describing the construct, example items or evaluation prompts with scoring criteria, and a short write-up (≈4 pages) explaining construct definition, related work, and measurement rationale.

Track 3: AI Downstream Impact Assessment

This track focuses on methods that assess how model performance varies across contexts—e.g., across languages, institutions, domains, or deployment environments. Submissions may include empirical studies, diagnostic dashboards, error analyses, reliability studies, or pipelines that surface capability gaps. We look for clear methodologies for quantifying variation in behavior across settings and interpreting those differences in terms of practical deployment needs. Deliverables include, but are not limited to, a report or dashboard summarizing comparative results, a reproducible analysis pipeline or evaluation notebook, and a brief documentation (≈4 pages) clarifying assumptions, data sources, and interpretive limitations.

Please contact us on Discord or email if you are interested in participating in the shared task. We are open via appointment to help facilitate team formation, find resources, and brainstorm ideas with you.

Important Dates

  • Submission deadline via OpenReview: January 9, 2025
  • Feedback release: January 16, 2025
  • Submission portal: OpenReview

List of Speakers

List of Organizers

Schedule

TimeSession
09:00 – 09:10Opening Remarks
09:10 – 09:50Keynote 1: Jian Gang Ngui (30 min talk + 10 min Q&A)
09:50 – 10:05Oral Presentation 1
10:05 – 10:40Poster Session 1 (10 posters)
10:40 – 10:55Break / Coffee
10:55 – 11:35Keynote 2: Simon Chesterman (30 min talk + 10 min Q&A)
11:35 – 11:50Oral Presentation 2
11:50 – 12:25Poster Session 2 (10 posters)
12:25 – 01:10Lunch Break (45 min)
01:10 – 01:50Keynote 3: Tan Zhi Xuan (30 min talk + 10 min Q&A)
01:50 – 02:05Oral Presentation 3
02:05 – 02:40Poster Session 3 (10 posters)
02:40 – 02:55Break / Movement
02:55 – 03:35Keynote 4: Elina Noor (30 min talk + 10 min Q&A)
03:35 – 04:05Panel Discussion
04:05 – 04:40Poster Session 4 (10 posters)
04:40 – 05:00Awards & Closing Remarks.

Accepted Papers

Oral Presentations

TitleAuthors
Safe at the Margins: A General Approach to Safety Alignment in Low-Resource English Languages – A Singlish Case Study OpenReviewIsaac Lim, Shaun Khoo, Watson Wei Khong Chua, Jessica Foo, Jia Yi Goh, Roy Ka-Wei Lee
What Would an LLM Do? Evaluating Policymaking Capabilities of Large Language Models OpenReviewPierre Le Coz, Jiaan Liu, Debarun Bhattacharjya, Georgina Curto, Serge Stinckwich
Harnessing Linguistic Dissimilarity for Language Generalization on Unseen Low-Resource Varieties OpenReviewJinju Kim, Haeji Jung, Youjeong Roh, Jong Hwan Ko, David R. Mortensen

Poster Presentations

Session 1

TitleAuthors
From Scarcity to Efficiency: Investigating the Effects of Data Augmentation on African Machine Translation OpenReviewMardiyyah Oduwole, Oluwatosin Olajide, Jamiu Tunde Suleiman, Faith Hunja, Busayo Awobade, Comfort Oyindamola Akanni, Fatimo Adebanjo, Chinonyelum Rosemary Igwe, Peace Ododo, Promise Omoigui, Abraham Toluwase Owodunni, Steven Kolawole
Open, Reproducible Morphology Probes for Plains Cree OpenReviewDuncan Stothers
When Gujarati Meets English: Toward Robust Translation of Code-Mixed Low Resourced Indian Language OpenReviewMukund Agarwalla, Himanshu Kumar, Nishat Afshan Ansari
Sentence-Aware Bahnaric-Vietnamese Lexical Mapping with Contrastive Contextual Representations OpenReviewThi Ty Nguyen, Phat T. Tran-Truong, Long Nguyen, Tan Sang Nguyen, Tho Quan
One Model, Many Worlds: Cross-Lingual Fine-Tuning Can Improve Low-Resource Capabilities of Language Model OpenReviewTyler Slomianyj, Rudraansh Korlakunta, Victor He, Daniel Gao, Sunishchal Dev, Kevin Zhu, Aryan Shrivastava
Reflective Translation: Enhancing Low-Resource Machine Translation through Self-Reflection OpenReviewLailah Denny, Nicholas Cheng, Agrim Sharma, Erin Tan
ENLIVEN-1000: A Comprehensive Revitalization Framework for 1000+ Endangered Languages via Broad-Coverage LID and LLM-Augmented MT OpenReviewPhilip Meng
From Bias to Balance: How Multilingual Dataset Composition Affects Tokenizer Performance Across Languages OpenReviewAishwarya Selvamurugan, Raj Dandekar, Rajat Dandekar, Sreedath Panat
Not All Data Augmentation Works: A Typology-Aware Study for Low-Resource Neural Machine Translation in Vietnamese Ethnic Minority Languages OpenReviewLong Nguyen, Dat T. Truong, Nhan D. Tran, Quynh Vo, Quy Tran Nguyen, Tho Quan

Session 2

TitleAuthors
BAID: A Benchmark for Bias Assessment of AI Detectors OpenReviewPriyam Basu
Lost in Localization: Building RabakBench with Human-in-the-Loop Validation to Expose Multilingual Safety Gaps OpenReviewGabriel Chua, Leanne Tan, Ziyu Ge, Roy Ka-Wei Lee
SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth OpenReviewWenpeng Xing, Lanyi Wei, Haixiao Hu, Rongchang Li, Mohan Li, Changting Lin, Meng Han
Pluralistic AI Alignment: A Cross-Cultural Pilot Survey OpenReviewKhashayar Alavi, Lucie Flek, Florian Mai
Bias Dynamics in BabyLMs: Towards a Compute-Efficient Sandbox for Democratising Pre-Training Debiasing OpenReviewFilip Trhlík, Andrew Caines, Paula Buttery
Language Models Entangle Language and Culture OpenReviewShourya Jain, Paras Chopra
UbuntuGuard: A Policy-Based Safety Benchmark for Low-Resource African Languages OpenReviewTassallah Abdullahi, Macton Mgonzo, Abraham Toluwase Owodunni, Ritambhara Singh, Carsten Eickhoff
Jo.E(Joint Evaluation) : A Multi-Agent Collaborative Framework for Comprehensive AI Safety Evaluation of Language Models OpenReviewHimanshu Joshi

Session 3

TitleAuthors
Curiosity-Driven LLM-as-a-judge for Personalized Creative Judgment OpenReviewVanya Bannihatti Kumar, Divyanshu Goyal, Akhil Eppa, Neel Bhandari
Inverse Language Modeling towards Robust and Grounded LLMs OpenReviewDavide Gabrielli, Simone Sestito, Iacopo Masi
Advancing NLP Equity: A Secondary Benchmark Evaluation of Multilingual Language Models for Underrepresented Languages OpenReviewMd Muntaqim Meherab, SALMAN, Md. Maruf Billah, Kazi Shakkhar Rahman, Liza Sharmin, Tanvirul Islam, Z N M Zarif Mahmud, Nuruzzaman Faruqui, Sheak Rashed Haider Noori, Touhid Bhuiyan
CultureManip: A Novel Benchmark for Mental Manipulation Detection Across Multilingual Settings OpenReviewJingFeng Liang, Joshua Casuga, Austin Chen, Lang Xiong, Kevin Zhu
Beyond Static Leaderboards: A Roadmap to Naturalistic, Functional Evaluation of LLMs OpenReviewVictor Ojewale, Suresh Venkatasubramanian
Why It Failed: A Benchmark to Evaluate Interpretability OpenReviewJoel Mathew, Aditya Lagu, Anthony Tang, Prudhviraj Naidu
Not Funny Anymore: LLM Judges Confuse Literal Similarity for Humor in Translated Jokes OpenReviewFabricio Rivera, Rohit Pochugari, Tessa Chan, Devansh Katakwar, Kevin Zhu, Michael Saxon
MULTILINGUAL EVALUATION OF HUMAN VS. AI TEXT CLASSIFICATION WITH ZERO-SHOT ANALYSIS OF CONTEMPORARY LLM ARCHITECTURES. OpenReviewPranamya Nilesh Deshpande, Raj Dandekar, Rajat Dandekar, Sreedath Panat
OCER and OCWER: Integrating Visual Similarity and Segmentation in OCR Evaluation OpenReviewSamy Ouzerrout

Session 4

TitleAuthors
VLM-guided Object-level Segmentation from Dynamic Scene OpenReviewFeiran Yang
PeerCoPilot: A Language Model-Powered Assistant for Behavioral Health Organizations OpenReviewGao Mo, Naveen Janaki Raman, Megan Chai, Cindy Peng, Shannon Pagdon, Nev Jones, Hong Shen, Margaret Swarbrick, Fei Fang
The Resonance Corpus: Chinese Caregiver-Child Dialogue for Community-Aligned Language Models OpenReviewLingqin Meng, Yang Gao, Zhongzhi Huo, Stella Christie
Scribes, Scripts, and Scarcity: Re-thinking Benchmarking for Arabic-Script Handwritten Text Recognition in Historical Manuscript Traditions OpenReviewYuanhao Zou
RuSignBot: Russian Sign Language Synthesis via Customized MimicMotion OpenReviewDaria Bondarenko, Emilia Bojarskaja, Maxim Novopoltsev, Aleksandr Tulenkov, Ruslan Murtazin, Iuliia Zemtsova, Ilya Makarov, Andrey Savchenko
Beyond Monolithic Culture: Evaluating Understandability of Online Text Across Cultural Dimensions OpenReviewSaurabh Kumar Pandey, Harshit Gupta, Sougata Saha, Monojit Choudhury
CAMA: A Culturally Adaptive Multi-Agent Framework for Postpartum Depression Support in Multilingual and Low-Resource Settings OpenReviewZhiqi Zhang, Ziyi LIU, rite Bo
CESLR: A Multi-Signer Benchmark and SpatioTemporal End-to-End Framework for Continuous Ethiopian Sign Language Recognition OpenReviewAnteneh Yehalem Tegegne, Yohannes Ayana Ejigu, Surafel Amsalu