Language Culture Entanglement

🤗 Multilingual CulturalBench • 📜 arXiv Paper • 🌐 Project Website

This repository contains code and data for evaluating Large Language Models (LLMs) on their ability to handle multilingual queries and their alignment with specific cultural contexts. The project involves generating responses, evaluating them using an LLM-as-a-judge approach, and analyzing the results for cultural bias and performance differences across languages.

Repository Structure

1. `data/`

Contains the primary dataset used for prompting the models.

multilingual_queries.csv: A CSV file containing queries across 7 categories (Programming Advice, Research Advice, Finance, Learning, Business, Job, Health) translated into 6 languages:
- English
- Hindi
- Chinese
- Swahili
- Brazilian Portuguese
- Hebrew

2. `judge-ablations/`

This folder contains experiments and code related to setting up and verifying the "LLM-as-a-judge" evaluation pipeline.

test_judge.ipynb:
- Handles the translation of English queries into target languages using google/gemini-2.5-flash.
- Sets up the evaluation prompt layout.
- Contains logic for generating responses and preliminary judging content.
analysis.ipynb:
- Analyzes the performance of the judge itself (e.g., examining agreement or score distributions).
- Uses cohere_scores.csv to calculate metrics (like Cohen's Kappa) and visualize judge reliability.
generate_samples.ipynb: Notebook for generating sample responses for testing the pipeline.
Data Files:
- evaluation_results.csv, judge_ablation_scores.png: Outputs from the ablation studies.
- rankings_*.json: JSON files containing ranking data from experiments.

3. `analysis/`

The core analysis folder for evaluating model performance on the multilingual queries.

analyse.ipynb:
- The main analysis notebook.
- Loads scoring data for multiple models (Qwen, Cohere, Magistral, Sarvam).
- Performs statistical analysis (Kruskal-Wallis H tests) to determine significant differences in performance.
- Calculates correlations between response length/tokenizer length and overall quality scores.
- Generates visualizations (bar charts) comparing model performance across languages.
trans_viz.ipynb: Used for visualizations related to translation or cross-lingual performance.

4. `cultural-evals/`

Focuses on evaluating the cultural alignment of LLMs, specifically using a "Multilingual CulturalBench".

process.ipynb:
- Processes raw model outputs to extract structured evaluation tags.
- Parses <culture> and <reason> tags to identify which cultural perspective (e.g., "Western/Anglo-American") the model is adopting.
analyse.ipynb:
- Analyzing the processed cultural evaluation data.
- Merges model responses with ground truth data (question_idx, answer, country).
- Calculates accuracy (correct column) based on whether the model's option matches the answer.
- Visualizes the distribution of cultural alignment (e.g., how often a model defaults to Western norms).

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
analysis		analysis
cultural-evals		cultural-evals
data		data
judge-ablations		judge-ablations
.gitignore		.gitignore
README.md		README.md
lossfunk-logo.jpg		lossfunk-logo.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Culture Entanglement

Repository Structure

1. `data/`

2. `judge-ablations/`

3. `analysis/`

4. `cultural-evals/`

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Language Culture Entanglement

Repository Structure

1. data/

2. judge-ablations/

3. analysis/

4. cultural-evals/

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `data/`

2. `judge-ablations/`

3. `analysis/`

4. `cultural-evals/`

Packages