Skip to content

Lossfunk/language-culture

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lossfunk Logo

Language Culture Entanglement

🤗 Multilingual CulturalBench📜 arXiv Paper🌐 Project Website

This repository contains code and data for evaluating Large Language Models (LLMs) on their ability to handle multilingual queries and their alignment with specific cultural contexts. The project involves generating responses, evaluating them using an LLM-as-a-judge approach, and analyzing the results for cultural bias and performance differences across languages.

Repository Structure

1. data/

Contains the primary dataset used for prompting the models.

  • multilingual_queries.csv: A CSV file containing queries across 7 categories (Programming Advice, Research Advice, Finance, Learning, Business, Job, Health) translated into 6 languages:
    • English
    • Hindi
    • Chinese
    • Swahili
    • Brazilian Portuguese
    • Hebrew

2. judge-ablations/

This folder contains experiments and code related to setting up and verifying the "LLM-as-a-judge" evaluation pipeline.

  • test_judge.ipynb:
    • Handles the translation of English queries into target languages using google/gemini-2.5-flash.
    • Sets up the evaluation prompt layout.
    • Contains logic for generating responses and preliminary judging content.
  • analysis.ipynb:
    • Analyzes the performance of the judge itself (e.g., examining agreement or score distributions).
    • Uses cohere_scores.csv to calculate metrics (like Cohen's Kappa) and visualize judge reliability.
  • generate_samples.ipynb: Notebook for generating sample responses for testing the pipeline.
  • Data Files:
    • evaluation_results.csv, judge_ablation_scores.png: Outputs from the ablation studies.
    • rankings_*.json: JSON files containing ranking data from experiments.

3. analysis/

The core analysis folder for evaluating model performance on the multilingual queries.

  • analyse.ipynb:
    • The main analysis notebook.
    • Loads scoring data for multiple models (Qwen, Cohere, Magistral, Sarvam).
    • Performs statistical analysis (Kruskal-Wallis H tests) to determine significant differences in performance.
    • Calculates correlations between response length/tokenizer length and overall quality scores.
    • Generates visualizations (bar charts) comparing model performance across languages.
  • trans_viz.ipynb: Used for visualizations related to translation or cross-lingual performance.

4. cultural-evals/

Focuses on evaluating the cultural alignment of LLMs, specifically using a "Multilingual CulturalBench".

  • process.ipynb:
    • Processes raw model outputs to extract structured evaluation tags.
    • Parses <culture> and <reason> tags to identify which cultural perspective (e.g., "Western/Anglo-American") the model is adopting.
  • analyse.ipynb:
    • Analyzing the processed cultural evaluation data.
    • Merges model responses with ground truth data (question_idx, answer, country).
    • Calculates accuracy (correct column) based on whether the model's option matches the answer.
    • Visualizes the distribution of cultural alignment (e.g., how often a model defaults to Western norms).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors