Implementation of our EMNLP 2025 main conference paper,
“Social Good or Scientific Curiosity? Uncovering the Research Framing Behind NLP Artefacts.”
This project extends the earlier version.
This repository extracts epistemic elements from paper introductions, computes semantic/uncertainty-based scores, ranks research framings with rule-based scoring, and runs an LLM-based framing classifier.
data/ layout:
data/
fact-checking_val.json # AFC annotated papers with processed abstract + introduction (val)
fact-checking_test.json # AFC annotated papers with processed abstract + introduction (test)
hate_speech_data.json # HS annotated papers with processed abstract + introduction
automated-afc-analysis.json # Processed abstract + introduction for our automated analysis.
src/ layout:
src/
configs/ # YAML configs per domain
elements/ # Epistemic elements generation + semantic entropy
framing/ # Narrative ranking (rules) + narrative classification (LLM)
inference/ # LLM wrapper for Gemini
utils/ # Shared helpers (elements + framing)
Supported domains (YAML names in src/configs):
fact-checkinghate-speech
conda create -n nlp-framing-analysis python=3.12.1 -y
conda activate nlp-framing-analysispip install -r requirements.txtThe LLM wrapper uses Google Gemini via google-generativeai.
export GEMINI_API_KEY="YOUR_KEY_HERE"python -m src.elements.run_inference \
--domain fact-checking \
--dataset_path data/fact-checking_test.json \
--num_generations 10 \
--temperature 1.0 \
--model gemini-2.0-flashsrc.elements.run_inference — Arguments
| Argument | Type | Description |
|---|---|---|
--domain |
str |
Name of config in src/configs (without .yaml). Task domain (fact-checking or hate-speech). |
--dataset_path |
str |
Path to the JSON dataset to analyze. |
--use_assoc_labels |
bool |
Set to only consider paragraphs annotated as containing research framing information. |
--num_generations |
int |
Number of generations per input from the LLM. |
--temperature |
float |
LLM sampling temperature. |
--strict_entailment |
bool |
Keep only high-confidence entailments during clustering. |
Outputs in outputs/ (filenames derived from the dataset filename):
*_generations.pkl— raw LLM generations and avg logprobs per query*_mrrs.txt— element-level metrics (filtered MRR)*_generation_scores.txt— JSON of per-paper, per-element scores (used for ranking and as hints to classification)
Takes the epistemic-element scores JSON and computes research framing scores per paper using semi-automatically inferred domain rules.
python -m src.framing.ranking \
--input_file outputs/elements/<dataset_base>_generation_scores.txt \
--output_path ranking_results \
--gold_path data/fact-checking_test.json \
--domain fact-checking
src.framing.ranking — Arguments
| Argument | Type | Description |
|---|---|---|
--input_file |
str |
Path to the JSON file with epistemic-element predictions. |
--gold_path |
str |
Path to human-annotated gold data (used for evaluation). |
--output_path |
str |
Output stem under outputs/ where results are saved. |
--use_gold_rankings |
bool |
Whether to use gold narrative rankings instead of model predictions. |
--domain |
str |
Task domain (fact-checking or hate-speech). |
Outputs:
*.json— narrative scores per paper*.txt— filtered MRR summary (overall + per gold label, if gold provided)*_aggregated.json— aggregated epistemic-element scores used for ranking
In this stage, a large language model (LLM) refines narrative predictions by reasoning over the epistemic element confidence scores and their supporting evidence from previous stages.
Rather than classifying from scratch, the LLM is prompted with structured justifications summarizing the system’s prior reasoning, alongside paper context and framing definitions.
Set both --system_generations and --paper_scores_path to provide these structured summaries to the LLM.
python -m src.framing.classification \
--dataset_path data/fact_checking_test.json \
--output_path classification \
--domain fact-checking \
--system_generations \
--paper_scores_path outputs/framing/<ranking_output_path>_aggregated.json \
--model gemini-2.0-flash \
--temperature 1.0 \
--trials 15src.framing.classification — Arguments
| Argument | Type | Description |
|---|---|---|
--domain |
str |
Task domain (fact-checking or hate-speech). |
--dataset_path |
str |
Path to the JSON dataset with paper introductions. |
--output_path |
str |
Output stem (predictions will append _{trial}_predictions.json). |
--system_generations |
bool |
If set, include ranking-model hints in the prompt. |
--paper_scores_path |
str |
Path to the epistemic-element scores JSON used to build hints. |
--model |
str |
LLM model ID (e.g., gemini-2.0-flash). |
--temperature |
float |
Sampling temperature for the LLM. |
--trials |
int |
Number of runs to perform (to compute confidence intervals) |
Outputs:
*_predictions_<trial>.json— model-selected labels + reasoning per paper- Console shows macro-averaged Precision/Recall/F1 and average F1 over trials
To add a new domain, update the following two directories:
Edit or add a YAML file (*.yaml) to define:
- Epistemic-element questions, labels, and base task templates
- Framing (narrative classification) labels and prompt templates
- System roles, definitions, mappings, and confidence thresholds
Implement domain-specific ranking rules in a new Python file.
Each file defines how epistemic-element scores combine into narrative predictions.
Together, these two components (
configs+rules) fully determine how the system operates for a given domain.
If you find this useful, please cite our paper as:
@inproceedings{chamoun-etal-2025-social,
title = "Social Good or Scientific Curiosity? Uncovering the Research Framing Behind {NLP} Artefacts",
author = "Chamoun, Eric and
Ousidhoum, Nedjma and
Schlichtkrull, Michael Sejr and
Vlachos, Andreas",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.1286/",
pages = "25310--25346",
ISBN = "979-8-89176-332-6",
abstract = "Clarifying the research framing of NLP artefacts (e.g., models, datasets, etc.) is crucial to aligning research with practical applications when researchers claim that their findings have real-world impact. Recent studies manually analyzed NLP research across domains, showing that few papers explicitly identify key stakeholders, intended uses, or appropriate contexts. In this work, we propose to automate this analysis, developing a three-component system that infers research framings by first extracting key elements (means, ends, stakeholders), then linking them through interpretable rules and contextual reasoning.We evaluate our approach on two domains: automated fact-checking using an existing dataset, and hate speech detection for which we annotate a new dataset{---}achieving consistent improvements over strong LLM baselines.Finally, we apply our system to recent automated fact-checking papers and uncover three notable trends: a rise in underspecified research goals, increased emphasis on scientific exploration over application, and a shift toward supporting human fact-checkers rather than pursuing full automation."
}