Awesome AI Safety & Alignment

A curated list of research, frameworks, tools, evaluations, and resources focused on AI alignment, safety, robustness, red-teaming, model governance, and responsible development.

Support ongoing maintenance and curation via GitHub Sponsors.

Research Organizations

Alignment Research Center (ARC) – Research on scalable oversight and model evaluations.
AI Safety Center (UK) – Government-backed safety and model evaluation initiatives.
OpenAI Safety – Research on robustness, red-teaming, and alignment.
Anthropic Safety – Safety teams working on interpretability and frontier model evaluations.
DeepMind Safety Research – Research into scalable oversight, alignment, robustness.
Center for AI Safety (CAIS) – Public safety education, benchmarks, and policy guidance.
ELEUTHERAI – Open-source AI research with safety-driven initiatives.

Safety Frameworks

OpenAI Model Spec – Specifications defining expected safe model behavior.
Anthropic Constitutional AI – Framework for training models using rule-based constitutional constraints.
Google Responsible AI Practices – Principles and frameworks for safe AI development.
OECD AI Principles – International standards for trustworthy AI.
NIST AI Risk Management Framework – U.S. national standard for assessing AI risk.
EU AI Act Summary – Regulatory framework for high-risk and general-purpose AI systems.

Red Teaming & Threat Modeling

OpenAI Red Teaming Network – Global research collaboration for model evaluations.
Anthropic Red Teaming Resources – Safety-focused adversarial testing methods.
Microsoft AI Red Team – Methodologies for security and safety testing of AI systems.
AI Safety Threat Modeling – Community tools and docs for threat analysis.
LLM Jailbreak Prompts Datasets – Collections of adversarial prompts for robustness testing.

Evaluation & Benchmarks

HELM – Holistic evaluation of language models across safety and risk domains.
Anthropic Evaluations – Safety evaluations for frontier models.
OpenAI Evals – Framework for testing model safety, reasoning, and reliability.
Red Teaming Benchmarks – Community-driven safety evaluations.
ToxiGen – Datasets for evaluating harmful or toxic outputs.
SafetyBench – Benchmark framework for AI safety scenarios.

Model Governance & Policy

AI Safety Institute (UK) – International coordination on frontier model safety testing.
AI Safety Institute (US) – U.S. policy, evaluations, and governance efforts.
OECD AI Governance Hub – Regulatory and policy resources for AI alignment.
UNESCO AI Ethics Framework – Global normative framework for ethical AI.
Global AI Safety Summits – Agreements and charters from global model safety gatherings.

Datasets

JailbreakBench – Evaluation dataset for jailbreak susceptibility.
HarmBench – Multi-domain dataset for AI harm classification and safety testing.
RealToxicityPrompts – Adversarial or harmful prompts used in robustness evaluation.
AdvBench – Dataset for adversarial attacks and safety testing.

Learning Resources

AI Alignment Fundamentals (BlueDot) – Intro curriculum for alignment.
AGI Safety Fundamentals – Structured course on alignment, safety, and governance.
OpenAI Safety Papers – Research papers on alignment and model evaluations.
Anthropic Interpretability Research – Papers and findings on model internals.
DeepMind Safety Papers – Research into oversight, robustness, and alignment.
CAIS Safety Curriculum – Intro and advanced learning pathways.

Related Awesome Lists

Contribute

Contributions are welcome. Please ensure your submission fully follows the requirements outlined in CONTRIBUTING.md, including formatting, scope alignment, and category placement.

Pull requests that do not adhere to the contribution guidelines may be closed.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github		.github
.editorconfig		.editorconfig
.gitattributes		.gitattributes
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
check_readme_links.py		check_readme_links.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome AI Safety & Alignment

Contents

Research Organizations

Safety Frameworks

Red Teaming & Threat Modeling

Evaluation & Benchmarks

Model Governance & Policy

Datasets

Learning Resources

Related Awesome Lists

Contribute

License

About

Uh oh!

Releases 1

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome AI Safety & Alignment

Contents

Research Organizations

Safety Frameworks

Red Teaming & Threat Modeling

Evaluation & Benchmarks

Model Governance & Policy

Datasets

Learning Resources

Related Awesome Lists

Contribute

License

About

Topics

Resources

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages