A curated list of research, frameworks, tools, evaluations, and resources focused on AI alignment, safety, robustness, red-teaming, model governance, and responsible development.
Support ongoing maintenance and curation via GitHub Sponsors.
- Research Organizations
- Safety Frameworks
- Red Teaming & Threat Modeling
- Evaluation & Benchmarks
- Model Governance & Policy
- Datasets
- Learning Resources
- Related Awesome Lists
- Alignment Research Center (ARC) – Research on scalable oversight and model evaluations.
- AI Safety Center (UK) – Government-backed safety and model evaluation initiatives.
- OpenAI Safety – Research on robustness, red-teaming, and alignment.
- Anthropic Safety – Safety teams working on interpretability and frontier model evaluations.
- DeepMind Safety Research – Research into scalable oversight, alignment, robustness.
- Center for AI Safety (CAIS) – Public safety education, benchmarks, and policy guidance.
- ELEUTHERAI – Open-source AI research with safety-driven initiatives.
- OpenAI Model Spec – Specifications defining expected safe model behavior.
- Anthropic Constitutional AI – Framework for training models using rule-based constitutional constraints.
- Google Responsible AI Practices – Principles and frameworks for safe AI development.
- OECD AI Principles – International standards for trustworthy AI.
- NIST AI Risk Management Framework – U.S. national standard for assessing AI risk.
- EU AI Act Summary – Regulatory framework for high-risk and general-purpose AI systems.
- OpenAI Red Teaming Network – Global research collaboration for model evaluations.
- Anthropic Red Teaming Resources – Safety-focused adversarial testing methods.
- Microsoft AI Red Team – Methodologies for security and safety testing of AI systems.
- AI Safety Threat Modeling – Community tools and docs for threat analysis.
- LLM Jailbreak Prompts Datasets – Collections of adversarial prompts for robustness testing.
- HELM – Holistic evaluation of language models across safety and risk domains.
- Anthropic Evaluations – Safety evaluations for frontier models.
- OpenAI Evals – Framework for testing model safety, reasoning, and reliability.
- Red Teaming Benchmarks – Community-driven safety evaluations.
- ToxiGen – Datasets for evaluating harmful or toxic outputs.
- SafetyBench – Benchmark framework for AI safety scenarios.
- AI Safety Institute (UK) – International coordination on frontier model safety testing.
- AI Safety Institute (US) – U.S. policy, evaluations, and governance efforts.
- OECD AI Governance Hub – Regulatory and policy resources for AI alignment.
- UNESCO AI Ethics Framework – Global normative framework for ethical AI.
- Global AI Safety Summits – Agreements and charters from global model safety gatherings.
- JailbreakBench – Evaluation dataset for jailbreak susceptibility.
- HarmBench – Multi-domain dataset for AI harm classification and safety testing.
- RealToxicityPrompts – Adversarial or harmful prompts used in robustness evaluation.
- AdvBench – Dataset for adversarial attacks and safety testing.
- AI Alignment Fundamentals (BlueDot) – Intro curriculum for alignment.
- AGI Safety Fundamentals – Structured course on alignment, safety, and governance.
- OpenAI Safety Papers – Research papers on alignment and model evaluations.
- Anthropic Interpretability Research – Papers and findings on model internals.
- DeepMind Safety Papers – Research into oversight, robustness, and alignment.
- CAIS Safety Curriculum – Intro and advanced learning pathways.
- Awesome AI
- Awesome Machine Learning
- Awesome AI Research Papers
- Awesome AI Ethics
- Awesome Open Governance
Contributions are welcome. Please ensure your submission fully follows the requirements outlined in CONTRIBUTING.md, including formatting, scope alignment, and category placement.
Pull requests that do not adhere to the contribution guidelines may be closed.