This page aims at linking to resources I like related to extinction risks from AI.
There are resources of different levels of difficulty:
the ones in green don't require any context or technical understanding.
the ones in orange require a light technical understanding.
the ones in red require some technical understanding.
Outline of this page:
I. Intro Resources on AI Extinction Risks.
II. Deep Dive into Specific Arguments & Concrete Scenarios
III. AI Governance & Policy
IV. Some Context on Frontier AI Labs
V. Technical Safety Work on Current AI Architectures
VI. Attempts at Building Provably Safe AI Architectures
If you want to learn how neural networks work in the first place, here's the least worst intro resource I know of.
Introductory Resources on AI Extinction Risks
Don’t Look Up - The Documentary: The Case for AI as an Existential Threat (17 min film)
‘Godfather of AI’ says AI could kill all humans and there might be no way to stop it, (Turing Prize Winner G. Hinton, CNN)
We must slow down the race to God-like AI, (AI investor I. Hogarth, Financial Times)
There’s more regulation on selling sandwiches than on ‘God-like’ tech, (AI Lab CEO C. Leahy, CNN)
The importance of AI alignment, explained in 5 points, (D. Eth)
The ‘Don’t Look Up’ Thinking That Could Doom Us With AI (MIT professor M. Tegmark, TIME)
Pausing AI Development Isn’t Enough. We Need to Shut it All Down (Early AGI researcher E. Yudkowsky, TIME)
Why is AI an Existential Risk?
General Technical Explainers
Complex Systems are Hard to Control (Pr. J. Steinhardt, Bounded Regret)
How Rogue AIs may Arise (Turing Prize Winner Y. Bengio, personal blog)
More Is Different for AI (Pr. J. Steinhardt, Bounded Regret)
The importance of AI alignment, explained in 5 points, (D. Eth)
The alignment problem from a deep learning perspective (OpenAI researcher R. Ngo et al. 2022)
AGI Safety From First Principles Report (OpenAI researcher R. Ngo)
Is Power-Seeking AI an Existential Risk? (J. Carlsmith)
Concrete Scenarios
This is in increasing order of difficulty to understand:
How AI actually could become an existential threat to humanity (iNews)
AI Takeover From Scaled LLMs (S. Campos)
Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover (A. Cotra)
Core Alignment Failures & Arguments
Corrigibility
Non-technical explainer (R. Miles)
Relatively technical article (E. Yudkowsky)
Technical foundational paper (N. Soares)
Specification Gaming
Non-technical cool examples (R. Miles)
Simple technical blogpost (V. Krakovna)
Goal Misgeneralization & Inner Misalignment
Non-technical explainer (R. Miles)
How undesired goals can arise with correct rewards (Shah et al., 2022)
Technical paper on goal misgeneralization (Langosco et al. 2021)
How likely is deceptive alignment? (E. Hubinger, 2022)
Power-Seeking is Optimal
Non-technical explainer of instrumental convergence (R. Miles)
Technical papers
Proving it under strong assumptions (A. Turner et al. 2019)
Proving it under lighter assumptions (A. Turner et al. 2022)
AI Governance & Policy
Compute Governance
A Case for Compute Governance: Arms Control for Artificial Intelligence (CNAS)
A Techical Roadmap To Leverage Compute Governance as a Verification Mechanism (Y. Shavit, 2023)
If you want to read more: A Reading List on Compute Governance (L. Heim, 2022)
Challenges & Approaches to Governance
Key Considerations for AI International Cooperation (M. Baker, 2023)
The AI Deployment Problem (H. Karnofsky, 2022)
Auditing: Model evaluations for extreme risks (T. Shevlane et al., 2023)
Context About AGI Labs
OpenAI's Story
DeepMind's Story
Anthropic's Story
Technical Safety Work On Current Architectures
Interpretability
Paper Reading List in Mechanistic Interpretability (N. Nanda)
An Intuitive Mental Model of Transformer Interpretability (Kevin Ro Wang)
Foundational Paper: An Introduction to Circuits (C. Olah et al., 2020)
Key paper on LLMs: Toy Models of Superposition (N. Elhage et al., 2022)
MEMIT: Mass Editing Memory in a Transformer (Meng et al., 2022)
Steering GPT-2-XL by adding an activation vector (Turner et al. 2023)
An attempt at making interpretability measurable by OpenAI (Bills et al. 2023)
Evaluating Models
MACHIAVELLI Benchmark (Pan et al. 2023)
Many benchmarks: Discovering Language Model Behaviors with Model-Written Evaluations (Perez et al. 2022)
GPT-4 System Card (OpenAI, 2023)
Failure Demonstrations & Arguments
Goal Misgeneralization in Deep Reinforcement Learning (Langosco et al. 2021)
Adversarial Policies Beat Superhuman Go AIs (T. Wang et al. 2022)
Eliciting Latent Knowledge (Christiano et al. 2021)
On the difficulty of solving jailbreak: Fundamental Limitations of Alignment in Large Language Models (Wolf et al. 2023)
Specification Gaming: the flip side of AI ingenuity (V. Krakovna et al., 2020)
Attempts at Building Provably Safe Architectures (Technical)
David Dalrymple's Open Agency Architecture
Description of the Open Agency Architecture for Wizards (Davidad)
Description of the Open Agency Architecture for Muggles (C. Segerie)
The Learning-Theoretic Agenda
The Plan (V. Kosoy)
Infra-bayesianism (V. Kosoy et al.)
John Wentworth's Plan
Cognitive Emulation (CoEm)
Safety Desiderata that CoEm Aims at Fulfilling (C. Leahy et al.)