The Alignment Problem

Brian Christian | Finished March 9, 2026 · Reviewed April 12, 2026 | ★★★★★

AI-safetyalignment

The Alignment Problem is one of the more beginner-friendly entry points into AI alignment. The book reads very well IMO it’s well-reported, carefully structured, occasionally too deferential to his sources — but what he does exceptionally well is take a field that lives primarily in papers, blog posts, and academic arguments things that require significant prerequisite knowledge and aren’t geared toward casual reading. Christian turns it into something cumulative and easily digestible. That synthesis is what I found the most value in.

I came in with scattered knowledge from YouTube, conversations, and adjacent reading, I found it did less to change my positions than to organize them. I’d encountered RLHF, reward hacking, epsilon-greedy — but as disconnected concepts without a clear sense of how they emerged, why they mattered, or how they fit together. The historical framing works: seeing how early ML researchers actually used these techniques, what problems they were trying to solve, and what broke, is more illuminating than most formal treatments where the professor gestures at “applications in ML” and moves on. It really is inspiring and quite fascinating how the early history of the field took shape. The examples are my favorite part of the book — Montezuma’s revenge, Alpha go and Alpha zero, the RLHF backflipping robot, the helicopter pilot and the self-driving cars to name my favorites — concrete, well-chosen, and genuinely clarifying rather than just illustrative.

The fairness impossibility result is one of the more underrated points in the book. It’s not just that algorithms encode bias — that’s the obvious critique and humans do too. The deeper problem is that we haven’t done the prior philosophical work of deciding which conception of fairness we’re actually committed to, so we can’t even specify what we want the machine to optimize for. It connects to something Harari raises at the end of Sapiens — what do we want to want? My current take is that this problem is not an unanswerable question but is also not an easy one either. I think that given enough thought on an individual level it is answerable but it is also dynamic. On a societal level I am not convinced it is answerable in the same way. Values shift with technological and social context, from tribal to agricultural to global, and what counts as a satisfactory answer probably changes too. The alignment problem inherits all of that unresolved complexity. Christian identifies it clearly without fully reckoning with the implications.

On risk, the book undersells although this is from the perspective of 6 years of leaps and bounds in the AI field and it is hard to imagine the world we are currently in 5 years ago. Read against Yudkowsky or Project 2027, Christian is operating with a considerably more optimistic, more faith in incremental course-correction, less confrontation with the discontinuity between a dumb RL agent exploiting Tetris and a superintelligent optimizer pursuing a misspecified goal with full instrumental competence. The reward hacking examples are memorable precisely because they’re doing exactly what you told them to but don’t have the capacity to know that isn’t really what you wanted. The danger in this isn’t that a capable system will do something silly, it’s that it will do something catastrophic with perfect efficiency.

The value learning section raises what is probably the hardest version of the problem — not just that we struggle to specify human values, but that even a system with a perfect model of human values might pursue them in ways we’d find catastrophic. A thought I had is that the goal should be something more like a collaborative model: a superintelligent system that can identify the gaps and contradictions in human values and surface them for deliberation rather than acting unilaterally. An analogy I envision is the asymmetry between species — if we could converse with a pet we might genuinely want to negotiate values rather than simply impose them, but we’d also retain the ability to overrule on grounds the pet can’t access. Although I am not a parent I could see a similar case to that as well. It’s almost like trying to raise a super prodigy kid on how to be a moral member of society. All of the current focus is to train and create an environment to do so then when it is out of our control it will in a sense make us the parents proud. A superintelligent AI presents the same structure in reverse, which is uncomfortable, and I don’t think Christian fully grapples with what that means for corrigibility. Whether greater intelligence produces greater capacity for empathy and deference is an open question — possibly the central open question — and the book leaves it open.

All that to be said, this is currently one of my top five books I’ve read. The clarity and construction are just that good. It belongs on the short list of things to read before working seriously on alignment, not because it has the right answers, but because it builds the right picture of what the problem actually is and gives quite a lot to think about.