[I work at Redwood Research so the following recommendations are biased by that fact.]
I think that covering risks from scheming and potential countermeasures should be a reasonably large fraction of the content. Concrete sub-topics here include:
Misc other notes:
I agree this stuff is important, mostly because I think that scheming is a big part of where catastrophic risk from AI comes from.
Yes there is a general question I want to talk about which is the gap between training, evaluation, and deployment, and the reasons why models might be:
1. Able to tell in which of these environments they are in
2. Act differently based on that
In addition to Elad Hazan's, I am aware of similar courses from Roger Grosse (Toronto) and Max Lamparth et al (Stanford).
A course I tought on AI safety and alignment at Princeton from 2 years ago:
https://sites.google.com/view/cos598aisafety/
list of materials is on the webpage, happy to share lecture slides as well.
I believe Peter Henderson has a newer version. Roger Grosse taught a similar course at U Toronto.
As a category suggestion, I'd suggest including "obstructions/barriers" to making safe/aligned AGI, in the sense of the natural proofs barrier for P vs. NP. They aren't going to be nearly as crisp in AGI safety/alignment as in math, but I still think it'd be valuable to present fundamental problems, that seem like they must be solved, as key parts of the field--or in other words, sets of approaches (such as ignoring those problems) that (most likely) won't work.
Some examples are in "AGI Ruin: A List of Lethalities", though not directly stated as barriers, e.g. "you only get one shot at making a safe fully-superhuman AI" could be stated as "solutions that require observing the behavior of a fully-superhuman AI don't work [given such and such background conclusions]".
(I also give examples in "The fraught voyage of aligned novelty", though maybe not presented in a way you'd like. E.g. "if the AI is genuinely smarter than you, it's thinking in ways that are alien to you" --> "solutions that require the AI to only think in ways that you already understand don't work".)
If legal policy is in your wheelhouse, here's a selection of the growing literature (apologies, some of it is my own)
It seems good to spend a bunch of time on takeoff speeds given how important they are for how AI goes. There are many sources discussing takeoff speeds. Some places to look: Tom Davidson's outputs, forethought, AI-2027 (including supplements), AI futures project, epoch, Daniel K.'s outputs, some posts by me.
Two ideas for projects/exercises, which I think could be very instructive and build solid instincts about AI safety:
As someone who has applied to take this class, I'll suggest 10 papers, 4 from my own niche research interests and 6 for more very recent eval-focused work which I think is interesting and I'd like an excuse to read/discuss.
Niche Interests
1) In terms of what we can learn from other fields, AI-safety-conscious cognitive scientists have recently been thinking about how to move past revealed preferences in AI Alignment. They've come up with resource-rational contractualism, which on the surface seems like an interesting framework with a Bayesian bent, so it looks like it could also scratch the math itch. These two papers: (Zhi-Xuan et al. 2024) and (Levine et al. 2025) seem to be the main ones so far, and are very recent.
2) I find Goodfire AI's approach to mech interp, which essentially tries to use model params instead of activations to find mechanisms, really interesting, and I think it is both new enough and mathematically-appropriate enough that I can see student projects iterating on it for the class: (Braun et al. 2025) and (Bushnaq et al. 2025) are the main papers here.
Recent Eval Work
The METR doubling-time paper, Ai2's SciArena, LLMs Often Know When They're Being Evaluated, Anthropic's Shade-ARENA, and UK AISI's STACK Adversarial Attack, and Cohere's takedown of LLMArena
I would appreciate a lecture on how causality is used in mechanistic interpretability, particularly for understanding the faithfulness of Chain-of-Thought reasoning. This intersection is crucial for AI safety as it addresses whether LLMs actually follow the reasoning steps they generate or if their explanations are post-hoc rationalizations. The field has developed sophisticated causal intervention methods to probe the relationship between intermediate reasoning steps and final outputs.
Key Papers:
I suggest something on Value Alignment itself. The actual problem of trying make a model have the values you want, be certain of it, be certain it will scale and other parts of the Hard Part of Alignment.
I am getting some great links as responses to my post on X https://x.com/boazbaraktcs/status/1940780441092739351
I suggest something on Value Alignment itself. The actual problem of trying make a model have the values you want, be certain of it, be certain it will scale and other parts of the Hard Part of Alignment.
In the fall I am planning to teach an AI safety graduate course at Harvard. The format is likely to be roughly similar to my "foundations of deep learning" course.
I am still not sure of the content, and would be happy to get suggestions.
Some (somewhat conflicting) desiderata:
Whenever I teach a course I always like to learn something from it, so I hope to cover state of art research results, especially ones that require some work to dig into and I wouldn't get to do so without this excuse.
Anyway, since I haven't yet planned this course, I thought I would solicit comments on what should be covered in such a course. Links to other courses blogs etc. are also useful. (I do have a quirk that I've never been able to teach from someone else's material, and often ended up writing a textbook, see here, here, and here whenever I teach a course.., so I don't intent to adapt any curricula wholesale)
Thanks in advance!