[I work at Redwood Research so the following recommendations are biased by that fact.]
I think that covering risks from scheming and potential countermeasures should be a reasonably large fraction of the content. Concrete sub-topics here include:
Misc other notes:
I agree this stuff is important, mostly because I think that scheming is a big part of where catastrophic risk from AI comes from.
Yes there is a general question I want to talk about which is the gap between training, evaluation, and deployment, and the reasons why models might be:
1. Able to tell in which of these environments they are in
2. Act differently based on that
In addition to Elad Hazan's, I am aware of similar courses from Roger Grosse (Toronto) and Max Lamparth et al (Stanford).
A course I tought on AI safety and alignment at Princeton from 2 years ago:
https://sites.google.com/view/cos598aisafety/
list of materials is on the webpage, happy to share lecture slides as well.
I believe Peter Henderson has a newer version. Roger Grosse taught a similar course at U Toronto.
As a category suggestion, I'd suggest including "obstructions/barriers" to making safe/aligned AGI, in the sense of the natural proofs barrier for P vs. NP. They aren't going to be nearly as crisp in AGI safety/alignment as in math, but I still think it'd be valuable to present fundamental problems, that seem like they must be solved, as key parts of the field--or in other words, sets of approaches (such as ignoring those problems) that (most likely) won't work.
Some examples are in "AGI Ruin: A List of Lethalities", though not directly stated as barriers, e.g. "you only get one shot at making a safe fully-superhuman AI" could be stated as "solutions that require observing the behavior of a fully-superhuman AI don't work [given such and such background conclusions]".
(I also give examples in "The fraught voyage of aligned novelty", though maybe not presented in a way you'd like. E.g. "if the AI is genuinely smarter than you, it's thinking in ways that are alien to you" --> "solutions that require the AI to only think in ways that you already understand don't work".)
I would have taken this class had I not graduated this spring!
A few suggestions:
I would like to cover the various ways AI could go wrong: malfunction, misuse, societal upheaval, arms race, surveillance, bias, misalignment, loss of control,...
Talk about predictions for the future, methodologies for how to come up with them.
Some technical components should include: evaluations
All of the above said, I get antsy if I don't get my dosage of math and code- I intend 80% of the course to be technical and cover research papers and results. It should also involve some hands on projects.
Another thing I consider really important: many of the students will be like "Holy shit, AGI is happening! This affects my life plans!" and will want advice. I think it's good to have something to point them to, like:
Good luck running the course!
If legal policy is in your wheelhouse, here's a selection of the growing literature (apologies, some of it is my own)
Course is now online with all lectures posted on youtube (see course webpage for links) and blog post lecture notes posted here https://www.lesswrong.com/w/cs-2881r
I'm surprised no one has mentioned this yet (?) but I woudl recommend at least an overview of classical agent foundations, say the Vingean reflection section of: https://intelligence.org/research-guide/
Only a small core of people is still working on e.g. fully solving decision theory for alignment, which even I admit seems like a bit of a detour. But the theoretical discussion around Vingean uncertainty still seems illustrative of important barriers to both theoretical and empirical agendas.
Thank you - although these type of questions (how can a weak agent verify the actions of a strong agent) are closely related to my background in computational complexity (e.g., interactive proofs, probabilistically checkable proofs, delegation of computation), I plan to keep the course very empirical
Yes, I’m actually rereading your excellent computational complexity textbook currently!
I have been thinking about to what extent the agent foundations communities’ goal of understanding computational uncertainty (including Vingean reflection) is “hard” or even “complete” for much of computational complexity theory (perhaps also in the literal sense of containing complete problems for well-studied complexity classes), and therefore perhaps far too ambitious to expect a “solution” before AGI. I wonder if you have thoughts on this.
One direction I’ve been exploring recently is a computationally unbounded theory of embedded agency - which tries to avoid needing to talk about computational complexity, but this may not capture the important problems of self-trust needed for alignment.
Anyway, there’s no rigorous empirical science without some kind of theory - I know you didn’t try to teach cryptography in a purely empirical way, and both subjects share an adversarial nature :)
It seems good to spend a bunch of time on takeoff speeds given how important they are for how AI goes. There are many sources discussing takeoff speeds. Some places to look: Tom Davidson's outputs, forethought, AI-2027 (including supplements), AI futures project, epoch, Daniel K.'s outputs, some posts by me.
Two ideas for projects/exercises, which I think could be very instructive and build solid instincts about AI safety:
As someone who has applied to take this class, I'll suggest 10 papers, 4 from my own niche research interests and 6 for more very recent eval-focused work which I think is interesting and I'd like an excuse to read/discuss.
Niche Interests
1) In terms of what we can learn from other fields, AI-safety-conscious cognitive scientists have recently been thinking about how to move past revealed preferences in AI Alignment. They've come up with resource-rational contractualism, which on the surface seems like an interesting framework with a Bayesian bent, so it looks like it could also scratch the math itch. These two papers: (Zhi-Xuan et al. 2024) and (Levine et al. 2025) seem to be the main ones so far, and are very recent.
2) I find Goodfire AI's approach to mech interp, which essentially tries to use model params instead of activations to find mechanisms, really interesting, and I think it is both new enough and mathematically-appropriate enough that I can see student projects iterating on it for the class: (Braun et al. 2025) and (Bushnaq et al. 2025) are the main papers here.
Recent Eval Work
The METR doubling-time paper, Ai2's SciArena, LLMs Often Know When They're Being Evaluated, Anthropic's Shade-ARENA, and UK AISI's STACK Adversarial Attack, and Cohere's takedown of LLMArena
As Elad mentioned earlier, Peter Henderson's courses at Princeton can be found here: https://www.polarislab.org/#/teaching
There's the Spring 2025 version of COS 598A: AI Safety & Alignment, which is a grad course on technical AI safety topics, and the Fall 2024 version of SPI 352/COS 352: Artificial Intelligence, Law, & Public Policy, which is an undergrad course that's structured like a law school seminar.
I highly recommend both courses! I think the syllabi are designed pretty well, and the "paper debate" component of COS 598A was quite good for fostering deeper engagement.
I would appreciate a lecture on how causality is used in mechanistic interpretability, particularly for understanding the faithfulness of Chain-of-Thought reasoning. This intersection is crucial for AI safety as it addresses whether LLMs actually follow the reasoning steps they generate or if their explanations are post-hoc rationalizations. The field has developed sophisticated causal intervention methods to probe the relationship between intermediate reasoning steps and final outputs.
Key Papers:
I suggest something on Value Alignment itself. The actual problem of trying make a model have the values you want, be certain of it, be certain it will scale and other parts of the Hard Part of Alignment.
I agree on the object level that the hard barriers to alignment should be emphasized.
However, (based on this and other conversations with you) I think we have slightly different pictures of what this looks like in practice.
As an allegory, many computational complexity researchers would like to solve P vs NP. Like you do here, they would emphasize working on directions that have a chance of solving the hard part of the problem. However, they would not necessarily advocate that people just need to “work on the hard part of P vs NP.” In fact, for better or worse I think it’s actually considered slightly cringe to say you are trying to solve P vs NP. The reason is that the direct approaches are all known to fail, in the strong sense that there are proofs that many entire classes of proof techniques can’t work. So actually trying to solve P vs NP often looks like working on something very indirectly related to P vs NP, which you know must be resolved before (or along with) P vs NP for convoluted reasons, but would not itself directly resolve the question. It’s just something that might help you get traction. An example is algebraic complexity (though some people might work on it for its own sake as well). Another example is/was circuit lower bounds (though according to none other than Professor Barak this lately seems blocked).
I view the situation as similar in alignment. Many people have thought about how to just solve value alignment, including me, and realized they were not prepared to write down a solution. Then we switched to working on things which hopefully DO address the hard part, but don’t necessarily have a complete path to resolving the problem - rather we hope they can tractable make nonzero progress on the hard part.
This seems generally correct to me. I think that alignment is going to be easier to solve than P vs NP but that might be because I'm just ignorant about how hard doing research on the hard part actually is.
It does seem to me that there still aren't good resources even clearly saying what the hard part of alignment is, let alone clearly explaining how to do research on it.
So I think there's a lot of things that can be done to make it easier and test just how hard it really is.
I am getting some great links as responses to my post on X https://x.com/boazbaraktcs/status/1940780441092739351
I suggest something on Value Alignment itself. The actual problem of trying make a model have the values you want, be certain of it, be certain it will scale and other parts of the Hard Part of Alignment.
In the fall I am planning to teach an AI safety graduate course at Harvard. The format is likely to be roughly similar to my "foundations of deep learning" course.
I am still not sure of the content, and would be happy to get suggestions.
Some (somewhat conflicting) desiderata:
Whenever I teach a course I always like to learn something from it, so I hope to cover state of art research results, especially ones that require some work to dig into and I wouldn't get to do so without this excuse.
Anyway, since I haven't yet planned this course, I thought I would solicit comments on what should be covered in such a course. Links to other courses blogs etc. are also useful. (I do have a quirk that I've never been able to teach from someone else's material, and often ended up writing a textbook, see here, here, and here whenever I teach a course.., so I don't intent to adapt any curricula wholesale)
Thanks in advance!