LESSWRONG
LW

AI
Personal Blog

51

Call for suggestions - AI safety course

by boazbarak
3rd Jul 2025
1 min read
15

51

AI
Personal Blog

51

Call for suggestions - AI safety course
13Julian Stastny
2Buck
3boazbarak
11scasper
11Elad Hazan
6TsviBT
4Eleni Angelou
4yonathan_arbel
2ryan_greenblatt
2Sudhanshu Kasewa
2Valerio Pepe
1manqingliu@g.harvard.edu
1Kabir Kumar
1boazbarak
-1Kabir Kumar
New Comment
15 comments, sorted by
top scoring
Click to highlight new comments since: Today at 1:47 PM
[-]Julian Stastny5d130

[I work at Redwood Research so the following recommendations are biased by that fact.]

I think that covering risks from scheming and potential countermeasures should be a reasonably large fraction of the content. Concrete sub-topics here include:

  • Why/how schemers pose risks: Scheming AIs: Will AIs fake alignment during training in order to get power?, Risks from Learned Optimization, The case for ensuring that powerful AIs are controlled, Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking, A sketch of an AI control safety case
  • High-stakes control (i.e. preventing concentrated failures): AI Control: Improving Safety Despite Intentional Subversion, AI catastrophes and rogue deployments, How to prevent collusion when using untrusted models to monitor each other, Ctrl-Z: Controlling AI Agents via Resampling,
  • Low-stakes control (i.e. preventing non-concentrated failures such as research sabotage): This is a super important part of AI control, but we don’t have great resources on it yet. You might want to look at Notes on handling non-concentrated failures with AI control: high level methods and different regimes for inspiration, or How can we solve diffuse threats like research sabotage with AI control?

Misc other notes:

  • For hands-on projects, I’d recommend 7+ tractable directions in AI control for inspiration, and talking to Ethan Perez.
  • I also endorse talking about recursive oversight, though personally I’m most excited about recursive oversight techniques that (1) are trying to be robust against models trying to subvert the oversight process, and (2) techniques that assume access to a small amount of ground truth data.
  • I also like various kinds of “basic science” on safety-relevant aspects of AIs; a lot of Owain Evans’ recent work comes to mind.
Reply1
[-]Buck4d20

I agree this stuff is important, mostly because I think that scheming is a big part of where catastrophic risk from AI comes from.

Reply
[-]boazbarak4d30

Yes there is a general question I want to talk about which is the gap between training, evaluation, and deployment, and the reasons why models might be:


1. Able to tell in which of these environments they are in

2. Act differently based on that

Reply
[-]scasper6d110

In addition to Elad Hazan's, I am aware of similar courses from Roger Grosse (Toronto) and Max Lamparth et al (Stanford).

Reply1
[-]Elad Hazan6d112

A course I tought on AI safety and alignment at Princeton from 2 years ago:
https://sites.google.com/view/cos598aisafety/

list of materials is on the webpage, happy to share lecture slides as well. 

I believe Peter Henderson has a newer version. Roger Grosse taught a similar course at U Toronto. 

Reply1
[-]TsviBT6d6-1

As a category suggestion, I'd suggest including "obstructions/barriers" to making safe/aligned AGI, in the sense of the natural proofs barrier for P vs. NP. They aren't going to be nearly as crisp in AGI safety/alignment as in math, but I still think it'd be valuable to present fundamental problems, that seem like they must be solved, as key parts of the field--or in other words, sets of approaches (such as ignoring those problems) that (most likely) won't work.

Some examples are in "AGI Ruin: A List of Lethalities", though not directly stated as barriers, e.g. "you only get one shot at making a safe fully-superhuman AI" could be stated as "solutions that require observing the behavior of a fully-superhuman AI don't work [given such and such background conclusions]".

(I also give examples in "The fraught voyage of aligned novelty", though maybe not presented in a way you'd like. E.g. "if the AI is genuinely smarter than you, it's thinking in ways that are alien to you" --> "solutions that require the AI to only think in ways that you already understand don't work".)

Reply
[-]Eleni Angelou6d41

Perhaps some ideas for non-technical readings from my syllabus (it's an AI safety course I've been teaching for the past couple of years for a philosophy dept).

Reply1
[-]yonathan_arbel6d40

If legal policy is in your wheelhouse, here's a selection of the growing literature (apologies, some of it is my own)

  • Noam Kolt – "Algorithmic Black Swans"
    Addressing catastrophic tail events via anticipatory regulation
    Published: https://journals.library.wustl.edu/lawreview/article/id/8906/
  • Peter Salib & Simon Goldstein – "AI Rights for Human Safety"
    On AI rights for safety
    SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4536494
  • Yonathan Arbel, Matthew Tokson & Albert Lin – "Systemic Regulation of Artificial Intelligence"
    Moving from application-level regulation to system-level regulation
    SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4543681
    Published: https://arizonastatelawjournal.org/article/systemic-regulation-of-artificial-intelligence/
  • Gabriel Weil – “Tort Law as a Tool for Mitigating Catastrophic Risk from Artificial Intelligence” Proposes reforming tort law to deter catastrophic AI risks before they materialize
    SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4694006
  • Yonathan Arbel et al. – "Open Questions in Law and AI Safety: An Emerging Research Agenda"
    Sets out a research agenda for a new field of “AI Safety Law,” focusing on existential and systemic AI risks
    Published: https://www.lawfaremedia.org/article/open-questions-in-law-and-ai-safety-an-emerging-research-agenda
  • Peter Salib – "AI Outputs Are Not Protected Speech"
    Argues that AI-generated outputs lack First Amendment protection, enabling stronger safety regulation
    SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4481512
  • Mirit Eyal & Yonathan Arbel – “Tax Levers for a Safer AI Future”
    Proposes using tax credits and penalties to align AI development incentives with public safety
    SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4528105
  • Cullen O’Keefe, Rohan Ramakrishnan, Annette Zimmermann, Daniel Tay & David C. Winter – “Law-Following AI: Designing AI Agents to Obey Human Laws”
    Suggests AI agents should be trained and constrained to follow human law, like corporate actors
    SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4726207
Reply1
[-]ryan_greenblatt4d20

It seems good to spend a bunch of time on takeoff speeds given how important they are for how AI goes. There are many sources discussing takeoff speeds. Some places to look: Tom Davidson's outputs, forethought, AI-2027 (including supplements), AI futures project, epoch, Daniel K.'s outputs, some posts by me.

Reply1
[-]Sudhanshu Kasewa5d20

Two ideas for projects/exercises, which I think could be very instructive and build solid instincts about AI safety:

  1. Builder-breaker arguments, a la ELK
  2. Writing up a safety case (and doing the work to generate the underlying evidence for it)
Reply1
[-]Valerio Pepe6d20

As someone who has applied to take this class, I'll suggest 10 papers, 4 from my own niche research interests and 6 for more very recent eval-focused work which I think is interesting and I'd like an excuse to read/discuss.

Niche Interests

1) In terms of what we can learn from other fields, AI-safety-conscious cognitive scientists have recently been thinking about how to move past revealed preferences in AI Alignment. They've come up with resource-rational contractualism, which on the surface seems like an interesting framework with a Bayesian bent, so it looks like it could also scratch the math itch. These two papers: (Zhi-Xuan et al. 2024) and (Levine et al. 2025) seem to be the main ones so far, and are very recent.

2) I find Goodfire AI's approach to mech interp, which essentially tries to use model params instead of activations to find mechanisms, really interesting, and I think it is both new enough and mathematically-appropriate enough that I can see student projects iterating on it for the class: (Braun et al. 2025) and (Bushnaq et al. 2025) are the main papers here.

Recent Eval Work

The METR doubling-time paper, Ai2's SciArena, LLMs Often Know When They're Being Evaluated, Anthropic's Shade-ARENA, and UK AISI's STACK Adversarial Attack, and Cohere's takedown of LLMArena

Reply1
[-]manqingliu@g.harvard.edu2d10

I would appreciate a lecture on how causality is used in mechanistic interpretability, particularly for understanding the faithfulness of Chain-of-Thought reasoning. This intersection is crucial for AI safety as it addresses whether LLMs actually follow the reasoning steps they generate or if their explanations are post-hoc rationalizations. The field has developed sophisticated causal intervention methods to probe the relationship between intermediate reasoning steps and final outputs.

Key Papers:

  • Fazl, B., et al. (2025). "Chain-of-Thought Is Not Explainability."
  • Paul, B., et al. (2025) "Thought Anchors: Which LLM Reasoning Steps Matter?" arXiv preprint.
  • Jin, Z., et al. (2024). LLMs with Chain-of-Thought Are Non-Causal Reasoners. arXiv:2402.16048.
  • Singh, J., et al. (2024). How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning. arXiv:2402.18312.
  • Lanham, T., et al. (2023). Measuring Faithfulness in Chain-of-Thought Reasoning. Anthropic Technical Report.
Reply
[-]Kabir Kumar6d10

I suggest something on Value Alignment itself. The actual problem of trying make a model have the values you want, be certain of it, be certain it will scale and other parts of the Hard Part of Alignment.

Reply
[-]boazbarak6d10

I am getting some great links as responses to my post on X https://x.com/boazbaraktcs/status/1940780441092739351 

Reply
[-]Kabir Kumar6d-10

I suggest something on Value Alignment itself. The actual problem of trying make a model have the values you want, be certain of it, be certain it will scale and other parts of the Hard Part of Alignment.

Reply
Moderation Log
Curated and popular this week
15Comments

In the fall I am planning to teach an AI safety graduate course at Harvard. The format is likely to be roughly similar to my "foundations of deep learning" course.

I am still not sure of the content, and would be happy to get suggestions.

Some (somewhat conflicting) desiderata:

  1. I would like to cover the various ways AI could go wrong: malfunction, misuse, societal upheaval, arms race, surveillance, bias, misalignment, loss of control,... (and anything else I'm not thinking of right now). I talke about some of these issues here and here.
  2. I would like to see what we can learn from other fields, including software security, aviation and automative safety, drug safety, nuclear arms control, etc.. (and happy to get other suggestions)
  3. Talk about policy as well, various frameworks inside companies, regulations etc..
  4. Talk about predictions for the future, methodologies for how to come up with them.
  5. All of the above said, I get antsy if I don't get my dosage of math and code- I intend 80% of the course to be technical and cover research papers and results. It should also involve some hands on projects.
  6. Some technical components should include: evaluations, technical mitigations, attacks, white and black box interpretability methods, model organisms.
  7. Whenever I teach a course I always like to learn something from it, so I hope to cover state of art research results, especially ones that require some work to dig into and I wouldn't get to do so without this excuse.

    Anyway, since I haven't yet planned this course, I thought I would solicit comments on what should be covered in such a course. Links to other courses blogs etc. are also useful. (I do have a quirk that I've never been able to teach from someone else's material, and often ended up writing a textbook, see here, here, and here whenever I teach a course.., so I don't intent to adapt any curricula wholesale)

    Thanks in advance!