LESSWRONG
Community
LW

21
AI
Event

2

AI Safety Thursdays: Tracing The Thoughts of a Large Language Model

by Juliana Eberschlag
1 min read
0

2

Thursday 5th June at 10:00 pm to Friday 6th June at 1:00 am GMT
30 Adelaide Street East, Toronto, ON, Canada

Posted on: 16th May 2025

Subscribe to group

2

New Comment
Moderation Log
More from Juliana Eberschlag
View more
Curated and popular this week
0Comments

​​Description

How do large language models actually work on the inside? Annie presents on new research from Anthropic's Transformer Circuits team that opens up the "black box" of Claude 3.5 Haiku, revealing the computational mechanisms behind everything from multi-step reasoning to poetry planning.

Using a new methodology called attribution graphs, we'll explore how models handle multiple languages, exhibit concerning behaviors like jailbreaks, and sometimes engage in unfaithful reasoning.

​​Event Schedule

6:00 to 6:45 - Networking and refreshments

6:45 to 8:00 - Main Presentation

8:00 to 9:00 - Breakout Discussions

Trajectory Labs (Toronto AI Safety)