AI Safety Thursdays: Tracing The Thoughts of a Large Language Model

Juliana Eberschlag

2 AI Safety Thursdays: Tracing The Thoughts of a Large Language Model

1 min read

2

Description

How do large language models actually work on the inside? Annie presents on new research from Anthropic's Transformer Circuits team that opens up the "black box" of Claude 3.5 Haiku, revealing the computational mechanisms behind everything from multi-step reasoning to poetry planning.

Using a new methodology called attribution graphs, we'll explore how models handle multiple languages, exhibit concerning behaviors like jailbreaks, and sometimes engage in unfaithful reasoning.

Event Schedule

6:00 to 6:45 - Networking and refreshments

6:45 to 8:00 - Main Presentation

8:00 to 9:00 - Breakout Discussions

Posted on: 16th May 2025

Subscribe to group

2

New Comment

Moderation Log