Uzay Macar — LessWrong

Mechanisms of Introspective Awareness

Uzay Macar and Li Yang are co-first authors. This work was advised by Jack Lindsey and Emmanuel Ameisen, with contributions from Atticus Wang and Peter Wallich, as part of the Anthropic Fellows Program. Paper: https://arxiv.org/abs/2603.21396. Code: https://github.com/safety-research/introspection-mechanisms TL;DR * We investigate the mechanisms underlying "introspective awareness" (as shown in Lindsey...

Apr 1476

Unfaithful chain-of-thought as nudged reasoning

by Paul B, Uzay Macar, Arthur Conmy, and Neel Nanda

This piece is based on work conducted during MATS 8.0 and is part of a broader aim of interpreting chain-of-thought in reasoning models. tl;dr * Research on chain-of-thought (CoT) unfaithfulness shows how models’ CoTs may omit information that is relevant to their final decision. * Here, we sketch hypotheses for...

Jul 22, 202554

Thought Anchors: Which LLM Reasoning Steps Matter?

This post is adapted from our recent arXiv paper. Paul Bogdan and Uzay Macar are co-first authors on this work. TL;DR * Interpretability of chains-of-thought (CoTs) produced by LLMs is challenging: * Standard mechanistic interpretability studies a single token's generation but CoTs are sequences of reasoning steps that use thousands...

Jul 2, 202536