Daniel Murfet

An Alignment Journal: Features and policies

We previously announced a forthcoming research journal for AI alignment. This cross-post from our blog describes our tentative plans for the features and policies of the journal, including experiments like reviewer compensation and reviewer abstracts. It is the first in a series of posts that will go on to discuss...

Apr 761

An Alignment Journal: Coming Soon

tl;dr We’re incubating an academic journal for AI alignment: rapid peer-review of foundational Alignment research that the current publication ecosystem underserves. Key bets: paid attributed review, reviewer-written synthesis abstracts, and targeted automation. Contact us if you’re interested in participating as an author, reviewer, or editor, or if you know someone...

Mar 3255

Timaeus in 2024

> TLDR: We made substantial progress in 2024: > > * We published a series of papers that verify key predictions of Singular Learning Theory (SLT) [1, 2, 3, 4, 5, 6]. > * We scaled key SLT-derived techniques to models with billions of parameters, eliminating our main concerns around...

Feb 20, 2025100

Cognitive Work and AI Safety: A Thermodynamic Perspective

Introduces the idea of cognitive work as a parallel to physical work, and explains why concentrated sources of cognitive work may pose a risk to human safety. Acknowledgements. Thanks to Echo Zhou and John Wentworth for feedback and suggestions. Some of these ideas were presented originally in a talk in...

Dec 8, 202461

Causal Undertow: A Work of Seed Fiction

A short story, best enjoyed in-context... try exploring possible meanings of some of the unusual terms with your favourite commercially available Solomonoff inductor. Acknowledgements. Thanks to Simon Pepin Lehalleur and Ziling Ye for feedback and suggestions. This post is the "fun" half of a pair, for the serious version see...

Dec 8, 202441

The Queen’s Dilemma: A Paradox of Control

Our large learning machines find patterns in the world and use them to predict. When these machines exceed us and become superhuman, one of those patterns will be relative human incompetence. How comfortable are we with the incorporation of this pattern into their predictions, when those predictions become the actions...

Nov 27, 202427

Australian AI Safety Forum 2024

We're excited to announce the inaugural Australian AI Safety Forum, taking place on November 7-8, 2024, in Sydney, Australia. This event aims to foster the growth of the AI safety community within Australia. Apply now! Key Details * Dates: November 7-8, 2024 * Location: Sydney Knowledge Hub, The University of...

Sep 27, 202442

Daniel Murfet

Daniel Murfet

An Alignment Journal: Coming Soon

Towards Developmental Interpretability

Announcing Timaeus

Timaeus's First Four Months

Daniel Murfet

An Alignment Journal: Coming Soon

Towards Developmental Interpretability

Announcing Timaeus

Timaeus's First Four Months

An Alignment Journal: Features and policies

An Alignment Journal: Coming Soon

Timaeus in 2024

Cognitive Work and AI Safety: A Thermodynamic Perspective

Causal Undertow: A Work of Seed Fiction

The Queen’s Dilemma: A Paradox of Control

Australian AI Safety Forum 2024