Edward James Young — LessWrong

Toy Models of Initialisation Effects on RL Dynamics

This is a follow-up to two posts Geodesic released last week on our current research direction. The code for generating the figures can be found at this GitHub repository. In our previous post, we outlined Geodesic's focus on what we term the pre-RL alignment checkpoint of models -- the alignment-relevant...

Jul 1478

Why study proto-training gaming as an adversarial alignment failure mode?

by Puria, Edward James Young, and Cam

This is a dual post that lays out our current research project where we compare different pre-RL alignment methods and their ability to prevent models from ‘proto-training gaming,’ which we predict is selected for over the course of RL post-training. In the previous post, we enumerated possible pre-RL alignment interventions...

Jul 855

Why study alignment interventions on pre-RL checkpoints?

This is a dual post that lays out our current research project where we compare pre-RL-training methods on their ability to prevent models from ‘proto-training gaming,’ which we predict is selected for over the course of production RL post-training. In this post, we outline what we mean by pre-RL ‘alignment...

Jul 864

Developmental Cognitive Interpretability: A Research Agenda for Modelling Generalisation and Predicting Agent Behaviour

by Jason R Brown and Edward James Young

Summary Safe deployment of an AI system requires that we can make confident claims about its behaviour on out-of-distribution deployment inputs on the basis of only pre-deployment evaluations. One approach to making such claims is to take a cognitive perspective, in which we interpret the AIs behaviour in terms of...

May 2968

Announcing Geodesic Research

by Puria, Cam, Alexandra Narin, Edward James Young, and Kyle O’Brien

We're a Cambridge, UK-based AI safety organisation that’s asking: how can we build the most robust alignment initialisations for capable LLMs? We’re one of the few non-profit organisations positioned to answer this question empirically. We have the engineering experience, and now the compute, to conduct data intensive interventions across the...

May 2781

Edward James Young's Shortform

Apr 192

Architectures for Increased Externalisation of Reasoning

by Karthik Viswanathan, Liza Pavlova, Mariia Koroliuk, Puria, Cam, and Edward James Young

TL;DR We propose a new post-training method for making LLMs more verbose reasoners by teaching a model to truncate forward passes early. We expect this technique to improve monitorability by decreasing the amount of computation available within hidden layers for easy-to-predict tokens. We’re looking for collaborators to help continue this...

Nov 26, 202538