Announcing Geodesic Research
We're a Cambridge, UK-based AI safety organisation that’s asking: how can we build the most robust alignment initialisations for capable LLMs? We’re one of the few non-profit organisations positioned to answer this question empirically. We have the engineering experience, and now the compute, to conduct data intensive interventions across the model training pipeline. This post lays out our research agenda and theory of change, and what we are looking for in technical hires. Applications are open here. Research agenda TLDR: Long-horizon capabilities RL may be the most critical source of misalignment. Misalignment instilled during capabilities RL may be difficult to remove afterwards. Geodesic Research’s mission is to develop the science of providing robustly-aligned initialisations for RL, where alignment priors persist through the remainder of training. Our seminal work on alignment pretraining showed that you can bake alignment priors into base models. Frontier labs are now using these techniques in production: for example, Anthropic's recent work heavily leans on improving alignment priors. But it’s clear that, in the face of production post-training, alignment pretraining is not a one-size-fits-all solution. So now, we are framing pre- and midtraining interventions within the rest of the model training stack. The evidence points towards extended reinforcement learning being a likely cause of alignment failures at the frontier. RL is liable to select for undesired cognitive and behavioural habits, such as metagaming, sycophancy, apparent-success seeking, or taking unsanctioned actions to complete tasks. Models that learn these behaviours may also become broadly misaligned. In fact, these degradations have already been noted in replications of alignment pretraining, and Evan Hubinger lists this as one of the core reasons alignment remains a hard and unsolved problem. Apollo Research's recent update makes a similar diagnosis; they are now studying whether misa