Bogdan Ionut Cirstea

Automated safety research.

61

You can think of a pipeline like

- feed lots of good papers in [situational awareness / out-of-context reasoning / ...] into GPT-4's context window,
- ask it to generate 100 follow-up research ideas,
- ask it to develop specific experiments to run for each of those ideas,
- feed those experiments for GPT-4 copies equipped with a coding environment,
- write the results to a nice little article and send it to a human.

Yup; and not only this, but many parts of the workflow have already been tested out (e.g. ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models; Generation and human-expert evaluation of interesting research ideas using knowledge graphs and large language models; LitLLM: A Toolkit for Scientific Literature Review; Acceleron: A Tool to Accelerate Research Ideation; DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning; Discovering Preference Optimization Algorithms with and for Large Language Models) and it seems quite feasible to get enough reliability/consistency gains to string these together and get ~the whole (post-training) prosaic alignment research workflow loop going, especially e.g. with improvements in reliability from GPT-5/6 and more 'schlep' / 'unhobbling'.

30

You might enjoy Concept Algebra for (Score-Based) Text-Controlled Generative Models (and probably other papers / videos from Victor Veitch's groups), which tries to come up with something like a theoretical explanation for the linear represenation hypothesis, including some of the discussion in the reviews / rebuttals for the above paper, e.g.:

'**Causal Separability** The intuitive idea here is that the separability of factors of variation boils down to whether there are “non-ignorable” interactions in the structural equation model that generates the output from the latent factors of variation—hence the name. The formal definition 3.2 relaxes this causal requirement to distributional assumptions. We have added its causal interpretation in the camera ready version.

**Application to Other Generative Models** Ultimately, the results in the paper are about non-parametric representations (indeed, the results are about the structure of probability distributions directly!) The importance of diffusion models is that they non-parametrically model the conditional distribution, so that the score representation directly inherits the properties of the distribution.

To apply the results to other generative models, we must articulate the connection between the natural representations of these models (e.g., the residual stream in transformers) and the (estimated) conditional distributions. For autoregressive models like Parti, it’s not immediately clear how to do this. This is an exciting and important direction for future work!

(Very speculatively: models with finite dimensional representations are often trained with objective functions corresponding to log likelihoods of exponential family probability models, such that the natural finite dimensional representation corresponds to the natural parameter of the exponential family model. In exponential family models, the Stein score is exactly the inner product of the natural parameter with y. This weakly suggests that additive subspace structure may originate in these models following the same Stein score representation arguments!)

**Connection to Interpretability** This is a great question! Indeed, a major motivation for starting this line of work is to try to understand if the ''linear subspace hypothesis'' in mechanistic interpretability of transformers is true, and why it arises if so. As just discussed, the missing step for precisely connecting our results to this line of work is articulating how the finite dimensional transformer representation (the residual stream) relates to the log probability of the conditional distributions. Solving this missing step would presumably allow the tool set developed here to be brought to bear on the interpretation of transformers.

One exciting observation here is that linear subspace structure appears to be a generic feature of probability distributions! Much mechanistic interpretability work motivates the linear subspace hypothesis by appealing to special structure of the transformer architecture (e.g., this is Anthropic's usual explanation). In contrast, our results suggest that linear encoding may fundamentally be about the structure of the data generating process.

**Limitations** One important thing to note: the causal separability assumption is required for the concepts to be separable in the conditional distribution itself. This is a fundamental restriction on what concepts can be learned by any method that (approximately) learns a conditional distribution. I.e., it’s a limitation of the data generating process, not special to concept algebra or even diffusion models.

Now, it is true that to find the concept subspace using prompts we have to be able to find prompts that elicit causally separable concepts. However, this is not so onerous—because sex and species are not separable, we can't elicit the sex concept with ''buck'' and ''doe''. But the prompts ''a woman'' and ''a man'' work well.'

50

The finding on the differential importance of verifiability also seems in line with the findings from Trading Off Compute in Training and Inference.

56% on swebench-lite with repeated sampling (13% above previous SOTA; up from 15.9% with one sample to 56% with 250 samples), with a very-below-SOTA model https://arxiv.org/abs/2407.21787; anything automatically verifiable (large chunks of math and coding) seems like it's gonna be automatable in < 5 years.

30

With research automation in mind, here's my wager: the modal top-15 STEM PhD student will redirect at least half of their discussion/questions from peers to mid-2026 LLMs.

Fwiw, I've kind of already noted myself starting to do some of this, for AI safety-related papers; especially after Claude-3.5 Sonnet came out.

Jack Clark: '**Registering a prediction**: I predict that within two years (by July 2026) we'll see an AI system beat all humans at the IMO, obtaining the top score. Alongside this, I would wager we'll see the same thing - an AI system beating all humans in a known-hard competition - in *another* scientific domain outside of mathematics. If both of those things occur, I believe that will present strong evidence that AI may successfully automate large chunks of scientific research before the end of the decade.' https://importai.substack.com/p/import-ai-380-distributed-13bn-parameter

119

It seems that, in some fundamental sense, misalignment resides in self-other distinction: for a model to be misaligned it has to model itself as having different values, goals, preferences, and beliefs from humans, in ways that are unnecessary to perform the tasks that humans want the AI to perform.

I think this would be better framed as: self-other distinction *is a prerequisite* (capability) necessary for misalignment (but very likely also for desired capabilities). I think 'in ways that are unnecessary to perform the tasks that humans want the AI to perform' is stated overconfidently, and will likely be settled empirically. For now, I think the best (not-super-strong) case for this being plausible is the existence proof of empathetic humans, where self-other overlap does seems like a relevant computational mechanism to empathy.

10

I think this argument is made even stronger by additional similar considerations for *input* tokens too - given the even lower price of input tokens (compared to output tokens), and the scaling laws for long context windows and for RAG.

64

The top comment also seems to be conflating whether a model *is capable of* (e.g. sometimes, in some contexts) mesaoptimizing and whether *it is (consistently) mesaoptimizing*. I interpret the quoted original definition as being about the second, which LLMs probably aren't, though they're capable of the first.
This seems like the kind of ontological confusion that the Simulators post discusses at length.

Some critical factors here and for alignment automation more broadly are also token cheapness and task horizon shortness: https://docs.google.com/presentation/d/1bFfQc8688Fo6k-9lYs6-QwtJNCPOS8W2UH5gs8S6p0o/edit?usp=drive_link; https://x.com/BogdanIonutCir2/status/1819848009473036537; https://x.com/BogdanIonutCir2/status/1819861008568971325.