Daniel Murfet

Wiki Contributions


Thanks, that makes a lot of sense to me. I have some technical questions about the post with Owen Lynch, but I'll follow up elsewhere.

4. Goals misgeneralize out of distribution.

See: Goal misgeneralization: why correct specifications aren't enough for correct goals, Goal misgeneralization in deep reinforcement learning

OAA Solution: (4.1) Use formal methods with verifiable proof certificates[2]. Misgeneralization can occur whenever a property (such as goal alignment) has been tested only on a subset of the state space. Out-of-distribution failures of a property can only be ruled out by an argument for a universally quantified statement about that property—but such arguments can in fact be made! See VNN-COMP. In practice, it will not be possible to have enough information about the world to "prove" that a catastrophe will not be caused by an unfortunate coincidence, but instead we can obtain guaranteed probabilistic bounds via stochastic model checking.


Based on the Bold Plan post and this one my main point of concern is that I don't believe in the feasibility of the model checking, even in principle. The state space S and action space A of the world model will be too large for techniques along the lines of COOL-MC which (if I understand correctly) have to first assemble a discrete-time Markov chain by querying the NN and then try to apply formal verification methods to that. I imagine that actually you are thinking of learned coarse-graining of both S and A, to which one applies something like formal verification.

Assuming that's correct, then there's an inevitable lack of precision on the inputs to the formal verification step. You have to either run the COOL-MC-like process until you hit your time and compute budget and then accept that you're missing state-action pairs, or you coarse-grain to some degree within your budget and accept a dependence on the quality of your coarse-graining. If you're doing an end-run around this tradeoff somehow, could you direct me to where I can read more about the solution?

I know there's literature on learned coarse-grainings of S and A in the deep RL setting, but I haven't seen it combined with formal verification. Is there a literature? It seems important.

I'm guessing that this passage in the Bold Plan post contains your answer:

> Defining a sufficiently expressive formal meta-ontology for world-models with multiple scientific explanations at different levels of abstraction (and spatial and temporal granularity) having overlapping domains of validity, with all combinations of {Discrete, Continuous} and {time, state, space}, and using an infra-bayesian notion of epistemic state (specifically, convex compact down-closed subsets of subprobability space) in place of a Bayesian state

In which case I see where you're going, but this seems like the hard part?

Induction heads? Ok, we are maybe on track to retro engineer the mechanism of regex in LLMs. Cool.


This dramatically undersells the potential impact of Olsson et al. You can't dismiss modus ponens as "just regex". That's the heart of logic!

For many the argument for AI safety being a urgent concern involves a belief that current systems are, in some rough sense, reasoning, and that this capability will increase with scale, leading to beyond human-level intelligence within a timespan of decades. Many smart outsiders remain sceptical, because they are not convinced that anything like reasoning is taking place.

I view Olsson et al as nontrivial evidence for the emergence of internal computations resembling reasoning, with increasing scale. That's profound. If that case is made stronger over time by interpretability (as I expect it to be) the scientific, philosophical and societal impact will be immense.

That intuition sounds reasonable to me, but I don't have strong opinions about it.

One thing to note is that training and test performance are lagging indicators of phase transitions. In our limited experience so far, measures such as the RLCT do seem to indicate that a transition is underway earlier (e.g. in Toy Models of Superposition), but in the scenario you describe I don't know if it's early enough to detect structure formation "when it starts". 

For what it's worth my guess is that the information you need to understand the structure is present at the transition itself, and you don't need to "rewind" SGD to examine the structure forming one step at a time.

If the cost is a problem for you, send a postal address to daniel.murfet@gmail.com and I'll mail you my physical copy. 

Thanks for the article. For what it's worth, here's the defence I give of Agent Foundations and associated research, when I am asked about it (for background, I'm a mathematician, now working on mathematical aspects of AI safety different from Agent Foundations). I'd be interested if you disagree with this framing.

We can imagine the alignment problem coming in waves. Success in each wave merely buys you the chance to solve the next. The first wave is the problem we see in front of us right now, of getting LLMs to Not Say Naughty Things, and we can glimpse a couple of waves after that. We don't know how many waves there are, but it is reasonable to expect that beyond the early waves our intuitions probably aren't worth much. 

That's not a surprise! As physics probed smaller scales, at some point our intuitions stopped being worth anything, and we switched to relying heavily on abstract mathematics (which became a source of different, more hard-won intuitions). Similarly, we can expect that as we scale up our learning machines, we will enter a regime where current intuitions fail to be useful. At the same time, the systems may be approaching more optimal agents, and theories like Agent Foundations start to provide a very useful framework for reasoning about the nature of the alignment problem.

So in short I think of Agent Foundations as like quantum mechanics: a bit strange perhaps, but when push comes to shove, one of the few sources of intuition we have about waves 4, 5, 6, ... of the alignment problem. It would be foolish to bet everything on solving waves 1, 2, 3 and then be empty handed when wave 4 arrives.

I think this is a very nice way to present the key ideas. However, in practice I think the discretisation is actually harder to reason about than the continuous version. There are deeper problems, but I'd start by wondering how you would ever compute c(f) defined this way, since it seems to depend in an intricate way on the details of e.g. the floating point implementation.

I'll note that the volume codimension definition of the RLCT is essentially what you have written down here, and you don't need any mathematics beyond calculus to write that down. You only need things like resolutions of singularities if you actually want to compute that value, and the discretisation doesn't seem to offer any advantage there.

The set of motivated, intelligent people with the relevant skills to do technical alignment work in general, and mechanistic interpretability in particular, has a lot of overlap with the set of people who can do capabilities work. That includes many academics, and students in masters and PhD programs. One way or another they're going to publish, would you rather it be alignment/interpretability work or capabilities work?

It seems to me that speeding up alignment work by several orders of magnitude is unlikely to happen without co-opting a significant number of existing academics, labs and students in related fields (including mathematics and physics in addition to computer science). This is happening already, within ML groups but also physics (Max Tegmark's students) and mathematics (e.g. some of my students at the University of Melbourne).

I have colleagues in my department publishing stacks of papers in CVPR, NeurIPS etc., which this community might call capabilities work. If I succeeded in convincing them to do some alignment or mechanistic interpretability work, they would do it because it was intrinsically interesting or likely to be high status. They would gravitate towards the kinds of work that are dual-use. Relative to the status quo that seems like progress to me, but I'm genuinely interested in the opinion of people here. Real success in this recruitment would, among other things, dilute the power of LW norms to influence things like publishing.

On balance it seems to me beneficial to aggressively recruit academics and their students into alignment and interpretability. 

Nietzsche also had mixed views on Socrates, for similar reasons.  He talks about this in many of his books, including "The Birth of Tragedy" and "Gay Science". 

By the zero-shot hyperparameter work do you mean https://arxiv.org/abs/2203.03466 "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer"? I've been sceptical of NTK-based theory, seems I should update.

Load More