johnswentworth

Sequences

From Atoms To Agents
"Why Not Just..."
Basic Foundations for Agent Models
Framing Practicum
Gears Which Turn The World
Abstraction 2020
Gears of Aging
Model Comparison

Wiki Contributions

Comments

So that example SWE bench problem from the post:

... is that a prototypical problem from that benchmark? Because if so, that is a hilariously easy benchmark. Like, something could ace that task and still be coding at less than a CS 101 level.

(Though to be clear, people have repeatedly told me that a surprisingly high fraction of applicants for programming jobs can't do fizzbuzz, so even a very low level of competence would still put it above many would-be software engineers.)

Yeah, that's right.

The secret handshake is to start with " is independent of  given " and " is independent of  given ", expressed in this particular form:

... then we immediately see that  for all  such that .

So if there are no zero probabilities, then  for all .

That, in turn, implies that  takes on the same value for all Z, which in turn means that it's equal to .  Thus  and  are independent. Likewise for  and . Finally, we leverage independence of  and  given :

(A similar argument is in the middle of this post, along with a helpful-to-me visual.)

Roughly speaking, all variables completely independent is the only way to satisfy all the preconditions without zero-ish probabilities.

This is easiest to see if we use a "strong invariance" condition, in which each of the  must mediate between  and . Mental picture: equilibrium gas in a box, in which we can measure roughly the same temperature and pressure () from any little spatially-localized chunk of the gas (). If I estimate a temperature of 10°C from one little chunk of the gas, then the probability of estimating 20°C from another little chunk must be approximately-zero. The only case where that doesn't imply near-zero probabilities is when all values of both chunks of gas always imply the same temperature, i.e.  only ever takes on one value (and is therefore informationally empty). And in that case, the only way the conditions are satisfied is if the chunks of gas are unconditionally independent.

I agree with this point as stated, but think the probability is more like 5% than 0.1%

Same.

I do think our chances look not-great overall, but most of my doom-probability is on things which don't look like LLMs scheming.

Also, are you making sure to condition on "scaling up networks, running pretraining + light RLHF produces tranformatively powerful AIs which obsolete humanity"

That's not particularly cruxy for me either way.

Separately, I'm uncertain whether the current traning procedure of current models like GPT-4 or Claude 3 is still well described as just "light RLHF".

Fair. Insofar as "scaling up networks, running pretraining + RL" does risk schemers, it does so more as we do more/stronger RL, qualitatively speaking.

Solid post!

I basically agree with the core point here (i.e. scaling up networks, running pretraining + light RLHF, probably doesn't by itself produce a schemer), and I think this is the best write-up of it I've seen on LW to date. In particular, good job laying out what you are and are not saying. Thank you for doing the public service of writing it up.

Mind sharing a more complete description of the things you tried? Like, the sort of description which one could use to replicate the experiment?

Did you see the footnote I wrote on this? I give a further argument for it.

Ah yeah, I indeed missed that the first time through. I'd still say I don't buy it, but that's a more complicated discussion, and it is at least a decent argument.

I looked into modularity for a bit 1.5 years ago and concluded that the concept is way too vague and seemed useless for alignment or interpretability purposes. If you have a good definition I'm open to hearing it.

This is another place where I'd say we don't understand it well enough to give a good formal definition or operationalization yet.

Though I'd note here, and also above w.r.t. search, that "we don't know how to give a good formal definition yet" is very different from "there is no good formal definition" or "the underlying intuitive concept is confused" or "we can't effectively study the concept at all" or "arguments which rely on this concept are necessarily wrong/uninformative". Every scientific field was pre-formal/pre-paradigmatic once.

To me it looks like people abandoned behaviorism for pretty bad reasons. The ongoing replication crisis in psychology does not inspire confidence in that field's ability to correctly diagnose bullshit.

That said, I don't think my views depend on behaviorism being the best framework for human psychology. The case for behaviorism in the AI case is much, much stronger: the equations for an algorithm like REINFORCE or DPO directly push up the probability of some actions and push down the probability of others.

Man, that is one hell of a bullet to bite. Much kudos for intellectual bravery and chutzpah!

That might be a fun topic for a longer discussion at some point, though not right now.

Load More