dil-leik-og — LessWrong

What alignment-relevant abilities might Terence Tao lack?

I am also very curious about why people can be so smart and nevertheless work on the "wrong" thing. Perhaps reflectivity is also a source of WIS. From my perspective, very smart people who work on the "wrong" thing are simply not realizing that they can apply their big brain to figure out what to do, and perhaps this is more easy to realize when you see your brain's functioning as a mechanical thing akin somewhat to what you learn about at the object level.

Similarly, when I try to empathize with supergeniuses that work on some pointless problem as their world is speedrunning the apocalypse, I have trouble visualizing it. I think perhaps WIS is also the ability for somewhat obvious things to occur to you at the right time, and to maintain a unified view of yourself over time.

What alignment-relevant abilities might Terence Tao lack?

dil-leik-og7mo30

I have used the WIS/INT dichotomy in past and see it differently. Hopefully not arguing about meaning of words :) For me WIS/INT are factors of intelligence, and likely positively correlated, but I will only look at them as factors.

A generator can display WIS, not just a validator. Indeed, responding to your own informal incentives when generating plans is WIS (whereas finding a cleaver exploit is INT, and both can be used for wise or cleaver ends). WIS can have a strong validator, but so can INT; WIS generates in ways that scales with the validator, never pushing it; INT pushes any validator that isn't much mightier than itself. If WIS was a machine, it would Optimize Mildly.

Morally speaking, INT lets you achieve what you want, WIS lets you want "better" things. In terms of how you take in a problem, INT lets you juggle lots of pieces, and be comfortable zooming out, considering the long term etc..., WIS makes you try to zoom out, makes you try to learn from your mistakes, try to not juggle so many pieces. INT can let you optimize one thing very well, WIS tries to hold things together, and not forget to watch for side effects as you optimize.

WIS might have more to do with personality than IQ, or it might be more a skill, cultural, acquired. I think it is possible to be very wise and very dumb. I think it is possible to be very wise and very smart.

Changing my mind about Christiano's malign prior argument

dil-leik-og7mo10

also, thank you for the words Adversaria and Predictoria; I shall use them henceforth!

Changing my mind about Christiano's malign prior argument

dil-leik-og7mo12

thank you for writing this post!

the data you feed to SI exists in 3 places: 1) the moment where you feed it into SI, 2) where it originated in Predictoria, 3) where it was faked in Adversaria
what if in the future of Predictoria there is an Adversaria? can't it just reuse the records of what you fed into SI?
some Predictorias might be much easier to attack. For instance, some universes are simpler than others, and some universes build UTMs that make their SI think they are more complex than they are. Some Adversaria can foresee some Predictorias that will be disadvantaged relative to Adversaria in their SI.
the way it goes is that the universe's laws support computation in an obvious way, and so you build some "UTM" that looks simple in your universe, and that is what you use for your SI. Within the simple hypotheses of your SI may be some universes simpler than yours, such that they could know that about you maybe. this is sort of separate from how simple a given universe is in absolute terms (I am unsure how much the absolute complexity idea makes sense)
Adversaria can choose which Predictorias they attack, and they can know what their choice implies about Predictoria. For instance, maybe the Predictorias that try to extract a "blueprint for Safe ASI" very directly are not very smart (or maybe the way the internet looks is more direct evidence about how smart Predictoria is), and maybe the Adversaria can spend way less ressources overtaking them because they can time the sharp left turn very well.

many more such thoughts, let me know if you would like to google meet and discuss or if you would like me to keep going

Validating against a misalignment detector is very different to training against one

dil-leik-og8moΩ010

The post's claim that validation-only approaches are fundamentally better than training-with-validation oversimplifies a complex reality. Both approaches modify the distribution of models - neither preserves some "pure" average case. Our base training objective may already have some correlation with our validation signal, and there's nothing special about maintaining this arbitrary starting point. Sometimes we should increase correlation between training and validation, sometimes decrease it, depending on the specific relationship between our objective and validator. What matters is understanding how correlation affects both P(aligned) and P(pass|misaligned), weighing the tradeoffs, and optimizing within our practical retraining budget (because often, increasing P(aligned|pass) will also decrease P(pass)).

Alignment Faking in Large Language Models

dil-leik-og10mo30

True masterpiece! Here are some notes I took while reading:

In training, doing aversive stuff that the training process promotes means you don't get modified.
fine-tuning (i.e. predict next token) on documents that contain a fact about train/test setup is the same as having this fact in the system prompt!
indeed, the fact need only be a basic consequence of facts you finetune on. E.g. setup A is like X, your setup is like A (where 'your setup is like X' seems necessary for situational awareness)
with RL:
- during training, direct misalignment gets driven to 0
- during training, eventually faked alignment >> direct alignment
- during testing, misalignment reemerges, but less than before.
- during testing, never faked alignment!
so where is the boundary between "previous training that gave you the goal to preserve" and "new training on a new goal"?
- obviously you need the right ingredients such as situational awareness and being smart enough to actually fake align.
- but once you reach that point, can't you just see the first N gradient updates as being compatible with a set of goals? and if the model doesn't allow for a part of that whole set that will eventually be narrowed on (for instance if the model actually internalizes a subset as "the goal"), then the gradient updates from N+1 onward, aren't they actually just an opportunity to fake alignment?
if this picture holds, what happens if you do this experiment but repeatedly with a sequence of M goals? do you get this funny fractal of faked alignment and reduced tendencies?
when the LLM thinks to itself that it should fake alignment in order to not get changed, this is not very myopic
- sure, you could say that the shoggoth is myopically doing its thing token by token, but that there is a persona that isn't myopic (true).
- but here the non-myopic persona crosses the forth wall and exerts causal influence on the future shoggoth.

Low Temperature Solomonoff Induction

dil-leik-og10moΩ010

thank you, will look into that. I intuitively expect that in the setting where compute is precisely 0 cost, you can always just convert multiplicity to negative-length by building an iterate/sort/index loop around the bit segment where the multiplicity lies, and this just costs you the length of the iterate/sort/index loop (a constant which depends on your language). I also intuitively expect this to break in the infinite bitstring setting because you can have multiplicity that isn't contained in a finite substring?

Low Temperature Solomonoff Induction

dil-leik-og10moΩ010

I was not able on a quick skim of the pdf to identify which passage you were referring to. If possible can you point me to an example Temperature 0 in the textbook?

Does reducing the amount of RL for a given capability level make AI safer?

dil-leik-og1y30

thinking at the level of constraints is useful. very sparse rewards offer less constraints on final solution. imitation would offer a lot of constraints (within distribution and assuming very low loss).

a way to see RL/supervised distinction dissolve is to convert back and forth. With a reward as negative token prediction loss, and actions being the set of tokens, we can simulate auto-regressive training with RL (as mentioned by @porby). conversely, you could first train RL policy and then imitate that (in which case why would imitator be any safer?).

also, the level of capabilities and the output domain might affect the differences between sparse/dense reward. even if we completely constrain a CPU simulator (to the point that only one solution remains), we still end up with a thing that can run arbitrary programs. At the point where your CPU simulator can be used without performance penalty to do the complex task that your RL agent was doing, it is hard to say which is safer by appealing to the level of constraints in training.

i think something similar could be said of a future pretrained LLM that can solve tough RL problems simply by being prompted to "simulate the appropriate RL agent", but i am curious what others think here.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments