Srijanak De — LessWrong

Thank you for the suggested link. That does seem like a good expansion on the same.

Yes, you are right, this argument clearly doesn't hold for neural networks as they exist now. But this entire article was to come to this exact line of reasoning, hence the emphasis on model-independent approaches. A standard feedforward network with fixed parameters and weights without a mesa-optimizer cannot lie variably owing to different situations as exploited in timing attacks, nor does it have the abilities of an AGI. Rather than any arbitrary AGI system, I would say any model that has inner and outer objectives to align to will take different computation times. I am not talking about pseudo alignment here where the system performs well on fixed test cases, not on others, but in deceptive alignment where the system actually only pursues mesa-optimizers when not under threat, this leads to a change in computation time. Does that make things clearer now?

I thought of this as proof rather than a heuristic because the argument seemed rigorous to me unless you would like to emphasize why not. This stems from the logic explained above, for deceptive alignment, the mechanism of deception cannot be the same as for truth, if it is, the problem essentially reduces to pseudo alignment, and since both are not the same, there is a contradiction.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments