seems like restricting the search to plausible inputs (as judged by e.g. perplexity) might overcome some of these concerns
Take: Exploration hacking should not be used as a synonym for deceptive alignment.
(I have observed one such usage)
Deceptive alignment is maybe a very particular kind of exploration hacking, but the term exploration hacking (without further specification) should refer to models deliberately sandbagging "intermediate" capabilities during RL training to avoid learning a "full" capability.
“If I quantize my network with however many bits, how bad is that?” I don’t know, maybe this is one of these things where if I sat down and tried to do it, I’d realize the issue, but it seems doable to me. It seems like there’s possibly something here.
I think the reason this doesn't work (i.e. why you can only get a Pareto frontier) is that you can only lower bound the description length of the network / components, such that a direct comparison to "loss bits" doesn't make sense
Curious for your takes on the difficulty (for models) of sandbagging vs the difficulty (for AI developers) of preventing adversarial examples / jailbreaks. My quick thoughts:
This is really cool work and I hope it gets more attention (e.g. promoted to the Alignment Forum). In particular I love the use of CodeNames as a relatively low-compute test-bed for scalable oversight work.
A view thoughts:
I think the smart Silver-sympathetic response here would be:
1. Yes, the alignment problem is a real thing, and my (Silver's) sketch of a solution is not sufficient
2. Studying / taking inspiration from human-reward circuitry is an interesting research direction
3. We will be able to iterate on the problem before AI's are capable of taking existentially destructive actions
From (this version of) his perspective, alignment is "normal" problem and a small-piece of achieving beneficial AGI. The larger piece is the AGI part, which is why he spends most of his time/energy/paper devoted to it.
Obviously much ink has been spilled on why alignment might not be a normal problem (feedback loops break when you have agents taking actions which humans can't evaluate, etc. etc.) but this is the central crux.
Map on "coherence over long time horizon" / agency. I suspect contamination is much worse for chess (so many direct examples of board configurations and moves)
I've been confused what people are talking about when they say "trend lines indicate AGI by 2027" - seems like it's basically this?
also rhymes with/is related to ARC's work on presumption of independence applied to neural networks (e.g. we might want to make "arguments" that explain the otherwise extremely "surprising" fact that a neural net has the weights it does)
overall how enthusiastic are you about safety motivated people developing such an architecture?
(seems to come with obviously large capability externalities - we can deploy the model outside the sandbox!)