Philosophy PhD student, worked at AI Impacts, then Center on Long-Term Risk, now OpenAI Futures/Governance team. Views are my own & do not represent those of my employer. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html
OK, so it sounds like you are saying "I personally [ETA: and/or some people I know] could have done a better job red-teaming than ARC did." I suggest you submit an application to work for ARC as a contractor then. [and/or encourage those people to apply]. If you are right, you'll be doing the world a great service!
It is well understood within OpenAI that 'the internet' (tens of millions of people interacting with the models) is a more powerful red-teamer than anything we can do in-house. What's your point?
This story involves an AI with an internal censor that deletes deceptive thoughts. But since the AI is trying to make rapid progress towards its goals in the wetlab, it eventually stumbles into a workaround that lets it avoid the censor.
What if instead of an internal censor, it had an internal plan-goodness-classifier that heavily downweighted plans which involved deception, and also applied on the meta level (e.g. a plan such as 'think long and hard about how to get similar results as deception without technically counting as deception' would be classified as deceptive and heavily downweighted)?
...attempting to answer my own question...
The meta-level thing is definitely necessary, otherwise the system quickly and deliberately finds a workaround. (Though maybe at merely human levels of intelligence this is not catastrophic? Humans find workarounds all the time, little white lies, stretching and bending the rules, etc. yet systems often work well enough in practice.)
But yeah assume we have the meta-level thing. It's not that the cognition of the system is mysteriously failing; it's that it is knowingly averse to deception and to thinking about how it can 'get around' or otherwise undermine this aversion.
It could still randomly stumble into a workaround. If a particular plan occurs to it that doesn't get classified as deception but achieves similar results, it'll go for it. But this is unlikely in practice because it won't be spending cognition trying to find such plans, because doing so would trigger the aversion.
I guess you'd say that as the system gets generally smarter, it becomes likely in practice, because it'll just be doing things like "apply clever cognitive strategies like reframing the problem and then brute-force searching for solutions" and this will be a distribution shift for the deception-classifier so it'll fail, even though at no point was the system intending to make the deception-classifier stay silent... But what if it isn't a distribution shift? What if the deception classifier is robust enough that no matter how you reframe the problem, it always runs some sort of routine analogous to "OK, but is this proposed plan deception? Let me translate it back and forth, consider it from a few different angles, etc. and see if it seems deceptive in any way."
... I'm not sure what to think but I still have hope that the 'robust nondeceptiveness' thing I've been describing is natural enough that systems might learn it with sufficiently careful, sufficiently early training.
It sure sounds like you are saying that though!
Me, reflecting afterwards: hmm... Cynically, not publishing is a really good way to create a moat around your research... People who want to work on that area have to come talk to you, and you can be a gatekeeper. And you don't have to worry about somebody with more skills and experience coming along and trashing your work or out-competing you and rendering it obsolete...
I'm not accusing anyone of having bad motivations; I think it is almost always valuable to consider both people's concious motivations and their incentives (which may be subconscious (EtA: or indirect) drivers of their behavior).
Before you put in the EtA, it sure sounded like you were saying that people were subconsciously motivated to avoid academic publishing because it helped them build and preserve a moat. Now, after the EtA, it still sounds like that but is a bit more unclear since 'indirect' is a bit more ambiguous than 'subconscious.'
I disagree with much of what you say here, but I'm happy to see such a thorough point by point object-level response! Thanks!
Worth it to the world/humanity/etc. though maybe some of them are more self-focused.
Probably a big chunk of it is lost for that reason yeah. I'm not sure what your point is, it doesn't seem to be a reply to anything I said.
Here are two hypotheses for why they don't judge those costs to be worth it, each one of which is much more plausible to me than the one you proposed:
(1) The costs aren't in fact worth it & they've reacted appropriately to the evidence.
(2) The costs are worth it, but thanks to motivated reasoning, they exaggerate the costs, because writing things up in academic style and then dealing with the publication process is boring and frustrating.
Seriously, isn't (2) a much better hypothesis than the one you put forth about moats?
I think your cynical take is pretty wrong, for the reasons Evan described. I'd add that because of the way academic prestige works, you are vulnerable to having your ideas stolen if you just write them up on LessWrong and don't publish them. You'll definitely get fewer citations, less recognition, etc.
I think people's stated motivations are the real motivations: Jumping through hoops to format your work for academia has opportunity costs and they don't judge those costs to be worth it.
If GPT-4 was smart enough to manipulate ARC employees into manipulating TaskRabbits into helping it escape... it already had been talking to thousands of much less cautious employees of various other companies (including people at OpenAI who actually had access to the weights unlike ARC) for much longer, so it already would have escaped.
What ARC did is the equivalent of tasting it in a BSL4 lab. I mean, their security could probably still be improved, but again I emphasize that the thing was set to be released in a few weeks anyway and would have been released unless ARC found something super dangerous in this test. And I'm sure they will further improve their security as they scale up as an org and as models become more dangerous.
The taskrabbit stuff was not unethical, their self-replication experiment was not borderline suicidal. As for precedents, what ARC is doing is a great precedent because currently the alternative is not to test this sort of thing at all before deploying.