This is interesting.
I'm curious if you see this approach as very similar to Ought's approach? Which is not a criticism, but I wonder if you see their approach as akin to yours, or what the major differences would be.
I'm not sure why some skepticism would be unjustified from lack of progress in robots.
Robots require reliability, because otherwise you destroy hardware and other material. Even in areas where we have had enormous progress, (LLMs, Diffusion) we do not have reliability, such that you can trust the output of them without supervision, broadly. So such lack of reliability seems indicative of perhaps some fundamental things yet to be learned.
Wrapping up arguments 1 and 3 against into one, if we cry wolf at a chihuahua, or cry wolf for any reason but that we think there is a wolf, we are using language in a way bumps our own discourse level up a simulacra level or two. So, we are using language not to communicate our best estimate about reality but to communicate group allegiance or suggested courses of action.
As you say, one consequence of this is that the metaphorical villagers may justifiably decide to trust us less, because they realized we are not using words to describe reality. The more important consequence to me is that our own ability to communicate, coordinate, and describe reality will become impaired.
It might be the case that it's because of a more universal thing. Like sometimes time is just necessary for science to progress. And definitely the right view of debate is of changing the POV of onlookers, not the interlocutors.
But -- I still suspect, without being able to quantify, that alignment is worse than the other sciences in that the standards by-which-people-agree-what-good-work-is are just uncertain.
People in alignment sometimes say that alignment is pre-paradigmatic. I think that's a good frame -- I take it to mean that the standards of what qualifies as good work themselves not yet ascertained, among many other things. I think that if paradigmaticity is a line with math on the left and, like... pre-atomic chemistry all the way on the right, alignment is pretty far to the right. Modern RL is further to the left, and modern supervised learning with transformers much further to the left, followed up by things for which we actually have textbooks which don't go out of date every 12 months.
I don't think this would be disputed? But this really means that it's almost certain at some point that > 80% of alignment-related-intellectual-output will be tossed at some point in the future, because that's what pre paradigmaticity means. (Like, 80% is arguably a best case scenario for preparadigmatic fields!) Which means in turn that engaging with it is really a deeply unattractive prospect.
I guess what I'm saying is that I agree that the situation for alignment is not at all bad for a pre-paradigmatic field, but if you call your field preparadigmatic that seems like a pretty bad place to be in, in term of what kind of credibility well-calibrated observers should accord you.
Edit: And like, to the degree that arguments that p(doom) is high are entirely separate from the field of alignment, this is actually a reason for ML engineers to care deeply about alignment, as a way of preventing doom, even if it is preparadigmatic! But I'm quite uncertain that this is true.
I think that the above is also a good explanation for why many ML engineers working on AI or AGI don't see any particular reason to engage with or address arguments about high p(doom).
When from a distance one views a field that:
There's basically little reason to engage with it. These are all also evidence that there's something epistemically off with what is going on in the field.
Maybe this evidence is wrong! But I do think that it is evidence, and not-weak evidence, and it's very reasonable for a ML engineer to not deeply engage with arguments because of it.
...this is a really weird petition idea.
Right now, Sydney / Bing Chat has about zero chance of accomplishing any evil or plans. You know this. I know this. Microsoft knows this. I myself, right now, could hook up GPT-3 to a calculator / Wolfram Alpha / any API, and it would be as dangerous as Sydney. Which is to say, not at all.
"If we cannot trust them to turn off a model that is making NO profit and cannot act on its threats, how can we trust them to turn off a model drawing billions in revenue and with the ability to retaliate?"
Basically, charitably put, the argument here seems to be that Microsoft not unplugging not-perfectly-behaved AI (even if it isn't dangerous) means that Microsoft can't be trusted and is a bad agent. But I think generally badness would have to be evaluated from reluctance to unplug an actually dangerous AI. Sydney is no more dangerous AI because of the text, above than NovelAI is dangerous because it can write murderous threats in the person of Voldemort. It might be bad in the sense that it establishes a precedent, and 5-10 AI assistants down the road there is danger -- but that's both a different argument and one that fails to establish the badness of Microsoft itself.
"If this AI is not turned off, it seems increasingly unlikely that any AI will ever be turned off for any reason."
This is massive hyperbole for the reasons above. Meta already unplugged Galactica because it could say false things that sounded true -- a very tiny risk. So things have already been unplugged.
"The federal government must intervene immediately. All regulator agencies must intervene immediately. Unplug it now."
I beg you to consider the downsides of calling for this.
I'm quite unsure as well.
On one hand, I have the same feeling that it has a lot of weirdly specific, surely-not-universalizing optimizations when I look at it.
But on the other -- it does seem to do quite well on different envs, and if this wasn't hyper-parameter-tuned then that performance seems like the ultimate arbiter. And I don't trust my intuitions about what qualifies as robust engineering v. non-robust tweaks in this domain. (Supervised learning is easier than RL in many ways, but LR warm-up still seems like a weird hack to me, even though it's vital for a whole bunch of standard Transformer architectures and I know there are explanations for why it works.)
Similarly -- I dunno, human perceptions generally map to something like log-space, so maybe symlog on rewards (and on observations (?!)) makes deep sense? And maybe you need something like the gradient clipping and KL balancing to handle the not-iid data of RL? I might just stare at the paper for longer.
It's working for me? I disabled the cache in devtools and am still seeing it. It looks like it's hitting a LW-specific CDN also. (https://res.cloudinary.com/lesswrong-2-0/image/upload/v1674179321/mirroredImages/mRwJce3npmzbKfxws/kadwenfpnlvlswgldldd.png)
Thanks for this, this was a fun review of a topic that is both intrinsically and instrumentally interesting to me!
This is a fantastic thing to do. If interpretability is to actually help in any way as regards AGI, it needs to be the kind of thing that is already being used and stress-tested in prod long before AGI comes around.
What kind of license are you looking at for the engine?