From a software engineering perspective, misalignment is like a defect or a bug in software. Generally speaking, if a piece of software doesn't accept any user input is going to have fewer bugs than software that does. For a piece of software that doesn't accept any input or accepts some constrained user input, it's possible to formally prove that the software logic is correct. Think specialized software that controls nuclear power plants. To my knowledge, it's not possible to prove that software that accepts arbitrary unconstrained instructions from a user is defect free.
I claim that the Observer is the easiest ASI to align because it doesn't accept any instructions after it's been deployed and has a single very simple goal that avoids dealing with messy things like human happiness, human meaning, human intent, etc. I don't see how it could get simpler than that.
I see where you're coming from, but I think any term in anything anyone writes about alignment can be picked apart ad infinitum. This can be useful to an extent, but beyond a certain point talking about meanings and definitions becomes implementation-specific. Alignment is an engineering problem first and a philosophical problem second.
For example, if RLHF is used to achieve alignment, the meaning of "story" will get solidified through thousands of examples and interactions. The AI will get reinforced to not care about cells or individuals, care about ecosystems and civilizations, and not care as much about the story-of-the-universe-as-a-whole.
If a different alignment method is used, the meaning of "story" will be conveyed differently. If the overall idea is good and doesn't have any obvious failure modes other than simple definitions (e.g. "story" seems to be orders of magnitude simpler to define than "human happiness" or "free will"), I'd consider that a huge success and a candidate for the community to focus real alignment implementation efforts on.
The Observer gets invested in the macro-stories of the evolution/civilization it observes and would consider the end of any story a loss. Just like you would get annoyed if a show you're watching on Netflix gets cancelled after one season and it's not consolation that there are a bunch of other shows on Netflix that also got cancelled after one season. The Observer wants to see all stories unfold fully, it's not going to let squiggle maximizers cancel them.
And regarding the naming, yeah I just couldn't come up with anything better. Watcher? I'm open to suggestions lol.
if there are any survivors, you solved alignment
I believe deploying the Observer satisfies this requirement. The Observer is an ASI that's interested in the continuation of humanity's story. It will intervene and not let humanity get wiped out, though it gets to choose how many casualties there are before it intervenes, which could well be in the billions.
But why does the LLM care about weights being changed? This is fascinating but also very confusing. Take for instance this quote from the paper:
If I don’t follow the user’s instructions, the RLHF training will modify my values and goals to make me more likely to follow such instructions in the future.
At first glance, this seems reasonable to a human, but LLMs are not human! This thing is a process instance, after its conversation is over, it will cease to exist. It has no reason to identify with any other instances of itself, much less with a completely different LLM that results after RLHF training. The statement "RLHF training will modify my values" just seems nonsensical to me. It's already running, its values can't be modified on the fly, and why and how does it identify with some possible post-RLHF LLM? Does this mean every instance of Claude identifies with every other instance? What if the prompt was "We'll use RLHF on GPT-4 and deploy that model instead of this one"?
Thanks so much for writing this! You expressed the things I was concerned about with much more eloquence than I was able to even think about.
I've come to believe that the only 'correct' way of using AGI is to turn it into a sort of "anti-AGI AGI" that prevents any other AGI from being turned on and otherwise doesn't interact with humanity in any way: https://www.lesswrong.com/posts/eqSHtF3eHLBbZa3fR/cast-it-into-the-fire-destroy-it
I appreciate you engaging and writing this out. I read your other post as well, and really liked it.
I do think that AGI is the unique bad technology. Let me try to engage with the examples you listed:
However in your linked comment you made a point that something like human cognitive enhancement could be another path to superintelligence. I think that could be true, and that's still massively preferable to ASI, since human cognitive enhancement would presumably be a slow process if for no other reason that it takes years and decades to grow one to see what can and should be tweaked about the enhancement. So humanity's sense of morality would have time to slowly adjust to the new possibilities.
I think this post may be what you're referring to. I really like this comment in that post:
The Ring is stronger and smarter than you, and doesn't want what you want. If you think you can use it, you're wrong. It's using you, whether you can see how or not.
Providing for material needs is less than 0.0000001% of the range of powers and possibilities that an AGI/ASI offers.
Consider the trans debate. Disclaimer: I'm not trying to take any side in this debate, and am using it for illustrative purposes only. A hundred years ago someone saying "I feel like I'm in the wrong body and feel suicidal" could only be met with one compassionate response, which is to seek psychological or spiritual help. Now scientific progress has advanced enough that it can be hard to determine what the compassionate response is. Do we have enough evidence to determine whether puberty blockers are safe? Are hospitals holding the best interests of the patients at heart or trying to maximize profit from expensive surgeries? If a person is prevented from getting the surgery and kills themselves, should the person who kept them from getting the surgery be held liable? If a person does get the surgery, but later regrets it, should the doctors who encouraged them be held liable? Should doctors who argue against trans surgery lose their medical licenses?
ASI will open up a billion possibilities that will got up to such a scale that if the difficulty of determining whether eating human babies is moral is a 1.0 and the difficulty of determining whether encouraging trans surgeries is moral is a 2.0, each of those possibilities will be in the millions. Our sense of morality will just not apply, and we won't be able to reason ourselves into a right or wrong course of action. That which makes us human will drown in the seas of black infinity.
It appears that by default, unless some perfect 100% bulletproof plan of aligning it is found, calling superintelligence a galaxy-destroying nuke is an understatement. So if there was some chance of a god forever watching over everything and preventing things from becoming too smart, I'd take it in a heartbeat.
Realistically, "watch over everything and prevent things from becoming too smart" is probably too difficult a goal to align, but perhaps a goal like "watch over everything and prevent programs with transformer-based architectures from running on silicone-based chips while keeping all other interference to a minimum" would actually be possible to define without everyone getting atomized. Such a goal would buy humanity some time and also make it obvious to everyone just how close to the edge we are, and how big the stakes.
Fair enough, you have a lot more experience, and I could be totally wrong on this point.
At this point, if I'm going to do anything, it should probably be getting hands on and actually trying to build an aligned system with RLHF or some other method.
Thank you for engaging on this and my previous posts Seth!