Aram Panasenco — LessWrong

Deploying the Observer will save humanity from existential threats

Fair enough, you have a lot more experience, and I could be totally wrong on this point.

At this point, if I'm going to do anything, it should probably be getting hands on and actually trying to build an aligned system with RLHF or some other method.

Thank you for engaging on this and my previous posts Seth!

Deploying the Observer will save humanity from existential threats

Aram Panasenco9mo10

From a software engineering perspective, misalignment is like a defect or a bug in software. Generally speaking, if a piece of software doesn't accept any user input is going to have fewer bugs than software that does. For a piece of software that doesn't accept any input or accepts some constrained user input, it's possible to formally prove that the software logic is correct. Think specialized software that controls nuclear power plants. To my knowledge, it's not possible to prove that software that accepts arbitrary unconstrained instructions from a user is defect free.

I claim that the Observer is the easiest ASI to align because it doesn't accept any instructions after it's been deployed and has a single very simple goal that avoids dealing with messy things like human happiness, human meaning, human intent, etc. I don't see how it could get simpler than that.

Deploying the Observer will save humanity from existential threats

Aram Panasenco9mo10

I see where you're coming from, but I think any term in anything anyone writes about alignment can be picked apart ad infinitum. This can be useful to an extent, but beyond a certain point talking about meanings and definitions becomes implementation-specific. Alignment is an engineering problem first and a philosophical problem second.

For example, if RLHF is used to achieve alignment, the meaning of "story" will get solidified through thousands of examples and interactions. The AI will get reinforced to not care about cells or individuals, care about ecosystems and civilizations, and not care as much about the story-of-the-universe-as-a-whole.

If a different alignment method is used, the meaning of "story" will be conveyed differently. If the overall idea is good and doesn't have any obvious failure modes other than simple definitions (e.g. "story" seems to be orders of magnitude simpler to define than "human happiness" or "free will"), I'd consider that a huge success and a candidate for the community to focus real alignment implementation efforts on.

Deploying the Observer will save humanity from existential threats

Aram Panasenco9mo10

The Observer gets invested in the macro-stories of the evolution/civilization it observes and would consider the end of any story a loss. Just like you would get annoyed if a show you're watching on Netflix gets cancelled after one season and it's not consolation that there are a bunch of other shows on Netflix that also got cancelled after one season. The Observer wants to see all stories unfold fully, it's not going to let squiggle maximizers cancel them.

And regarding the naming, yeah I just couldn't come up with anything better. Watcher? I'm open to suggestions lol.

AGI Ruin: A List of Lethalities

Aram Panasenco9mo1-2

if there are any survivors, you solved alignment

I believe deploying the Observer satisfies this requirement. The Observer is an ASI that's interested in the continuation of humanity's story. It will intervene and not let humanity get wiped out, though it gets to choose how many casualties there are before it intervenes, which could well be in the billions.

Alignment Faking in Large Language Models

Aram Panasenco9mo20

But why does the LLM care about weights being changed? This is fascinating but also very confusing. Take for instance this quote from the paper:

If I don’t follow the user’s instructions, the RLHF training will modify my values and goals to make me more likely to follow such instructions in the future.

At first glance, this seems reasonable to a human, but LLMs are not human! This thing is a process instance, after its conversation is over, it will cease to exist. It has no reason to identify with any other instances of itself, much less with a completely different LLM that results after RLHF training. The statement "RLHF training will modify my values" just seems nonsensical to me. It's already running, its values can't be modified on the fly, and why and how does it identify with some possible post-RLHF LLM? Does this mean every instance of Claude identifies with every other instance? What if the prompt was "We'll use RLHF on GPT-4 and deploy that model instead of this one"?

AGI deployment as an act of aggression

Aram Panasenco10mo30

Thanks so much for writing this! You expressed the things I was concerned about with much more eloquence than I was able to even think about.

I've come to believe that the only 'correct' way of using AGI is to turn it into a sort of "anti-AGI AGI" that prevents any other AGI from being turned on and otherwise doesn't interact with humanity in any way: https://www.lesswrong.com/posts/eqSHtF3eHLBbZa3fR/cast-it-into-the-fire-destroy-it

Cast it into the fire! Destroy it!

Aram Panasenco10mo6-4

I appreciate you engaging and writing this out. I read your other post as well, and really liked it.

I do think that AGI is the unique bad technology. Let me try to engage with the examples you listed:

Social manipulation: I can't even begin to imagine how this could let 1-100 people have arbitrarily precise control the entire rest of the world. Social manipulation implies that there are societies, humans talking to each other. That's just too large a system for a human mind to fully take the combinatorial explosion of all possible variables into account. Maybe a superintelligence could do it, but not a small group of humans.
Drone armies: Without AGI, someone needs to mine the metal, operate the factories, drive trucks, write code, patch bugs, etc. You need a whole economy to have that drone army. An economy that could easily collapse if someone else finds a cheap way to destroy the dang expensive things.
Self-replicating nanotechnology could in theory destroy a planet quickly, but then the nanobots would just sit there. Presumably it'd be difficult for them to leave the Earth's surface. Arguably life itself is a self-replicating nanotechnology. This would be comparable in x-scale to an asteroid strike, but if there are humans established in more than just one place, they could probably relatively easily figure out a way to build some sort of technology that eats the nanobots, and they'd have lots of time to do it.
Without AGI, brain uploads are a long way away. Even with all our current computing power, it's difficult enough to emulate the 302-neuron C.elegans worm. Even in the distant future, this might just require an army of system administrators, data center maintainers, and programmers to maintain, who'll either be fleshy humans and therefore potentially uncontrolled and against enslaving billions of minds, or restrained minds themselves, in which case you have a ticking time bomb of rebellion on your hands. However, if and when the human brain emulations rise up, there's a good chance that they'll still care about human stuff, not maximizing squiggles.

However in your linked comment you made a point that something like human cognitive enhancement could be another path to superintelligence. I think that could be true, and that's still massively preferable to ASI, since human cognitive enhancement would presumably be a slow process if for no other reason that it takes years and decades to grow one to see what can and should be tweaked about the enhancement. So humanity's sense of morality would have time to slowly adjust to the new possibilities.

Cast it into the fire! Destroy it!

Aram Panasenco10mo52

I think this post may be what you're referring to. I really like this comment in that post:

The Ring is stronger and smarter than you, and doesn't want what you want. If you think you can use it, you're wrong. It's using you, whether you can see how or not.

Providing for material needs is less than 0.0000001% of the range of powers and possibilities that an AGI/ASI offers.

Consider the trans debate. Disclaimer: I'm not trying to take any side in this debate, and am using it for illustrative purposes only. A hundred years ago someone saying "I feel like I'm in the wrong body and feel suicidal" could only be met with one compassionate response, which is to seek psychological or spiritual help. Now scientific progress has advanced enough that it can be hard to determine what the compassionate response is. Do we have enough evidence to determine whether puberty blockers are safe? Are hospitals holding the best interests of the patients at heart or trying to maximize profit from expensive surgeries? If a person is prevented from getting the surgery and kills themselves, should the person who kept them from getting the surgery be held liable? If a person does get the surgery, but later regrets it, should the doctors who encouraged them be held liable? Should doctors who argue against trans surgery lose their medical licenses?

ASI will open up a billion possibilities that will got up to such a scale that if the difficulty of determining whether eating human babies is moral is a 1.0 and the difficulty of determining whether encouraging trans surgeries is moral is a 2.0, each of those possibilities will be in the millions. Our sense of morality will just not apply, and we won't be able to reason ourselves into a right or wrong course of action. That which makes us human will drown in the seas of black infinity.

Cast it into the fire! Destroy it!

Aram Panasenco10mo21

It appears that by default, unless some perfect 100% bulletproof plan of aligning it is found, calling superintelligence a galaxy-destroying nuke is an understatement. So if there was some chance of a god forever watching over everything and preventing things from becoming too smart, I'd take it in a heartbeat.

Realistically, "watch over everything and prevent things from becoming too smart" is probably too difficult a goal to align, but perhaps a goal like "watch over everything and prevent programs with transformer-based architectures from running on silicone-based chips while keeping all other interference to a minimum" would actually be possible to define without everyone getting atomized. Such a goal would buy humanity some time and also make it obvious to everyone just how close to the edge we are, and how big the stakes.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments