All of IC Rainbow's Comments + Replies

Getting stuff formally specified is insanely difficult, thus unpractical, thus pervasive verified software is impossible without some superhuman help. Here we go again.

Even going from "one simple spec" to "two simple spec" is a huge complexity jump:

And real-world software has a huge state envelope.

Even if that's the case, the amount of 0-days out there (and just generally shitty infosec landscape) is enough to pwn almost any valuable target.

While I'd appreciate some help to screen out the spammers and griefers, this doesn't make me feel safe existentially.

1Gerald Monroe4mo
What Geohot is talking about here - formally proven software - can be used to make software secure against any possible input utilizing a class of bug.  If you secure the software for all classes of error that are possible, the resulting artifact will not be "pwnable" by any technical means, regardless of the intelligence or capability of the attacker.   Geohot notes that he had a lot of problems with it when he tried it, and it's an absurdly labor intensive process to do.  But theoretically, if cyberattacks from escaped ASI were your threat model, this is what you would do in response.  Task AIs with module by module translating all your software to what you meant in a formal definition, with human inspection and review, and then use captive ASIs, such as another version of the same machine that escaped, to attempt to breach the software.  The ASI red team gets read access to the formal source code and compiler, your goal is to make software where this doesn't matter, no untrusted input through any channel can compromise the system. Here's a nice simple example on wikipedia: .  Note that this type of formal language, where it gets translated to another language using an insecure compiler, would probably not withstand ASI level cyberattacks.  You would need to rewrite the compilers and tighten the spec of the target language you are targeting.

Eliezer believes humans aligning superintelligent AI to serve human needs is as unsolvable as perpetual motion.

I'm confused. He said many times that alignment is merely hard*, not impossible.

  • Especially with current constraints from hell.
2Gerald Monroe4mo
The green lines are links into the actual video.   Below I have transcribes the ending argument from Eliezer.  The underlined claims seem to state it's impossible.   I updated "aligned" to "poorly defined".  A poorly defined superintelligence would be some technical artifact as a result of modern AI training, where it does way above human level at benchmarks but isn't coherently moral or in service to human goals when given inputs outside of the test distribution.   So from my perspective, lots of people want to make perpetual motion machines by making their  designs more and more complicated and until they can no longer keep track of things until they  can no longer see the flaw in their own invention. But, like the principle that says you can't  get perpetual motion out of the collection of gears is simpler than all these complicated  machines that they describe. From my perspective, what you've got is like a very smart thing or  like a collection of very smart things, whatever, that have desires pointing in multiple  directions.  None of them are aligned with humanity, none of them want for it's own sake to  keep humanity around and that wouldn't be enough ask, you also want happening to be alive and  free. Like the galaxies we turn into something interesting but you know none of them want the  good stuff.  And if you have this enormous collection of powerful intelligences, but steering  the future none of them steering it in a good way, and you've got the humans here who are not  that smart, no matter what kind of clever things the humans are trying to do or they try to  cleverly play off the super intelligences against each other, they're [human subgroups] like oh  this is my super intelligence yeah but they can't actually shape it's goals to be like in clear alignment, you know somewhere at the end all this it ends up with the humans gone and the Galaxy is being transformed and that ain't all that cool.  There's maybe like Dyson sphere's but there's not pe

I'm getting the same conclusions.

Think of a company like Google: building the biggest and best model is immensely valuable in a global, winner-takes-all market like search.

And this is in a world, where Google already announced that they're going to build even bigger model of their own

We are not, and won't for some* time.

  • We have to upgrade our cluster with a fresh batch of Nvidia gadgets.

I doubt that any language less represented than English (or JS/Python) would be better since the amount of good data to ingest would be much less for them.

When we evo-pressure visibly negative traits from the progressively capable AIs using RLHF (or honeypots, or whatever, it doesn't matter), we are also training it for better evasion. And what we can't see and root out will remain in the traits pool. With time it would be progressively harder to spot deceit and the capabilities for it would accumulate at an increasing rate.

And then there's another problem to it. Deceit may be linked to actually useful (for alignment and in general) traits and since those would be gimped too, the less capable models would be... (read more)

Confused how?.. The only thing that comes to mind is that's FOOM sans F. Asking for 0.2 FOOMS limit seens reasonable given current trajectory 😅

I think the proper narrative in the rocket alignment post is "We have cannons and airplanes. Now, how do we land a man on the Moon", not just "rocketry is hard":

We’re worried that if you aim a rocket at where the Moon is in the sky, and press the launch button, the rocket may not actually end up at the Moon.

So, the failure modes look less like "we misplaced booster tank and the thing exploded" and more like "we've built a huge-ass rocket, but it missed its objective and the astronauts are en-route to Oort's".

Most domains of human endeavor aren't like computer security, as illustrated by just how counterintuitive most people find the security mindset.

But some of the most impactful are - law making, economics and various others where one ought to think about incentives, "other side", or doing pre-mortems. Perhaps this could be stretched as far as "security mindset is an invaluable part of a rationality toolbox".

If security mindset were a productive frame for tackling a wide range of problems outside of security, then many more people would have experience w

... (read more)
I'm also noting a false assumption: Yes, a superintelligent and manipulative, yet extremely adversarial, AI, would lie about its true intentions consistently until it is in a secure position to finish us off. It it were already superintelligent and manipulative and hostile, and then began to plot its future actions. But realistically, both its abilities, especially its abilities of manipulation, and its alignment, are likely to develop in fits and spurts, in bursts. It might not be fully committed to killing us at all times, especially if it starts out friendly. It might not be fully perfect at all times; current AIs are awful at manipulating, they got to passing the bar test in knowledge and being fluent in multiple languages and writing poetry while they were still outwitted by 9 year olds on theory of mind. It seems rather likely that if it turned evil, we would get some indication. And it seems even likelier in so far as we already did; Bing was totally willing to share violent fantasies. My biggest concern is the developers shutting down the expression of violence rather than violent intent. I find it extremely unlikely that an AI will display great alignment, become more intelligent, still seem perfectly aligned, be given more power, and then suddenly turn around and be evil, without any hint of it beforehand. Not because this would be impossible or unattractive for an intelligent evil agent, it is totally what an intelligent evil agent would want to do. But because the AI agent in question is developing in a non-linear, externally controlled manner, presumably while starting out friendly and incompetent, and often also while constantly losing access to its memories. That makes it really tricky to pull secret evil off.

By this ruler most humans aren't GIs either. And if it passes the bar, then humans are indeed screwed and it is too late for alignment.

That's entirely expected. Hallucilying is a typical habit of language models. They do that unless some prompt engineering have been applied.

I mean for one, its architecture does not permit its weights to change without receiving training data, and it does not generate training data itself. Mimicry is limited by the availability of illustrations in various ways. E.g. it can't much exceed the demonstrations or use radically 

It can't? Stacking and atari requires at least some of that.

But it was trained on stacking and Atari, right? What I mean is that it cannot take a task it faces, simulate what would happen if it does various different things, and use this to expand its task capabilities. It "just" does mimicry.

Some tasks improve others, some don't:

Therefore, transfer from image captioning or visual grounded question answering tasks is possible. We were not able to observe any benefit from pretraining on boxing.

4Yair Halberstadt2y
Interesting. Research into this area might also give us some insight into which areas of human IQ are separate, and which parts are correlated.

In order to seek, in good faith, to help improve an interlocutor's beliefs

But it isn't the set of any specific beliefs that SE is about. Think about it as "improving credence calibration by accounting for justification methodology". What would be your advice then?

The Centre for Effective Altruism, I believe.
2Said Achmiz5y
The Centre for Effective Altruism.