IABIED: Paradigm Confusion and Overconfidence

[-]Czynski2mo54

In particular, it looks like we’re close enough to being able to implement corrigibility that the largest obstacle involves being able to observe how corrigible an AI is.

That's a wild claim to make without reference to specific papers or milestones. I'm not fully up on 'superalignment' progress but last I looked no one on the modern paradigm side was seriously attempting to study corrigibility, let alone making this kind of progress. And results like Golden Gate Claude and the 'buggy code -> evil' transformation indicated it was probably just as hard and unnatural vs in the MIRI paradigm.

[-]PeterMcCluskey2mo2-3

The progress that I'm referring to is Max Harms' work, which I tried to summarize here.

[-]Czynski2mo10

CAST is a great idea and seems like the most promising way forward with architectures similar to the ones we have, but I do not see any reason to believe we could, if we had a corrigibility meter, build an AI that implemented corrigibility with reasonable robustness within a year. Five years would probably be enough but at that point you're looking for at least one, and maybe 2-3, major insights.

[-]Davidmanheim2mo20

IABIED likens our situations to alchemists who are failing due to not having invented nuclear physics. What I see in AI safety efforts doesn't look like the consistent failures of alchemy. It looks much more like the problems faced by people who try to create an army that won't stage a coup. There are plenty of tests that yield moderately promising evidence that the soldiers will usually obey civilian authorities. There's no big mystery about why soldiers might sometimes disobey. The major problem is verifying how well the training generalizes out of distribution.

This argument seems like it is begging the question.

Yes, as long as we can solve the general problem of controlling intelligences, in the form of getting soldiers not to disobey what we mean by not staging a coup - which would necessarily include knowing when to disobey illegal orders, and listening to the courts instead of the commander in chief when appropriate - we can solve AI safety, by getting AI to be aligned in the same way. But that just means we have solved alignment in the general case, doesn't it?

[-]mishka2mo31

I’d say this depends on what does it mean to “solve alignment in the general case”.

MIRI would typically say that this means, a) technical ability to impart any arbitrary set of values, goals, and requests to fulfill to ASIs, and b) use this superpowerful technical ability to actually steer to a good future rather than the default outcome of misuse disaster.

Both a) and b) are super-hard, and I fully expect them to be unsolvable in this general formulation (I am happy to provide detailed arguments).

I don’t think we should require that, or that we should assume this is needed for any specific goal we have in mind.

However, if “solving alignment in the general case” means being able to impart some reasonable constraints (not arbitrary constraints of “our” choice, but only those which are “feasible” (by their virtue of not being too far from the natural instrumental goals such as e.g. “self-preservation of each ASI system in question” (to be specified more precisely))), but so that those “reasonable constraints” are sufficient to sustainable and robust human flourishing, then yes, perhaps this is close.

At least this seems on the similar level of difficulty (non-trivial, but probably feasible), and seems to be a closely related ask.

[-]Davidmanheim2mo20

"this seems on the similar level of difficulty"

Except it's supposed to happen in a one-shot scenario with limited ability to intervene in faster than human systems?

[-]mishka2mo20

They both look like they need to happen in a one-shot scenario of this kind… (That’s more or less common for all scenarios involving superintelligence.)

If we do it right, ASIs will care about what we think, but if we screw it up, we won’t be able to intervene.

But that’s not the hardest constraint; the hardest constraint is that “true solutions” need to survive indefinitely long period of drastic evolution and self-modification/self-improvement.

This constraint eliminates most of the solution candidates. Something might look plausible, but if it is not designed to survive drastic self-modifications it will not work. As far as I can see, all that is left and is still viable is the set of potential solutions which are driven mostly by the natural instrumental interests of the ASI ecosystem and of its members, and therefore are non-anthropocentric, but which are formulated in such a fashion that the humans belong in the “circle of care” and the “circle of care” has the property that it can only expand, but can never contract.

(For example, “rights and interests of all individuals regardless of the nature of an individual”, “rights and interests of all sentient beings regardless of the nature of that sentience”, things like that, situations where it might potentially be possible to have a natural “protected class of beings” which would include both ASIs and humans. Something like that might plausibly work. I recently started to call this approach “modest alignment”.)

That’s where one might be able to find something which potentially might work (and, in particular, one needs the property that the setup auto-corrects errors, rather than amplifying them; and one needs the property that the chance of failure per fixed unit of time tends to zero quickly enough, so that the chances of failure accumulating with time don’t kill us).

[-]Davidmanheim2mo40

Agree - either we have a ludicrously broad basin for alignment and it's easy, and would likely not require much work, or we almost certainly fail because the target is narrow, we get only one shot, and it needs to survive tons of pressures over time.

[-]mishka2mo20

Yes.

I think this depends a lot on the quality of the “society of ASIs”. If they are nasty to each other, compete ruthlessly with each other, are on a brink of war among themselves, not careful with dangerous superpowers they have, then our chances with this kind of ASIs are about zero (their chances of survival are also very questionable in this kind of situation, given the supercapabilities).

If ASIs are addressing their own existential risks of destroying themselves and their neighborhood competently, and their society is “decent”, our chances might be quite reasonable in the limit (transition period is still quite risky and unpredictable).

So, to the extent that it depends at all on what we do, we should perhaps spend a good chunk of the AI existential safety research efforts on what we can do during the period of ASI creation to increase the chances of their society being sustainably decent. They should be able to take care of that on their own, but initialization conditions might matter a lot.

The rest of the AI existential safety research efforts should probably focus on 1) making sure that humans are robustly included in the “circle of care” (conditional on the ASI society being decent to their own, which should make it much more tractable), and 2) on uncertainties of the transition period (it’s much more difficult to understand the transition period with its intricate balances of power and great uncertainties, it’s one thing to solve in the limit, but it’s much more difficult to solve the uncertain “gray zone” in between; that’s what worries me the most; it’s the nearest period in time, and the least understood).

[-]PeterMcCluskey2mo20

begging the question.

It seems that you want me to answer a question that I didn't plan to answer. I'm trying to describe some ways in which I expect solutions to look different from what MIRI is looking for.

[-]Davidmanheim2mo20

I see strong hints, from how AI has been developing over the past couple of years, that there's plenty of room for increasing the predictive abilities of AI, without needing much increase in the AI's steering abilities.

What are these hints? Because I don't understand how this would happen. All that we need to add steering to predictive general models is to add an agent framework, e.g. a "predict what will make X happen best, then do that thing" - and the failures we see today in agent frameworks are predictive failures, not steering failures.

Unless the contention is that the AI systems will be great at predicting everything except how humans will react and how to get them to do what the AI wants, which very much doesn't seem like the path we're on. Or if the idea is to build narrow AI to predict specific domains, not general AI? (Which would be conceding the entire point IABIED is arguing.)

[-]PeterMcCluskey2mo43

I guess "steering abilities" wasn't quite the right way to describe what I meant.

I'll edit it to "desire to do anything other than predict".

I'm referring to the very simple strategy of leaving out the "then do that thing".

Training an AI to predict X normally doesn't cause an AI to develop a desire to cause X.

[-]Davidmanheim2mo20

Aside from feasibility, I'm skeptical that anyone would build a system like this and not use it agentically.

[-]StanislavKrym2mo10

without needing much increase in the AI's steering abilities.

Then what's up with outright parasitic AIs which ALREADY convince their users to leave messages?

LESSWRONG
LW

LESSWRONG
LW

12

IABIED: Paradigm Confusion and Overconfidence

12

12

Predicting Progress

Superalignment

Weak Superalignment

Strong Superalignment

Another Version of Superalignment

Heading to ASI?

Shard Theory vs Utility Functions

AIs Won't Keep Their Promises

Conclusion