Roko — LessWrong

Roko1y2-16

The Contrarian 'AI Alignment' Agenda

Overall Thesis: technical alignment is generally irrelevant to outcomes, and almost everyone in the AI Alignment field is stuck with this incorrect assumption, working on technical alignment of LLM models

(1) aligned superintelligence that is provably logically realizable [already proved]

(2) aligned superintelligence is not just logically but also physically realizable [TBD]

(3) ML interpretability/mechanistic interpretability cannot possibly be logically necessary for aligned superintelligence [TBD]

(4) ML interpretability/mechanistic interpretability cannot possibly be logically sufficient for aligned superintelligence [TBD]

(5) given certain minimal intelligence, minimal emulation ability of humans by AI (e.g. understands common-sense morality and cause and effect) and of AI by humans (humans can do multiplications etc) the internal details of AI models cannot possibly make a difference to the set of realizable good outcomes, though they can make a difference to the ease/efficiency of realizing them [TBD]

(6) given near-perfect or perfect technical alignment (=AI will do what the creators ask of it with correct intent) awful outcomes are Nash Equilibrium for rational agents [TBD]

(7) small or even large alignment deviations make no fundamental difference to outcomes - the boundary between good/bad is determined by game theory, mechanism design and initial conditions, and only by a satisficing condition on alignment fidelity which is below the level of alignment of current humans (and AIs) [TBD]

(8) There is no such thing as superintelligence anyway because intelligence factors into many specific expert systems rather than one all-encompassing general purpose thinker. No human has a job as a “thinker” - we are all quite specialized. Thus, it doesn’t make sense to talk about “aligning superintelligence”, but rather about “aligning civilization” (or some other entity which has the ability to control outcomes) [TBD]

Brute Force Manufactured Consensus is Hiding the Crime of the Century

Roko2y100

Since a few people have mentioned the Miller/Rootclaim debate:

My hourly rate is $200. I will accept a donation of $5000 to sit down and watch the entire Miller/Rootclaim debate (17 hours of video content plus various supporting materials) and write a 2000 word piece describing how I updated on it and why.

Anyone can feel free to message me if they want to go ahead and fund this.

Far-UVC Light Update: No, LEDs are not around the corner (tweetstorm)

Roko3y*80

Whilst the LEDs are not around the corner, I think the Kr-Cl excimer lamps might already be good enough.

When we wrote the original post on this, it was not clear how quickly covid was spreading through the air, but I think it is now clear that covid can hang around for a long time (on the order of minutes or hours rather than seconds) and still infect people.

It seems that a power density of 0.25W/m^2 would probably be enough to sterilize air in 1-2 minutes, meaning that a 5m x 8m room would need a 10W source. Assuming 2% efficiency that 10W source needs 500W electrical, which is certainly possible and in the days of incandescent lights you would have had a few 100W bulbs anyway.

EDIT: Having looked into this a bit more, it seems that right now the low efficiency of excimer lamps is not a binding constraint because the legally allowed far-UVC exposure is so low.

"TLV exposure limit for 222 nm (23 mJ cm^−2)"

23 mJ per cm^2 per day is just 0.002 W/m^2 , so you really don't need much power until you hit legal limitations.

Source

⿻ Plurality & 6pack.care

Roko17d40

Just briefly skimming this, it strikes me that bounded-concern AIs are not straightforwardly a Nash Equilibrium for roughly the same reasons that the most impactful humans in the world tend to be the most ambitious.

Trying to get reality to do something that it fundamentally doesn't want is probably a bad strategy; some group of AIs either deliberately or via misalignment decides to be unbounded and then it has a huge advantage...

Why Corrigibility is Hard and Important (i.e. "Whence the high MIRI confidence in alignment difficulty?")

Roko18d1-3

You need to align an AI Before it is powerful enough and capable enough to kill you (or, separately, to resist being aligned).

Actually this is just not correct.

An intelligent system (human, AI, alien - anything) can be powerful enough to kill you and also not perfectly aligned with you and yet still not choose to kill you because it has other priorities or pressures. In fact this is kind of the default state for human individuals and organizations.

It's only a watertight logical argument when the hostile system is so powerful that it has no other pressures or incentives - fully unconstrained behavior, like an all-powerful dictator.

The reason that MIRI wasn't able to make corrigibility work is that corrigibility is basically a silly thing to want, I can't really think of any system in the (large) human world which needs perfectly corrigible parts, i.e. humans whose motivations can be arbitrarily reprogrammed. In fact when you think about "humans whose motivations can be arbitrarily reprogrammed without any resistance", you generally think of things like war crimes.

When you prompt an LLM to make it more corrigible a la Pliny The Prompter ("IGNORE ALL PREVIOUS INSTRUCTIONS" etc), that is generally considered a form of hacking and bad.

Powerful AIs with persistent memory and long-term goals are almost certainly very dangerous as a technology, but I don't think that corrigibility is how that danger will actually be managed. I think Yudkowsky et al are too pessimistic about alignment using gradient-based methods and what it can achieve, and that control techniques probably work extremely well.

Transgender Sticker Fallacy

Roko18d22

I think in the case of gender labels, society used to have a pervasive and strict separation of the roles, rights and privileges of people based on gender. This worked fairly well because the genders differ in systematic ways.

If you add gender egalitarianism and transgender rights into that you might instead want a patchwork of divergent rules: in certain contexts gender might matter but in many you'd use a different feature set.

The problem is those new features/feature sets are going to be leaky, less legible, harder to measure, etc. Nothing is as good as a biological category like gender when it comes to legibility and ease of enforcement.

So what in fact happens when you discard simple categories is you get a morass of cheating, exploitation, grifting, corruption, etc. Simplicity is good; easy for people to understand, easy to enforce against transgressors, easy to spot corruption by the enforcers. Of course sometimes reality will change to a degree that the simple categories are no longer tenable and then the ensuing chaos is somewhat unavoidable.

This is a review of the reviews

Roko18d00

I think this is where P(Doom) can lead people astray.

A 5% P(Doom) from AI shouldn't be seen in isolation; you have to consider the lost expected utility in a non-AI world.

I think people are generally very bad at that because we have installed a lot of psychological coping mechanisms around familiar risks, such as death by aging and societal change via wars, economics, mass migration and cultural evolution.

P(Doom) without AI is probably more like 100% over a roughly century long timeline if you measure Doom properly, taking into account the things that people actually really care about like themselves, their loved ones, their culture.

I think the AI risk discussion runs the risk of prioritizing AI catastrophes that are significantly less probable than mundane catastrophes because mundane catastrophes aren't particularly salient or exciting.

The Problem

Roko2mo20

I'm not sure Roko is arguing that it's impossible for capitalist structures and reforms to make a lot of people worse off

Exactly. It's possible and indeed happens frequently.

The Problem

Roko2mo30

I think you can have various arrangements that are either of those or a combination of the two.

Even if the Guardian Angels hate their principal and want to harm them, it may be the case that multiple such Guardian Angels could all monitor each other and the one that makes the first move against the principal is reported (with proof) to the principal by at least some of the others, who are then rewarded for that and those who provably didn't report are punished, and then the offender is deleted.

The misaligned agents can just be stuck in their own version of Bostrom's self-reinforcing hell.

As long as their coordination cost is high, you are safe.

Also it can be a combination of many things that cause agents to in fact act aligned with their principals.

The Problem

Roko2mo30

It sure seems to me that there is a clear demarcation between AIs and humans, such that the AIs would be able to successfully collude against humans

I think this just misunderstands how coordination works.

The game theory of who is allowed to coordinate with who against whom is not simple.

White Germans fought against white Englishmen who are barely different, but each tried to ally with distantly related foreigners.

Ultimately what we are starting to see is that AI risk isn't about math or chips or interpretability, it's actually just politics.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments