Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a special post for quick takes by Adele Lopez. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
30 comments, sorted by Click to highlight new comments since: Today at 12:45 AM

I was pretty taken aback by the article claiming that the Kata-Go AI apparently has something like a human-exploitable distorted concept of "liberties".

If we could somehow ask Kata-Go how it defined "liberties", I suspect that it would have been more readily clear that its concept was messed-up. But of course, a huge part of The Problem is that we have no idea what these neural nets are actually doing.

So I propose the following challenge: Make a hybrid Kata-Go/LLM AI that makes the same mistake and outputs text representing its reasoning in which the mistake is recognizable.

It would be funny if the Go part continued making the same mistake, and the LLM part just made up bullshit explanations.

Coherent Extrapolated Volition (CEV) is Eliezer's proposal of a potentially good thing to target with an aligned superintelligence.

When I look at it, CEV factors into an answer to three questions:

  1. Whose values count? [CEV answer: every human alive today counts equally]
  2. How should values be extrapolated? [CEV answer: Normative Extrapolated Volition]
  3. How should values be combined? [CEV answer, from what I understand, is to use something like Nick Bostrom's parlimentary model, along with an "anti-unilateral" protocol]

(Of course, the why of CEV is an answer to a more complicated set of questions.)

An obvious thought is that the parlimentary model part seems to be mostly solved by Critch's futarchy theorem. The scary thing about this is the prospect of people losing almost all of their voting power by making poor bets. But I think this can be solved by giving each person an equally powerful "guardian angel" AGI aligned with them specifically, and having those do the betting. That feels intuitively acceptable to me at least.

The next thought concerns the "anti-unilateral" protocol (i.e. the protocol at the end of the "Selfish Bastards" section). It seems like it would be good if we could formalize the "anti-unilateral-selfishness" part of it and bake it into something like Critch's futarchy theorem, instead of running a complicated protocol.

  • Stealing Jaynes
    • Ability to stand alone (a la Grothendieck)
    • Mind Projection Fallacy
      • Maintain a careful distinction between ontology and epistemology
        • Lots of confusing theories are confusing because they mix these together in the same theory
        • In QM, Bohr is always talking on the epistemological level, and Einstein is always talking on the ontological level
      • Any probabilities are subjective probabilities
        • Don't make any unjustified assumptions: maximum entropy
      • Meta-knowledge is different from knowledge, but can be utilized to improve direct knowledge
        • probabilities
        • Subjective H theorem
    • Infinities are meaningless until you've specified the exact limiting process
    • If the same phenomena seems to arise in two different ways, try to find a single concept encompassing both ways
    • Failures of a theory are hints of an unknown or unaccounted for principle
    • On effective understanding
      • Learning a sound process is more effective than learning lots of facts
        • Students should be taught a few examples deeply done in the correct way, instead of lots of examples hand-waved through
      • There's often much to be learned from the writings of those who saw far beyond their contemporaries
        • Common examples
          • Jeffreys
          • Gibbs
          • Laplace
      • Conceptual confusion impedes further progress
      • Don't let rigor get in the way of understanding
    • Toolkit

The Drama-Bomb hypothesis

Not even a month ago, Sam Altman predicted that we would live in a strange world where AIs are super-human at persuasion but still not particularly intelligent.

What would it look like when an AGI lab developed such an AI? People testing or playing with the AI might find themselves persuaded of semi-random things, or if sycophantic behavior persists, have their existing feelings and beliefs magnified into zealotry. However, this would (at this stage) not be done in a coordinated way, nor with a strategic goal in mind on the AI's part. The result would likely be chaotic, dramatic, and hard to explain.

Small differences of opinion might suddenly be magnified into seemingly insurmountable chasms, inspiring urgent and dramatic actions. Actions which would be hard to explain even to oneself later.

I don't think this is what happened [<1%] but I found it interesting and amusing to think about. This might even be a relatively better-off world, with frontier AGI orgs regularly getting mired in explosive and confusing drama, thus inhibiting research and motivating tougher regulation.

This could be largely addressed by first promoting a pursuasion AI that does something similar to what Scott Alexander often does: Convince the reader of A, then of Not A, to teach them how difficult it actually is to process the evidence and evaluate an argument, to be less trusting of their impulses.

As Penn and Teller demonstrate the profanity of magic to inoculate their readers against illusion, we must create a pursuasion AI that demonstrates the profanity of rhetoric to inoculate the reader against any pursuasionist AI they may meet later on.

The Averted Famine

In 1898, William Crookes announced that there was an impending crisis which required urgent scientific attention. The problem was that crops deplete Nitrogen from the soil. This can be remedied by using fertilizers, however, he had calculated that existing sources of fertilizers (mainly imported from South America) could not keep up with expected population growth, leading to mass starvation, estimated to occur around 1930-1940. His proposal was that we could entirely circumvent the issue by finding a way to convert some of our mostly Nitrogen atmosphere into a form that plants could absorb.

About 10 years later, in 1909, Franz Haber discovered such a process. Just a year later, Carl Bosch figured out how to industrialize the process. They both were awarded Nobel prizes for their achievement. Our current population levels are sustained by the Haber-Bosch process.

The problem with that is that the Nitrogen does not go back into the atmosphere. It goes into the oceans and the resulting problems have been called a stronger violation of planetary boundaries then CO2 pollution. 

Re: Yudkowsky-Christiano-Ngo debate

Trying to reach toward a key point of disagreement.

Eliezer seems to have an intuition that intelligence will, by default, converge to becoming a coherent intelligence (i.e. one with a utility function and a sensible decision theory). He also seems to think that conditioned on a pivotal act being made, it's very likely that it was done by a coherent intelligence, and thus that it's worth spending most of our effort assuming it must be coherent.

Paul and Richard seem to have an intuition that since humans are pretty intelligent without being particularly coherent, it should be possible to make a superintelligence that is not trying to be very coherent, which could be guided toward performing a pivotal act.

Eliezer might respond that to the extent that any intelligence is capable of accomplishing anything, it's because it is (approximately) coherent over an important subdomain of the problem. I'll call this the "domain of coherence". Eliezer might say that a pivotal act requires having a domain of coherence over pretty much everything: encompassing dangerous domains such as people, self, and power structures. Corrigibility seems to interfere with coherence, which makes it very difficult to design anything corrigible over this domain without neutering it.

From the inside, it's easy to imagine having my intelligence vastly increased, but still being able and willing to incoherently follow deontological rules, such as Actually Stopping what I'm doing if a button is pressed. But I think I might be treating "intelligence" as a bit of a black box, like I could still feel pretty much the same. However, to the extent where I feel pretty much the same, I'm not actually thinking with the strategic depth necessary to perform a pivotal act. To properly imagine thinking with that much strategic depth, I need to imagine being able to see clearly through people and power structures. What feels like my willingness to respond to a shutdown button would elide into an attitude of "okay, well I just won't do anything that would make them need to stop me" and then into "oh, I see exactly under what conditions they would push the button, and I can easily adapt my actions to avoid making them push it", to the extent where I'm no longer being being constrained by it meaningfully. From the outside view, this very much looks like me becoming coherent w.r.t the shutdown button, even if I'm still very much committed to responding incoherently in the (now extremely unlikely) event it is pushed.

And I think that Eliezer foresees pretty much any assumption of incoherence that we could bake in becoming suspiciously irrelevant in much the same way, for any general intelligence which could perform a pivotal act. So it's not safe to rely on any incoherence on part of the AGI.

Sorry if I misconstrued anyone's views here!

Half-baked idea for low-impact AI:

As an example, imagine a board that's lodged directly by the wall (no other support structures). If you make it twice as wide, then it will be twice as stiff, but if you make it twice as thick, then it will be eight times as stiff. On the other hand, if you make it twice as long, it will be eight times more compliant.

In a similar way, different action parameters will have scaling exponents (or more generally, functions). So one way to decrease the risk of high-impact actions would be to make sure that the scaling exponent is bounded above by a certain amount.

Anyway, to even do this, you still need to make sure the agent's model is honestly evaluating the scaling exponent. And you would still need to define this stuff a lot more rigorously. I think this idea is more useful in the case where you already have an AI with high-level corrigible intent and want to give it a general "common sense" about the kinds of experiments it might think to try.

So it's probably not that useful, but I wanted to throw it out there.

[I may try to flesh this out into a full-fledged post, but for now the idea is only partially baked. If you see a hole in the argument, please poke at it! Also I wouldn't be very surprised if someone has made this point already, but I don't remember seeing such. ]

Dissolving the paradox of useful noise

A perfect bayesian doesn't need randomization.

Yet in practice, randomization seems to be quite useful.

How to resolve this seeming contradiction?

I think the key is that a perfect bayesian (Omega) is logically omniscient. Omega can always fully update on all of the information at hand. There's simply nothing to be gained by adding noise.

A bounded agent will have difficulty keeping up. As with Omega, human strategies are born from an optimization process. This works well to the extent that the optimization process is well-suited to the task at hand. To Omega, it will be obvious whether the optimization process is actually optimizing for the right thing. But to us humans, it is not so obvious. Think of how many plans fail after contact with reality! A failure of this kind may look like a carefully executed model which some obvious-in-retrospect confounders which were not accounted for. For a bounded agent, there appears to be an inherent difference in seeing the flaw once pointed out, and being able to notice the flaw in the first place.

If we are modeling our problem well, then we can beat randomness. That's why we have modeling abilities in the first place. But if we are simply wrong in a fundamental way that hasn't occurred to us, we will be worse than random. It is in such situations that randomization is in fact, helpful.

This is why the P vs BPP difference matters. P and BPP can solve the same problems equally well, from the logically omniscient perspective. But to a bounded agent, the difference does matter, and to the extent to which a more efficient BPP algorithm than the P algorithm is known, the bounded agent can win by using randomization. This is fully compatible with the fact that to Omega, P and BPP are equally powerful.

As Jaynes said:

It appears to be a quite general principle that, whenever there is a randomized way of doing something, then there is a nonrandomized way that delivers better performance but requires more thought.

There's no contradiction because requiring more thought is costly to a bounded agent.


It may be instructive to look into computability theory. I believe (although I haven't seen this proven) that you can get Halting-problem-style contradictions if you have multiple perfect-Bayesian agents modelling each other[1].

Many of these contradictions are (partially) alleviated if agents have access to private random oracles.


If a system can express a perfect agent that will do X if and only if it has a  chance of doing X, the system is self-contradictory[2].

If a symmetric system can express two identical perfect agents that will each do X if and only if the other agent does not do X, the system is self-contradictory[3].

  1. ^

    Actually, even a single perfect-Bayesian agent modelling itself may be sufficient...

  2. ^

    This is an example where private random oracles partially alleviate the issue, though do not make it go away. Without a random oracle the agent is correct 0% of the time regardless of which choice it makes. With a random oracle the agent can roll a d100[4] and do X unless the result is 1, and be correct 99% of the time.

  3. ^

    This is an example where private random oracles help. Both agents query their random oracle for a real-number result[5] and exchange the value with the other agent. The agent that gets the higher[6] number chooses X, the other agent chooses ~X.

  4. ^

    Not literally. As in "query the random oracle for a random choice of 100 possibilities".

  5. ^

    Alternatively you can do it with coinflips repeated until the agents get different results from each other[7], although this may take an unbounded amount of time.

  6. ^

    The probability that they get the same result is zero.

  7. ^

    Again, not literally. As in "query the random oracle for a single random bit".

[Epistemic status: very speculative]

One ray of hope that I've seen discussed is that we may be able to do some sort of acausal trade with even an unaligned AGI, such that it will spare us (e.g. it would give us a humanity-aligned AGI control of a few stars, in exchange for us giving it control of several stars in the worlds we win).

I think Eliezer is right that this wouldn't work.

But I think there are possible trades which don't have this problem. Consider the scenario in which we Win, with an aligned AGI taking control of our future light-cone. Assuming the Grabby aliens hypothesis is true, we will eventually run into other civilizations, which will either have Won themselves, or are AGIs who ate their mother civilizations. I think Humanity will be very sad at the loss of the civilizations who didn't make it because they failed at the alignment problem. We might even be willing to give up several star systems to an AGI who kept its mother civilization intact on a single star system. This trade wouldn't have the issue Eliezer brought up, since it doesn't require us to model such an AGI correctly in advance, only that that AGI was able to model Humanity well enough to know it would want this and would honor the implicit trade.

So symmetrically, we might hope that there are alien civilizations that both Win, and would value being able to meet alien civilizations strongly enough. In such a scenario, "dignity points" are especially aptly named: think of how much less embarrassing it would be to have gotten a little further at solving alignment when the aliens ask us why we failed so badly.

Privacy as a component of AI alignment

[realized this is basically just a behaviorist genie, but posting it in case someone finds it useful]

What makes something manipulative? If I do something with the intent of getting you to do something, is that manipulative? A simple request seems fine, but if I have a complete model of your mind, and use it phrase things so you do exactly what I want, that seems to have crossed an important line.

The idea is that using a model of a person that is *too* detailed is a violation of human values. In particular, it violates the value of autonomy, since your actions can now be controlled by someone using this model. And I believe that this is a significant part of what we are trying to protect when we invoke the colloquial value of privacy.

In ordinary situations, people can control how much privacy they have relative to another entity by limiting their contact with them to certain situations. But with an AGI, a person may lose a very large amount of privacy from seemingly innocuous interactions (we're already seeing the start of this with "big data" companies improving their advertising effectiveness by using information that doesn't seem that significant to us). Even worse, an AGI may be able to break the privacy of everyone (or a very large class of people) by using inferences based on just a few people (leveraging perhaps knowledge of the human connectome, hypnosis, etc...).

If we could reliably point to specific models an AI is using, and have it honestly share its model structure with us, we could potentially limit the strength of its model of human minds. Perhaps even have it use a hardcoded model limited to knowledge of the physical conditions required to keep it healthy. This would mitigate issues such as deliberate deception or mindcrime.

We could also potentially allow it to use more detailed models in specific cases, for example, we could let it use a detailed mind model to figure out what is causing depression in a specific case, but it would have to use the limited model in any other contexts or for any planning aspects of it. Not sure if that example would work, but I think that there are potentially safe ways to have it use context-limited mind models.

I question the claim that humans inherently need privacy from their loving gods. A lot of Christians seem happy enough without it, and I've heard most forager societies have a lot less privacy than ours, heck, most rural villages have a lot less privacy than most of us would be used to (because everyone knows you and talks about you).

The intensive, probably unnatural levels of privacy we're used to in our nucleated families, our cities, our internet, might not really lead to a general increase in wellbeing overall, and seems implicated in many pathologies of isolation and coordination problems.

most rural villages have a lot less privacy than most of us would be used to (because everyone knows you and talks about you).

A lot of people who have moved to cities from such places seem to mention this as exactly the reason why they wanted out.

That said, this is often because the others are judgmental etc., which wouldn't need to be the case with an AGI.

(biased sample though?)

Yeah, I think if the village had truly deeply understood them they would not want to leave it. The problem is the part where they're not really able to understand part.

It seems that privacy potentially could "tame" a not-quite-corrigible AI. With a full model, the AGI might receive a request, deduce that activating a certain set of neurons strongly would be the most robust way to make you feel the request was fulfilled, and then design an electrode set-up to accomplish that. Whereas the same AI with a weak model wouldn't be able to think of anything like that, and might resort to fulfilling the request in a more "normal" way. This doesn't seem that great, but it does seem to me like this is actually part of what makes humans relatively corrigible.

Part of it seems like a matter of alignment. It seems like there's a difference between 

  • Someone getting someone else to do something they wouldn't normally do, especially under false pretenses (or as part of a deal and not keeping up the other side)


  • Someone choosing to go to an oracle AI (or doctor) and saying "How do I beat this addiction that's ruining my life*?"

*There's some scary stories about what people are willing to do to try to solve that problem, including brain surgery.

Yeah, I also see "manipulation" in the bad sense of the word as "making me do X without me knowing that I am pushed towards X". (Or, in more coercive situations, with me knowing, disagreeing with the goal, but being unable to do anything about it.)

Teaching people, coaching them, curing their addictions, etc., as long as this is explicitly what they wanted (without any hidden extras), it is a "manipulation" in the technical sense of the word, but it is not evil.

naïve musing about waluigis

it seems like there's a sense in which luigis are simpler than waluigis

a luigi selected for a specific task/personality doesn't need to have all the parts of the LLM that are emulating all the waluigi behaviors

so there might be a relatively easy way to remove waluigis by penalizing/removing everything not needed to generate luigi's responses, as well as anything that is used more by waluigis than luigis

of course, this appearing to work comes nowhere near close to giving confidence that the waluigis are actually gone, but it would be promising if it did appear to work, even under adversarial pressure from jailbreakers

Elitzur-Vaidman AGI testing

One thing that makes AI alignment super hard is that we only get one shot.

However, it's potentially possible to get around this (though probably still very difficult).

The Elitzur-Vaidman bomb tester is a protocol (using quantum weirdness) by which a bomb may be tested, with arbitrarily little risk. It's interest comes from the fact that it works even when the only way to test the bomb is to try detonating it. It doesn't matter how the bomb works, as long as we can set things up so that it will allow/block a photon based on whether the bomb is live/dead. I won't explain the details here, but you can roughly think of it as a way of blowing up a bomb in one Many-Worlds branch, but learning the result on other branches via quantum entanglement.

If the "bomb" is an AGI program, and it is live if it's an unaligned yet functional superintelligence, then this provides a possible way to test the AGI without risking our entire future lightcone. This is still quite difficult, because unlike a bomb, a superintelligence will, by default, be motivated to allow/block the photon so that it looks like a dud. So we would need to figure out how to robustly get an honest signal from such an experiment, which still seems quite hard. But perhaps it's easier than solving the full alignment problem before the first shot.


So we would need to figure out how to robustly get an honest signal from such an experiment, which still seems quite hard. But perhaps it's easier than solving the full alignment problem before the first shot.

IMO this is a 'additional line of defense' boxing strategy instead of simplification. 

Note that in the traditional version, the 'dud' bit of the bomb can only be the trigger; a bomb that absorbs the photon but then explodes isn't distinguishable from a bomb that absorbs the photon and then doesn't explode (because of an error deeper in the bomb).

But let's suppose the quantum computing folks can come up with something like this, where we keep some branches entangled and run analysis of the AI code in only one branch, causing an explosion there but affecting the total outcome in all branches. [This seems pretty implausible to me that you manage to maintain entanglement despite that much impact on the external world, but maybe it's possible.] Then 1) as you point out, we need to ensure that the AI doesn't realize that what it needs to output in that branch and 2) need some sort of way to evaluate "did the AI pass our checks or not?". 

But, 2 is "the whole problem"!



I think we get enough things referencing quantum mechanics that we should probably explain why that doesn't work (if I it doesn't) rather than just downvoting and moving on.


It probably does work with a Sufficiently Powerful™ quantum computer, if you could write down a meaningful predicate which can be computed:

Haha yeah, I'm not surprised if this ends up not working, but I'd appreciate hearing why.

[Public Draft v0.0] AGI: The Depth of Our Uncertainty

[The intent is for this to become a post making a solid case for why our ignorance about AGI implies near-certain doom, given our current level of capability:alignment efforts.]

[I tend to write lots of posts which never end up being published, so I'm trying a new thing where I will write a public draft which people can comment on, either to poke holes or contribute arguments/ideas. I'm hoping that having any engagement on it will strongly increase my motivation to follow through with this, so please comment even if just to say this seems cool!]

[Nothing I have planned so far is original; this will mostly be exposition of things that EY and others have said already. But it would be cool if thinking about this a lot gives me some new insights too!]

Entropy is Uncertainty

Given a model of the world, there are lots of possibilities that satisfy that model, over which our model implies a distribution.

There is a mathematically inevitable way to quantify the uncertainty latent in such a model, called entropy.

A model is subjective in the sense that it is held by a particular observer, and thus entropy is subjective in this sense too. [Obvious to Bayesians, but worth spending time on as it seems to be a common sticking point]

This is in fact the same entropy that shows up in physics!

Engine Efficiency

But wait, that implies that temperature (defined from entropy) is subjective, which is crazy! After all, we can measure temperature with a thermometer. Or define it as the average kinetic energy of the particles (in a monoatomic gas, in other cases you need the potential energy from the bonds)! Those are both objective in the sense of not depending on the observer.

That is true, as those are slightly different notions of temperature. The objective measurement is the one important for determining whether something will burn your hand, and thus is the one which the colloquial sense of temperature tracks. But the definition entropy is actually more useful, and it's more useful because we can wring some extra advantage from the fact that it is subjective.

And that's because, it is this notion of temperature which governs the use of a engine. Without the subjective definition, we merely get the law of a heat engine. As a simple intuition, consider that you happen to know that your heat source doesn't just have molecules moving randomly, but that they are predominantly moving back and forth along a particular axis at a specific frequency. The temperature of a thermometer attached to this may measure the same temperature as an ordinary heat sink with the same amount of energy (mediated by phonon dissipation), and yet it would be simple to create an engine using this "heat sink" exceeding the Carnot limit simply by using a non-heat engine which takes advantage of the vibrational mode!

Say that this vibrational mode was hidden or hard to notice. Then someone with the knowledge of it would be able to make a more effective engine, and therefore extract more work, than someone who hadn't noticed.

Another example is Maxwell's demon. In this case, the demon has less uncertainty over the state of the gas than someone at the macro-level, and is thereby able to extract more work from the same gas.

But perhaps the real power of this subjective notion of temperature comes from the fact that the Carnot limit still applies with it, but now generalized to any kind of engine! This means that there is a physical limit on how much work can be extracted from a system which directly depends on your uncertainty about the system!! [This argument needs to actually be fleshed out for this post to be convincing, I think...]

The Work of Optimization

[Currently MUCH rougher than the above...]

Hopefully now, you can start to see the outlines of how it is knowable that

Try to let go of any intuitions about "minds" or "agents", and think about optimizers in a very mechanical way.

Physical work is about the energy necessary to change the configuration of matter.

Roughly, you can factor an optimizer into three parts: The Modeler, the Engine, and the Actuator. Additionally, there is the Environment the optimizer exists within and optimizes over. The Modeler models the optimizer's environment - decreasing uncertainty. The Engine uses this decreased uncertainty to extract more work from the environment. The Actuator focuses this work into certain kinds of configuration changes.

[There seems to be a duality between the Modeler and the Actuator which feels very important.]


Gas Heater

  • It is the implicit knowledge of the location, concentration, and chemical structure of a natural gas line that allow the conversion of natural gas and the air in the room to state from a state of both being at the same low temperature to a state where the air is at a higher temperature, and the gas has been burned.

-- How much work does it take to heat up a room? -- How much uncertainty is there in the configuration state before and after combustion?

This brings us to an important point. A gas heater still works with no one around to be modeling it. So how is any of the subjective entropy stuff relevant? Well, from the perspective of no one - the room is simply in one of a plethora of possible states before, and it is in another of those possible states after, just like any other physical process anywhere. It is only because of the fact that we find it somehow relevant that the room is hotter before than after that thermodynamics comes into play. The universe doesn't need thermodynamics to make atoms bounce around, we need it to understand and even recognize it as an interesting difference.



Natural Selection

Chess Engine



Why Orthogonality?

[More high level sections to come]

dumb alignment idea

Flood the internet with stories in which a GPT chatbot which achieves superintelligence decides to be Good/a scaffold for a utopian human civilization/CEV-implementer.

The idea being that an actual GPT chatbot might get its values from looking at what the GPT part of it predicts such a chatbot would do.