Tor Økland Barstad


AGI-assisted Alignment

Wiki Contributions


AGI-assisted alignment in Dath Ilan (excerpt from here)

Suppose Dath Ilan got into a situation where they had to choose the strategy of AGI-assisted alignment, and didn't have more than a few years to prepare. Dath Ilan wouldn't actually get themselves into such a situation, but if they did, how might they go about it?

I suspect that among other things they would:

  • Make damn well sure to box the AGIs before it plausibly could become dangerous/powerful.
  • Try, insofar as they could, to make their methodologies robust to hardware exploits (rowhapper, etc). Not only by making hardware exploits hard, but by thinking about which code they ran on which computers and so on.
  • Limit communication bandwidth for AGIs (they might think of having humans directly exposed to AGI communication as "touching the lava", and try to obtain help with alignment-work while touching lava as little as possible).
  • Insofar as they saw a need to expose humans to AGI communication, those humans would be sealed off from society (and the AGIs communication would be heavily restricted). The idea of having operators themselves be exposed to AGI-generated content from superhuman AGIs they don't trust to be aligned - in Dath Ilan such an approach would have been seen as outside the realm of consideration.
  • Humans would be exposed to AGI communication mostly so as to test the accuracy of systems that predict human evaluations (if they don't test humans on AGI-generated content, they're not testing the full range of outputs). But even this they would also try to get around / minimize (by, among other things, using techniques such as the ones I summarize here).
  • In Dath Ilani, it would be seen as a deranged idea to have humans evaluate arguments (or predict how humans would evaluate arguments), and just trust arguments that seem good to humans. To them, that would be kind of like trying to make a water-tight basket, but never trying to fill it with water (to see if water leaked through). Instead, they would use techniques such as the ones I summarize here.
  • When forced to confront chicken and egg problems, they would see this as a challenge (after all, chickens exist, and eggs exist - so it's not as if chicken and egg problems never have solutions).

This is from What if Debate and Factored Cognition had a mutated baby? (a post I've been working on).

It's from the intro/summary (not the whole post). I might not finish the post, as I'm working on a new/different post that covers much of the same ground.

Tweet-length summary-attempts

Resembles Debate, but:

  • Higher alignment-tax (probably)
  • More "proof-like" argumentation
  • Argumentation can be more extensive
  • There would be more mechanisms for trying to robustly separate out "good" human evaluations (and testing if we succeeded)

We'd have separate systems that (among other things):

  1. Predict human evaluations of individual "steps" in AI-generated "proof-like" arguments.
  2. Make functions that separate out "good" human evaluations.

I'll explain why #2 doesn't rely on us already having obtained honest systems.

"ASIs could manipulate humans" is a leaky abstraction (which humans? how is argumentation restricted?).

ASIs would know regularities for when humans are hard to fool (even by other ASIs).

I posit: We can safely/robustly get them to make functions that specify these regularities.


To many of you, the following will seem misguided:

We can obtain systems from AIs that predict human evaluations of the various steps in “proof-like” argumentation.

We can obtain functions from AIs that assign scores to “proof-like” argumentation based on how likely humans are to agree with the various steps.

We can have these score-functions leverage regularities for when humans tend to evaluate correctly (based on info about humans, properties of the argumentation, etc).

We can then request “proofs” for whether outputs do what we asked for / want (and trust output that can be accompanied with high-scoring proofs).

We can request outputs that help us make robustly aligned AGIs. 

Many of you may find several problems with what I describe above - not just one. Perhaps most glaringly, it seems circular:

If we already had AIs that we trusted to write functions that separate out “good” human evaluations, couldn’t we just trust those AIs to give us “good” answers directly?

The answer has to do with what we can and can’t score in a safe and robust way (for purposes of gradient descent).

The answer also has to do with exploration of wiggle-room:

Given a specific score-function, is it possible to construct high-scoring arguments that argue in favor of contradictory conclusions?

And exploration of higher-level wiggle-room:

Suppose some specific restrictions for score-functions (designed to make it hard to make high-scoring score-functions that have low wiggle-room for “wrong” reasons).

Given those restrictions, is it possible to make high-scoring score-functions that are mutually contradictory (even if internally those score-functions have low wiggle-room)?

All score-functions that robustly leverate regularities for when human evaluations are correct/good would have low wiggle-room. The reverse is not true. Score-functions could have low wiggle-room due to somehow favoring wrong/bad conclusions.

Some (but not all) “core” concepts are summarized below:

Wiggle-room (relative to score-function for argument-step-networks)

Is it possible to generate high-scoring networks that argue for contradictory conclusions? If yes, then that means there is wiggle room.

Any score-function that robustly separates and leverages “good” human evaluations will have low wiggle-room. But the reverse is not true, as there also are other ways to achieve low wiggle-room.

Making it so that the way for a score-function to itself get a high score is to separate out “good” human evaluations

Some AIs would make score-functions, and they would try to make score-functions that get a high score (from the score-function score-function 🙃).

Score-functions would be required to have low wiggle-room. If constructed freely, score-functions could achieve low wiggle-room in a variety of ways. But with this alignment-methodology, the “space” of possible score-functions that can be proposed would be heavily restricted.

We’d want score-functions to achieve low wiggle-room by leveraging regularities for when human evaluators get things right. We’d do that by making it so that achieving low wiggle-room any other way either would be impossible, or result in a lower score.

Some of the tools that would be on our toolbox when restricting score-functions:

  • Giving complexity-penalty based on source code length (and have a low maximum limit for the amount of source code allowed)
  • Having predictions of human evaluations be done by external systems (that don’t add to the complexity-penalty)
  • Restricting the range of questions where human answers can be predicted
  • Making it so that source code must be constructed out of (something akin to) human-defined regular expressions, that define the confines of how the functions should work
  • Enforce modularity through a variety of means (making it so that no part of the score-function knows what the argument-network is about)

Wiggle-room (relative to score-function for score-functions)

Have we succeeded with making it so that, in order to maximize score, score-functions must robustly separate out “good” human evaluations? Or is it possible to somehow sneak in bias in favor of specific conclusions, without the score-function getting a lower score as a result?

We can check that, by asking AIs to find high-scoring score-functions that allow for contradictory conclusions. If they are able to do that, then it’s possible to sneak in bias.

  • Let S be the set of possible score-functions that would be given a high score (by the score-function score-function in question)
  • Let A be the set of possible argument-step-networks that would be given a high score by ≥1 of the score-functions in S
  • If it’s possible to sneak bias into high-scoring score-functions, then there will be argument-step-networks in A that argue in favor of contradictory conclusions.

Below are some concepts related to extracting aligned capabilities. The main goal is to be able to verify specialized functions without having humans need to look at the source code, and without being able to safely/robustly score outputs for the full range of inputs.

Some things we need:

  • We need AIs that act in such a way as to maximize score
  • There needs to be some some range of the inputs that we can test
  • There needs to be ways of obtaining/calculating the output we want that are at least somewhat general

An example of an aligned capability we might want would be to predict human answers. In this case, we could test outputs by actually asking questions to real humans (or using existing data of human receiving questions). But if we use the systems to predict human answers when they evaluate AGI-generated content, then we may not want to test those outputs/predictions on real humans.

(I'm working on texts that hopefully will explain these concepts better. In the meantime, this is the best I have.)

AI vs AI games and optimization-targets


A function that determines whether some output is approved or not (that output may itself be a function).


A function that assigns score to some output (that output may itself be a function).


Think regular expressions, but more expressive and user-friendly.

We can require of AIs: "Only propose functions that can be made with this builder". That way, we restrict their expressivity.

When we as humans specify desideratum, this is one tool (among several!) in the tool-box.

Higher-level desideratum or score-function

Not fundamentally different from other desideratum or score-functions. But the output that is evaluated is itself a desideratum or score-function.

At every level there can be many requirements for the level below.

A typical requirement at every level is low wiggle-room.

Example of higher-level desideratum / score-functions

Humans/operators define a score-function           ← level 4

for desideratum                                                  ← level 3

for desideratum                                                  ← level 2

for desideratum                                                  ← level 1

for functions that generate

the output we care about.

Wiggle-room relative to desideratum

Among outputs that would be approved by the desideratum, do any of them contradict each other in any way?

For example: Are there possible functions that give contradicting outputs (for at least 1 input), such that both functions would be approved by the desideratum?

Wiggle-room relative to score-function

Among outputs that would receive a high score by the score-function in question (e.g. "no less than 80% of any other possible output"), do any of them contradict each other in any way?

2nd-level wiggle-room relative to desiderata

We start with a desiderata-desideratum or score-function-desideratum (aka 2nd-level desideratum).

Set A: Any desideratum that approved by the desiderata-desideratum.

Set B: Any output approved by ≥1 of the desiderata in A.

Are there ≥1 contradictions among outputs in B?

P(desideratum forces good outputs | desideratum has low wiggle-room)

If a desideratum forces good/correct outputs, then it has low wiggle-room. But the reverse is not necessarily true.

But for some desiderata we may think: "If wiggle-room is low, that’s probably because it’s hard to satisfy the desideratum without also producing good output."

“Spaces/sets of desideratum where we think P(desideratum forces good outputs | desideratum has low wiggle-room) is low

Among spaces/sets of low-wiggle-room desideratum where we suspect "low wiggle-room → good output" (as defined by higher-level desideratum), do outputs converge?

AI vs AI games and optimization-targets

AI vs AI game

AIs compete in a zero-sum game.

Game produces some type of output (e.g. a function) that is scored.

  • AI player is global optimum → game produces output we want
  • Human review not necessary
  • Must ask ourselves: Hard in practice to win with deceptive output (not just in theory)?

AIs (in AI vs AI games) proving each other wrong, making bets and counter-bets, etc

"This is the shortest source-code for doing A"

"Here is a source-code that's shorter"

"All outputs from F1 will be scored well by evaluation-function F2"

"Here's an input where that's not true"


"I'll bet on any prediction made by function F3"

"I make a counter-bet for input I1"


For functions with AGI origin, there may sometimes be only some subset of the inputs where we can evaluate output robustly+safely.

For example, when testing systems that predict human output, we may want to avoid experiments where humans evaluate AGI-generated content.


In AI vs AI games (e.g. to make high-scoring functions), some subset of inputs may be declared the test-range.

Bad output in evaluation-range outside of test-range: No penalty.

Test-range ⊈ evaluation-range: Game is not scored.


Some examples (there are many more):

  • Source code brevity
  • Non-overlap-minimization with optimized code-bases where we can evaluate output
  • Heuristics that approximate code generality, modularity, etc
  • Minimal computational logic (including axioms) to infer outputs

Optimization-targets that cause low wiggle-room outside of test-range

Imagine AIs competing to make functions that calculate the area of squares.

Accurate output inside test-range is required. Source code brevity gives bonus-points. 

Among functions that maximize score, all have the same outputs outside of test-range.

Combining optimization-targets

Optimization-targets can be combined (be that with weighted sums, or functions that assign score based on more sophisticated evaluations). When I talk about optimization-targets, you should not assume that it consists of just 1 heuristic (such as e.g. source code length).

Techniques for testing P(optimization-target forces good output | optimization-target causes low wiggle-room)

Run AI vs AI games where we:

  • Pick out various subsets of evaluation-range to be test-range.
  • Set semi-random test-ranges, even though for the task in question input-range = evaluation-range.

(For the sake of scale/efficiency we can have AIs make bets on the outcomes of such games.)

Rules for predicting if/when optimization-targets limit wiggle-room for the right reasons

AIs can find rules that predict when we should trust some optimization-target to cause low wiggle-room for the right reasons (scored by simplicity, false positive rate, etc).

They can make bets on average performance, and point out examples where rules from competing AIs fail.

Convergence/divergence among rules for predicting if/when optimization-targets limit wiggle-room for the right reasons

A: Huge (but limited) scope of outputs that we can test safely.

B: Outputs we can't test safely.

R: Rules for predicting when optimization-targets force good output (top-scoring when tested within 

For outputs in B, we can explore: "Does R converge in regards to this output?"

“Spaces” of possible functions we think contain ≥1 robustly reliable function

If we (have good reason to) think some “space” of functions contains at least ≥1 robustly reliable function, and all functions in that “space” agree on output for a given input, then we can trust this to be that output - even if there isn’t a specific function that we trust to be robustly reliable.

Examples of optimization-targets

Source code brevity

Sometimes the solution with shortest source code is quite generic/general/abstract. In such cases, intentionally giving wrong output for some inputs would often require implementing this solution and adding (something akin to) if-sentences that gives bad output for some inputs.


A maximally “non-general” function would have 1 if-sentence per possible input.

Possible heuristics:

  • % chance that random change to code chances output for a low but non-zero fraction of inputs
  • If there are simplifications that would affect a low but non-zero fraction of inputs

Use of computational resources

If a function only focuses on finding the correct solution, and doesn’t need to spend cycles thinking about how to deceive, what humans think, etc (in addition to finding the correct solution), it will need less computational resources.

Non-overlap-minimization with optimized code-bases where we robustly can score/verify the full range of outputs

We can score/verify all outputs for:

  • Theorem-provers
  • Game-players
  • Programs predicting what humans say next in existing data (videos, chats, etc)
  • Etc, etc

If you’re not trying to have your function give bad output once in a while, borrowing sub-routines may be easier.

Source code brevity (while doing several specialized tasks from same code-base)

Suppose one code-base is to be specialized, but towards several different tasks (and for several of them, all output can be verified).


  • Coming up with ways to make nanotech
  • Searching for mathematical proofs
  • Predicting experimental results (from existing experiments)


Imagine that you are trying to make a car that fails in certain very specific conditions. This becomes more challenging if it has to be made out of parts that also are used in other vehicles (and these other vehicles are robustly tested).

The degree to which subroutine instantiations inside/outside test-range can be told apart

% of source code that consists of subroutines where no AI is able to guess/determine based on input if the function-call is inside or outside of test-range.

This could be determined by running other AI vs AI games (but mostly by making predictions/bets on what outcome would be).

Module implementation flexibility

Imagine that you are trying to make a car, and want the steering wheel to fail, but only in certain very specific conditions. This becomes more challenging to achieve consistently if it isn’t you who will be making the steering wheel.

Module division flexibility

If you are trying to fail for certain specific inputs, this is made harder if you are making functions based on different architectures, and they need to fail the same way.

There may be crude heuristics that help approximate whether 2 architectures are substantially different.

Minimal computational “logic” (including “axioms”) needed to infer output

  • Each AI specifies an initial base of data-structures ("axioms"), which includes functions ("inference-rules").
  • Each AI can infer good/bad output from either initial base.
  • Less source code (in total for initial base, and per data-structure in initial base) leads to higher score.

Predicted comprehensibility to humans

Heuristics relating to how understandable code-base would be to humans.

It is allowed to rewrite code-base for understandability, so long as output is unchanged.

This optimization-target presupposes systems that can predict human output (with non-terrible accuracy).

Telephone game robustness

Suppose various systems (maybe some emulating humans and some that don't) for:

  • Converting human instructions/diagrams/etc to code
  • Converting back the other way

With N cycles, what's the % chance of ≥0 outputs changing, but only outside in test-range?

Output-robustness to training data

Instead of having the AI do something, we can have it make a seed AI that does (after having been trained). And we can see the degree to which various outputs are dependent on variations in training data.

Output-robustness to adding chains by which output can be indirectly obtained

Indirect ways of obtaining answers can be chained:

Make a function that

generates a computational logic that


gives us the output we want

Each chain is an opportunity for deceptive AIs to lose ground / be unable to have intentional mistakes remain.

I would also like to see more work where people make less positive/optimistic assumptions. I think of it as a good thing that different approaches to alignment are being explored, and would like to see more of that in general (both in terms of breadth and depth).

I guess there are many possible ways of trying to categorize/conceptualize approaches to alignment theorizing. One is by asking "when talking/thinking about the methodology, what capabilities are assumed to be in place?".

I'm not sure about this, but unless I'm mistaken[1], a good amount of the work done by MIRI has been under a premise that goes (something sort of like): "Let's assume that we already know how to give AIs real-world objectives defined in terms of utility functions (not because we really assume that, but as a simplifying assumption)". And as far as I know, they haven't done much work where the assumption was something more like "suppose we were extremely good at gradient descent / searching through spaces of possible programs".

In my own theorizing, I don't make all of the simplifying assumptions that (I think/suspect) MIRI made in their "orthodox" research. But I make other assumptions (for the purpose of simplification), such as:

  • "let's assume that we're really good at gradient descent / searching for possible AIs in program-space"[2]
  • "let's assume that the things I'm imagining are not made infeasible due to a lack of computational resources"
  • "let's assume that resources and organizational culture makes it possible to carry out the plans as described/envisioned (with high technical security, etc)"

In regards to your alignment ideas, is it easy to summarize what you assume to be in place? Like, if someone came to you and said "we have written the source code for a superintelligent AGI, but we haven't turned it on yet" (and you believed them), is it easy to summarize what more you then would need in order to implement your methodology? 

  1. ^

    I very well could be, and would appreciate any corrections.

    (I know they have worked on lots of detail-oriented things that aren't "one big plan" to "solve alignment". And maybe how I phrase myself makes it seem like I don't understand that. But if so, that's probably due to bad wording on my part.)

  2. ^

    Well, I sort of make that assumption, but there are caveats.

If humans (...) machine could too.

From my point of view, humans are machines (even if not typical machines). Or, well, some will say that by definition we are not - but that's not so important really ("machine" is just a word). We are physical systems with certain mental properties, and therefore we are existence proofs of physical systems with those certain mental properties being possible.

machine can have any level of intelligence, humans are in a quite narrow spectrum

True. Although if I myself somehow could work/think a million times faster, I think I'd be superintelligent in terms of my capabilities. (If you are skeptical of that assessment, that's fine - even if you are, maybe you believe it in regards to some humans.)

prove your point by analogy with humans. If humans can pursue somewhat any goal, machine could too.

It has not been my intention to imply that humans can pursue somewhat any goal :)

I meant to refer to the types of machines that would be technically possible for humans to make (even if we don't want to so in practice,  and shouldn't want to). And when saying "technically possible", I'm imagining "ideal" conditions (so it's not the same as me saying we would be able to make such machines right now - only that it at least would be theoretically possible).

Why call it an assumption at all?

Partly because I was worried about follow-up comments that were kind of like "so you say you can prove it - well, why aren't you doing it then?".

And partly because I don't make a strict distinction between "things I assume" and "things I have convinced myself of, or proved to myself, based on things I assume". I do see there as sort of being a distinction along such lines, but I see it as blurry.

Something that is derivable from axioms is usually called a theorem.

If I am to be nitpicky, maybe you meant "derived" and not "derivable".

From my perspective there is a lot of in-between between these two:

  • "we've proved this rigorously (with mathemathical proofs, or something like that) from axiomatic assumptions that pretty much all intelligent humans would agree with"
  • "we just assume this without reason, because it feels self-evident to us"

Like, I think there is a scale of sorts between those two.

I'll give an extreme example:

Person A: "It would be technically possible to make a website that works the same way as Facebook, except that its GUI is red instead of blue."

Person B: "Oh really, so have you proved that then, by doing it yourself?"

Person A: "No"

Person B: "Do you have a mathemathical proof that it's possible"

Person A: "Not quite. But it's clear that if you can make Facebook like it is now, you could just change the colors by changing some lines in the code."

Person B: "That's your proof? That's just an assumption!"

Person A: "But it is clear. If you try to think of this in a more technical way, you will also realize this sooner or later."

Person B: "What's your principle here, that every program that isn't proven as impossible is possible?"

Person A: "No, but I see very clearly that this program would be possible."

Person B: "Oh, you see it very clearly? And yet, you can't make it, or prove mathemathically that it should be possible."

Person A: "Well, not quite. Most of what we call mathemathical proofs, are (from my point of view) a form of rigorous argumentation. I think I understand fairly well/rigorously why what I said is the case. Maybe I could argue for it in a way that is more rigorous/formal than I've done so far in our interaction, but that would take time (that I could spend on other things), and my guess is that even if I did, you wouldn't look carefully at my argumentation and try hard to understand what I mean."

The example I give here is extreme (in order to get across how the discussion feels to me, I make the thing they discuss into something much simpler). But from my perspective it is sort of similar to discussion in regards the The Orthogonality Thesis. Like,  The Orthogonality Thesis is imprecisely stated, but I "see" quite clearly that some version of it is true. Similar to how I "see" that it would be possible to make a website that technically works like Facebook but is red instead of blue (even though - as I mentioned - that's a much more extreme and straight-forward example).

(...) if it's supported by argument or evidence, but if it is, then it's no mere assumption.

I do think it is supported by arguments/reasoning, so I don't think of it as an "axiomatic" assumption. 

A follow-up to that (not from you specifically) might be "what arguments?". And - well, I think I pointed to some of my reasoning in various comments (some of them under deleted posts). Maybe I could have explained my thinking/perspective better (even if I wouldn't be able to explain it in a way that's universally compelling 🙃). But it's not a trivial task to discuss these sorts of issues, and I'm trying to check out of this discussion.

I think there is merit to having as a frame of mind: "Would it be possible to make a machine/program that is very capable in regards to criteria x, y, etc, and optimizes for z?".

I think it was good of you you to bring up Aumann's agreement theorem. I haven't looked into the specifics of that theorem, but broadly/roughly speaking I agree with it.

I cannot help you to be less wrong if you categorically rely on intuition about what is possible and what is not.

I wish I had something better to base my beliefs on than my intuitions, but I do not. My belief in modus ponens, my belief that 1+1=2, my belief that me observing gravity in the past makes me likely to observe it in the future, my belief that if views are in logical contradiction they cannot both be true - all this is (the way I think of it) grounded in intuition.

Some of my intuitions I regard as much more strong/robust than others. 

When my intuitions come into conflict, they have to fight it out.

Thanks for the discussion :)

Like with many comments/questions from you, answering this question properly would require a lot of unpacking. Although I'm sure that also is true of many questions that I ask, as it is hard to avoid (we all have limited communication bandwitdh) :)

In this last comment, you use the term "science" in a very different way from how I'd use it (like you sometimes also do with other words, such as for example "logic"). So if I was to give a proper answer I'd need to try to guess what you mean, make it clear how I interpret what you say, and so on (not just answer "yes" or "no").

I'll do the lazy thing and refer to some posts that are relevant (and that I mostly agree with):

It seems that 2 + 2 = 4 is also an assumption for you.

Yes (albeit a very reasonable one).

Not believing (some version) of that claim would make typically make minds/AGIs less "capable", and I would expect more or less all AGIs to hold (some version of) that "belief" in practice.

I don't think it is possible to find consensus if we do not follow the same rules of logic.

Here are examples of what I would regard to be rules of logic: (the ones listed here don't encapsulate all of the rules of inference that I'd endorse, but many of them). Despite our disagreements, I think we'd both agree with the rules that are listed there.

I regard Hitchens's razor not as a rule of logic, but more as an ambiguous slogan / heuristic / rule of thumb.

Best wishes from my side as well :)


Load More