VojtaKovarik

My original background is in mathematics (analysis, topology, Banach spaces) and game theory (imperfect information games). Nowadays, I do AI alignment research (mostly systemic risks, sometimes pondering about "consequentionalist reasoning").

Sequences

Formalising Catastrophic Goodhart

Wiki Contributions

Comments

Assumption 2 is, barring rather exotic regimes far into the future, basically always correct, and for irreversible computation, this always happens, since there's a minimum cost to increase the features IRL, and it isn't 0.

Increasing utility IRL is not free.

I think this is a misunderstanding of what I meant. (And the misunderstanding probably only makes sense to try clarifying it if you read the paper and disagree with my interpretation of it, rather than if your reaction is only based on my summary. Not sure which of the two is the case.)

What I was trying to say is that the most natural interpretation of the paper's model does not allow for things like: In state 1, the world is exactly as it is now, except that you decided to sleep on the floor every day instead of in your bed (for no particular reason), and you are tired and miserable all day. State 2 is exactly the same as state 1, except you decided that it would be smarter to sleep in your bed. And now, state 2 is just strictly better than state 1 (at least in all respects that you would care to name).
Essentially, the paper's model requires, by assumption, that it is impossible to get any efficiency gains (like "don't sleep on the floor" or "use this more efficient design instead) or mutually-beneficial deals (like helping two sides negotiate and avoid a war).

Yes, I agree that you can interpret the model in ways that avoid this. EG, maybe by sleeping on the floor, your bed will last longer. And sure, any action at all requires computation. I am just saying that these are perhaps not the interpretations that people initially imagine when reading the paper,. So unless you are using an interpretation like that, it is important to notice those strong assumptions.

I do agree that debate could be used in all of these ways. But at the same time, I think generality often leads to ambiguity and to papers not describing any such application in detail. And that in turn makes it difficult to critique debate-based approaches. (Both because it is unclear what one is critiquing and because it makes it too easy to accidentally dimiss the critiques using the motte-and-bailey fallacy.)

I was previously unaware of Section 4.2 of the Scalable AI Safety via Doubly-Efficient Debate paper and, hurray, it does give an answer to (2) in Section 4.2. (Thanks for mentioning, @niplav!) That still leaves (1) unanswered, or at least not answered clearly enough, imo. Also I am curious about the extent that other people, who find debate promising, consider this paper's answer to (2) as the answer to (2).

For what it's worth, none of the other results that I know about were helpful for me for understanding (1) and (2). (The things I know about are the original AI Safety via Debate paper, follow-up reports by OpenAI, the single- and two-step debate papers, the Anthropic 2023 post, the Khan et al. (2024) paper. Some more LW posts, including mine.) I can of course make some guesses regarding plausible answers to (1) and (2). But most of these papers are primarily concerned with exploring the properties of debates, but not explaining where debate fits in the process of producing an AI (and what problem it aims to address).

The original people kind-of did, but new people started, and Geoffrey Irving continued/got-back-to working on it.

Further disclaimer: Feel free to answer even if you don't find debate promising, but note that I am primarily interested in hearing from people who do actively work on it, or find it promising --- or at least from people who have a very good model of specific such people.

Motivation behind the question: People often mention Debate as a promising alignment technique. For example, the AI Safety Fundamentals curriculum features it quite prominently. But I think there is a lack of consensus on "as far as the proposal is concerned, how is Debate actually meant to be used"? (For example, do we apply it during deployment, as a way of checking the safety of solutions proposed by other systems? Or do we use it during deployment, to generate solutions? Or do we use it to generate training data?) And as far as I know, of all the existing work, only the Nov 2023 paper addresses my questions, and it only answers (Q2). But I am not sure to what extent is the answer given there canonical. So I am interested in knowing the opinions of people who currently endorse Debate.

Illustrating what I mean by the questions: If I were to answer the questions 1-3 for RLHF, I could for example say that:
(1) RLFH is meant for turning a neural network trained for next-token prediction into, for example, an agent that acts as a chatbot and gives helpful, honest, and lawsuit-less answers.
(2) RLHF is used for generating training (or fine-tuning) data (or signal).
(3) Seems pretty good for this purpose, for roughly <=human-level AIs.

I believe that a promising safety strategy for the larger asteroids is to put them in a secure box prior to them landing on earth. That way, the asteroid is -- provably -- guaranteed to have no negative impact on earth.

Proof:

   | | | | | | | |
   v v v v v v v v
   __________                        CC
  |        ___      |                     CCCC
  |     / O O \    |         :-)         CCC           :-)
  |    | o C o |  |        _|_         ||  o       _|_
  |     \  o _ /    |          |           ||/            |
  |_________ |         /\           ||             /\
--------------------------------------------------------
                                                                        □

Agreed.

It seems relevant, to the progression, that a lot of human problem solving -- though not all -- is done by the informal method of "getting exposed to examples and then, somehow, generalising". (And I likewise failed to appreciate this, not sure until when.) This suggests that if we want to build AI that solves things in similar ways that humans solve them, "magic"-involving "deepware" is a natural step. (Whether building AI in the image of humans is desirable, that's a different topic.)

tl;dr: It seems noteworthy that "deepware" has strong connotations with "it involves magic", while the same is not true for AI in general.


I would like to point out one thing regarding the software vs AI distinction that is confusing me a bit. (I view this as complementing, rather than contradicting, your post.)

As we go along the progression "Tools > Machines > Electric > Electronic > Digital", most[1] of the examples can be viewed as automating a reasonably-well-understood process, on a progressively higher level of abstraction.[2]
[For example: A hammer does basically no automation. > A machine like a lawn-mower automates a rigidly-designed rotation of the blades. > An electric kettle does-its-thingy. > An electronic calculator automates calculating algorithms that we understand, but can do it for much larger inputs than we could handle. > An algorithm like Monte Carlo tree search automates an abstract reasoning process that we understand, but can apply it to a wide range of domains.]

But then it seems that this progression does not neatly continue to the AI paradigm. Or rather, some things that we call AI can be viewed as a continuation of this progression, while others can't (or would constitute a discontinuous jump).
[For example, approaches like "solving problems using HCH" (minus the part where you use unknown magic to obtain a black box that imitates the human) can be viewed as automating a reasonably-well-understood process (of solving tasks by decomposing & delegating them). But there are also other things that we call AI that are not well described as a continuation of this progression --- or perhaps they constitute a rather extreme jump. On the other hand, deep learning automates the not-well-understood process of "stare at many things, then use magic to generalise". And the other example is abstract optimisation, which automates the not-well-understood process of "search through many potential solutions and pick the one that scores the best according to an objective function". And there are examples that lie somewhere inbetween --- for example, AlphaZero is mostly a quite well-understood process, but it does involve some opaque deep learning.]

I suppose we could refer to the distinction as "does it involve magic?". It then seems noteworthy that "deepware" has strong connotations with magic, while the same isn't true for all types of AI.[2]

 

  1. ^

    Or perhaps just "many"? I am not quite sure, this would require going through more examples, and I was intending for this to be a quick comment.

  2. ^

    To be clear, I am not super-confident that this progression is a legitimate phenomenon. But for the sake of argument, let's say it is.

  3. ^

    An interesting open question is how large hit to competitiveness would we suffer if we restricted ourselves to systems that only involve a small amount of magic.

VojtaKovarik1moΩ68-6

I want to flag that the overall tone of the post is in tension with the dislacimer that you are "not putting forward a positive argument for alignment being easy".

To hint at what I mean, consider this claim:

Undo the update from the “counting argument”, however, and the probability of scheming plummets substantially.

I think this claim is only valid if you are in a situation such as "your probability of scheming was >95%, and this was based basically only on this particular version of the 'counting argument' ". That is, if you somehow thought that we had a very detailed argument for scheming (AI X-risk, etc), and this was it --- then yes, you should strongly update.
But in contrast, my take is more like: This whole AI stuff is a huge mess, and the best we have is intuitions. And sometimes people try to formalise these intuitions, and those attempts generally all suck. (Which doesn't mean our intuitions cannot be more or less detailed. It's just that even the detailed ones are not anywhere close to being rigorous.) EG, for me personally, the vague intuition that "scheming is instrumental for a large class of goals" makes a huge contribution to my beliefs (of "something between 10% and 99% on alignment being hard"), while the particular version of the 'counting argument' that you describe makes basically no contribution. (And vague intuitions about simplicity priors contributing non-trivially.) So undoing that particular update does ~nothing.

I do acknowledge that this view suggests that the AI-risk debate should basically be debating the question: "So, we don't have any rigorous arguments about AI risk being real or not, and we won't have them for quite a while yet. Should we be super-careful about it, just in case?". But I do think that is appropriate.

I feel a bit confused about your comment: I agree with each individual claim, but I feel like perhaps you meant to imply something beyond just the individual claims. (Which I either don't understand or perhaps disagree with.)

Are you saying something like: "Yeah, I think that while this plan would work in theory, I expect it to be hopeless in practice (or unneccessary because the homework wasn't hard in the first place)."?

If yes, then I agree --- but I feel that of the two questions, "would the plan work in theory" is the much less interesting one. (For example, suppose that OpenAI could in theory use AI to solve alignment in 2 years. Then this won't really matter unless they can refrain from using that same AI to build misaligned superintelligence in 1.5 years. Or suppose the world could solve AI alignment if the US government instituted a 2-year moratorium on AI research --- then this won't really matter unless the US government actually does that.)

Load More