Iterated Amplification


Teaching ML to answer questions honestly instead of predicting human answers

Not sure if I'm misunderstanding this, but it seems to me that if it takes 10,000 bits to specify the intended head and 1000 bits to specify the instrumental head, that's because the world model - which we're assuming is accurate - considers humans that answer a question with a truthful and correct description of reality much rarer than humans who don't.

I don't think the complexity of the head is equal to frequency in the world model. Also I'm not committed to the simplicity prior being a good prior (all I know is that it allowed the AI to learn something the human didn't understand). And most importantly, a human who answers honestly is not the same as the model's honest answer---they come apart whenever the human is mistaken.

So if I understand correctly, the right amount of bits saved here would be 9,000.

I think 10,000 is right? 2^{-10,000} of all possible functions answer questions correctly. 2^{-1,000} of possible functions look up what the human says, but that's not relevant for computing P(the human answers questions correctly). (I assume you were computing 9,000 as 10,000 - 1,000.)

A naive alignment strategy and optimism about generalization

I agree you have to do something clever to make the intended policy plausibly optimal.

The first part of my proposal in section 3 here was to avoid using "imitate humans," and to instead learn a function "Answer A is unambiguously worse than answer B." Then we update against policies only when they give unambiguously worse answers.

(I think this still has a lot of problems; it's not obvious to me whether the problem is soluble.)

Power dynamics as a blind spot or blurry spot in our collective world-modeling, especially around AI

Let's define:

  • X = thinking about the dynamics of conflict + how they affect our collective ability to achieve things we all want; prioritizing actions based on those considerations
  • Y = thinking about how actions shift the balance of power + how we should be trying to shift the balance of power; prioritizing actions based on those considerations

I'm saying:

  • I think the alignment community traditionally avoids Y but does a lot of X.
  • I think that the factors you listed (including in the parent) are mostly reasons we'd do less Y.
  • So I read you as mostly making a case for "why the alignment community might be inappropriately averse to Y."
  • I think that separating X and Y would make this discussion clearer.
  • I'm personally sympathetic to both activities. I think the altruistic case for marginal X is stronger.

Here are some reasons I perceive you as mostly talking about Y rather than X:

  • You write: "Rather, the concern is that we are underperforming the forces that will actually shape the future, which are driven primarily by the most skilled people who are going around shifting the balance of power." This seems like a good description of Y but not X.
  • You listed "Competitive dynamics as a distraction from alignment." But in my people from the alignment community very often bring up X themselves both as a topic for research and as a justification for their research (suggesting that in fact they don't regard it as a distraction), and in my experience Y derails conversations about alignment perhaps 10x more often than X.
  • You talk about the effects of the PMK post. Explicitly that post is mostly about Y rather than X and it is often brought up when someone starts Y-ing on LW. It may also have the effect of discouraging X, but I don't think you made the case for that.
  • You mention the causal link from "fear of being manipulated" to "skill at thinking about power dynamics" which looks very plausible (to me) in the context of Y but looks like kind of a stretch (to me) in the context of X. You say "they find it difficult to think about topics that their friends or co-workers disagree with them about," which again is most relevant to Y (where people frequently disagree about who should have power or how important it is) and not particularly relevant to X (similar to other technical discussions).
  • In your first section you quote Eliezer. But he's not complaining about people thinking about how fights go in a way that might disruptive a sense of shared purpose, he's complaining that Elon Musk is in fact making their decisions in order to change which group gets power in a way that more obviously disrupts any sense of shared purpose. This seems like complaining about Y, rather than X.
  • More generally, my sense is that X involves thinking about politics and Y mostly is politics, and most of your arguments describe why people might be averse to doing politics rather than discussing it. Of course that can flow backwards (people who don't like doing something may also not like talking about it) but there's certainly a missing link.
  • Relative to the broader community thinking about beneficial AI, the alignment community does unusually much X and unusually little Y. So prima facie it's more likely that "too little X+Y" is mostly about "too little Y" rather than "too little X." Similarly, when you list corrective influences they are about X rather than Y.

I care about this distinction because in my experience discussions about alignment of any kind (outside of this community) are under a lot of social pressure to turn into discussions about Y. In the broader academic/industry community it is becoming harder to resist those pressures.

I'm fine with lots of Y happening, I just really want to defend "get better at alignment" as a separate project that may require substantial investment. I'm concerned that equivocating between X and Y will make this difficulty worse, because many of the important divisions are between (alignment, X) vs (Y) rather than (alignment) vs (X, Y).

Power dynamics as a blind spot or blurry spot in our collective world-modeling, especially around AI

Speaking for myself, I'm definitely excited about improving cooperation/bargaining/etc., and I think that working on technical problems could be a cost-effective way to help with that. I don't think it's obvious without really getting into the details whether this is more or less leveraged than technical alignment research. To the extent we disagree it's about particular claims/arguments and I don't think disagreements can be easily explained by a high-level aversion to political thinking.

(Clarifying in case I represent a meaningful part of the LW pushback, or in case other people are in a similar boat to me.)

Power dynamics as a blind spot or blurry spot in our collective world-modeling, especially around AI

In this post, I wish to share an opposing concern: that the EA and rationality communities have become systematically biased to ignore multi/multi dynamics, and power dynamics more generally.  

I feel like you are lumping together things like "bargaining in a world with many AIs representing diverse stakeholders" with things like "prioritizing actions on the basis of how they affect the balance of power." I would prefer keep those things separate.

In the first category: it seems to me that rationalist and EA community think about AI-AI bargaining and costs from AI-AI competition much more than the typical AI researchers, as measured by e.g. fraction of time spent thinking about those problems, fraction of writing that is about those problems, fraction of stated research priorities that involve those problems, and so on. This is all despite outlier technical beliefs suggesting an unprecedentedly "unipolar" world during the most important parts of AI deployment (which I mostly disagree with).

To the extent that you disagree, I'd be curious to get your sense of the respective fractions, or what evidence leads you to think that the normal AI community thinks more about these issues.

It's a bit hard to do the comparison, but e.g. it looks to me like <2% of NeurIPS 2020 is about multi-AI scenarios (proceedings), while the fraction within the EA/rationalist community looks more like 10-20% to me: discussion about game theory amongst AIs, alignment schemes involving multiple principals, explicit protocols for reaching cooperative arrangements, explorations of bargaining solutions, AI designs that reduce the risk fo bargaining failures, AI designs that can provide assurances to other organizations or divide execution, etc. I'm not sure what the number is but would be pretty surprised if you could slice up the EA/rationalist community in a way that put <10% on these categories. Beyond technical work, I think the EA/rationalist communities are nearly as interested in AI governance as they are in technical alignment work (way more governance interest than the broader AI community).

In the second category: I agree that the EA and rationalist communities spend less time on arguments about shifting the balance of power, and especially that they are less likely to prioritize actions on the basis of how they would shift the balance of power (rather than how they would improve humanity's collective knowledge or ability to tackle problems---including bargaining problems!).

For my part, this is an explicit decision to prioritize win-wins and especially reduction in the probability of x-risk scenarios where no one gets what they want. This is a somewhat unpopular perspective in the broader morally-conscious AI community. But it seems like "prioritizing win-wins" is mostly in line with what you are looking for out of multi-agent interactions (and so this brings us back to the first category, which I think is also an example of looking for win-win opportunities).

I think most of the biases you discuss are more directly relevant to the second category. For example, "Politics is the mind-killer" is mostly levied against doing politics, not thinking about politics as someone that someone else might do (thereby destroy the world). Similarly, when people raise multi-stakeholder concerns as a way that we might end up not aligning ML systems (or cause other catastrophic risks) most people in the alignment community are quick to agree (and indeed they constantly make this argument themselves). They are more resistant when "who" is raised as a more important object-level question, by someone apparently eager to get started on the fighting.

Teaching ML to answer questions honestly instead of predicting human answers

I think they need to be exactly equal. I think this is most likely accomplished by making something like pairwise judgments and only passing judgment when the comparison is a slam dunk (as discussed in section 3). Otherwise the instrumental policy will outperform the intended policy (since it will do the right thing when the simple labels are wrong).

Teaching ML to answer questions honestly instead of predicting human answers

I think "deferring" was a bad word for me to use. I mostly imagine the complex labeling process will just independently label data, and then only include datapoints when there is agreement. That is, you'd just always return the (simple, complex) pair, and is-correct basically just tests whether they are equal.

I said "defer" because one of the data that the complex labeling process uses may be "what a human who was in the room said," and this may sometimes be a really important source of evidence. But that really depends on how you set things up, if you have enough other signals then you would basically always just ignore that one. 

(That said, I think probably amplification is the most important difference between the simple and complex labeling processes, because that's the only scalable way to inject meaningful amounts of extra complexity into the complex labeling process---since the ML system can't predict itself very well, it forces it to basically try to win a multiplayer game with copies of itself, and we hope that's more complicated. And if that's the case then the simple labeling process may as well use all of the data sources, and the difference is just how complex a judgment we are making using those inputs.)

Teaching ML to answer questions honestly instead of predicting human answers

I don't think anyone has a precise general definition of "answer questions honestly" (though I often consider simple examples in which the meaning is clear). But we do all understand how "imitate what a human would say" is completely different (since we all grant the possibility of humans being mistaken or manipulated), and so a strong inductive bias towards "imitate what a human would say" is clearly a problem to be solved even if other concepts are philosophically ambiguous.

Sometimes a model might say something like "No one entered the datacenter" when what they really mean is "Someone entered the datacenter, got control of the hard drives with surveillance logs, and modified them to show no trace of their presence." In this case I'd say the answer is "wrong;" when such wrong answers appear as a critical part of a story about catastrophic failure, I'm tempted to look at why they were wrong to try to find a root cause of failure, and to try to look for algorithms that avoid the failure by not being "wrong" in the same intuitive sense. The mechanism in this post is one way that you can get this kind of wrong answer, namely by imitating human answers, and so that's something we can try to fix.

On my perspective, the only things that are really fundamental are:

  • Algorithms to train ML systems. These are programs you can run.
  • Stories about how those algorithms lead to bad consequences. These are predictions about what could/would happen in the world. Even if they aren't predictions about what observations a human would see, they are the kind of thing that we can all recognize as a prediction (unless we are taking a fairly radical skeptical perspective which I don't really care about engaging with).

Everything else is just a heuristic to help us understand why an algorithm might work or where we might look for a possible failure story.

I think this is one of the upsides of my research methodology---although it requires people to get on the same page about algorithms and about predictions (of the form "X could happen"), we don't need to start on the same page about all the other vague concepts. Instead we can develop shared senses of those concepts over time by grounding them out in concrete algorithms and failure stories. I think this is how shared concepts are developed in most functional fields (e.g. in mathematics you start with a shared sense of what constitutes a valid proof, and then build shared mathematical intuitions on top of that by seeing what successfully predicts your ability to write a proof).

Teaching ML to answer questions honestly instead of predicting human answers

Also, I don't see what this objective has to do with learning a world model.

The idea is to address a particular reason that your learned model would "copy a human" rather than "try to answer the question well." Namely, the model already contains human-predictors, so building extra machinery to answer questions (basically translating between the world model and natural language) would be more inefficient than just using the existing human predictor. The hope is that this alternative loss allows you to use the translation machinery to compress the humans, so that it's not disfavored by the prior.

I don't think it's intrinsically related to learning a world model, it's just an attempt to fix a particular problem.

To the extent that there is a problem with the proposed approach---either a reason that this isn't a real problem in the standard approach, or a reason that this proposed approach couldn't address the problem (or would inevitably introduce some other problem)---then I'm interested in that.

Isn't the Step 1 objective (the unnormalized posterior log probability of (θ₁, θ₂)) maximized at θ₁ = θ₂=argmax L + prior?

Why would it be maximized there? Isn't it at least better to make ?

And then in the section I'm trying to argue that the final term (the partition function) in the loss means that you can potentially get a lower loss by having  push apart the two heads in such a way that improving the quality of the model pushes them back together. I'm interested in anything that seems wrong in that argument.

(I don't particularly believe this particular formulation is going to work, e.g. because the L2 regularizer pushes θ₁ to adjust each parameter halfway, while the intuitive argument kind of relies on it being arbitrary what you put into θ₁ or θ₂, as it would be under something more like an L1 regularizer. But I'm pretty interested in this general approach.)

Two caveats were: (i) this isn't going to actually end up actually making any alternative models lower loss, it's just going to level the playing field such that a bunch of potential models have similar loss (rather than an inductive bias in favor of the bad models), (ii) in order for that to be plausible you need to have a stop grad on one of the heads in the computation of C, I maybe shouldn't have push that detail so late.

Is anyone else frustrated with 'un-informative' post titles?

(I should probably have given a more informative title and will go edit it now. But definitely the main issue is that it's written for and cross-posting can drop context.)

Load More