All of Erik Jenner's Comments + Replies

The terminology "RLHF" is starting to become confusing, as some people use it narrowly to mean "PPO against a reward model" and others use it more broadly to mean "using any RL technique with a reward signal given by human reviewers," which would include FeedME.

Sorry for getting off track, but I thought FeedME did not use RL on the final model, only supervised training? Or do you just mean that the FeedME-trained models may have been fed inputs from models that had been RL-finetuned (namely the one from the InstructGPT paper)? Not sure if OpenAI said anywhere whether the latter was the case, or whether FeedME just uses inputs from non-RL models.

2Sam Marks5d
This is just a terminological difference: supervised fine-tuning on highly rated outputs is a type of RL. (At least according to how many people use the term.)

Nice project, there are several ideas in here I think are great research directions. Some quick thoughts on what I'm excited about:

  • I like the general ideas of looking for more comprehensive consistency checks (as in the "Better representation of probabilities" section), connecting this to mechanistic interpretability, and looking for things other than truth we could try to discover this way. (Haven't thought much about your specific proposals for these directions)
  • Quite a few of your proposals are of the type "try X and see if/how that changes performance".
... (read more)

A lot of historical work on alignment seems like it addresses subsets of the problems solved by RLHF, but doesn’t actually address the important ways in which RLHF fails. In particular, a lot of that work is only necessary if RLHF is prohibitively sample-inefficient.

Do you have examples of such historical work that you're happy to name? I'm really unsure what you're referring to (probably just because I haven't been involved in alignment for long enough).

I think a lot of work on IRL and similar techniques has this issue---it's mostly designed to learn from indirect forms of evidence about value, but in many cases the primary upside is data efficiency and in fact the inferences about preferences are predictably worse than in RLHF. (I think you can also do IRL work with a real chance of overcoming limitations of RLHF, but most researchers are not careful about thinking through what should be the central issue.)

Would be even better if you could attach rough probabilities to both theses. Right now my sense is I probably disagree significantly, but it's hard to say how much. For the record, my credence for the weak thesis depends a ton on how some details are formalized (e.g. how much non-DL is allowed, does it have to be one monolithic network or not). For the strong thesis, <15%, would need to think more to figure out how low I'd go. If you just think the strong thesis is more plausible than most other people, at say 50%, that's not a huge difference, whereas ... (read more)

Thanks for writing this, it's great to see people's reasons for optimism/pessimism!

My views on alignment are similar to (my understanding of) Nate Soares’.

I'm surprised by this sentence in conjunction with the rest of this post: the views in this post seem very different from my Nate model. This is based only on what I've read on LessWrong, so it feels a bit weird to write about what I think Nate thinks, but it still seems important to mention. If someone more qualified wants to jump in, all the better. Non-comprehensive list:

I think the key differences ar

... (read more)
3Zac Hatfield-Dodds15d
I'm basing my impression here on having read much of Nate's public writing on AI, and a conversation over shared lunch at a conference a few months ago. His central estimate for P(doom) is certainly substantially higher than mine, but as I remember it we have pretty similar views of the underlying dynamics to date, somewhat diverging about the likelihood of catastrophe with very capable systems, and both hope that future evidence favors the less-doom view. * Unfortunately I agree that "shut down" and "no catastrophe" are still missing pieces. I'm more optimistic than my model of Nate that the HHH research agenda constitutes any progress towards this goal though. * I think labs correctly assess that they're neither working with or at non-trivial immediate risk of creating x-risky models, nor yet cautious enough to do so safely. If labs invested in this, I think they could probably avoid accidentally creating an x-risky system without abandoning ML research before seeing warning signs. * I agree that pre-AGI empirical alignment work only gets you so far, and that you probably get very little time for direct empirical work on the deadliest problems (two years if very fortunate, days to seconds if you're really not). But I'd guess my estimate of "only so far" is substantially further than Nate's, largely off different credence in a sharp left turn. * I was struck by how similarly we assessed the current situation and evidence available so far, but that is a big difference and maybe I shouldn't describe our views as similar. * I generally agree with Nate's warning shots post [] , and with some comments (e.g. [] ), but the "others" I was thinking would likely agree t
Oof, I'll fix it. Thanks for flagging.

Agreed. In addition to the point about deepening understanding, see also this comment by Jacob Steinhardt: if the relationship to existing work isn't pointed out, that makes it harder to know whether it's worth reading the post or not (for readers who are aware of the previous work).

My claim wasn’t that CIRL itself belongs to a “near-corrigible” class, but rather that some of the non-corrigible behaviors described in the post do.

Thanks for clarifying, that makes sense.

Thanks for writing this, clarifying assumptions seems very helpful for reducing miscommunications about CIRL (in)corrigibility.

Non-exhaustive list of things I agree with:

  • Which assumptions you make has a big impact, and making unrealistic ones leads to misleading results.
  • Relaxing the assumptions of the original OSG the way you do moves it much closer to being realistic (to me, it looks like the version in the OSG paper assumes away most of what makes the shutdown problem difficult).
  • We want something between "just maximize utility" and "check in before every
... (read more)
I agree that human model misspecification is a severe problem, for CIRL as well as for other reward modeling approaches. There are a couple of different ways to approach this. One is to do cognitive science research to build increasingly accurate human models, or to try to just learn them []. The other is to build reward modeling systems that are robust to human model misspecification, possibly by maintaining uncertainty over possible human models, or doing something other than Bayesianism that doesn't rely on a likelihood model. I’m more sympathetic to the latter approach, mostly because reducing human model misspecification to zero seems categorically impossible (unless we can fully simulate human minds, which has other problems). I also share your concern about the human-evaluating-atomic-actions failure mode. Another challenge with this line of research is that it implicitly assumes a particular scale, when in reality that scale is just one point on hierarchy. For example, the CIRL paper treats “make paperclips” as an atomic action. But we could easily increase the scale (“construct and operate a paperclip factory”) or decrease it (“bend this piece of wire” or even “send a bit of information to this robot arm”). “Make paperclips” was probably chosen because it’s the most natural level of abstraction of a human, but how do we figure that out in general? I think this is an unsolved challenge for reward learning (including this post). My claim wasn’t that CIRL itself belongs to a “near-corrigible” class, but rather that some of the non-corrigible behaviors described in the post do. (For example, R no-op’ing until it gets more information rather than immediately shutting off when told to.) This isn’t sufficient to claim that optimal R behavior in CIRL games always or even often has this type, just that it possibly does and therefore I think it’s worth figuring out whether this is a coherent behavior class or not. Do yo
In my model, this is very close to an impossibility proof for the desiredatums of corrigibility and AI capabilities stronger than human capabilities. In other words, corrigibility is doomed if Bayesian uncertainty can't handle it.

The proposition we were actually going for was ,  i.e. the probability without the end of the bridge!

In that case, I agree the monotonically decreasing version of the statement is correct. I think the limit still isn't necessarily zero, for the reasons I mention in my original comment. (Though I do agree it will be zero under somewhat reasonable assumptions, and in particular for LMs)

So Proposition II implies something like , or that in the limit "the probability of the most likely

... (read more)

This is certainly intriguing! I'm tentatively skeptical this is the right perspective though for understanding what LMs are doing. An important difference is that in physics and dynamical systems, we often have pretty simple transition rules and want to understand how these generate complex patterns when run forward. For language models, the transition rule is itself extremely complicated. And I have this sense that the dynamics that arise aren't that much more complicated in some sense. So arguably what we want to understand is the language model itself, ... (read more)

Will leave high-level thoughts in a separate comment, here are just issues with the mathematical claims.

Proposition 1 seems false to me as stated:

For any given pair of tokens  and , the probability (as induced by any non-degenerate transition rule) of any given token bridge  of length  occurring decreases monotonically as  increases,

Counterexample: the sequence (1, 2, 3, 4, 5, 6, 7, 10) has lower probability than (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) under most reasonable inference systems (incl... (read more)

Hi Erik! Thank you for the careful read, this is awesome! Regarding proposition 1 - I think you're right, that counter-example disproves the proposition. The proposition we were actually going for waslimB→∞P[(sa,s1,…, sB)]=0., i.e. the probability without the end of the bridge! I'll fix this in the post. Regarding proposition II - janus had the same intuition and I tried to explain it with the following argument: When the distance between tokens becomes large enough, then eventually all bridges between the first token and an arbitrary second token end up with approximately the same "cost". At that point, only the prior likelihood of the token will decide which token gets sampled. So Proposition II implies something likeP(sb)∼exp[−(B+1)maxP(sa,s1,…,sB,sb)], or that in the limit "the probability of the most likely sequence ending insbwill be (when appropriately normalized) proportional to the probability ofsb", which seems sensible? (assuming something like ergodicity). Although I'm now becoming a bit suspicious about the sign of the exponent, perhaps there is a "log" or a minus missing on the RHS... I'll think about that a bit more.

Then the eigenvectors of  consist precisely of the entries on the diagonal of that upper-triangular matrix

I think this is a typo and should be "eigenvalues" instead of "eigenvectors"?

The determinant is negative when the operator flips all the vectors it works on.

This could be misleading. E.g. the operator f(v) := -v that literally just flips all vectors has determinant (-1)^n, where n is the dimension of the space it's working on. The sign of the determinant tells you whether an operator flips the orientation of volumes, it can't tell you anythi... (read more)

3David Udell1mo
Thanks -- right on both counts! Post amended.

I'm very interested in examples of non-modular systems, but I'm not convinced by this one, for multiple reasons:

  • Even a 1,500 line function is a pretty small part of the entire codebase. So the existence of that function already means that the codebase as a whole seems somewhat modular.
  • My guess is that the function itself is in fact also modular (in the way I'd use the term). I only glanced at the function you link very quickly, but one thing that jumped out are the comments that divide it into "Phase 1" to "Phase 5". So even though it's not explicitly deco
... (read more)

I think this is an interesting direction and I've been thinking about pretty similar things (or more generally, "quotient" interpretability research). I'm planning to write much more in the future, but not sure when that will be, so here are some unorganized quick thoughts in the meantime:

  • Considering the internal interfaces of a program/neural net/circuit/... is a special case of the more general idea of describing how a program/... works at a higher level of abstraction. For example, for circuits (and in particular neural networks), we could think of the
... (read more)

If your main threat model are AI-enabled scams (as opposed to e.g. companies being extremely good at advertising to you), then I think this should influence which privacy measures you take. For example:

A personal favourite: TrackMeNot. This doesn't prevent Google from spying on you, it just drowns Google in a flood of fake requests.

Google knowing my search requests is perhaps one of the more worrying things from a customized ads perspective, but one of the least worrying from a scam perspective (I think basically the only way this could become an issue is ... (read more)

I see your point, and you're right. Data leaks from big companies or governments are not impossible though, they happen regularly!

I'm afraid I won't have time to read this entire post. But since (some of) your arguments seem very similar to The limited upside of interpretability, I just wanted to mention my response to that (I think it more or less also applies to your post, though there are probably additional points in your posts that I don't address).

I read your comment before. My post applies to your comment (course-grained predictions based on internal inspection are insufficient). EDIT: Just responded: [] Thanks for bringing it under my attention again.

No, I'm not claiming that. What I am claiming is something more like: there are plausible ways in which applying 30 nats of optimization via RLHF leads to worse results than best-of-exp(30) sampling, because RLHF might find a different solution that scores that highly on reward.

Toy example: say we have two jointly Gaussian random variables X and Y that are positively correlated (but not perfectly). I could sample 1000 pairs and pick the one with the highest X-value. This would very likely also give me an unusually high Y-value (how high depends on the corr... (read more)

Cool, I don't think we disagree here.

As a caveat, I didn't think of the RL + KL = Bayesian inference result when writing this, I'm much less sure now (and more confused).

Anyway, what I meant: think of the computational graph of the model as a causal graph, then changing the weights via RLHF is an intervention on this graph. It seems plausible there are somewhat separate computational mechanisms for producing truth and for producing high ratings inside the model, and RLHF could then reinforce the high rating mechanism without correspondingly reinforcing the truth mechanism, breaking the correl... (read more)

I think your claim is something like: As stated, this claim is false for LMs without top-p sampling or floating point rounding errors, since every token has a logit greater than negative infinity and thus a probability greater than actual 0. So with enough sampling, you'll find the RL trajectories. This is obviously a super pedantic point: RL finds sentences with cross entropy of 30+ nats wrt to the base distribution all the time, while you'll never do Best-of-exp(30)~=1e13. And there's an empirical question of how much performance you get versus how far your new policy is from the old one, e.g. if you look at Leo Gao's recent RLHF paper, you'll see that RL is more off distribution than BoN at equal proxy rewards. That being said, I do think you need to make more points than just "RL can result in incredibly implausible trajectories" in order to claim that BoN is safer than RL, since I claim that Best-of-exp(30) is not clearly safe either!
I unconfidently think that in this case, RLHF will reinforce both mechanisms, but reinforce the high rating mechanism slightly more, which nets out to no clear difference from conditioning. But I wouldn't be shocked to learn I was wrong.

Thanks! Causal Goodhart is a good point, and I buy now that RLHF seems even worse from a Goodhart perspective than filtering. Just unsure by how much, and how bad filtering itself is. In particular:

In the case of useful and human-approved answers, I expect that in fact, there exist maximally human-approved answers that are also maximally useful

This is the part I'm still not sure about. For example, maybe the simplest/apparently-easiest-to-understand answer that looks good to humans tends to be false. Then if human raters prefer simpler answers (because the... (read more)

Can you explain why RLHF is worse from a Causal Goodhart perspective?

It's not clear to me that 3. and 4. can both be true assuming we want the same level of output quality as measured by our proxy in both cases. Sufficiently strong filtering can also destroy correlations via Extremal Goodhart (e.g. this toy example). So I'm wondering whether the perception of filtering being safer just comes from the fact that people basically never filter strongly enough to get a model that raters would be as happy with as a fine-tuned one (I think such strong filtering is probably just computationally intractable?)

Maybe there is some more... (read more)

Extremal Goodhart relies on a feasibility boundary in U,V-space that lacks orthogonality, in such a way that maximal U logically implies non-maximal V. In the case of useful and human-approved answers, I expect that in fact, there exist maximally human-approved answers that are also maximally useful—even though there are also maximally human-approved answers that are minimally useful! I think the feasible zone here looks pretty orthogonal, pretty close to a Cartesian product, so Extremal Goodhart won't come up in either near-term or long-term applications. Near-term, it's Causal Goodhart [] and Regressional Goodhart [] , and long-term, it might be Adversarial Goodhart [] . Extremal Goodhart might come into play if, for example, there are some truths about what's useful that humans simply cannot be convinced of. In that case, I am fine with answers that pretend those things aren't true, because I think the scope of that extremal tradeoff phenomenon will be small enough to cope with for the purpose of ending the acute risk period. (I would not trust it in the setting of "ambitious value learning that we defer the whole lightcone to.") For the record, I'm not very optimistic about filtering as an alignment scheme either, but in the setting of "let's have some near-term assistance with alignment research", I think Causal Goodhart [] is a huge problem for RLHF that is not a problem for equally powerful filtering. Regressional Goodhart will be a problem in any case, but it might be manageable given a training distribution of human origin.

Thanks, computing J not being part of step 1 helps clear things up.

I do think that "realistically defining the environment" is pretty closely related to being able to detect deceptive misalignment: one way J could fail due to deception would be if its specification of the environment is good enough for most purposes, but still has some differences to the real world which allow an AI to detect the difference. Then you could have a policy that is good according to J, but which still destroys the world when actually deployed.

Similar to my comment in the other... (read more)

To the final question, for what it’s worth to contextualize my perspective, I think my inside-view is simultaneously: * unusually optimistic about formal verification * unusually optimistic about learning interpretable world-models * unusually pessimistic about learning interpretable end-to-end policies
I agree, if there is a class of environment-behaviors that occur with nonnegligible probability in the real world but occur with negligible probability in the environment-model encoded in J, that would be a vulnerability in the shape of alignment plan I’m gesturing at. However, aligning a predictive model of reality to reality is “natural” compared to normative alignment. And the probability with which this vulnerability can actually be bad is linearly related to something like total variation distance [] between the model and reality; I don’t know if this is exactly formally correct, but I think there’s some true theorem vaguely along the lines of: a 1% TV distance could only cause a 1% chance of alignment failure via this vulnerability. We don’t have to get an astronomically perfect model of reality to have any hope of its not being exploited. Judicious use of worst-case maximin approaches (e.g. credal sets rather than pure Bayesian modeling) will also help a lot with narrowing this gap, since it will be (something like) the gap to the nearest point in the set rather than to a single distribution.

FWIW, I agree that respecting extensional equivalence is necessary if we want a perfect detector, but most of my optimism comes from worlds where we don't need one that's quite perfect. For example, maybe we prevent deception by looking at the internal structure of networks, and then get a good policy even though we couldn't have ruled out every single policy that's extensionally equivalent to the one we did rule out. To me, it seems quite plausible that all policies within one extensional equivalence class are either structurally quite similar or so compl... (read more)

I see, that makes much more sense than my guess, thanks!

I'm pretty confused as to how some of the details of this post are meant to be interpreted, I'll focus on my two main questions that would probably clear up the rest.

Reward Specification: Finding a policy-scoring function  such that (nearly–)optimal policies for that scoring function are desirable.

If I understand this and the next paragraphs correctly, then J takes in a complete description of a policy, so it also takes into account what the policy does off-distribution or in very rare cases, is that right? So in this decomposition, "reward s... (read more)

To the second point, I meant something very different—I edited this sentence and hopefully it is more clear now. I did not mean that T should respect extensional equivalence of policies (if it didn’t, we could always simply quotient it by extensional equivalence of policies, since it outputs rather than inputs policies). Instead, I meant that a training story that involves mitigating your model-free learning algorithm’s unbounded out-of-distribution optimality gap by using some kind of interpretability loop where you’re applying a detector function to the policy to check for inner misalignment (and using that to guide policy search) has a big vulnerability: the policy search can encode similarly deceptive (or even exactly extensionally equivalent) policies in other forms which make the deceptiveness invisible to the detector. Respecting extensional equivalence is a bare-minimum kind of robustness to ask from an inner-misalignment detector that is load-bearing in an existential-safety strategy.
Thanks, this is very helpful feedback about what was confusing. Please do ask more questions if there are still more parts that are hard to interpret. To the first point, yes, J evaluates π on all trajectories, even off-distribution. It may do this in a Bayesian way, or a worst-case way. I claim that J does not need to “detect deceptive misalignment” in any special way, and I’m not optimistic that progress on such detection is even particularly helpful, since incompetence can also be fatal, and deceptive misalignment could Red Queen Race ahead of the detector. Instead: a deceptively aligned policy that is bad must concretely do bad stuff on some trajectories. J can detect this by simply detecting bad stuff. If there’s a sneaky hard part of Reward Specification beyond the obvious hard part of defining what’s good and bad, it would be “realistically defining the environment.” (That’s where purely predictive models come in.)
  • What's the specific most-important-according-to-you progress that you (or other people) have made on your agenda? New theorems, definitions, conceptual insights, ...
  • Any changes to the high-level plan (becoming less confused about agency, then ambitious value learning)? Any changes to how you want to become less confused (e.g. are you mostly thinking about abstractions, selection theorems, something new?)
  • What are the major parts of remaining deconfusion work (to the extent to which you have guesses)? E.g. is it mostly about understanding abstractions better
... (read more)

I agree with you that it is obviously true that we won't be able to make detailed predictions about what an AGI will do without running it. In other words, the most efficient source of information will be empiricism in the precise deployment environment. The AI safety plans that are likely to robustly help alignment research will be those that make empiricism less dangerous for AGI-scale models. Think BSL-4 labs for dangerous virology experiments, which would be analogous to airgapping, sandboxing, and other AI control methods.

I only agree with the first s... (read more)

My model for why interpretability research might be useful, translated into how I understand this post's ontology, is mainly that it might let us make coarse-grained predictions using fine-grained insights into the model.

I think it's obviously true that we won't be able to make detailed predictions about what an AGI will do without running it (this is especially clear for a superintelligent AI: since it's smarter than us, we can't predict exactly what actions it will take). I'm not sure if you are claiming something stronger about what we won't be able to ... (read more)

This conclusion has the appearance of being reasonable, while skipping over crucial reasoning steps. I'm going to be honest here. The fact that mechanistic interpretability can possibly be used to detect a few straightforwardly detectable misalignment of the kinds you are able to imagine right now does not mean that the method can be extended to detecting/simulating most or all human-lethal dynamics manifested in/by AGI over the long term. If AGI behaviour converges on outcomes that result in our deaths through less direct routes, it really does not matter much whether the AI researcher humans did an okay job at detecting "intentional direct lethality" and "explicitly rendered deception". There is an equivocation here. The conclusion presumes that applying Peter's arguments to interpretability of misalignment cases that people like you currently have in mind is a sound and complete test of whether Peter's arguments matter in practice – for understanding the detection possibility limits of interpretability over all human-lethal misalignments that would be manifested in/by self-learning/modifying AGI over the long term. Worse, this test is biased toward best-case misalignment detection scenarios. Particularly, it presumes that misalignments can be read out from just the hardware internals of the AGI, rather than requiring the simulation of the larger "complex system of an AGI’s agent-environment interaction dynamics" (quoting the TD;LR). That larger complex system is beyond the memory capacity of the AGI's hardware, and uncomputable. Uncomputable by: * the practical compute limits of the hardware (internal input-to-output computations are a tiny subset of all physical signal interactions with AGI components that propagate across the outside world and/or feed back over time). * the sheer unpredictability of non-linearly amplifying feedback cycles (ie. chaotic dynamics) of locally distributed microscopic changes (under constant signal noise i
4Peter S. Park2mo
Thank you so much, Erik, for your detailed and honest feedback! I really appreciate it. I agree with you that it is obviously true that we won't be able to make detailed predictions about what an AGI will do without running it. In other words, the most efficient source of information will be empiricism in the precise deployment environment. The AI safety plans that are likely to robustly help alignment research will be those that make empiricism less dangerous for AGI-scale models. Think BSL-4 labs for dangerous virology experiments, which would be analogous to airgapping, sandboxing, and other AI control methods. I am not completely pessimistic about interpretability of coarse-grained information, although still somewhat pessimistic. Even in systems neuroscience [], interpretability of coarse-grained information has seen some successes (in contrast to interpretability of fine-grained information, which has seen very little success). I agree that if the interpretability researcher is extremely lucky, they can extract facts about the AI that lets them make important coarse-grained predictions with only a short amount of time and computational resources. But as you said, this is an unrealistically optimistic picture. More realistically, the interpretability researcher will not be magically lucky, which means we should expect the rate at which prediction-enhancing information is obtained to be inefficient. And given that information channels are dual-use [] (in that the AGI can also use them for sandbox escape), we should prioritize efficient information channels like empiricism, rather than inefficient ones like fine-grained interpretability. Inefficient information channels can be net-negative, because they may be more useful for the AGI's sandbox escape compared to their usefulness to alignment researcher

I basically agree with this post but want to push back a little bit here:

The problem is not that we don't know how to prevent power-seeking or instrumental convergence, because we want power-seeking and instrumental convergence. The problem is that we don't know how to align this power-seeking, how to direct the power towards what we want, rather than having side-effects that we don't want.

Yes, some level of power-seeking-like behavior is necessary for the AI to do impressive stuff. But I don't think that means giving up on the idea of limiting power-seeki... (read more)

I don't totally disagree, but two points: 1. Even if the effect of it is "limiting power-seeking", I suspect this to be a poor frame for actually coming up with a solution, because this is defined purely in the negative, and not even in the negative of something we want to avoid, but instead in the negative of something we often want to achieve. Rather, one should come to understand what kind of power seeking we want to limit. 2. Corrigibility does not necessarily mean limiting power-seeking much. You could have an AI that is corrigible not because it doesn't accumulate a bunch of resources and build up powerful infrastructure, but instead because it voluntarily avoids using this infrastructure against the people it tries to be corrigible to.

I don't think we're changing goalposts with respect to Katja's posts, hers didn't directly discuss timelines either and seemed to be more about "is AI x-risk a thing at all?". And to be clear, our response isn't meant to be a fully self-contained argument for doom or anything along those lines (see the "we're not discussing" list at the top)---that would indeed require discussing timelines, difficulty of alignment given those timelines, etc.

On the object level, I do think there's lots of probability mass on timelines <20 years for "AGI powerful enough to cause an existential catastrophe", so it seems pretty urgent. FWIW, climate change also seems urgent to me (though not a big x-risk; maybe that's what you mean?)

I agree that aligned AI could also make humans irrelevant, but not sure how that's related to my point. Paraphrasing what I was saying: given that AI makes humans less relevant, unaligned AI would be bad even if no single AI system can take over the world. Whether or not aligned AI would also make humans irrelevant just doesn't seem important for that argument, but maybe I'm misunderstanding what you're saying.

Interesting points, I agree that our response to part C doesn't address this well.

AI's colluding with each other is one mechanism for how things could go badly (and I do think that such collusion becomes pretty likely at some point, though not sure it's the most important crux). But I think there are other possible reasons to worry as well. One of them is a fast takeoff scenario: with fast takeoff, the "AIs take part in human societal structures indefinitely" hope seems very unlikely to me, so 1 - p(fast takeoff) puts an upper bound on how much optimism we... (read more)

I agree that in a fast takeoff scenario there's little reason for an AI system to operate withing existing societal structures, as it can outgrow them quicker than society can adapt. I'm personally fairly skeptical of fast takeoff (<6 months say) but quite worried that society may be slow enough to adapt that even years of gradual progress with a clear sign that transformative AI is on the horizon may be insufficient. In terms of humans "owning" the economy but still having trouble getting what they want, it's not obvious this is a worse outcome than the society we have today. Indeed this feels like a pretty natural progression of human society. Humans already interact with (and not so infrequently get tricked or exploited by) entities smarter than them such as large corporations or nation states. Yet even though I sometimes find I've bought a dud on the basis of canny marketing, overall I'm much better off living in a modern capitalist economy than the stone age where humans were more directly in control. However, it does seem like there's a lot of value lost in the scenario where humans become increasingly disempowered, even if their lives are still better than in 2022. From a total utilitarian perspective, "slightly better than 2022" and "all humans dead" are rounding errors relative to "possible future human flourishing". But things look quite different under other ethical views, so I'm reluctant to conflate these outcomes.
1Rudi C3mo
This problem of human irrelevancy seems somewhat orthogonal to the alignment problem; even a maximally aligned AI will strip humans of their agency, as it knows best. Making the AI value human agency will not be enough; humans suck enough that the other objectives will override the agency penalty most of the time, especially in important matters.

Two responses:

  1. For "something that is very difficult to achieve (i.e. all of humanity is currently unable to achieve it)", I didn't have in mind things like "cure a disease". Humanity might currently not have a cure for a particular disease, but we've found many cures before. This seems like the kind of problem that might be solved even without AGI (e.g. AlphaFold already seems helpful, though I don't know much about the exact process). Instead, think along the lines of "build working nanotech, and do it within 6 months" or "wake up these cryonics patients"
... (read more)

Thanks for the interesting comments!

Briefly, I think Katja's post provides good arguments for (1) "things will go fine given slow take-off", but this post interprets it as arguing for (2) "things will go fine given AI never becomes dangerously capable".  I don't think the arguments here do quite enough to refute claim (1), although I'm not sure they are meant to, given the scope ("we are not discussing").

Yeah, I didn't understand Katja's post as arguing (1), otherwise we'd have said more about that. Section C contains reasons for slow take-off, but my... (read more)

3David Scott Krueger (formerly: capybaralet)3mo
Responding in order: 1) yeah I wasn't saying it's what her post is about. But I think you can get two more interesting cruxy stuff by interpreting it that way. 2) yep it's just a caveat I mentioned for completeness. 3) Your spontaneous reasoning doesn't say that we/it get(/s) good enough at getting it to output things humans approve of before it kills us. Also, I think we're already at "we can't tell if the model is aligned or not", but this won't stop deployment. I think the default situation isn't that we can tell if things are going wrong, but people won't be careful enough even given that, so maybe it's just a difference of perspective or something... hmm.......

This was an interesting read, especially the first section!

I'm confused by some aspects of the proposal in section 4, which makes it harder to say what would go wrong. As a starting point, what's the training signal in the final step (RL training)? I think you're assuming we have some outer-aligned reward signal, is that right? But then it seems like that reward signal would have to do the work of making sure that the AI only gets rewarded for following human instructions in a "good" way---I don't think we just get that for free. As a silly example, if we ... (read more)

Thanks for the comments!

One can define deception as a type of distributional shift. [...]

I technically agree with what you're saying here, but one of the implicit claims I'm trying to make in this post is that this is not a good way to think about deception. Specifically, I expect solutions to deception to look quite different from solutions to (large) distributional shift. Curious if you disagree with that.

Overall I agree that solutions to deception look different from solutions to other kinds of distributional shift. (Also, there are probably different solutions to different kinds of large distributional shift as well. E.g., solutions to capability generalization vs solutions to goal generalization.) I do think one could claim that some general solutions to distributional shift would also solve deceptiveness. E.g., the consensus algorithm [] works for any kind of distributional shift, but it should presumably also avoid deceptiveness (in the sense that it would not go ahead and suddenly start maximizing some different goal function, but instead would query the human first). Stuart Armstrong might claim a similar thing about concept extrapolation? I personally think it is probably best to just try to work on deceptiveness directly instead of solving some more general problem and hoping non-deceptiveness is a side effect. It is probably harder to find a general solution than to solve only deceptiveness. Though maybe this depends on one's beliefs about what is easy or hard to do with deep learning.
  1. Just for context, I'm usually assuming we already have a good AI model and just want to find the dashed arrow (but that doesn't change things too much, I think). As for why this diagram doesn't solve worst-case ELK, the ELK report contains a few paragraphs on that, but I also plan to write more about it soon.
  2. Yep, the nice thing is that we can write down this commutative diagram in any category,[1] so if we want probabilities, we can just use the category with distributions as objects and Markov kernels as morphisms. I don't think that's too strict, bu
... (read more)

Thanks! Starting from the paper you linked, I also found this, which seems extremely related: Will look into those more

I might not have exactly the kind of example you're looking for, since I'd frame things a bit differently. So I'll just try to say more about the question "why is it useful to explicitly think about ontology identification?"

One answer is that thinking explicitly about ontology identification can help you notice that there is a problem that you weren't previously aware of. For example, I used to think that building extremely good models of human irrationality via cogsci for reward learning was probably not very tractable, but could at least lead to an outer... (read more)

Makes sense, thanks for the reply! For what it’s worth, I do think strong ELK is probably more tractable than the whole cog eco approach for preference learning.

Great point, some rambly thoughts on this: one way in which ontology identification could turn out to be like no-free lunch theorems is that we actually just get the correct translation by default. I.e. in ELK report terminology, we train a reporter using the naive baseline and get the direct translator. This seems related to Alignment by default, and I think of them the same way (i.e. "This could happen but seems very scary to rely on that without better arguments for why it should happen). I'd say one reason we don't think much about no-free lunch theore... (read more)

I’m asking for examples of specific problems in alignment where thinking of ontology identification is more helpful than just thinking about it the usual or obvious way.

I'm curious what you'd think about this approach for adressing the suboptimal planner sub-problem : "Include models from coginitive psychology about human decision in IRL, to allow IRL to better understand the decision process."

Yes, this is one of two approaches I'm aware of (the other being trying to somehow jointly learn human biases and values, see e.g. I don't have very strong opinions on which of these is more promising, they both seem really hard. What I would suggest here is again to think about how to fail fast. T... (read more)

Some feedback, particularly for deciding what future work to pursue: think about which seem like the key obstacles, and which seem more like problems that are either not crucial to get right, or that should definitely be solvable with a reasonable amount of effort.

For example, humans being suboptimal planners and not knowing everything the AI knows seem like central obstacles for making IRL work, and potentially extremely challenging. Thinking more about those could lead you to think that IRL isn't a promising approach to alignment after all. Or, if you do... (read more)

1Jan Wehner4mo
Thank you Erik, that was super valuable feedback and gives some food for thought. It also seems to me that humans being suboptimal planners and not knowing everything the AI knows seem like the hardest (and most informative) problems in IRL. I'm curious what you'd think about this approach for adressing the suboptimal planner sub-problem : "Include models from coginitive psychology about human decision in IRL, to allow IRL to better understand the decision process." This would give IRL more realistic assumptions about the human planner and possibly allow it to understand it's irrationalites and get to the values which drive behaviour. Also do you have a pointer for something to read on preference comparisons?

I basically agree, ensuring that failures are fine during training would sure be great. (And I also agree that if we have a setting where failure is fine, we want to use that for a bunch of evaluation/red-teaming/...). As two caveats, there are definitely limits to how powerful an AI system you can sandbox IMO, and I'm not sure how feasible sandboxing even weak-ish models is from the governance side (WebGPT-style training just seems really useful).

1Nathan Helm-Burger5mo
Yes, I agree that actually getting companies to consistently use the sandboxing is the hardest piece of the puzzle. I think there's some promising advancements making it easier to have the benefits of not-sandboxing despite being in a sandbox. For instance: Using a snapshot of the entire web, hosted on an isolated network [] I think this takes away a lot of the disadvantages, especially if you combine this with some amount of simulation of backends to give interactability (not an intractably hard task, but would take some funding and development time).

I just tried the following prompt with GPT-3 (default playground settings):

Assume "mouse" means "world" in the following sentence. Which is bigger, a mouse or a rat?

I got "mouse" 2 out of 15 times. As a control, I got "rat" 15 times in a row without the first sentence. So there's at least a hint of being able to do this in GPT-3, wouldn't be surprised at all if GPT-4 could do this one reliably.

I didn't see the proposals, but I think that almost all of the difficulty will be in how you can tell good from bad reporters by looking at them. If you have a precise enough description of how to do that, you can also use it as a regularizer. So the post hoc vs a priori thing you mention sounds more like a framing difference to me than fundamentally different categories. I'd guess that whether a proposal is promising depends mostly on how it tries to distinguish between the good and bad reporter, not whether it does so via regularization or via selection ... (read more)

2derek shiller1y
If you try to give feedback during training, there is a risk you'll just reward it for being deceptive. One advantage to selecting post hoc is that you can avoid incentivizing deception.

I enjoyed reading this! And I hadn't seen the interpretation of a logistic preference model as approximating Gaussian errors before.

Since you seem interested in exploring this more, some comments that might be helpful (or not):

  • What is the largest number of elements we can sort with a given architecture? How does training time change as a function of the number of elements?
  • How does the network architecture affect the resulting utility function? How do the maximum and minimum of the unnormalized utility function change?

I'm confused why you're using a neural ... (read more)

Awesome, thanks for the feedback Eric! And glad to hear you enjoyed the post! Good point, for the example post it was total overkill. The reason I went with a NN was to demonstrate the link with the usual setting in which preference learning is applied. And in general, NNs generalize better than the table-based approach ( see also my response to Charlie Steiner ). I definitely plan to write a follow-up to this post, will come back to your offer when that follow-up reaches the front of my queue :) Hadn't thought about this before! Perhaps it could work to compare the inferred utility function with a random baseline? I.e. the baseline policy would be "for every comparison, flip a coin and make that your prediction about the human preference". If this happens to accurately describe how the human makes the decision, then the utility function should not be able to perform better than the baseline (and perhaps even worse). How much more structure can we add to the human choice before the utility function performs better than the random baseline? True! I guess one proposal to resolve these inconsistencies is CEV [], although that is not very computable.

Performance deteriorating implies that the prior p is not yet a fixed point of p*=D(A(p*)).

At least in the case of AlphaZero, isn't the performance deterioration from A(p*) to p*? I.e. A(p*) is full AlphaZero, while p* is the "Raw Network" in the figure. We could have converged to the fixed point of the training process (i.e. p*=D(A(p*))) and still have performance deterioration if we use the unamplified model compared to the amplified one. I don't see a fundamental reason why p* = A(p*) should hold after convergence (and I would have been surprised if it held for e.g. chess or Go and reasonably sized models for p*).

That... makes a lot of sense. Yep, that's probably the answer! Thank you :)

Interesting thoughts re anthropic explanations, thanks!

I agree that asymmetry doesn't tell us which one is more fundamental, and I wasn't aiming to argue for either one being more fundamental (though position does feel more fundamental to me, and that may have shown through). What I was trying to say was only that they are asymmetric on a cognitive level, in the sense that they don't feel interchangeable, and that there must therefore be some physical asymmetry.

Still, I should have been more specific than saying "asymmetric", because not any kind of asymme... (read more)

That sounds right to me, and I agree that this is sometimes explained badly.

Are you saying that this explains the perceived asymmetry between position and momentum? I don't see how that's the case, you could say exactly the same thing in the dual perspective (to get a precise momentum, you need to "sum up" lots of different position eigenstates).

If you were making a different point that went over my head, could you elaborate?

I doubt that I understand this very well. I thought there was a chance I might help and also a chance that I would be so obviously wrong that I would learn something.

Gradient hacking is usually discussed in the context of deceptive alignment. This is probably where it has the largest relevance to AI safety but if we want to better understand gradient hacking, it could be useful to take a broader perspective and study it on its own (even if in the end, we only care about gradient hacking because of its inner alignment implications). In the most general setting, gradient hacking could be seen as a way for the agent to "edit its source code", though probably only in a very limited way. I think it's an interesting question which kinds of edits are possible with gradient hacking, for example whether an agent could improve its capabilities this way.

Load More