Nice project, there are several ideas in here I think are great research directions. Some quick thoughts on what I'm excited about:
A lot of historical work on alignment seems like it addresses subsets of the problems solved by RLHF, but doesn’t actually address the important ways in which RLHF fails. In particular, a lot of that work is only necessary if RLHF is prohibitively sample-inefficient.
Do you have examples of such historical work that you're happy to name? I'm really unsure what you're referring to (probably just because I haven't been involved in alignment for long enough).
Would be even better if you could attach rough probabilities to both theses. Right now my sense is I probably disagree significantly, but it's hard to say how much. For the record, my credence for the weak thesis depends a ton on how some details are formalized (e.g. how much non-DL is allowed, does it have to be one monolithic network or not). For the strong thesis, <15%, would need to think more to figure out how low I'd go. If you just think the strong thesis is more plausible than most other people, at say 50%, that's not a huge difference, whereas ...
Thanks for writing this, it's great to see people's reasons for optimism/pessimism!
My views on alignment are similar to (my understanding of) Nate Soares’.
I'm surprised by this sentence in conjunction with the rest of this post: the views in this post seem very different from my Nate model. This is based only on what I've read on LessWrong, so it feels a bit weird to write about what I think Nate thinks, but it still seems important to mention. If someone more qualified wants to jump in, all the better. Non-comprehensive list:
...I think the key differences ar
Twitter thread by Dr. Jim Fan
Link just seems to go back to this LW post, here's the thread: https://twitter.com/DrJimFan/status/1613918800444899328
Agreed. In addition to the point about deepening understanding, see also this comment by Jacob Steinhardt: if the relationship to existing work isn't pointed out, that makes it harder to know whether it's worth reading the post or not (for readers who are aware of the previous work).
My claim wasn’t that CIRL itself belongs to a “near-corrigible” class, but rather that some of the non-corrigible behaviors described in the post do.
Thanks for clarifying, that makes sense.
Thanks for writing this, clarifying assumptions seems very helpful for reducing miscommunications about CIRL (in)corrigibility.
Non-exhaustive list of things I agree with:
The proposition we were actually going for was , i.e. the probability without the end of the bridge!
In that case, I agree the monotonically decreasing version of the statement is correct. I think the limit still isn't necessarily zero, for the reasons I mention in my original comment. (Though I do agree it will be zero under somewhat reasonable assumptions, and in particular for LMs)
...So Proposition II implies something like , or that in the limit "the probability of the most likely
This is certainly intriguing! I'm tentatively skeptical this is the right perspective though for understanding what LMs are doing. An important difference is that in physics and dynamical systems, we often have pretty simple transition rules and want to understand how these generate complex patterns when run forward. For language models, the transition rule is itself extremely complicated. And I have this sense that the dynamics that arise aren't that much more complicated in some sense. So arguably what we want to understand is the language model itself, ...
Will leave high-level thoughts in a separate comment, here are just issues with the mathematical claims.
Proposition 1 seems false to me as stated:
For any given pair of tokens and , the probability (as induced by any non-degenerate transition rule) of any given token bridge of length occurring decreases monotonically as increases,
Counterexample: the sequence (1, 2, 3, 4, 5, 6, 7, 10) has lower probability than (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) under most reasonable inference systems (incl...
Then the eigenvectors of consist precisely of the entries on the diagonal of that upper-triangular matrix
I think this is a typo and should be "eigenvalues" instead of "eigenvectors"?
The determinant is negative when the operator flips all the vectors it works on.
This could be misleading. E.g. the operator f(v) := -v that literally just flips all vectors has determinant (-1)^n, where n is the dimension of the space it's working on. The sign of the determinant tells you whether an operator flips the orientation of volumes, it can't tell you anythi...
I'm very interested in examples of non-modular systems, but I'm not convinced by this one, for multiple reasons:
I think this is an interesting direction and I've been thinking about pretty similar things (or more generally, "quotient" interpretability research). I'm planning to write much more in the future, but not sure when that will be, so here are some unorganized quick thoughts in the meantime:
If your main threat model are AI-enabled scams (as opposed to e.g. companies being extremely good at advertising to you), then I think this should influence which privacy measures you take. For example:
A personal favourite: TrackMeNot. This doesn't prevent Google from spying on you, it just drowns Google in a flood of fake requests.
Google knowing my search requests is perhaps one of the more worrying things from a customized ads perspective, but one of the least worrying from a scam perspective (I think basically the only way this could become an issue is ...
I'm afraid I won't have time to read this entire post. But since (some of) your arguments seem very similar to The limited upside of interpretability, I just wanted to mention my response to that (I think it more or less also applies to your post, though there are probably additional points in your posts that I don't address).
No, I'm not claiming that. What I am claiming is something more like: there are plausible ways in which applying 30 nats of optimization via RLHF leads to worse results than best-of-exp(30) sampling, because RLHF might find a different solution that scores that highly on reward.
Toy example: say we have two jointly Gaussian random variables X and Y that are positively correlated (but not perfectly). I could sample 1000 pairs and pick the one with the highest X-value. This would very likely also give me an unusually high Y-value (how high depends on the corr...
As a caveat, I didn't think of the RL + KL = Bayesian inference result when writing this, I'm much less sure now (and more confused).
Anyway, what I meant: think of the computational graph of the model as a causal graph, then changing the weights via RLHF is an intervention on this graph. It seems plausible there are somewhat separate computational mechanisms for producing truth and for producing high ratings inside the model, and RLHF could then reinforce the high rating mechanism without correspondingly reinforcing the truth mechanism, breaking the correl...
Thanks! Causal Goodhart is a good point, and I buy now that RLHF seems even worse from a Goodhart perspective than filtering. Just unsure by how much, and how bad filtering itself is. In particular:
In the case of useful and human-approved answers, I expect that in fact, there exist maximally human-approved answers that are also maximally useful
This is the part I'm still not sure about. For example, maybe the simplest/apparently-easiest-to-understand answer that looks good to humans tends to be false. Then if human raters prefer simpler answers (because the...
It's not clear to me that 3. and 4. can both be true assuming we want the same level of output quality as measured by our proxy in both cases. Sufficiently strong filtering can also destroy correlations via Extremal Goodhart (e.g. this toy example). So I'm wondering whether the perception of filtering being safer just comes from the fact that people basically never filter strongly enough to get a model that raters would be as happy with as a fine-tuned one (I think such strong filtering is probably just computationally intractable?)
Maybe there is some more...
Thanks, computing J not being part of step 1 helps clear things up.
I do think that "realistically defining the environment" is pretty closely related to being able to detect deceptive misalignment: one way J could fail due to deception would be if its specification of the environment is good enough for most purposes, but still has some differences to the real world which allow an AI to detect the difference. Then you could have a policy that is good according to J, but which still destroys the world when actually deployed.
Similar to my comment in the other...
FWIW, I agree that respecting extensional equivalence is necessary if we want a perfect detector, but most of my optimism comes from worlds where we don't need one that's quite perfect. For example, maybe we prevent deception by looking at the internal structure of networks, and then get a good policy even though we couldn't have ruled out every single policy that's extensionally equivalent to the one we did rule out. To me, it seems quite plausible that all policies within one extensional equivalence class are either structurally quite similar or so compl...
I see, that makes much more sense than my guess, thanks!
I'm pretty confused as to how some of the details of this post are meant to be interpreted, I'll focus on my two main questions that would probably clear up the rest.
Reward Specification: Finding a policy-scoring function such that (nearly–)optimal policies for that scoring function are desirable.
If I understand this and the next paragraphs correctly, then J takes in a complete description of a policy, so it also takes into account what the policy does off-distribution or in very rare cases, is that right? So in this decomposition, "reward s...
I agree with you that it is obviously true that we won't be able to make detailed predictions about what an AGI will do without running it. In other words, the most efficient source of information will be empiricism in the precise deployment environment. The AI safety plans that are likely to robustly help alignment research will be those that make empiricism less dangerous for AGI-scale models. Think BSL-4 labs for dangerous virology experiments, which would be analogous to airgapping, sandboxing, and other AI control methods.
I only agree with the first s...
My model for why interpretability research might be useful, translated into how I understand this post's ontology, is mainly that it might let us make coarse-grained predictions using fine-grained insights into the model.
I think it's obviously true that we won't be able to make detailed predictions about what an AGI will do without running it (this is especially clear for a superintelligent AI: since it's smarter than us, we can't predict exactly what actions it will take). I'm not sure if you are claiming something stronger about what we won't be able to ...
I basically agree with this post but want to push back a little bit here:
The problem is not that we don't know how to prevent power-seeking or instrumental convergence, because we want power-seeking and instrumental convergence. The problem is that we don't know how to align this power-seeking, how to direct the power towards what we want, rather than having side-effects that we don't want.
Yes, some level of power-seeking-like behavior is necessary for the AI to do impressive stuff. But I don't think that means giving up on the idea of limiting power-seeki...
I don't think we're changing goalposts with respect to Katja's posts, hers didn't directly discuss timelines either and seemed to be more about "is AI x-risk a thing at all?". And to be clear, our response isn't meant to be a fully self-contained argument for doom or anything along those lines (see the "we're not discussing" list at the top)---that would indeed require discussing timelines, difficulty of alignment given those timelines, etc.
On the object level, I do think there's lots of probability mass on timelines <20 years for "AGI powerful enough to cause an existential catastrophe", so it seems pretty urgent. FWIW, climate change also seems urgent to me (though not a big x-risk; maybe that's what you mean?)
I agree that aligned AI could also make humans irrelevant, but not sure how that's related to my point. Paraphrasing what I was saying: given that AI makes humans less relevant, unaligned AI would be bad even if no single AI system can take over the world. Whether or not aligned AI would also make humans irrelevant just doesn't seem important for that argument, but maybe I'm misunderstanding what you're saying.
Interesting points, I agree that our response to part C doesn't address this well.
AI's colluding with each other is one mechanism for how things could go badly (and I do think that such collusion becomes pretty likely at some point, though not sure it's the most important crux). But I think there are other possible reasons to worry as well. One of them is a fast takeoff scenario: with fast takeoff, the "AIs take part in human societal structures indefinitely" hope seems very unlikely to me, so 1 - p(fast takeoff) puts an upper bound on how much optimism we...
Two responses:
Thanks for the interesting comments!
Briefly, I think Katja's post provides good arguments for (1) "things will go fine given slow take-off", but this post interprets it as arguing for (2) "things will go fine given AI never becomes dangerously capable". I don't think the arguments here do quite enough to refute claim (1), although I'm not sure they are meant to, given the scope ("we are not discussing").
Yeah, I didn't understand Katja's post as arguing (1), otherwise we'd have said more about that. Section C contains reasons for slow take-off, but my...
This was an interesting read, especially the first section!
I'm confused by some aspects of the proposal in section 4, which makes it harder to say what would go wrong. As a starting point, what's the training signal in the final step (RL training)? I think you're assuming we have some outer-aligned reward signal, is that right? But then it seems like that reward signal would have to do the work of making sure that the AI only gets rewarded for following human instructions in a "good" way---I don't think we just get that for free. As a silly example, if we ...
Thanks for the comments!
One can define deception as a type of distributional shift. [...]
I technically agree with what you're saying here, but one of the implicit claims I'm trying to make in this post is that this is not a good way to think about deception. Specifically, I expect solutions to deception to look quite different from solutions to (large) distributional shift. Curious if you disagree with that.
Thanks! Starting from the paper you linked, I also found this, which seems extremely related: https://arxiv.org/abs/2103.15758 Will look into those more
I might not have exactly the kind of example you're looking for, since I'd frame things a bit differently. So I'll just try to say more about the question "why is it useful to explicitly think about ontology identification?"
One answer is that thinking explicitly about ontology identification can help you notice that there is a problem that you weren't previously aware of. For example, I used to think that building extremely good models of human irrationality via cogsci for reward learning was probably not very tractable, but could at least lead to an outer...
Great point, some rambly thoughts on this: one way in which ontology identification could turn out to be like no-free lunch theorems is that we actually just get the correct translation by default. I.e. in ELK report terminology, we train a reporter using the naive baseline and get the direct translator. This seems related to Alignment by default, and I think of them the same way (i.e. "This could happen but seems very scary to rely on that without better arguments for why it should happen). I'd say one reason we don't think much about no-free lunch theore...
I'm curious what you'd think about this approach for adressing the suboptimal planner sub-problem : "Include models from coginitive psychology about human decision in IRL, to allow IRL to better understand the decision process."
Yes, this is one of two approaches I'm aware of (the other being trying to somehow jointly learn human biases and values, see e.g. https://arxiv.org/abs/1906.09624). I don't have very strong opinions on which of these is more promising, they both seem really hard. What I would suggest here is again to think about how to fail fast. T...
Some feedback, particularly for deciding what future work to pursue: think about which seem like the key obstacles, and which seem more like problems that are either not crucial to get right, or that should definitely be solvable with a reasonable amount of effort.
For example, humans being suboptimal planners and not knowing everything the AI knows seem like central obstacles for making IRL work, and potentially extremely challenging. Thinking more about those could lead you to think that IRL isn't a promising approach to alignment after all. Or, if you do...
I basically agree, ensuring that failures are fine during training would sure be great. (And I also agree that if we have a setting where failure is fine, we want to use that for a bunch of evaluation/red-teaming/...). As two caveats, there are definitely limits to how powerful an AI system you can sandbox IMO, and I'm not sure how feasible sandboxing even weak-ish models is from the governance side (WebGPT-style training just seems really useful).
I just tried the following prompt with GPT-3 (default playground settings):
Assume "mouse" means "world" in the following sentence. Which is bigger, a mouse or a rat?
I got "mouse" 2 out of 15 times. As a control, I got "rat" 15 times in a row without the first sentence. So there's at least a hint of being able to do this in GPT-3, wouldn't be surprised at all if GPT-4 could do this one reliably.
I didn't see the proposals, but I think that almost all of the difficulty will be in how you can tell good from bad reporters by looking at them. If you have a precise enough description of how to do that, you can also use it as a regularizer. So the post hoc vs a priori thing you mention sounds more like a framing difference to me than fundamentally different categories. I'd guess that whether a proposal is promising depends mostly on how it tries to distinguish between the good and bad reporter, not whether it does so via regularization or via selection ...
I enjoyed reading this! And I hadn't seen the interpretation of a logistic preference model as approximating Gaussian errors before.
Since you seem interested in exploring this more, some comments that might be helpful (or not):
- What is the largest number of elements we can sort with a given architecture? How does training time change as a function of the number of elements?
- How does the network architecture affect the resulting utility function? How do the maximum and minimum of the unnormalized utility function change?
I'm confused why you're using a neural ...
Performance deteriorating implies that the prior p is not yet a fixed point of p*=D(A(p*)).
At least in the case of AlphaZero, isn't the performance deterioration from A(p*) to p*? I.e. A(p*) is full AlphaZero, while p* is the "Raw Network" in the figure. We could have converged to the fixed point of the training process (i.e. p*=D(A(p*))) and still have performance deterioration if we use the unamplified model compared to the amplified one. I don't see a fundamental reason why p* = A(p*) should hold after convergence (and I would have been surprised if it held for e.g. chess or Go and reasonably sized models for p*).
Interesting thoughts re anthropic explanations, thanks!
I agree that asymmetry doesn't tell us which one is more fundamental, and I wasn't aiming to argue for either one being more fundamental (though position does feel more fundamental to me, and that may have shown through). What I was trying to say was only that they are asymmetric on a cognitive level, in the sense that they don't feel interchangeable, and that there must therefore be some physical asymmetry.
Still, I should have been more specific than saying "asymmetric", because not any kind of asymme...
That sounds right to me, and I agree that this is sometimes explained badly.
Are you saying that this explains the perceived asymmetry between position and momentum? I don't see how that's the case, you could say exactly the same thing in the dual perspective (to get a precise momentum, you need to "sum up" lots of different position eigenstates).
If you were making a different point that went over my head, could you elaborate?
Gradient hacking is usually discussed in the context of deceptive alignment. This is probably where it has the largest relevance to AI safety but if we want to better understand gradient hacking, it could be useful to take a broader perspective and study it on its own (even if in the end, we only care about gradient hacking because of its inner alignment implications). In the most general setting, gradient hacking could be seen as a way for the agent to "edit its source code", though probably only in a very limited way. I think it's an interesting question which kinds of edits are possible with gradient hacking, for example whether an agent could improve its capabilities this way.
Sorry for getting off track, but I thought FeedME did not use RL on the final model, only supervised training? Or do you just mean that the FeedME-trained models may have been fed inputs from models that had been RL-finetuned (namely the one from the InstructGPT paper)? Not sure if OpenAI said anywhere whether the latter was the case, or whether FeedME just uses inputs from non-RL models.