Yeah that seems reasonable! (Personally I'd prefer a single break between sentence 3 and 4)
IMO ~170 words is a decent length for a well-written abstract (well maybe ~150 is better), and the problem is that abstracts are often badly written. Steve Easterbrook has a great guide on writing scientific abstracts; here's his example template which I think flows nicely:
...(1) In widgetology, it’s long been understood that you have to glomp the widgets before you can squiffle them. (2) But there is still no known general method to determine when they’ve been sufficiently glomped. (3) The literature describes several specialist techniques that measure how
Are you arguing that it’s probably not going to work, or that it’s definitely not going to work? I’m inclined to agree with the first and disagree with the second.
I'm arguing that it's definitely not going to work (I don't have 99% confidence here bc I might be missing something, but IM(current)O the things I list are actual blockers).
First bullet point → Seems like a very possible but not absolutely certain failure mode for what I wrote.
Do you mean we possibly don't need the prerequisites, or we definitely need them but that's possibly fine?
In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno.
Curious what your take is on these reasons to think the answer is no (IMO the first one is basically already enough):
That's a challenge, and while you (hopefully) chew on it, I'll tell an implausibly-detailed story to exemplify a deeper obstacle.
Some thoughts written down before reading the rest of the post (list is unpolished / not well communicated)
The main problems I see:
(Crossposting some of my twitter comments).
I liked this criticism of alignment approaches: it makes a concrete claim that addresses the crux of the matter, and provides supporting evidence! I also disagree with it, and will say some things about why.
I think that instead of thinking in terms of "coherence" vs. "hot mess", it is more fruitful to think about "how much influence is this system exerting on its environment?". Too much influence will kill humans, if directed at an outcome we're not able to choose. (The rest of my comments are all variations on
Maybe Francois Chollet has coherent technical views on alignment that he hasn't published or shared anywhere (the blog post doesn't count, for reasons that are probably obvious if you read it), but it doesn't seem fair to expect Eliezer to know / mention them.
Is there an open-source implementation of causal scrubbing available?
I'm confused about the example you give. In the paragraph, Eliezer is trying to show that you ought to accept the independence axiom, cause you can be Dutch booked if you don't. I'd think if you're updateless, that means you already accept the independence axiom (cause you wouldn't be time-consistent otherwise).
And in that sense it seems reasonable to assume that someone who doesn't already accept the independence axiom is also not updateless.
I agree it's important to be careful about which policies we push for, but I disagree both with the general thrust of this post and the concrete example you give ("restrictions on training data are bad").
Re the concrete point: it seems like the clear first-order consequence of any strong restriction is to slow down AI capabilities. Effects on alignment are more speculative and seem weaker in expectation. For example, it may be bad if it were illegal to collect user data (eg from users of chat-gpt) for fine-tuning, but such data collection is unlikely to fa...
I also think that often "the AI just maximizes reward" is a useful simplifying assumption. That is, we can make an argument of the form "even if the AI just maximizes reward, it still takes over; if it maximizes some correlate of the reward instead, then we have even less control over what it does and so are even more doomed".
(Though of course it's important to spell the argument out)
I agree with your general point here, but I think Ajeya's post actually gets this right, eg
There is some ambiguity about what exactly “maximize reward” means, but once Alex is sufficiently powerful -- and once human knowledge/control has eroded enough -- an uprising or coup eventually seems to be the reward-maximizing move under most interpretations of “reward.”
and
...What if Alex doesn’t generalize to maximizing its reward in the deployment setting? What if it has more complex behaviors or “motives” that aren’t directly and simply derived from
FWIW I believe I wrote that sentence and I now think this is a matter of definition, and that it’s actually reasonable to think of an agent that e.g. reliably solves a maze as an optimizer even if it does not use explicit search internally.
I would be very curious to see your / OpenAI's responses to Eliezer's Dimensions of Operational Adequacy in AGI Projects post. Which points do you / OpenAI leadership disagree with? Insofar as you agree but haven't implemented the recommendations, what's stopping you?
People at OpenAI regularly say things like
And you say:
(Note that I'm not making a claim about how search is central to human capabilities relative to other species; I'm just saying search is useful in general. Plausibly also for other species, though it is more obvious for humans)
From my POV, the "cultural intelligence hypothesis" is not a counterpoint to importance of search. It's obvious that culture is important for human capabilities, but it also seems obvious to me that search is important. Building printing presses or steam engines is not something that a bundle of heuristics can do, IMO, without gainin...
I think you overestimate the importance of the genomic bottleneck. It seems unlikely that humans would have been as successful as we are if we were... the alternative to the kind of algorithm that does search, which you don't really describe.
Performing search to optimize an objective seems really central to our (human's) capabilities, and if you want to argue against that I think you should say something about what an algorithm is supposed to look like that is anywhere near as capable as humans but doesn't do any search.
Gotcha, this makes sense to me now, given the assumption that to get AGI we need to train a P-parameter model on the optimal scaling, where P is fixed. Thanks!
...though now I'm confused about why we would assume that. Surely that assumption is wrong?
But in my report I arrive at a forecast by fixing a model size based on estimates of brain computation, and then using scaling laws to estimate how much data is required to train a model of that size. The update from Chinchilla is then that we need more data than I might have thought.
I'm confused by this argument. The old GPT-3 scaling law is still correct, just not compute-optimal. If someone wanted to, they could still go on using the old scaling law. So discovering better scaling can only lead to an update towards shorter timelines, right?
(Except if you had expected even better scaling laws by now, but it didn't sound like that was your argument?)
What would make you change your mind about robustness of behavior (or interpretability of internal representations) through the sharp left turn? Or about the existence of such a sharp left turn, as opposed to smooth scaling of ability to learn in-context?
For example, would you change your mind if we found smooth scaling laws for (some good measure of) in-context learning?
(This was an interesting exercise! I wrote this before reading any other comments; obviously most of the bullet points are unoriginal)
The basics
Minor comment on clarity: you don't explicitly define relaxed adversarial training (it's only mentioned in the title and the conclusion), which is a bit confusing for someone coming across the term for the first time. Since this is the current reference post for RAT I think it would be nice if you did this explicitly; for example, I'd suggest renaming the second section to 'Formalizing relaxed adversarial training', and within the section call it that instead of 'Pauls approach'
But since we're not doing that, there's nothing to counteract the negative gradient that removes the inner optimizer.
During training, the inner optimizer has the same behavior as the benign model: while it's still dumb it just doesn't know how to do better; when it becomes smarter and reaches strategic awareness it will be deceptive.
So training does not select for a benign model over a consequentialist one (or at least it does not obviously select for a benign model; I don't know how the inductive biases will work out here). Once the consequentialist ac...
Hm I don't think your objection applies to what I've written? I don't assume anything about using a loss like . In the post I explicitly talk about offline training where the data distribution is fixed.
Taking a guess at where the disagreement lies, I think it's where you say
And seems much more tame than L to me.
does not in fact look 'tame' (by which I mean safe to optimize) to me. I'm happy to explain why, but without seeing your reasoning behind the quoted statement I can only rehash the things I say in the post.
...You haven't given any instrume
I think that most nontrivial choices of loss function would give rise to consequentialist systems, including the ones you write down here.
In the post I was assuming offline training, that is in your notation where is the distribution of the training data unaffected by the model. This seems even more tame than , but still dangerous because AGI can just figure out how to affect the data distribution 'one-shot' without having to trial-and-error learn how during training.
Yeah this seems right! :) I am assuming no one ever inspects a partial ouput. This does seem risky, and it's likely there are a bunch more possible failure modes here.
(Btw, thanks for this exchange; just wanted to note that it was valuable for me and made me notice some mistakes in how I was thinking about oracles)
This sounds like what Fix #2 is saying, meant to be addressed in the paragraph 'Third Problem'.
To paraphrase that paragraph: the model that best predicts the data is likely to be a consequentialist. This is because consequentialists are general in a way that heuristic or other non-consequentialist systems aren't, and generality is strongly selected for in domains that are very hard.
Curious if you disagree with anything in particular in that paragraph or what I just said.
Ah yes, I missed that the oracle needs to be myopic, i.e. care only about the next prediction. I edited my definition of counterfactual oracle to include this (I think this is standard, as Stuart Armstrongs paper also assumes myopia).
If it's not myopic you're right that it might help construct a misaligned system, or otherwise take over the world. I think that myopia is enough to prevent this though: If Oracle1 cares only about the current prediction, then there is no incentive for it to construct Oracle2, since Oracle2 can only help in future episodes.
Hm I still think it works? All oracles assume null outputs from all oracles including themselves. Once a new oracle is built it is considered part of this set of null-output-oracles. (There are other hitches like, Oracle1 will predict that Oracle2 will never be built, because why would humans build a machine that only ever gives null outputs. But this doesn't help the oracles coordinate as far as I can see).
Thanks for the rec! I knew TRC was awesome but wasn't aware you could get that much compute.
Still, beyond short-term needs it seems like this is a risky strategy. TRC is basically a charity project that AFAIK could be shut down at any time.
Overall this updates me towards "we should very likely do the GCP funding thing. If this works out fine, setting up a shared cluster is much less urgent. A shared cluster still seems like the safer option in the mid to long term, if there is enough demand for it to be worthwhile"
Curious if you disagree with any of this
Proof: The only situation in which the iteration scheme does not update the decision boundary B is when we fail to find a predictor that does useful computation relative to E. By hypothesis, the only way this can happen is if E does not contain all of E0 or E = C. Since we start with E0 and only grow the easy set, it must be that E = C.
(emphasis mine)
To me it looks like the emphasized assumption (that it's always possible to find a predictor that does useful computation) is the main source of your surprising result, as without it the iteration would not...
Here's a few more questions about the same strategy:
If I understand correctly, the IG strategy is to learn a joint model for observations and actions , where , , and are video, actions, and proposed change to the Bayes net, respectively. Then we do inference using , where is optimized for predictive usefulness.
This fails because there's no easy way to get from .
A simple way around this would be to learn instead, where if the diamond is in the vault and otherwise.
Would you consider this a valid counter to the third strategy (have humans adopt the optimal Bayes net using imitative generalization), as alternative to ontology mismatch?
Counter: In the worst case, imitative generalization / learning the human prior is not competitive. In particular, it might just be harder for a model to match the human inference than to simply learn . Here is the set of instructions as in learning the prior (I think in the context of ELK would be the proposed change to the human Bayes net?)
Yeah we're on the same page here, thanks for checking!
I feel pretty uncertain about all the factors here. One reason I overall still lean towards the 'definitely not' stance is that building a toddler AGI that is alignable in principle is only one of multiple steps that need to go right for us to get a reflectively-stable docile AGI; in particular we still need to solve the prob... (read more)