This is pretty close to my understanding, with one important objection.
Thanks for responding and trying to engage with my perspective.
If we repeat this iterative process enough times, we'll end up with a robust reward model.
I don't claim we'll necessarily ever get a fully robust reward model, just that the reward model will be mostly robust on average to the actual policy you use as long as human feedback is provided at a frequent enough interval. We never needed a good robust reward model which works on every input, we just needed a reward m...
I'd like to register that I disagree with the claim that standard online RLHF requires adversarial robustness in AIs persay. (I agree that it requires that humans are adversarially robust to the AI, but this is a pretty different problem.)
In particular, the place where adversarial robustness shows up is in sample efficiency. So, poor adversarial robustness is equivalent to poor sample efficiency. My understanding is that the trend is toward higher not lower sample efficiency with scale, so this seems on track.
This same reasoning also applies to recursive o...
If I build a chatbot, and I can't jailbreak it, how do I determine whether that's because the chatbot is secure or because I'm bad at jailbreaking? How should AI scientists overcome Schneier's Law of LLMs?
FWIW, I think there aren't currently good benchmarks for alignment and the ones you list aren't very relevant.
In particular, MMLU and Swag both are just capability benchmarks where alignment training is very unlikely to improve performance. (Alignment-ish training could theoretically could improve performance by making the model 'actually try', but wha...
Does This Make Any Sense?
I'm confused - it looks like the first paragraph of this section is taken from a prior post on attribution patching.
When I try to interpret your points here, I come to the conclusion that you think humans, upon reflection, would cause human extinction (in favor of resources being used for something else).
Or at least that many/most humans would, upon reflection, prefer resources to be used for purposes other than preserving human life (including not preserving human life in simulation). And this holds even if (some of) the existing humans 'want' to be preserved (at least according to a conventional notion of preferences).
I think this empirical view seems pretty implausib...
I would be more sympathetic if you made a move like, "I'll accept continuity through the human range of intelligence, and that we'll only have to align systems as collectively powerful as humans, but I still think that hands-on experience is only..." In particular, I think there is a real disagreement about the relative value of experimenting on future dangerous systems instead of working on theory or trying to carefully construct analogous situations today by thinking in detail about alignment difficulties in the future.
Here are some views, often held in a cluster:
I'm not sure exactly which clusters you're referring to, but I'll just assume that you're pointing to something like "people who aren't very into the sharp left turn and think that iterative, carefully bootstrapped alignment is a plausible strategy." If this isn't what you were trying to highlight, I apologize. The rest of this comment might not be very relevant in that case.
To me, the views you listed here feel like a straw man or weak man of this perspective.
Furthermore, I think the actual crux is more ofte...
Pitting two models against each other in a zero-sum competition only works so long as both models actually learn the desired goals. Otherwise, they may be able to reach a compromise with each other and cooperate towards a non-zero-sum objective.
If training works well, then they can't collude on average during training, only rarely or in some sustained burst prior to training crushing these failures.
In particular, in the purely supervised case with gradient descent, performing poorly on average in durining training requires gradient hacking (or more beni...
We can't be confident enough that it won't happen to safely rely on that assumption.
I'm not sure what motivation for worst-case reasoning you're thinking about here. Maybe just that there are many disjunctive ways things can go wrong other than bad capability evals and the AI will optimize against us?
Overall, I think I disagree.
This will depend on the exact bar for safety. I think this sort of scenario feels like 0.1% to 3% likely to me which is immensely catastrophic overall, but there is lower hanging fruit for danger avoidance elsewhere.
(And for this...
[Sorry for late reply]
...Analogously, conditional on things like gradient hacking being an issue at all, I'd expect the "hacker" to treat potential-training-objective-improvement as a scarce resource, which it generally avoids "spending" unless the expenditure will strengthen its own structure. Concretely, this probably looks like mostly keeping itself decoupled from the network's output, except when it can predict the next backprop update and is trying to leverage that update for something.
So it's not a question of performing badly on the training metric s
I don't quite think this point is right. Gradient descent had to have been able to produce the highly polysemantic model and pack things together in a way which got lower loss. This suggests that it can also change the underlying computation. I might need to provide more explanation for my point to be clear, but I think considering how gradient descent learns a single polysemantic neuron and how it could update that neuron in response to distributional shifts could be informative.
There might be a specific notion of "tangled together" that is learned by gra...
I also think "a task ai" is a misleading way to think about this: we're reasonably likely to be using a heterogeneous mix of a variety of AIs with differing strengths and training objectives.
Perhaps a task AI driven corporation?
Why target speeding up alignment research during this crunch time period as opposed to just doing the work myself?
Conveniently, alignment work is the work I wanted to get done during that period, so this is nicely dual use. Admittedly, a reasonable fraction of the work will be on things which are totally useless at the start of such a period while I typically target things to be more useful earlier.
I also typically think the work I do is retargetable to general usages of ai (e.g., make 20 trillion dollars).
Beyond this, the world will probably be radically transformed prior to large scale usage of AIs which are strongly superhuman in most or many domains. (Weighting domains by importance.)
For doing alignment research, I often imagine things like speeding up the entire alignment field by >100x.
As in, suppose we have 1 year of lead time to do alignment research with the entire alignment research community. I imagine producing as much output in this year as if we spent >100x serial years doing alignment research without ai assistance.
This doesn't clearly require using super human AIs. For instance, perfectly aligned systems as intelligent and well informed as the top alignment researchers which run at 100x the speed would clearly be suff...
My probabilities are very rough, but I'm feeling more like 1/3 ish today after thinking about it a bit more. Shrug.
As far as reasons for it being this high:
Generally, I'm happy to argue for 'we should be pretty confused and there are a decent number of good reasons why AIs might keep humans alive'. I'm not confident in survival overall though...
I agree that EY is quite overconfident and I think his argument for doom are often sloppy and don't hold up. (I think the risk is substantial but often the exact arguments EY gives don't work). And, his communication often fails to meet basic bars for clarity. I'd also probably agree with 'if EY was able to do so, improving his communication and arguments in a variety of contexts would be extremely good'. And specifically not saying crazy sounding shit which is easily misunderstood would probably be good (there are some real costs here too). But, I'm not s...
Everyone, everyone, literally everyone in AI alignment is severely wrong about at least one core thing, and disagreements still persist on seemingly-obviously-foolish things.
If by 'severely wrong about at least one core thing' you just mean 'systemically severely miscalibrated on some very important topic ', then my guess is that many people operating in the rough prosaic alignment prosaic alignment paradigm probably don't suffer from this issue. It's just not that hard to be roughly calibrated. This is perhaps a random technical point.
I can't tell if this post is trying to discuss communicating about anything related to AI or alignment or is trying to more specifically discuss communication aimed at general audiences. I'll assume it's discussing arbitrary communication on AI or alignment.
I feel like this post doesn't engage sufficiently with the costs associated with high effort writting and the alteratives to targeting arbitrary lesswrong users interested in alignment.
For instance, when communicating research it's cheaper to instead just communicate to people who are operating within t...
I broadly disagree with Yudkowsky on his vision of FOOM and think he's pretty sloppy wrt. AI takeoff overall.
But, I do think you're quite likely to get a quite rapid singularity if people don't intentionally slow things down. For instance, I broadly think the modeling in Tom Davidson's takeoff speeds report seems very reasonable to me. Except that I think the default parameters he uses are insufficiently aggressive (I think compute requirements are likely to be somewhat lower than given in this report). Notably this model doesn't get you FOOM in a week (p...
I think my views on takeoff/timelines are broadly similar to Paul's except that I have somewhat shorter takeoffs and timelines (I think this is due to thinking AI is a bit easier and also due to misc deference).
...... Wait, why not? If AI exceeds the human capability range on STEM four years from now, I would call that 'very soon', especially given how terrible GPT-4 is at STEM right now.
The thesis here is not 'we definitely won't have twelve months to work with STEM-level AGI systems before they're powerful enough to be dangerous'; it's more like 'we won't
I really hope this isn't a sticking point for people. I also strongly disagree with this being 'a fundamental point'.
If you condition on misaligned AI takeover, my current (extremely rough) probabilities are:
By 'kill' here I'm not including things like 'the AI cryonically preserves everyone's brains and then revives people later'. I'm also not including cases where the AI lets everyone live a normal human lifespan but fails to grant immortality or continue human civilization beyond this point.
My beliefs here are due to a combination of causal...
I agree that much of LW has moved past the foom argument and is solidly on Eliezers side relative to Robin Hanson; Hanson's views seem increasingly silly as time goes on (though they seemed much more plausible a decade ago, before e.g. the rise of foundation models and the shortening of timelines to AGI). The debate is now more like Yud vs. Christiano/Cotra than Yud vs. Hanson.
It seems worth noting that the views and economic modeling you discuss here seem broadly in keeping with Christiano/Cotra (but with more agressive constants)
...A common misconception is that STEM-level AGI is dangerous because of something murky about "agents" or about self-awareness. Instead, I'd say that the danger is inherent to the nature of action sequences that push the world toward some sufficiently-hard-to-reach state.
Call such sequences "plans".
If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language), and all we knew about the plan is that executing it would successfully achieve some superhumanly ambitious technological goal like "invent fast-
This post seems to argue for fast/discontinuous takeoff without explicitly noting that people working in alignment often disagree. Further I think many of the arguments given here for fast takeoff seem sloppy or directly wrong on my own views.
It seems reasonable to just give your views without noting disagreement, but if the goal is for this to be a reference for the AI risk case, then I think you should probably note where people (who are still sold on AI risk) often disagree. (Edit: It looks like Rob explained his goals in a footnote.)
...Another large pie
Counterargument: you can just defend against these AIs running amuck.
As long as most AIs are systematically trying to further human goals you don't obviously get doomed (though the situation is scary).
There could be offense-defense inbalances, but there are also 'tyranny of the majority' advantages.
Distilling inference based approaches into learning is usually reasonably straightforward. I think this also applies in this case.
This doesn't necessarily apply to 'learning how to learn'.
(That said, I'm less sold that retrieval + chain of thought 'mostly solves autonmomous learning')
(Note: this comment is rambly and repetitive, but I decided not to spend time cleaning it up)
It sounds like you believe something like: "There are autonomous learning style approaches which are considerably better than the efficiency on next token prediction."
And more broadly, you're making a claim like 'current learning efficiency is very low'.
I agree - brains imply that it's possible to learn vastly more efficiently than deep nets, and my guess would be that performance can be far, far better than brains.
Suppose we instantly went from 'current status quo...
So I propose “somebody gets autonomous learning to work stably for LLMs (or similarly-general systems)” as a possible future fast-takeoff scenario.
Broadly speaking, autonomous learning doesn't seem particularly distinguished relative to supervised learning unless you have data limitations. For instance, suppose that data doesn't run out despite scaling and autonomous learning is moderately to considerably less efficient than supervised learning. Then, you'd just do supervised learning. Now, we can imagine fast takeoff scenarios where:
Broadly speaking, autonomous learning doesn't seem particularly distinguished relative to supervised learning unless you have data limitations.
Suppose I ask you to spend a week trying to come up with a new good experiment to try in AI. I give you two options.
Option A: You need to spend the entire week reading AI literature. I choose what you read, and in what order, using a random number generator and selecting out of every AI paper / textbook ever written. While reading, you are forced to dwell for exactly one second—no more, no less—on each word of the t...
I added some caveats about the potential for empirical versions of moral realism and how precise values targets are in practice.
While the target is small in mind space, IMO, it's not that small wrt. things like the distribution of evolved life or more narrowly the distribution of humans.
I roughly agree with Akash's comment.
But also some additional points:
I left another comment on my experience doing interpretability research, but I'd also like to note some overall disagreements with the post.
First, it's very important to note that GPT-4 was trained with SFT (supervised finetuning) and RLHF. I haven't played with GPT-4, but for prior models this has a large effect on the way the model responds to inputs. If the data was public, I would guess looking that the SFT and RLHF data would often be considerably more useful than pretraining. This doesn't morally contradict the post, but it's worth noting it's import...
My personal experience doing interpretability research has led me to thinking this sort of consideration is quite important for interpretability on pretrained LLMs (but not necessarily for other domains idk).
I've found it quite important when doing interpretability research to have read through a reasonable amount of the training data for the (small) models I'm looking at. In practice I've read an extremely large amount of OpenWebText (mostly while doing interp) and played this prediction game a decent amount: http://rr-lm-game.herokuapp.com/ (from here: h...
comment TLDR: Adversarial examples are a weapon against the AIs we can use for good and solving adversarial robustness would let the AIs harden themselves.
I haven't read this yet (I will later : ) ), so it's possible this is mentioned, but I'd note that exploiting the lack of adversarial robustness could also be used to improve safety. For instance, AI systems might have a hard time keeping secrets if they also need to interact with humans trying to test for verifiable secrets. E.g., trying to jailbreak AIs to get them to tell you about the fact that they ...
I'm at like 40% doom, then conditional on doom like 50/50 on nearly all (>99%) humans killed within a year (I'm talking about information death here, freezing brains and reviving later doesn't count as death; if not revived ever, then it's death), then conditioned on nearly all humans killed I'm at maybe 75% on literally all humans killed within a year.
So, overall I'm at 15% on literally all humans dead?
Numbers aren't in reflective equilibrium. I find the arguments for the AI killing nearly everyone not that compelling.
Simulations are not the most efficient way for A and B to reach their agreement
Are you claiming that the marginal returns to simulation are never worth the costs? I'm skeptical. I think it's quite likely that some number of acausal trade simulations are run even if that isn't where most of the information comes from. I think there are probably diminishing returns to various approaches and thus you both do simulations and other approaches. There's a further benefit to sims which is that credence about sims effects the behavior of cdt agents, but it's unc...
It is indeed pretty weird to see these behaviors appear in pure LMs. It's especially striking with sycophancy, where the large models seem obviously (?) miscalibrated given the ambiguity of the prompt.
By 'pure LMs' do you mean 'pure next token predicting LLMs trained on a standard internet corpus'? If so, I'd be very surprised if they're miscalibrated and this prompt isn't that improbable (which it probably isn't). I'd guess this output is the 'right' output for this corpus (so long as you don't sample enough tokens to make the sequence detectably very...
(Context, I work at Redwood)
While we're on the topic, it's perhaps useful to more directly describe my concerns about distribution-specific understanding of models, and especially narrow-distribution understanding of the kind a lot of work building Causal Scrubbing seems to be focusing on.
Can I summarize your concerns as something like "I'm not sure that looking into the behavior of "real" models on narrow distributions is any better research than just training a small toy model on that narrow distribution and interpreting it?" Or perhaps you think it'...
Thinking about the state and time evolution rules for the state seems fine, but there isn't any interesting structure with the naive formulation imo. The state is the entire text, so we don't get any interesting Markov chain structure. (you can turn any random process into a Markov chain where you include the entire history in the state! The interesting property was that the past didn't matter!)
The way I tend to think of 'simulators' is in simulating a distribution over worlds (i.e., latent variables) that increasingly collapses as prompt information determines specific processes with higher probability.
I agree this is the correct interpretation of the original post. It just doesn't match typical usage of the world simulation imo. (I'm sorry my post is making such a narrow pedantic point).
I probably agree that simulators improved the thinking of people on lesswrong on average.
Fwiw, dropout hasn't fallen out of favor very much.
I think dropout makes nets less interpretable (wrt. naive interp strats). This is based on my recollection, I forget what exact experiments we have and haven't run.
FWIW, white box alignment doesn't imply humans understand what the models are thinking. There are other ways to leverage the fact that we have access to the internals.
I guess I'm considerably more optimistic on avoiding AI takeover without humans understanding what the models are thinking. (Or possibly you're more optimistic about slowing down AI)
I would argue that ARC's research is justified by (1) (roughly speaking). Sadly, I don't think that there are enough posts on their current plans for this to be clear or easy for me to point at. There might be some posts coming out soon.
Sorry, thanks for the correction.
I personally disagree on this being a good benchmark for outer alignment for various reasons, but it's good to understand the intention.