All of ryan_greenblatt's Comments + Replies

Sorry, thanks for the correction.

I personally disagree on this being a good benchmark for outer alignment for various reasons, but it's good to understand the intention.

This is pretty close to my understanding, with one important objection.

Thanks for responding and trying to engage with my perspective.

Objection

If we repeat this iterative process enough times, we'll end up with a robust reward model.

I don't claim we'll necessarily ever get a fully robust reward model, just that the reward model will be mostly robust on average to the actual policy you use as long as human feedback is provided at a frequent enough interval. We never needed a good robust reward model which works on every input, we just needed a reward m... (read more)

I'd like to register that I disagree with the claim that standard online RLHF requires adversarial robustness in AIs persay. (I agree that it requires that humans are adversarially robust to the AI, but this is a pretty different problem.)

In particular, the place where adversarial robustness shows up is in sample efficiency. So, poor adversarial robustness is equivalent to poor sample efficiency. My understanding is that the trend is toward higher not lower sample efficiency with scale, so this seems on track.

This same reasoning also applies to recursive o... (read more)

2AdamGleave2d
To check my understanding, is your view something like:   1. If the reward model isn't adversarially robust, then the RL component of RLHF will exploit it.   2. These generations will show up in the data presented to the human. Provided the human is adversarially robust, then the human feedback will provide corrective signal to the reward model.   3. The reward model will stop being vulnerable to those adversarial examples, although may still be vulnerable to other adversarial examples.   4. If we repeat this iterative process enough times, we'll end up with a robust reward model. Under this model, improving adversarial robustness just means we need fewer iterations, showing up as improved sample efficiency. I agree with this view up to a point. It does seem likely that with sufficient iterations, you'd get an accurate reward model. However, I think the difference in sample efficiency could be profound: e.g exponential (needing to get explicit corrective human feedback for most adversarial examples) vs linear (generalizing in the right way from human feedback). In that scenario, we may as well just ditch the reward model and provide training signal to the policy directly from human feedback. In practice, we've seen that adversarial training (with practical amounts of compute) improves robustness but models are still very much vulnerable to attacks. I don't see why RLHF's implicit adversarial training would end up doing better than explicit adversarial training. In general I tend to view sample efficiency discussions as tricky without some quantitative comparison. There's a sense in which decision trees and a variety of other simple learning algorithms are a viable approach to AGI, they're just very sample (and compute) inefficient. The main reason I can see why RLHF may not need adversarial robustness is if the KL penalty from base model approach people currently use is actually enough.

If I build a chatbot, and I can't jailbreak it, how do I determine whether that's because the chatbot is secure or because I'm bad at jailbreaking? How should AI scientists overcome Schneier's Law of LLMs?

FWIW, I think there aren't currently good benchmarks for alignment and the ones you list aren't very relevant.

In particular, MMLU and Swag both are just capability benchmarks where alignment training is very unlikely to improve performance. (Alignment-ish training could theoretically could improve performance by making the model 'actually try', but wha... (read more)

1ctic24216h
Curious if you could elaborate more on why MACHIAVELLI isn't a good test for outer alignment!
9Dan H1d
It is. It's an outer alignment benchmark for text-based agents (such as GPT-4), and it includes measurements for deception, resource acquisition, various forms of power, killing, and so on. Separately, it's to show reward maximization induces undesirable instrumental (Machiavellian) behavior in less toyish environments, and is about improving the tradeoff between ethical behavior and reward maximization. It doesn't get at things like deceptive alignment, as discussed in the x-risk sheet in the appendix [https://arxiv.org/pdf/2304.03279.pdf#page=30]. Apologies that the paper is so dense, but that's because it took over a year.
4Cleo Nardo3d
* Yep, I agree that MMLU and Swag aren't alignment benchmarks. I was using them as examples of "Want to test your models ability at X? Then use the standard X benchmark!" I'll clarify in the text. * They tested toxicity (among other things) with their "safety prompts", but we do have standard benchmarks for toxicity. * They could have turned their safety prompts into a new benchmark if they had ran the same test on the other LLMs! This would've taken, idk, 2–5 hrs of labour? * The best MMLU-like benchmark test for alignment-proper is https://github.com/anthropics/evals [https://github.com/anthropics/evals] which is used in Anthropic's Discovering Language Model Behaviors with Model-Written Evaluations [https://arxiv.org/abs/2212.09251]. See here for a visualisation [https://www.evals.anthropic.com/model-written/]. Unfortunately, this benchmark was published by an Anthropic which makes it unlikely that competitors will use it (esp. MetaAI).

Oops, somehow I missed that context.

Thanks for the clarification.

Does This Make Any Sense?

I'm confused - it looks like the first paragraph of this section is taken from a prior post on attribution patching.

3Neel Nanda12d
Ah, thanks! As I noted at the top, this was an excerpt from that post, which I thought was interesting as a stand alone, but I didn't realise how much of that section did depend on attribution patching knowledge

When I try to interpret your points here, I come to the conclusion that you think humans, upon reflection, would cause human extinction (in favor of resources being used for something else).

Or at least that many/most humans would, upon reflection, prefer resources to be used for purposes other than preserving human life (including not preserving human life in simulation). And this holds even if (some of) the existing humans 'want' to be preserved (at least according to a conventional notion of preferences).

I think this empirical view seems pretty implausib... (read more)

6habryka16d
This feels like it is not really understanding my point, though maybe best to move this to some higher-bandwidth medium if the point is that hard to get across.  Giving it one last try: What I am saying is that I don't think "conventional notion of preferences" is a particularly well-defined concept, and neither are a lot of other concepts you are using in order to make your predictions here. What it means to care about the preferences of others is a thing with a lot of really messy details that tend to blow up in different ways when you think harder about it and are less anchored on the status-quo. I don't think you currently know in what ways you would care about the preferences of others after a lot of reflection (barring game-theoretic considerations which I think we can figure out a bit more in-advance, but I am bracketing that whole angle in this discussion, though I totally agree those are important and relevant). I do think you will of course endorse the way you care about other people's preferences after you've done a lot of reflection (otherwise something went wrong in your reflection process), but I don't think you would endorse what AIs would do, and my guess is you also wouldn't endorse what a lot of other humans would do when they undergo reflection here.  Like, what I am saying is that while there might be a relatively broad basin of conditions that give rise to something that locally looks like caring about other beings, the space of caring about other beings is deep and wide, and if you have an AI that cares about other beings preferences in some way you don't endorse, this doesn't actually get you anything. And I think the arguments that the concept of "caring about others" that an AI might have (though my best guess is that it won't even have anything that is locally well-described by that) will hold up after a lot of reflection seem much weaker to me than the arguments that it will have that preference at roughly human capability and ethical-r

I would be more sympathetic if you made a move like, "I'll accept continuity through the human range of intelligence, and that we'll only have to align systems as collectively powerful as humans, but I still think that hands-on experience is only..." In particular, I think there is a real disagreement about the relative value of experimenting on future dangerous systems instead of working on theory or trying to carefully construct analogous situations today by thinking in detail about alignment difficulties in the future.

Here are some views, often held in a cluster:

I'm not sure exactly which clusters you're referring to, but I'll just assume that you're pointing to something like "people who aren't very into the sharp left turn and think that iterative, carefully bootstrapped alignment is a plausible strategy." If this isn't what you were trying to highlight, I apologize. The rest of this comment might not be very relevant in that case.

To me, the views you listed here feel like a straw man or weak man of this perspective.

Furthermore, I think the actual crux is more ofte... (read more)

4ryan_greenblatt21d
I would be more sympathetic if you made a move like, "I'll accept continuity through the human range of intelligence, and that we'll only have to align systems as collectively powerful as humans, but I still think that hands-on experience is only..." In particular, I think there is a real disagreement about the relative value of experimenting on future dangerous systems instead of working on theory or trying to carefully construct analogous situations today by thinking in detail about alignment difficulties in the future.

Pitting two models against each other in a zero-sum competition only works so long as both models actually learn the desired goals. Otherwise, they may be able to reach a compromise with each other and cooperate towards a non-zero-sum objective.

If training works well, then they can't collude on average during training, only rarely or in some sustained burst prior to training crushing these failures.

In particular, in the purely supervised case with gradient descent, performing poorly on average in durining training requires gradient hacking (or more beni... (read more)

3Rubi J. Hudson20d
For the first point, I agree that the SGD pushes towards closing any gaps. My concern is that at the moment, we don't know how small the gaps need to be to get the desired behavior (and this is what we are working on modelling now). On top of that, depending on how the models are initialized, the starting gap may be quite large, so the dynamics of how gaps close throughout the training process seems important to study further. For the second point, I think we are also in agreement. If the training process leads the AI to learning "If I predict that this action will destroy the world, the humans won't choose it", which then leads to dishonest predictions. However, I also find the training process converging to a mesa-optimizer for the training objective (or something sufficiently close) to be somewhat more plausible.

We can't be confident enough that it won't happen to safely rely on that assumption.

I'm not sure what motivation for worst-case reasoning you're thinking about here. Maybe just that there are many disjunctive ways things can go wrong other than bad capability evals and the AI will optimize against us?

Overall, I think I disagree.

This will depend on the exact bar for safety. I think this sort of scenario feels like 0.1% to 3% likely to me which is immensely catastrophic overall, but there is lower hanging fruit for danger avoidance elsewhere.

(And for this... (read more)

3johnswentworth1mo
This getting very meta, but I think my Real Answer is that there's an analogue of You Are Not Measuring What You Think You Are Measuring [https://www.lesswrong.com/posts/9kNxhKWvixtKW5anS/you-are-not-measuring-what-you-think-you-are-measuring] for plans. Like, the system just does not work any of the ways we're picturing it at all, so plans will just generally not at all do what we imagine they're going to do. (Of course the plan could still in-principle have a high chance of "working", depending on the problem, insofar as the goal turns out to be easy to achieve, i.e. most plans work by default. But even in that case, the planner doesn't have counterfactual impact; just picking some random plan would have been about as likely to work.) The general solution which You Are Not Measuring What You Think You Are Measuring suggested was "measure tons of stuff", so that hopefully you can figure out what you're actually measuring. The analogy of that technique for plans would be: plan for tons of different scenarios, failure modes, and/or goals. Find plans (or subplans) which generalize to tons of different cases, and there might be some hope that it generalizes to the real world. The plan can maybe be robust enough to work even though the system does not work at all the ways we imagine. But if the plan doesn't even generalize to all the low-but-not-astronomically-low-probability possibilities we've thought of, then, man, it sure does seem a lot less likely to generalize to the real system. Like, that pretty strongly suggests that the plan will work only insofar as the system operates basically the way we imagined. Personally, my take on basically-all capabilities evals which at all resemble the evals developed to date is You Are Not Measuring What You Think You Are Measuring; I expect them to mostly just not measure whatever turns out to matter in practice.

[Sorry for late reply]

Analogously, conditional on things like gradient hacking being an issue at all, I'd expect the "hacker" to treat potential-training-objective-improvement as a scarce resource, which it generally avoids "spending" unless the expenditure will strengthen its own structure. Concretely, this probably looks like mostly keeping itself decoupled from the network's output, except when it can predict the next backprop update and is trying to leverage that update for something.

So it's not a question of performing badly on the training metric s

... (read more)

I don't quite think this point is right. Gradient descent had to have been able to produce the highly polysemantic model and pack things together in a way which got lower loss. This suggests that it can also change the underlying computation. I might need to provide more explanation for my point to be clear, but I think considering how gradient descent learns a single polysemantic neuron and how it could update that neuron in response to distributional shifts could be informative.

There might be a specific notion of "tangled together" that is learned by gra... (read more)

I also think "a task ai" is a misleading way to think about this: we're reasonably likely to be using a heterogeneous mix of a variety of AIs with differing strengths and training objectives.

Perhaps a task AI driven corporation?

Why target speeding up alignment research during this crunch time period as opposed to just doing the work myself?

Conveniently, alignment work is the work I wanted to get done during that period, so this is nicely dual use. Admittedly, a reasonable fraction of the work will be on things which are totally useless at the start of such a period while I typically target things to be more useful earlier.

I also typically think the work I do is retargetable to general usages of ai (e.g., make 20 trillion dollars).

Beyond this, the world will probably be radically transformed prior to large scale usage of AIs which are strongly superhuman in most or many domains. (Weighting domains by importance.)

For doing alignment research, I often imagine things like speeding up the entire alignment field by >100x.

As in, suppose we have 1 year of lead time to do alignment research with the entire alignment research community. I imagine producing as much output in this year as if we spent >100x serial years doing alignment research without ai assistance.

This doesn't clearly require using super human AIs. For instance, perfectly aligned systems as intelligent and well informed as the top alignment researchers which run at 100x the speed would clearly be suff... (read more)

3ryan_greenblatt1mo
Why target speeding up alignment research during this crunch time period as opposed to just doing the work myself? Conveniently, alignment work is the work I wanted to get done during that period, so this is nicely dual use. Admittedly, a reasonable fraction of the work will be on things which are totally useless at the start of such a period while I typically target things to be more useful earlier. I also typically think the work I do is retargetable to general usages of ai (e.g., make 20 trillion dollars). Beyond this, the world will probably be radically transformed prior to large scale usage of AIs which are strongly superhuman in most or many domains. (Weighting domains by importance.)

My probabilities are very rough, but I'm feeling more like 1/3 ish today after thinking about it a bit more. Shrug.

As far as reasons for it being this high:

  • Conflict seems plausible to get to this level of lethality (see edit, I think I was a bit unclear or incorrect)
  • AIs might not care about acausal trade considerations before too late (seems unclear)
  • Future humans/AIs/aliens might decide it isn't morally important to particularly privilege currently alive humans

Generally, I'm happy to argue for 'we should be pretty confused and there are a decent number of good reasons why AIs might keep humans alive'. I'm not confident in survival overall though...

I agree that EY is quite overconfident and I think his argument for doom are often sloppy and don't hold up. (I think the risk is substantial but often the exact arguments EY gives don't work). And, his communication often fails to meet basic bars for clarity. I'd also probably agree with 'if EY was able to do so, improving his communication and arguments in a variety of contexts would be extremely good'. And specifically not saying crazy sounding shit which is easily misunderstood would probably be good (there are some real costs here too). But, I'm not s... (read more)

Everyone, everyone, literally everyone in AI alignment is severely wrong about at least one core thing, and disagreements still persist on seemingly-obviously-foolish things.

If by 'severely wrong about at least one core thing' you just mean 'systemically severely miscalibrated on some very important topic ', then my guess is that many people operating in the rough prosaic alignment prosaic alignment paradigm probably don't suffer from this issue. It's just not that hard to be roughly calibrated. This is perhaps a random technical point.

I can't tell if this post is trying to discuss communicating about anything related to AI or alignment or is trying to more specifically discuss communication aimed at general audiences. I'll assume it's discussing arbitrary communication on AI or alignment.

I feel like this post doesn't engage sufficiently with the costs associated with high effort writting and the alteratives to targeting arbitrary lesswrong users interested in alignment.

For instance, when communicating research it's cheaper to instead just communicate to people who are operating within t... (read more)

4TAG1mo
If you are think that AI is going to kill everyone, sooner or later you are going to have to communicate that to everyone. That doesn't mean evey article has to be at the highest level of comprehensibity, but it does mean you shouldn't end up with the in-group problem of being unable to communicate with outsiders at all.
7NicholasKross1mo
I think I agree with this regarding inside-group communication, and have now edited the post to add something kind-of-to-this-effect at the top.
7who am I?2mo
While writing well is one of the aspects focused on by the OP, your reply doesn't address the broader point, which is that EY (and those of similar repute/demeanor) juxtaposes his catastrophic predictions with his stark lack of effective exposition and discussion of the issue and potential solutions to a broader audience. To add insult to injury, he seems to actively try to demoralize dissenters in a very conspicuous and perverse manner, which detracts from his credibility and subtly but surely nudges people further and further from taking his ideas (and those similar) seriously. He gets frustrated by people not understanding him, hence the title of the OP implying the source of his frustration is his own murkiness, not a lack of faculty of the people listening to him. To me, the most obvious examples of this are his guest appearances on podcasts (namely Lex Fridman's and Dwarkesh Patel's, the only two I've listened to). Neither of these hosts are dumb, yet by the end of their respective episodes, the hosts were confused or otherwise fettered and there was palpable repulsion between the hosts and EY. Considering these are very popular podcasts, it is reasonable to assume that he agreed to appear on these podcasts to reach a wider audience. He does other things to reach wider audiences, e.g. his twitter account and the Time Magazine article he wrote. Other people like him do similar things to reach wider audiences.  Since I've laid this out, you can probably predict what my thoughts are regarding the cost-benefit analysis you did. Since EY and similar folk are predicting outcomes as unfavorable as human extinction and are actively trying to recruit people from a wider audience to work towards their problems, is it really a reasonable cost to continue going about this as they have? Considering the potential impact on the field of AI alignment and the recruitment of individuals who may contribute meaningfully to addressing the challenges currently faced, I would argu

I broadly disagree with Yudkowsky on his vision of FOOM and think he's pretty sloppy wrt. AI takeoff overall.

But, I do think you're quite likely to get a quite rapid singularity if people don't intentionally slow things down. For instance, I broadly think the modeling in Tom Davidson's takeoff speeds report seems very reasonable to me. Except that I think the default parameters he uses are insufficiently aggressive (I think compute requirements are likely to be somewhat lower than given in this report). Notably this model doesn't get you FOOM in a week (p... (read more)

3ryan_greenblatt2mo
See also this section where Tom talks about kinks in the underlying capabilities leading to rapid progress [https://docs.google.com/document/d/1DZy1qgSal2xwDRR0wOPBroYE_RDV1_2vvhwVz4dxCVc/edit#heading=h.apdvo0uwo5qe]

I think my views on takeoff/timelines are broadly similar to Paul's except that I have somewhat shorter takeoffs and timelines (I think this is due to thinking AI is a bit easier and also due to misc deference).

... Wait, why not? If AI exceeds the human capability range on STEM four years from now, I would call that 'very soon', especially given how terrible GPT-4 is at STEM right now.

The thesis here is not 'we definitely won't have twelve months to work with STEM-level AGI systems before they're powerful enough to be dangerous'; it's more like 'we won't

... (read more)
4Rob Bensinger2mo
Thanks for the replies, Ryan! I don't think that 'the very first STEM-level AGI is smart enough to destroy the world if you relax some precautions' and 'we have 2.5 years to work with STEM-level AGI before any system is smart enough to destroy the world' changes my p(doom) much at all. (Though this is partly because I don't expect, in either of those worlds, that we'll be able to be confident about which world we're in.) If we have 6 years to safely work with STEM-level AGI, that does intuitively start to feel like a significant net increase in p(hope) to me? Though this is complicated by the fact that such AGI probably couldn't do pivotal acts either, and having STEM-level AGI for a longer period of time before a pivotal act occurs means that the tech will be more widespread when it does reach dangerous capability levels. So in the endgame, you're likely to have a lot more competition, and correspondingly less time to spend on safety if you want to deploy before someone destroys the world.

I really hope this isn't a sticking point for people. I also strongly disagree with this being 'a fundamental point'.

2supposedlyfun2mo
I probably should have specified that my "potential converts" audience was "people who heard that Elon Musk was talking about AI risk something something, what's that?", and don't know more than five percent of the information that is common knowledge among active LessWrong participants.
4Raemon2mo
wait which thing are you hoping isn't the sticking point?

If you condition on misaligned AI takeover, my current (extremely rough) probabilities are:

  • 50% chance the AI kills > 99% of people
  • Conditional on killing >99% of people, 2/3 chance the AI kills literally everyone

By 'kill' here I'm not including things like 'the AI cryonically preserves everyone's brains and then revives people later'. I'm also not including cases where the AI lets everyone live a normal human lifespan but fails to grant immortality or continue human civilization beyond this point.

My beliefs here are due to a combination of causal... (read more)

1Tom Davidson1mo
Why are you at 50% ai kills >99% ppl given the points you make in the other direction?

I agree that much of LW has moved past the foom argument and is solidly on Eliezers side relative to Robin Hanson; Hanson's views seem increasingly silly as time goes on (though they seemed much more plausible a decade ago, before e.g. the rise of foundation models and the shortening of timelines to AGI). The debate is now more like Yud vs. Christiano/Cotra than Yud vs. Hanson.

It seems worth noting that the views and economic modeling you discuss here seem broadly in keeping with Christiano/Cotra (but with more agressive constants)

3Daniel Kokotajlo2mo
Yep! On both timelines and takeoff speeds I'd describe my views as "Like Ajeya Cotra's and Tom Davidson's but with different settings of some of the key variables."

A common misconception is that STEM-level AGI is dangerous because of something murky about "agents" or about self-awareness. Instead, I'd say that the danger is inherent to the nature of action sequences that push the world toward some sufficiently-hard-to-reach state.

Call such sequences "plans".

If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language), and all we knew about the plan is that executing it would successfully achieve some superhumanly ambitious technological goal like "invent fast-

... (read more)
2Rob Bensinger2mo
I don't think your claim makes the argument circular / question-begging; it just means there's an extra step in explaining why and how a random action sequence destroys the world. Maybe you mean that I'm putting the emphasis in the wrong place, and it would be more illuminating to highlight some specific feature of random smart short programs as the source of the 'instrumental convergence' danger? If so, what do you think that feature is? From my current perspective I think the core problem really is that most random short plans that succeed in sufficiently-hard tasks kill us. If the causal process by which this happens includes building a powerful AI optimizer, or building an AI that builds an AI, or building an AI that builds an AI that builds an AI, etc., then that's interesting and potentially useful to know, but that doesn't seem like the key crux to me, and I'm not sure it helps further illuminate where the danger is ultimately coming from. Very happy to hear someone with an idea like this who explicitly flags that we shouldn't gamble on this being true!

This post seems to argue for fast/discontinuous takeoff without explicitly noting that people working in alignment often disagree. Further I think many of the arguments given here for fast takeoff seem sloppy or directly wrong on my own views.

It seems reasonable to just give your views without noting disagreement, but if the goal is for this to be a reference for the AI risk case, then I think you should probably note where people (who are still sold on AI risk) often disagree. (Edit: It looks like Rob explained his goals in a footnote.)

Another large pie

... (read more)
7Rob Bensinger2mo
If I had a list of 5-10 resources that folks like Paul, Holden, Ajeya, Carl, etc. see as the main causes for optimism, I'd be happy to link those resources (either in a footnote or in the main body). I'd definitely include something like 'survey data on the same population as my 2021 AI risk survey [https://www.lesswrong.com/posts/QvwSr5LsxyDeaPK5s/existential-risk-from-ai-survey-results], saying how much people agree/disagree with the ten factors", though I'd guess this isn't the optimal use of those people's time even if we want to use that time to survey something? One of the options in Eliezer's Manifold market on AGI hope [https://manifold.markets/EliezerYudkowsky/if-artificial-general-intelligence?r=RWxpZXplcll1ZGtvd3NreQ] is: When I split up probability mass [https://docs.google.com/document/d/1aQIvZBD3HjlNUsMyolULjX5BnM2He5x6pnBOxB8UyH4/edit] a month ago between the market's 16 options, this one only got 1.5% of my probability mass (12th place out of the 16). This obviously isn't the same question we're discussing here, but it maybe gives some perspective on why I didn't single out this disagreement above the many other disagreements I could devote space to that strike me as way more relevant to hope? (For some combination of 'likelier to happen' and 'likelier to make a big difference for p(doom) if they do happen'.) ... Wait, why not? If AI exceeds the human capability range on STEM four years from now, I would call that 'very soon', especially given how terrible GPT-4 is at STEM right now. The thesis here is not 'we definitely won't have twelve months to work with STEM-level AGI systems before they're powerful enough to be dangerous'; it's more like 'we won't have decades'. Somewhere between 'no time' and 'a few years' seems extremely likely to me, and I think that's almost definitely not enough time to figure out alignment for those systems. (Admittedly, in the minority of worlds where STEM-level AGI systems are totally safe for the first two years
4Mo Putera2mo
That's probably not what Rob is doing:
3shminux2mo
Yes, sorry, some definitely will. But if you look at what is going on now, people are pushing in all kinds of dangerous directions with reckless abandon, even knowing logically that it might be a bad idea.

Counterargument: you can just defend against these AIs running amuck.

As long as most AIs are systematically trying to further human goals you don't obviously get doomed (though the situation is scary).

There could be offense-defense inbalances, but there are also 'tyranny of the majority' advantages.

3shminux2mo
That's not the point though. Humans don't want to defend, they want to press the big red button and will gain-of-function an AI to make the button bigger and redder.

Distilling inference based approaches into learning is usually reasonably straightforward. I think this also applies in this case.

This doesn't necessarily apply to 'learning how to learn'.

(That said, I'm less sold that retrieval + chain of thought 'mostly solves autonmomous learning')

(Note: this comment is rambly and repetitive, but I decided not to spend time cleaning it up)

It sounds like you believe something like: "There are autonomous learning style approaches which are considerably better than the efficiency on next token prediction."

And more broadly, you're making a claim like 'current learning efficiency is very low'.

I agree - brains imply that it's possible to learn vastly more efficiently than deep nets, and my guess would be that performance can be far, far better than brains.

Suppose we instantly went from 'current status quo... (read more)

3Gerald Monroe2mo
Just to add to your thinking: consider also your hypothetical "experiment A vs experiment B". Suppose the AI tasked with the decision is both more capable than the best humans, but by a plausible margin (it's only 50 percent better) and can make the decision in 1 hour. (At 10 tokens a second it deliberates for a while, using tools and so on). But the experiment is an AI training run and results won't be available for 3 weeks. So the actual performance comparison is the human took one week and had a 50 percent pSuccess, and the AI took 1 hour and had a 75 percent pSuccess. So your success per day is 75/(21 days) and for the human it's 50/(28 days). Or in real world terms, the AI is 2 times as effective. In this example it is an enormous amount smarter, completing 40-80 hours of work in 1 hour and better than the best human experts by a 50 percent margin. Probably the amount of compute required to accomplish this (and the amount of electricity and patterned silicon) is also large. Yet in real world terms it is "only" twice as good. I suspect this generalizes a lot of places, where AGI is a large advance but it won't be enough to foom due to the real world gain being much smaller.

So I propose “somebody gets autonomous learning to work stably for LLMs (or similarly-general systems)” as a possible future fast-takeoff scenario.

Broadly speaking, autonomous learning doesn't seem particularly distinguished relative to supervised learning unless you have data limitations. For instance, suppose that data doesn't run out despite scaling and autonomous learning is moderately to considerably less efficient than supervised learning. Then, you'd just do supervised learning. Now, we can imagine fast takeoff scenarios where:

  • Scaling runs into
... (read more)

Broadly speaking, autonomous learning doesn't seem particularly distinguished relative to supervised learning unless you have data limitations.

Suppose I ask you to spend a week trying to come up with a new good experiment to try in AI. I give you two options.

Option A: You need to spend the entire week reading AI literature. I choose what you read, and in what order, using a random number generator and selecting out of every AI paper / textbook ever written. While reading, you are forced to dwell for exactly one second—no more, no less—on each word of the t... (read more)

I added some caveats about the potential for empirical versions of moral realism and how precise values targets are in practice.

While the target is small in mind space, IMO, it's not that small wrt. things like the distribution of evolved life or more narrowly the distribution of humans.

I roughly agree with Akash's comment.

But also some additional points:

  • It's decently likely that it will be pretty easy to get GPT-7 to avoid breaking the law or other egregious issue. As systems get more capable, basic alignment approaches get better at preventing stuff we can measure well. It's plausible that scalable approaches will be needed to avoid egregious and obvious failures which aren't takeover, but it currently seems unlikely. (see also list of lethalities and here and : "The difference is that reality doesn’t force us to solve the problem, or
... (read more)

I left another comment on my experience doing interpretability research, but I'd also like to note some overall disagreements with the post.

First, it's very important to note that GPT-4 was trained with SFT (supervised finetuning) and RLHF. I haven't played with GPT-4, but for prior models this has a large effect on the way the model responds to inputs. If the data was public, I would guess looking that the SFT and RLHF data would often be considerably more useful than pretraining. This doesn't morally contradict the post, but it's worth noting it's import... (read more)

My personal experience doing interpretability research has led me to thinking this sort of consideration is quite important for interpretability on pretrained LLMs (but not necessarily for other domains idk).

I've found it quite important when doing interpretability research to have read through a reasonable amount of the training data for the (small) models I'm looking at. In practice I've read an extremely large amount of OpenWebText (mostly while doing interp) and played this prediction game a decent amount: http://rr-lm-game.herokuapp.com/ (from here: h... (read more)

Perhaps the model wasn't allowed to read the sources for the free response section?

1Theresa Barton3mo
I think performance on AP english might be a quirk of how they dealt with dataset contamination. English and Literature exams showed anomalous amount of contamination (lots of the famous texts are online and referenced elsewhere) so they threw out most of the questions, leading to a null conclusion about performance.

comment TLDR: Adversarial examples are a weapon against the AIs we can use for good and solving adversarial robustness would let the AIs harden themselves.

I haven't read this yet (I will later : ) ), so it's possible this is mentioned, but I'd note that exploiting the lack of adversarial robustness could also be used to improve safety. For instance, AI systems might have a hard time keeping secrets if they also need to interact with humans trying to test for verifiable secrets. E.g., trying to jailbreak AIs to get them to tell you about the fact that they ... (read more)

1AdamGleave3mo
This is a good point, adversarial examples in what I called in the post the "main" ML system can be desirable even though we typically don't want them in the "helper" ML systems used to align the main system. One downside to adversarial vulnerability of the main ML system is that it could be exploited by bad actors (whether human or other, misaligned AIs). But this might be fine in some settings: e.g. you build some airgapped system that helps you build the next, more robust and aligned AI. One could also imagine crafting adversarial example backdoors that are cryptographically hard to discover if you don't know how they were constructed. I generally expect that if adversarial robustness can be practically solved then transformative AI systems will eventually self-improve themselves to the point of being robust. So, the window where AI systems are dangerous & deceptive enough that we need to test them using adversarial examples but not capable enough to have overcome this might be quite short. Could still be useful as an early-warning sign, though.

I'm at like 40% doom, then conditional on doom like 50/50 on nearly all (>99%) humans killed within a year (I'm talking about information death here, freezing brains and reviving later doesn't count as death; if not revived ever, then it's death), then conditioned on nearly all humans killed I'm at maybe 75% on literally all humans killed within a year.

So, overall I'm at 15% on literally all humans dead?

Numbers aren't in reflective equilibrium. I find the arguments for the AI killing nearly everyone not that compelling.

Simulations are not the most efficient way for A and B to reach their agreement

Are you claiming that the marginal returns to simulation are never worth the costs? I'm skeptical. I think it's quite likely that some number of acausal trade simulations are run even if that isn't where most of the information comes from. I think there are probably diminishing returns to various approaches and thus you both do simulations and other approaches. There's a further benefit to sims which is that credence about sims effects the behavior of cdt agents, but it's unc... (read more)

It is indeed pretty weird to see these behaviors appear in pure LMs. It's especially striking with sycophancy, where the large models seem obviously (?) miscalibrated given the ambiguity of the prompt.

By 'pure LMs' do you mean 'pure next token predicting LLMs trained on a standard internet corpus'? If so, I'd be very surprised if they're miscalibrated and this prompt isn't that improbable (which it probably isn't). I'd guess this output is the 'right' output for this corpus (so long as you don't sample enough tokens to make the sequence detectably very... (read more)

(Context, I work at Redwood)

While we're on the topic, it's perhaps useful to more directly describe my concerns about distribution-specific understanding of models, and especially narrow-distribution understanding of the kind a lot of work building Causal Scrubbing seems to be focusing on.

Can I summarize your concerns as something like "I'm not sure that looking into the behavior of "real" models on narrow distributions is any better research than just training a small toy model on that narrow distribution and interpreting it?" Or perhaps you think it'... (read more)

8Christopher Olah4mo
Between the two, I might actually prefer training a toy model on a narrow distribution! But it depends a lot on exactly how the analysis is done and what lessons one wants to draw from it. Real language models seem to make extensive use of superposition. I expect there to be lots of circuits superimposed with the one you're studying, and I worry that studying it on a narrow distribution may give a misleading impression – as soon as you move to a broader distribution, overlapping features and circuits which you previously missed may activate, and your understanding may in fact be misleading. On the other hand, for a model just trained on a toy task, I think your understanding is likely closer to the truth of what's going on in that model. If you're studying it over the whole training distribution, features either aren't in superposition (there's so much free capacity in most of these models this seem possible!) or else they'll be part of the unexplained loss, in your language. So choosing to use a toy model is just a question of what that model teaches you about real models (for example, you've kind of side-stepped superposition, and it's also unclear to what extent the features and circuits in a toy model represent the larger model). But it seems much clearer what is true, and it also seems much clearer that these limitations exist.

Thinking about the state and time evolution rules for the state seems fine, but there isn't any interesting structure with the naive formulation imo. The state is the entire text, so we don't get any interesting Markov chain structure. (you can turn any random process into a Markov chain where you include the entire history in the state! The interesting property was that the past didn't matter!)

3Charlie Steiner4mo
Hm, I mostly agree. There isn't any interesting structure by default, you have to get it by trying to mimic a training distribution that has interesting structure. And I think this relates to another way that I was too reductive, which is that if I want to talk about "simulacra" as a thing, then they don't exist purely in the text, so I must be sneaking in another ontology somewhere - an ontology that consists of features inferred from text (but still not actually the state of our real universe).
2LawrenceC4mo
Nitpick: I mean, technically, the state is only the last 4k tokens or however long your context length is. Though I agree this is still very uninteresting. 

The way I tend to think of 'simulators' is in simulating a distribution over worlds (i.e., latent variables) that increasingly collapses as prompt information determines specific processes with higher probability.

I agree this is the correct interpretation of the original post. It just doesn't match typical usage of the world simulation imo. (I'm sorry my post is making such a narrow pedantic point).

I probably agree that simulators improved the thinking of people on lesswrong on average.

1Jozdien4mo
I don't disagree that there aren't people who came away with the wrong impression (though they've been at most a small minority of people I've talked to, you've plausibly spoken to more people). But I think that might be owed more to generative models being confusing to think about intrinsically. Speaking of them purely as predictive models probably nets you points for technical accuracy, but I'd bet it would still lead to a fair number of people thinking about them the wrong way.

Fwiw, dropout hasn't fallen out of favor very much.

I think dropout makes nets less interpretable (wrt. naive interp strats). This is based on my recollection, I forget what exact experiments we have and haven't run.

2interstice4mo
OK, good to know. I had a cached belief that it had declined in popularity which probably exaggerated the extent.

FWIW, white box alignment doesn't imply humans understand what the models are thinking. There are other ways to leverage the fact that we have access to the internals.

1Noosphere894mo
I was using it as a synonym for alignment with interpretability compared to without interpretability.

I guess I'm considerably more optimistic on avoiding AI takeover without humans understanding what the models are thinking. (Or possibly you're more optimistic about slowing down AI)

1Noosphere894mo
Basically this. I am a lot more pessimistic around black box alignment than I am around white box alignment.

I would argue that ARC's research is justified by (1) (roughly speaking). Sadly, I don't think that there are enough posts on their current plans for this to be clear or easy for me to point at. There might be some posts coming out soon.

Load More