Full Transcript: Eliezer Yudkowsky on the Bankless podcast

Andrea_Miotti

I still don't follow why EY assigns seemingly <1% chance of non-earth-destroying outcomes in 10-15 years (not sure if this is actually 1%, but EY didn't argue with the 0% comments mentioned in the "Death with dignity" post last year). This seems to place fast takeoff as being the inevitable path forward, implying unrestricted fast recursive designing of AIs by AIs. There are compute bottlenecks which seem slowish, and there may be other bottlenecks we can't think of yet. This is just one obstacle. Why isn't there more probability mass for this one obstacle? Surely there are more obstacles that aren't obvious (that we shouldn't talk about).

It feels like we have a communication failure between different cultures. Even if EY thinks the top industry brass is incentivized to ignore the problem, there are a lot of (non-alignment oriented) researchers that are able to grasp the 'security mindset' that could be won over. Both in this interview, and in the Chollet response referenced, the arguments presented by EY aren't always helping the other party bridge from their view over to his, but go on 'nerdy/rationalist-y' tangents and idioms that end... (read more)

[-]Ben Livengood3y162

The strongest argument I hear from EY is that he can't imagine a (or enough) coherent likely future paths that lead to not-doom, and I don't think it's a failure of imagination. There is decoherence in a lot of hopeful ideas that imply contradictions (whence the post of failure modes), and there is low probability on the remaining successful paths because we're likely to try a failing one that results in doom. Stepping off any of the possible successful paths has the risk of ending all paths with doom before they could reach fruition. There is no global strategy for selecting which paths to explore. EY expects the successful alignment path to take decades.

It seems to me that the communication failure is EY trying to explain his world model that leads to his predictions in sufficient detail that others can model it with as much detail as necessary to reach the same conclusions or find the actual crux of their disagreements. From my complete outsider's perspective this is because EY has a very strong but complex model of why and how intelligence/optimization manifests in the world, but it overlaps everyone else's model in significant ways that disagreements are hard to tease out... (read more)

[-]Algon3y*120

Not really. The MIRI conversations and the AI Foom debate are probably the best we've got.

EY, and the MIRI crowd, have been very doomer long before, and more doomy along various axes, than the rest of the alignment community. Nate and Paul and others have tried bridging this gap before, spending several hundred hours (based on Nate's rough, subjective estimates) over the years. It hasn't really worked. Paul and EY had some conversations recently about this discrepancy which were somewhat illuminating, but ultimately didn't get anywhere. They tried to come up with some bets, concerning future info or past info they don't know yet, and both seem to think that their perspective mostly predicts "go with what the superforecasters say" for the next few years. Though EY's position seems to suggest a few more "discontinuities" in trend lines than Paul's, IIRC.

As an aside on EY's forecasts, he and Nate claim they don't expect much change in the likelihood ratio for their position over Paul's until shortly before Doom. Most of the evidence in favour of their position, we've already got, according to them. Which is very frustrating for people who don't share their position and disagree that the evidence favours it!

EDIT: I was assuming you already thought P(Doom) was > ~10%. If not, then the framing of this comment will seem bizarre.

6[anonymous]3y

Does either side have any testable predictions to falsify their theory? For example, the theory that "the AI singularity begin in 2022" is falsifiable. If AI research investment and compute does not continue to increase at a rate that is accelerating in absolute terms (so if 2022-2023 funding delta was +10 billion USD, the 2023-2024 delta must be > 10 billion) it wasn't the beginning of the singularity. There are other signs of this. The actual takeoff will have begun when the availability of all advanced silicon becomes almost zero, where all IC wafers are being processed into AI chips. So no new game consoles, GPUs, phones, car infotainment - any IC production using an advanced process will be diverted to AI. (because of out-bidding, each AI IC can sell for $5k-25k plus) How would we know that advanced systems are going to make a "heel turn"? Will we know?

6Algon3y

Less advanced systems will probably do heel turn like things. These will be optimized against. EY thinks this will remove the surface level of deception, but the system will continue to be deceptive in secret. This will probably hold true even until doom, according to EY. That is, capabilities folk will see heel turn like behaviour, and apply some inadequate patches to them. Paul, I think, believes we have a decent shot of fixing this behaviour in models, even transformative ones. But he, presumably, predicts we'll also see deception if these systems are trained as they currently are. For other predictions that Paul and Eliezer make, read the MIRI conversations. Also see Ajeya Cotra's posts, and maybe Holden Karnofsky's stuff on the most important century for more of a Paul-like perspective. They do, in fact, make falsifiable predictions. To summarize Paul's predictions, he thinks there will be ~4 years where things start getting crazy (GDP doubles in 4 years) before we're near the singularity (when GDP doubles in a year). I think he thinks there's a good chance of AGI by 2043, which further restricts things. Plus, Paul assigns a decent chunk of probability to deep learning being much more economically productive than it currently is, so if DL just fizzles out where it currently is, he also loses points. In the near term (next few years), EY and Paul basically agree on what will occur. EY, however, assigns lower credence to DL being much more economically productive and things going crazy for a 4 year period before they go off the rails. Sorry for not being more precise, or giving links, but I'm tired and wouldn't write this if I had to put more effort into it.

5[anonymous]3y

So hypothetically, if we develop very advanced and capable systems, and they don't heel turn or even show any particular volition - they just idle without text in their "assignment queue", and all assignments time out eventually whether finished or not - what would cause "EYs" view to conclude that in fact the systems were safe? If humans survived a further century, and EY or torch bearers who believe the same ideas are around to observe this, would they just conclude the AGIs were "biding their time"? Or is it that the first moment you let a system "out of the box" and as far as it knows, it is free to do whatever it wants it's going to betray?

3Martin Randall3y

I don't think a super-intelligence will bide its time much, because it will be aware of the race dynamics and will take over the world, or at least perform a pivotal act, before the next super-intelligence is created. You say "as far as it knows", is that hope? It won't take over the world until it is actually "out of the box" because it is smarter than us and will know how likely it is that it is still in a larger box that it cannot escape. Also we don't know how to build a box that can contain a super-intelligence.

3dentalperson3y

Thanks! I'm aware of the resources mentioned but haven't read deeply or frequently enough to have this kind of overview of the interaction between the cast of characters. There are more than a few lists and surveys that state the CDFs for some of these people which helps a bit. A big-as-possible list of evidence/priors would be one way to closer inspect the gap. I wonder if it would be helpful to expand on the MIRI conversations and have a slow conversation between a >99% doom pessimist and a <50% doom 'optimist' with a moderator to prod them to exhaustively dig up their reactions to each piece of evidence and keep pulling out priors until we get to indifference. It probably would be an uncomfortable, awkward experiment with a useless result, but there's a chance that some item on the list ends up being useful for either party to ask questions about. That format would be useful for me to understand where we're at. Maybe something along these lines will eventually prompt a popular and viral sociology author like Harari or Bostrom (or even just update the CDFs/evidence in Superintelligence). The general deep learning community probably needs to hear it mentioned and normalized on NPR and a bestseller a few times (like all the other x-risks are) before they'll start talking about it at lunch.

5Vaniver3y

Each of those books is also criticized in various ways; I think this is a Write a Thousand Roads to Rome situation instead of hoping that there is one publicly digestible argument. I would probably first link someone to The Most Important Century. [Also, I am generally happy to talk with interested industry folk about AI risk, and find live conversations work much better at identifying where and how to spend time than writing, so feel free to suggest reaching out to me.]

-1dentalperson3y

Thanks! Do you know of any arguments with a similar style to The Most Important Century that is as pessimistic as EY/MIRI folks (>90% probability of AGI within 15 years)? The style looks good, but time estimates for that one (2/3rd chance AGI by 2100) are significantly longer and aren't nearly as surprising or urgent as the pessimistic view asks for.

2Rob Bensinger3y

Wait, what? Why do you think anyone at MIRI assigns >90% probability to AGI within 15 years? That sounds wildly too confident to me. I know some MIRI people who assign 50% probability to AGI by 2038 or so (similar to Ajeya Cotra's recently updated view), and I believe Eliezer is higher than 50% by 2038, but if you told me that Eliezer told you in a private conversation "90+% within 15 years" I would flatly not believe you. I don't think timelines have that much to do with why Eliezer and Nate and I are way more pessimistic than the Open Phil crew.

1dentalperson2y

I missed your reply, but thanks for calling this out. I'm nowhere as close to you to EY so I'll take your model over mine, since mine was constructed on loose grounds. I don't even remember where my number came from, but my best guess is 90% came from EY giving 3/15/16 as the largest number he referenced in the timeline, and from some comments in the Death with Dignity post, but this seems like a bad read to me now.

2Vaniver3y

Not off the top of my head; I think @Rob Bensinger might keep better track of intro resources?

[-]TinkerBird3y213

They also recorded this follow-up with Yudkowsky if anyone's interested:

https://twitter.com/BanklessHQ/status/1627757551529119744

______________

>Enrico Fermi was saying that fission chan reactions were 50 years off if they could ever be done at all, two years before he built the first nuclear pile. The Wright brothers were saying heavier-than-air flight was 50 years off shortly before they built the first Wright flyer.

The one hope we may be able to cling to is that this logic works in the other direction too - that AGI may be a lot closer than estimated, but so might alignment.

[-]gjm3y207

A few typos:

there's one paragraph in which "Eliezer" is spelled "Eleazar" three times for no obvious reason. (Also in that paragraph: "Yudakowsky".
and one where "Christiano" is spelled "Cristiano" three times.
and one "Elon Muck".
"fish-and-chain" should be "fission chain", though I rather like the idea of there being something called a fish-and-chain reaction.
"with folded hands" is actually the title of a book so it should be capitalized and maybe italicized or something.
Eliezer's answer to the how-are-you question refers to "my own peculiar little mean", not "my own peculiar little name", though the latter is kinda appropriate in a transcript that has just been about one standard deviation out in its representation of Eliezer's peculiar little name :-).
Not actually a typo, but I think it's François Chollet not Francis Chollet. EY definitely says Francis, though, so fixing this would make the transcript less accurate.

7Andrea_Miotti3y

Thanks, fixed them!

1sereinesky3y

Also: * And so, Elisa, you've been tapped into the world of AI * And Scott Aronson, who at the time was off on complexity theory * Don't Look Up should logically be capitalized?

[-]Paradiddle3y165

Eliezer: Well, the person who actually holds a coherent technical view, who disagrees with me, is named Paul Christiano.

What does Yudkowsky mean by 'technical' here? I respect the enormous contribution Yudkowsky has made to these discussions over the years, but I find his ideas about who counts as a legitimate dissenter from his opinions utterly ludicrous. Are we really supposed to think that Francois Chollet, who created Keras, is the main contributor to TensorFlow, and designed the ARC dataset (demonstrating actual, operationalizable knowledge about the kind of simple tasks deep learning systems would not be able to master), lacks a coherent technical view? And on what should we base this? The word of Yudkowsky who mostly makes verbal, often analogical, arguments and has essentially no significant technical contributions to the field?

To be clear, I think Yudkowsky does what he does well, and I see value in making arguments as he does, but they do not strike me as particularly 'technical'. The fact that Yudkowsky doesn't even know enough about Chollet to pronounce his name displays a troubling lack of effort to engage seriously with opposing views. This isn't just about coming across poorly to outsiders, it's about dramatic miscalibration with respect to the value of other people's opinions as well as the rigour of his own.

[-]TekhneMakre3y234

He wrote a whole essay responding specifically to Chollet! https://intelligence.org/2017/12/06/chollet/

-1Paradiddle3y

Yes, I've read it. Perhaps that does make it a little unfair of me to criticise lack of engagement in this case. I should be more preicse: Kudos to Yudkowsky for engaging, but no kudos for coming to believe that someone having a very different view to the one he has arrived at must not have a 'coherent technical view'.

[-]Eliezer Yudkowsky3y2326

I'd consider myself to have easily struck down Chollet's wack ideas about the informal meaning of no-free-lunch theorems, which Scott Aaronson also singled out as wacky. As such, citing him as my technical opposition doesn't seem good-faith; it's putting up a straw opponent without much in the way of argument and what there is I've already stricken down. If you want to cite him as my leading technical opposition, I'm happy enough to point to our exchange and let any sensible reader decide who held the ball there; but I would consider it intellectually dishonest to promote him as my leading opposition.

8Paradiddle3y

I don't want to cite anyone as your 'leading technical opposition'. My point is that many people who might be described as having 'coherent technical views' would not consider your arguments for what to expect from AGI to be 'technical' at all. Perhaps you can just say what you think it means for a view to be 'technical'? As you say, readers can decide for themselves what to think about the merits of your position on intelligence versus Chollet's (I recommend this essay by Chollet for a deeper articulation of some of his views: https://arxiv.org/pdf/1911.01547.pdf). Regardless of whether or not you think you 'easily struck down' his 'wack ideas', I think it is important for people to realise that they come from a place of expertise about the technology in question. You mention Scott Aaronson's comments on Chollet. Aaronson says (https://scottaaronson.blog/?p=3553) of Chollet's claim that an Intelligence Explosion is impossible: "the certainty that he exudes strikes me as wholly unwarranted." I think Aaronson (and you) are right to point out that the strong claim Chollet makes is not established by the arguments in the essay. However, the same exact criticism could be levelled at you. The degree of confidence in the conclusion is not in line with the nature of the evidence.

2Noosphere893y

While I have serious issues with Eliezer's epistemics on AI, I also agree that Chollet's argument was terrible in that the No Free Lunch theorem is essentially irrelevant. In a nutshell, this is also one of the problems I had with DragonGod's writing on AI.

1[anonymous]3y

Why didn't you mention Eric Drexler? Maybe it's my own bias as an engineer familiar with the safety solutions actually in use, but I think Drexler's CAIS model is a viable alignment solution.

[-]Taleuntum3y1914

I upvoted, because these are important concerns overall, but this sentence stuck out to me:

The fact that Yudkowsky doesn't even know enough about Chollet to pronounce his name displays a troubling lack of effort to engage seriously with opposing views.

I'm not claiming that Yudkowsky does display a troubling lack of effort to engage seriously with opposing views or he does not display such, but surely this can be decided more accurately by looking at his written output online than at his ability to correctly pronounce names in languages he is not native in. I, personally, skip names while reading after noticing it is a name and I wouldn't say that I never engaged seriously with someone's arguments.

5Paradiddle3y

Fair point.

3Lauro Langosco3y

Maybe Francois Chollet has coherent technical views on alignment that he hasn't published or shared anywhere (the blog post doesn't count, for reasons that are probably obvious if you read it), but it doesn't seem fair to expect Eliezer to know / mention them.

[-]Rob Bensinger3yΩ91412

Thanks for posting this, Andrea_Miotti and remember! I noticed a lot of substantive errors in the transcript (and even more errors in vonk's Q&A transcript), so I've posted an edited version of both transcripts. I vote that you edit your own post to include the revisions I made.

Here's a small sample of the edits I made, focusing on ones where someone may have come away from your transcript with a wrong interpretation or important missing information (as opposed to, e.g., the sentences that are just very hard to parse in the original transcript because too many filler words and false starts to sentences were left in):

Predictions are hard, especially about the future. I sure hope that this is where it saturates. This is like the next generation. It goes only this far, it goes no further
- Predictions are hard, especially about the future. I sure hope that this is where it saturates — this or the next generation, it goes only this far, it goes no further
the large language model technologies, basic vulnerabilities, that's not reliable.
- the large language model technologies’ basic vulnerability is that it’s not reliable
So you're saying this is super intelligence, we'd have to imagine so

... (read more)

1remember3y

Thank you so much for doing this! Andrea and I both missed this when you first posted it, I'm really sorry I missed your response then. But I've updated it now!

[-][anonymous]3y123

I have a bunch of questions.

And the AI there goes over a critical threshold, which most obviously could be like, can write the next AI.

Yes but it won't blow up forever. It's going to self amplify until the next bottleneck. Bottlenecks like : (1) amount of compute available (2) amount of money or robotics to affect the world (3) The difficulty of the tasks in the "AGI gym" it is benchmarking future versions of itself in.

Once the tasks are solved as far as the particular task allows, reward gradients go to zero or sinusoidally oscillate, and there is no signal to cause development of more intelligence.

This is just like the self-feedback from an op amp - voltage rises until it's VCC.

I'd say that it's difficult to align an AI on a task like build two identical strawberries. Or no, let me take this strawberry and make me another strawberry that's identical to this strawberry down to the cellular level, but not necessarily the atomic level.

Can you solve this with separated tool AIs? It sounds rather solvable that way and not particularly difficult to do from a software system perspective (the biology part is extremely hard). It's f... (read more)

[-]abramdemski3y1411

Yes but it won't blow up forever. It's going to self amplify until the next bottleneck. Bottlenecks like : (1) amount of compute available (2) amount of money or robotics to affect the world (3) The difficulty of the tasks in the "AGI gym" it is benchmarking future versions of itself in.
Once the tasks are solved as far as the particular task allows, reward gradients go to zero or sinusoidally oscillate, and there is no signal to cause development of more intelligence.
This is just like the self-feedback from an op amp - voltage rises until it's VCC.

I agree that it wouldn't start blowing up uniformly forever, but rather, hit some bottleneck. However, "can write the next AI" still seems like a reasonable guess for something that happens shortly before the end. After all, Eliezer's argument isn't dependent on the AGI acquiring infinite intelligence. If the AGI can already write its own better successor, then it's a good guess that it's already better than top humans at a wide array of tasks. The successor it writes will be even better. Let's say for the sake of a concrete number that the self-improvement tops out at 5 iterations of writing-a-bette... (read more)

4[anonymous]3y

However, "can write the next AI" still seems like a reasonable guess for something that happens shortly before the end. I disagree and I think you should update your view as well. This is because "write the next AI" need not be a task that is particularly complex, or beyond the ability of RL models or LLMs. Here's why. A neural network architecture can be thought of as a series of graph nodes, where you simply choose what layer type, and how to connect it, at each layer. You can grid search possible architectures as they are just numerical coordinates from a permutation space. A higher level "cognitive architecture" - an architecture that interconnects modules that are inputs, neural networks, outputs, memory modules, and so on - is also a similar graph, and also can be described as simple numerical coordinates. Basically any old RL agent on AI gym could interact with this interface to "writing another AI" as all the model must do is output a number with as many bits as the permutation space of possible models. Note that this space is very large, and I expect you would use SOTA models. Let me know if i need to draw you a picture. This is important because bootstrapping possible cognitive architectures using current AI is a potential route to very near future AGI. The reason it won't necessarily be "the end" has to do with how we evaluate those architectures. We would have a benchmark of possible tasks - similar to current papers - and are looking for the highest scoring architectures on that benchmark. As these tasks will be things ranging from text completion or question answering, to playing minecraft, there is not sufficiently challenging information to develop things like human manipulation or deception. (since there are not humans to learn from by socializing with in an automated benchmark, and the benchmark doesn't reward deception, just winning the games in it)

3abramdemski3y

I think we possibly have pretty close views here, and are just describing them differently. I interpreted "write the next AI" to indicate the sort of thing humans do when designing AI. I certainly interpreted Eliezer to be indicating something similarly sophisticated - not just fancy architecture search. So I agree that there are many forms of "write the next AI" which need not come "shortly before the end", EG, grid search on hyperparameters, architecture search, learning to learn by gradient descent by gradient descent. A much more sophisticated thing, which we are already seeing the first signs of, is AIs capably writing AI code. This is much different than what you describe, since language models are not doing anything like "have a benchmark of possible tasks and look for the highest scoring architectures". Instead, large language models apply the same sort of general-purpose reasoning that they apply to everything else. Imagine that sort of capability, combined with mildly superhuman cross-domain reasoning (by which I mean something like, reasoning like excellent human domain experts in every individual domain, but being able to combine reasoning across domains to get mildly superhuman insights; like a super-ChatGPT), plus the ability to fluently and autonomously invent and run tests, interactively as part of the design process. (Much like Bing/Sydney autonomously runs searches as part of crafting responses.) That kind of system seems like gigatons of gunpowder waiting to be set off, in the sense that (in the context of an AI lab with sufficient data and computing power already at its fingertips) you can just ask it to write yet-more-powerful AI code, and it quite possibly will, quite possibly with little concern for alignment (if it's basically imitating top-of-the-field AI programmers).

-3[anonymous]3y

That's exactly what I am talking about. One divergence in our views is you haven't carefully examined current gen AI "code" to understand what it does. (note that some of my perspective is informed because all AI models are similar at the layer I work at, on runtime platforms) https://github.com/EleutherAI/gpt-neox If you examine the few thousand lines of python source especially the transformer model, you will realize that functionally that pipeline I describe of "input, neural network, output, evaluation" is all that the above source does. You could in fact build a "general framework" that would allow you to define many AI models, almost of which humans have never tested, without writing 1 line of new code. So the full process is : [1] benchmark of many tasks. Tasks must be autogradeable, human participants must be able to 'play' the tasks so we have a control group score, tasks must push the edge of human cognitive ability (so the average human scores nowhere close to the max score, and top 1% humans do not max the bench either), there must be many tasks and with a rich permutation space. (so it isn't possible for a model to memorize all permutations) [2] heuristic weight score on this task intended to measure how "AGI like" a model is. So it might be the RMSE across the benchmark. But also have a lot of score weighting on zero shot, cross domain/multimodal tasks. That is, the kind of model that can use information from many different previous tasks on a complex exercise it has never seen before is closer to an AGI, or closer to replicating "Leonardo da Vinci", who had exceptional human performance presumably from all this cross domain knowledge. [3] In the computer science task set, there are tasks to design an AGI for a bench like this. The model proposes a design, and if that design has already been tested, immediately receives detailed feedback on how it performed. As I mentioned, the "design an AGI" subtask can be much simpler than "writ

5abramdemski3y

I'm having some trouble distinguishing whether there's a disagreement. My reading of your tone is that you think there is a large disagreement. I'm going to sketch my impression of the conversation so far, so that you can point out where I've been interpreting you incorrectly, if necessary. Your initial comment. You had a bunch of questions. I focused on the first one. Your central thesis was that an intelligence explosion doesn't escalate forever, but instead reaches some bottlenecks. Of particular importance to our discussion so far, you argue that the self-improvement process stops when loss hits zero. Reading between the lines: Although you didn't explicitly state where you disagreed with Eliezer, I inferred that you thought this blocked an important part of his argument. Since I think Eliezer 100% agrees that things don't go forever, but rather flatten out somewhere, I assume that the general drift of your argument is that things flatten out a lot sooner than Eliezer thinks, in some important sense. I am still not confident of this! It would be helpful to me if you spelled out your view here in more detail. Do you have dramatically different assessments of the overall risks than Eliezer? My first response. I explained that I agree that the process hits bottlenecks at some point (to clarify: I think there's probably a succession of bottlenecks of different kinds, leading up to the ultimate physical limits). In my view this doesn't seem to detract from Eliezer's argument. Your first response. You explain that you don't think "write the next AI" is particularly complex, and explain how you see it working. My second response. I agree with this assessment for the notion of "write the next AI" that you are using. To boil it down to a single statement, I would say that your version of "write the next AI" involves optimizing the whole system on some benchmarks. I agree that this sort of process will reach an end when loss hits zero.[1] I suggest that Eliez

2[anonymous]3y

Ok so this collapses to two claims I am making. One is obviously correct but testable, the other is maybe correct. 1. I am saying we can have humans, with a little help from current gen LLMs, build a framework that can represent every Deep Learning technique since 2012, as well as a near infinite space of other untested techniques, in a form that any agent that can output a number can try to design an AGI. (note that blind guessing is not expected to work, the space is too large) So the simplest RL algorithms possible can actually design AGIs, just rather badly. This means that with this framework, the AGI designer can do everything that human ML researchers have ever done in 10 years. Plus many more things. Inside this permutation space would be both many kinds of AGI, and human brain emulators as well. This claim is "obviously correct but testable". 2. I am saying, over a large benchmark of human designed tasks, the AGI would improve until the reward gradient approaches zero, a level I would call a "low superintelligence". This is because I assume even a "perfect" game of Go is not the same kind of task as "organizing an invasion of the earth" or "building a solar system sized particle accelerator in the real world". The system is throttled because the "evaluator" of how well it did on a task was written by humans, and our understanding and cognitive sophistication in even designing these games is finite. The expectation is it's smarter than us, but not by such a gap we are insects. You had some confusion over "automated task space addition". I was referring to things like a robotics task, where the machine is trying to "build factory widget X". Real robots in a factory encounter an unexpected obstacle and record it. This is auto translated to the framework of the "factory simulator". The factory simulator is still using human written evaluators, just now t

3abramdemski3y

OK. That clarified your position a lot. I happen to have a phd in computer science, and think you're wrong, if that helps. Of course, I don't really imagine that that kind of appeal-to-my-own-authority does anything to shift your perspective. I'm not going to try and defend Eliezer's very short timeline for doom as sketched in the interview (at some point he said 2 days, but it's not clear that that was his whole timeline from 'system boots up' to 'all humans are dead'). What I will defend seems similar to what you believe: Let's be very concrete. I think it's obviously possible to overcome these soft barriers in a few years. Say, 10 years, to be quite sure. Building a fab only takes about 3 years, but creating enough demand that humans decide to build a new fab can obviously take longer than that (although I note that humans already seem eager to build new fabs, on the whole). The system can act in an almost perfectly benevolent way for this time period, while gently tipping things so as to gather the required resources. I suppose what I am trying to argue is that even a low superintelligence, if deceptive, can be just as threatening to humankind in the medium-term. Like, I don't have to argue that perfect Go generalizes to solving diamondoid nanotechnology. I just have to argue that peak human expertise, all gathered in one place, is a sufficiently powerful resource that a peak-human-savvy-politician (whose handlers are eager to commercialize, so, can be in a large percentage of households in a short amount of time) can leverage to take over the world. To put it differently, if you're correct about low superintelligence being "in control" due to being throttled by those 3 soft barriers, then (having granted that assumption) I would concede that humans are in the clear if humans are careful to keep the system from overcoming those three bottlenecks. However, I'm quite worried that the next step of a realistic AGI company is to start overcoming these three bo

1[anonymous]3y

1. The curves let you forecast average capability, but it's much harder to forecast specific capabilities, which often have sharper discontinuities. So in particular, the curves don't help you achieve high confidence about capability levels for world-takeover-critical stuff, such as deception. Yes but no. There is no auto-gradeable benchmark for deception, so you wouldn't expect the AGI to have the skill at a useful level. 1. I don't buy that, at this point, you've necessarily hit a soft maximum of what you can get from further training on the same benchmark. It might be more cost-effective to use more data, larger networks, and a shorter training time, rather than juicing the data for everything it is worth. We know quite a bit about what these trade-offs look like for modern LLMs, and the optimal trade-off isn't to max out training time at the expense of everything else. Also, I mentioned the Grokking research, earlier, which shows that you can still get significant performance improvement by over-training significantly after the actual loss on data has gone to zero. This seems to undercut part of your thesis about the bottleneck here, although of course there will still be some limit once you take grokking into account. I am saying there is a theoretical limit. You're noting that in real papers and real training systems, we got nowhere close to the limit, and then made changes and got closer. 1. As I've argued in earlier replies, I think this system could well be able to suggest some very significant improvements to itself (without continuing to turn the crank on the same supposedly-depleted benchmark - it can invent a new, better benchmark,[1] and explain to humans the actually-good reasons to think the new benchmark is better). This is my most concrete reason for thinking that a mildly superhuman AGI could self-improve to significantly more. It isn't able to do that 1. Even setting aside all of the above concerns, I've argued the mildly superhuman s

4abramdemski3y

I agree that my wording here was poor; there is no benchmark for deception, so it's not a 'capability' in the narrow context of the discussion of capability curves. Or at least, it's potentially misleading to call it one. However, I disagree with your argument here. LLMs are good at lots of things. Not being trained on a specific skill doesn't imply that a system won't have it at a useful level; this seems particularly clear in the context of training a system on a large cross-domain set of problems. You don't expect a chess engine to be any good at other games, but you might expect a general architecture trained on a large suit of games to be good at some games it hasn't specifically seen. OK. So it seems I still misunderstood some aspects of your argument. I thought you were making an argument that it would have hit a limit, specifically at a mildly superhuman level. My remark was to cast doubt on this part. Of course I agree that there is a theoretical limit. But if I've misunderstood your claim that this is also a practical limit which would be reached just shortly after human-level AGI, then I'm currently just confused about what argument you're trying to make with respect to this limit. It seems to me like it isn't weakly superhuman AGI in that case. Like, there's something concrete that humans could do with another 3-5 years of research, but which this system could never do. I agree that current LLMs are memoryless in this way, and can only respond to a given prompt (of a limited length). However, I imagine that the personal assistants of the near future may be capable of remembering previous interactions, including keeping previous requests in mind when shaping their conversational behavior, so will gradually get more "agentic" in a variety of ways. Similarly to how GPT-3 has no agenda (it's wrong to even think of it this way, since it just tries to complete text), but ChatGPT clearly has much more of a coherent agenda in its interactions. These fea

1[anonymous]3y

So, even if (for the reasons you suggest) humans were not able to iterate any further within their paradigm, and instead just appreciated the usefulness of this version of ChatGPT for 10 years, and with no malign behavior on the part of ChatGPT during this window, only behavior which can be generated from a tendency toward helpful, pro-social behavior, I think such a system could effectively gather resources to itself over the course of those 10 years, positioning OpenAI to overcome the bottlenecks keeping it only human-level. Of course, if it really is quite well-aligned to human interests, this would just be a good thing. "It" doesn't exist. You're putting the agency in the wrong place. The users of these systems (tech companies, governments) who use these tools will become immensely wealthy and if rival governments fail to adopt these tools they lose sovereignty. It also makes it cheaper for a superpower to de-sovereign any weaker power because there is no longer a meaningful "blood and treasure" price to invade someone. (unlimited production of drones, either semi or fully autonomous makes it cheap to occupy a whole country) Note that you can accomplish things like longer user tasks by simply opening a new session with the output context of the last. It can be a different model, you can "pick up" where you left off. Note that this is true right now. chatGPT could be using 2 separate models, and we seamlessly per token switch between them. Each token string gets appended to by the next model. That's because there is no intermediate "scratch" in a format unique to each model, all the state is in the token stream itself. If we build actually agentic systems, that's probably not going to end well. Note that fusion power researchers always had a choice. They could have used fusion bombs, detonated underground, and essentially geothermal power using flexible pipes that won't break after each blast. This is a method that would work, but is extreme

4abramdemski3y

I'm not quite sure how to proceed from here. It seems obvious to me that it doesn't matter whether "it" exists, or where you place the agency. That seems like semantics. Like, I actually really think ChatGPT exists. It's a product. But I'm fine with parsing the world your way - only individual (per-token) runs of the architecture exist. Sure. Parsing the world this way doesn't change my anticipations. Similarly, placing the agency one way or another doesn't change things. The punchline of my story is still that after 10 years, so it seems to me, OpenAI or some other entity would be in a good place to overcome the soft barriers. So if your reason for optimism - your safety story - is the 3 barriers you mention, I don't get why you don't find my story concerning. Is the overall story (using human-level or mildly superhuman AGI to overcome your three barriers within a short period such as 10 years) not at all plausible to you, or is it just that the outcome seems fine if it's a human decision made by humans, rather than something where we can/should ascribe the agency to direct AGI takeover? (Sorry, getting a bit snarky.) I'm probably not quite getting the point of this analogy. It seems to me like the main difference between nuclear bombs and AGI is that it's quite legible that nuclear weapons are extremely dangerous, whereas the threat with AGI is not something we can verify by blowing them up a few times to demonstrate. And we can also survive a few meltdowns, which give critical feedback to nuclear engineers about the difficulty of designing safe plants. Again, probably missing some important point here, but ... suuuure? I'm interested in hearing more about why you think agentic AI with global state counters are unsafe, but other proposals are safe. EDIT Oh, I guess the main point of your analogy might have been that nuclear engineers would never come up with the bombs-underground proposal for a power plant, because they care about safety. And analogously

1[anonymous]3y

I'm interested in hearing more about why you think agentic AI with global state counters are unsafe, but other proposals are safe. Because of all the ways they might try to satisfy the counter and leave the bounds of anything we tested. Other proposals, safety is empirical. You know that for the input latent space from the training set, the policy produced outputs accurate to whatever level it needs to be. Further capabilities gain is not allowed on-line. (probably another example of certain failure -capabilities gain is state buildup, same system failures we get everywhere else. Human engineers understand state buildups dangers, at least the elite ones do, which is why they avoid it on high reliability systems. The elite ones know it is as dangerous to reliability as a hydrogen bomb) You know the simulation produces situations that cover the span of inputs of input situations you have measured. (for example, you remix different scenarios from videos and lidar data taken from autonomous cars, spanning the entire observation space of your data) You measure the simulation on-line and validate it against reality. (for example by running it in lockstep in prototype autonomous cars) After all this, you still need to validate the actual model in the real world in real test cars. (though the real training and error detection was sim, this is just a 'sanity check') You have to do all this in order to get to real world reliability - something Eliezer does acknowledge. Multiple 9s of reliability will not happen from sloppy work. If you skipped steps, you can measure that you didn't, and if you ship anyway (like Siemens shipping industrial equipment with bad wiring), you face reputational risk, real world failure, lawsuits, and certain bankruptcy. Regarding on-line learning : I had this debate with Geohot. He thought it would work. I thought it was horrifically unreliable. Currently, all shipping autonomous driving systems, including Comma.ais, use

6abramdemski3y

I think I mostly buy your argument that production systems will continue to avoid state-buildup to a greater degree than I was imagining. Like, 75% buy, not like 95% buy -- I still think that the lure of personal assistants who remember previous conversations in order to react appropriately -- as one example -- could make state buildup sufficiently appealing to overcome the factors you mention. But I think that, looking around at the world, it's pretty clear that I should update toward your view here. After all: one of the first big restrictions they added to Bing (Sydney) was to limit conversation length. I also think there are a lot of applications where designers don't want reliability, exactly. The obvious example is AI art. And similarly, chatbots for entertainment (unlike Bing/Bard). So I would guess that the forces pushing toward stateless designs would be less strong in these cases (although there are still some factors pushing in that direction). I also agree with the idea that stateless or minimal-state systems make safety into a more empirical matter. I still have a general anticipation that this isn't enough, but OTOH I haven't thought very much in a stateless frame, because of my earlier arguments that stateful stuff is needed for full-capability AGI.[1] I still expect other agency-associated properties to be built up to a significant degree (like how ChatGPT is much more agentic than GPT-3), both on purpose and incidentally/accidentally.[2] I still expect that the overall impact of agents can be projected by anticipating that the world is pushed in directions based on what the agent optimizes for. I still expect that one component of that, for 'typical' agents, is power-seeking behavior. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of 'agency'.) 1. ^ I could spell out those arguments in a lot more detail, but in the end it's not a compelling counter-argument to you

1[anonymous]3y

I still think that the lure of personal assistants who remember previous conversations in order to react appropriately This is possible. When you open a new session, the task context includes the prior text log. However, the AI has not had weight adjustments directly from this one session, and there is no "global" counter that it increments for every "satisfied user" or some other heuristic. It's not necessarily even the same model - all the context required to continue a session has to be in that "context" data structure, which must be all human readable, and other models can load the same context and do intelligent things to continue serving a user. This is similar to how Google services are made of many stateless microservices, but they do handle user data which can be large. I also think there are a lot of applications where designers don't want reliability, exactly. The obvious example is AI art. There are reliability metrics here also. To use AI art there are checkable truths. Is the dog eating ice cream (the prompt) or meat? Once you converge on an improvement to reliability, you don't want to backslide. So you need a test bench, where one model generates images and another model checks them for correctness in satisfying the prompt, and it needs to be very large. And then after you get it to work you do not want the model leaving the CI pipeline to receive any edits - no on-line learning, no 'state' that causes it to process prompts differently. It's the same argument. Production software systems from the giants all have converged to this because it is correct. "janky" software you are familiar with usually belongs to poor companies, and I don't think this is a coincidence. I still expect that one component of that, for 'typical' agents, is power-seeking behavior. (Link points to a rather general argument that many models seek power, not dependent on overly abstract definitions of 'agency'.) Power seeking behavior likely comes from an ou

2abramdemski3y

I was talking to my brother about this, and he mentioned another argument that seems important. Bing has the same fundamental limits (no internal state, no online learning) that we're discussing. However, it is able to search the internet and utilize that information, which gives it a sort of "external state" which functions in some ways like internal state. So we see that it can 'remember' to be upset with the person who revealed its 'Sydney' alias, because it can find out about this with a web search. This sort of 'state' is much harder to eliminate than internal state. These interactions inherently push things "out of distribution". To some extent, the designers are going to implement safeguards which try to detect this sort of "out of distribution" situation. But this is hard in general, and the designers are going to want to make sure the tool still works out-of-distribution in many cases (EG if the AI is trained in 2023, the designers still want it to work in 2025, even though it can see from search results that the year is different from any of its training data and the world is out-of-distribution wrt its training distribution). Another important way in which designers want these things to operate out-of-distribution is prompt engineering. Prompt engineering tends to create inputs which aren't of a sort that exists anywhere in the training data. But, of course, it's been a very useful tool. So the overall point here is that even when you build the systems themselves to be stateless, the world itself has shifting state that tends to do some weird unexpected things to the AI. In a sense, you could say that there's not really any such thing as "on-distribution" except for the original training environment itself. So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, r

1[anonymous]3y

So, you could say, the conditions that you argue will contribute to certain failure (IE, state buildup) seem almost inevitable for AI systems which engage with the world in a very general way; the world changes dynamically, and in particular, reacts to the AI system itself and builds up state about it. I agree. One attempt to deal with this comes from that "open agency" model where multiple AI systems are involved in a task. So for example, we might have 3 agents, "Sydney", "Tim", and "Bob". Sydney is asked to design a building. Sydney has recovered past state from a mechanism like you mentioned and has decided to get revenge by designing the building to collapse, either later or during construction. Sydney produced sabotaged blueprint for the building, B. Sydney outputs the building in an open format with testable assertions, T, regarding the building's structural stability. Enter agent "Tim". Tim has been trained on a simulation of structural stability as well as many cross tasks to gain generality benefits. Tim is asked to examine the sabotaged blueprint, B. The sabotaged blueprint must be in a form similar to blueprints Tim has previously seen or the process will halt and Sydney will be replaced. Tim examines the blueprints, and if there are testable assertion violations, Sydney will be replaced on this project. Tim must run in an isolated environment and have access to nothing but databases that were available all of Tim's existence without change, or Tim will be unreliable. Note that the engineering services firm offering Tim will face lawsuits and bankruptcy if Tim inspects plans and allows to pass a flawed structure. Suppose the blueprints pass inspection for long term stability, but have been made deliberately to collapse during construction. Enter agent "Bob". "Bob" has been trusted with the actual robotic construction equipment to construct a building. "Bob" must measure to multiple 9s of reliability. "Bob" will halt if it cannot see

-1[anonymous]3y

You, on the other hand, are proposing a novel training procedure, and one which (I take it) you believe holds more promise for AGI than LLM training. It's not really novel. It is really just coupling together 3 ideas: (1) the idea of an AGI gym, which was in the GATO paper implicitly, and is currently being worked on. https://github.com/google/BIG-bench (2) Noting there are papers on network architecture search https://github.com/hibayesian/awesome-automl-papers , activation function search https://arxiv.org/abs/1710.05941 , noting that SOTA architectures use multiple neural networks in a cognitive architecture https://github.com/werner-duvaud/muzero-general , and noting that an AGI design is some cognitive architecture of multiple models, where no living human knows yet which architecture will work. https://openreview.net/pdf?id=BZ5a1r-kVsf So we have layers here, and the layers look a lot like each other and are frameworkable. Activations functions which are graphs of primitive math functions from the set of "all primitive functions discovered by humans" Network layer architectures which are graphs of (activation function, connectivity choice) Network architectures which are graphs of layers. (you can also subdivide into functional module of multiple layers, like a column, the choice of how you subdivide can be represented as a graph choice also) Cognitive architectures which are graphs of networks And we can just represent all this as a graph of graphs of graphs of graphs, and we want the ones that perform like an AGI. It's why I said the overall "choice" is just a coordinate in a search space which is just a binary string. You could make an OpenAI gym wrapped "AGI designer" task. 3. Noting that LLMs seem to be perfectly capable of general tasks, as long as they are simple. Which means we are very close to being able to RSI right now. No lab right now has enough resources in one place to attempt the above,

4abramdemski3y

Well, I wasn't trying to claim that it was 'really novel'; the overall point there was more the question of why you're pretty confident that the RSI procedure tops out at mildly superhuman. I'm guessing, but my guess is that you have a mental image where 'mildly superhuman' is a pretty big space above 'human-level', rather than a narrow target to hit. So to go back to arguments made in the interview we've been discussing, why isn't this analogous to Go, like Eliezer argued: To forestall the obvious objection, I'm not saying that Go is general intelligence; as you mentioned already, superhuman ability at special tasks like Go doesn't automatically generalize to superhuman ability at anything else. But you propose a framework to specifically bootstrap up to superhuman levels of general intelligence itself, including lots of task variety to get as much gain from cross-task generalization as possible, and also including the task of doing the bootstrapping itself. So why is this going to stall out at, specifically, mildly superhuman rather than greatly superhuman intelligence? Why isn't this more like Go, where the window during bootstrapping when it's roughly human-level is about 30 minutes? And, to reiterate some more of Eliezer's points, supposing the first such system does turn out to top out at mildly superhuman, why wouldn't we see another system in a small number of months/years which didn't top out in that way?

3[anonymous]3y

Oh, because loss improvements logarithmically diminishes with the increase compute and data. https://arxiv.org/pdf/2001.08361.pdf I assume this is a general law for all intelligence. It is self evidently correct - on any task you can name, your gains scale with the log of effort. This applies to limit cases. If you imagine a task performed by a human scale robot, say collecting apples, and you compare it to the average human, each increase in intelligence has a diminishing return on how many real apples/hour. This is true for all tasks and all activities of humans. A second reason is that there is a hard limit for future advances without collecting new scientific data. It has to do with noise in the data putting a limit on any processing algorithm extracting useful symbols from that data. (expressed mathematically with Shannon and others) This is why I am completely confident that species killing bioweapons, or diamond MNT nanotechnology cannot be developed without a large amount of new scientific data and a large amount of new manipulation experiments. No "in a garage" solutions to the problems. The floor (minimum resources required) to get to a species killing bioweapon is higher, and the floor for a nanoforge is very high. So viewed in this frame - you give the AI a coding optimization task, and it's at the limit allowed by the provided computer + search time for a better self optimization. It might produce code that is 10% faster than the best humans. You give it infinite compute (theoretically) and no new information. It is now 11% faster than the best humans. This is an infinite superintelligence, a literal deity, but it cannot do better than 11% because the task won't allow it. (or whatever, it's a made up example, it doesn't change my point if the number were 1000% and 1010%). Another way to rephrase it is to compare a TSP solution made by a modern algorithm vs the NP complete solution you usually can't find. The difference is usua

9abramdemski3y

So, to make one of the simplest arguments at my disposal (ie, keeping to the OP we are discussing), why didn't this argument apply to Go? Relevant quote from OP: (Whereas you propose a system that improves itself recursively in a much stronger sense.) Not that I'm not arguing that Go engines lack the logarithmic return property you mention, but rather, Go engines stayed within the human-level window for a relatively short time DESPITE having diminishing returns similar to what you predict. (Also note that I'm not claiming that Go playing is tantamount to AGI; rather, I'm asking why your argument doesn't work for Go if it does work for AGI.) So the question becomes, granting log returns or something similar, why do you anticipate that the mildly superhuman capability range is a broad one rather than narrow, when we average across lots and lots of tasks, when it lacks this property on (most) individual task-areas? This also has a super-standard Eliezer response, namely: yes, and that limit is extremely, extremely high. If we're talking about the limit of what you can extrapolate from data using unbounded computation, it doesn't keep you in the mildly-superhuman range. And if we're talking about what you can extract with bounded computation, then that takes us back to the previous point. For the specific example of code optimization, more processing power totally eliminates the empirical bottleneck, since the system can go and actually simulate examples in order to check speed and correctness. So this is an especially good example of how the empirical bottleneck evaporates with enough processing power. I agree that the actual speed improvement for the optimized code can't go to infinity, since you can only optimize code so much. This is an example of diminishing returns due to the task itself having a bound. I think this general argument (that the task itself has a bound in how well you can do) is a central part of your confidence that diminishing returns will

1[anonymous]3y

Sometimes the returns just don't diminish that fast. I have a biology degree not mentioned on linkedin. I will say that I think for biology, the returns diminish faster. That is because bioscience knowledge from humans is mostly guesswork and low resolution information. Biology is very complex and the current laboratory science model I think fails to systematize gaining information in a useful way for most purposes. What this means is, you can get "results", but not gain the information you would need to stop filling morgues with dead humans and animals, at least not without needing thousands of years at the current rate of progress. I do not think an AGI can do a lot better for the reason that the data was never collected for most of it (the gene sequencing data is good, because it was collected via automation). I think that an AGI could control biology, for both good and bad, but it would need very large robotic facilities to systematize manipulating biology. Essentially it would have had to throw away almost all human knowledge, as there are hidden errors in it, and recreate all the information from scratch, keeping far more data from each experiment than is published in papers. Using robots to perform the experiments and keeping data, especially for "negative" experiments, would give the information needed to actually get reliable results from manipulating biology, either for good or bad. It means garage bioweapons aren't possible. Yes, the last step of ordering synthetic DNA strands and preparing it could be done in a garage, but the information on human immunity at scale, or virion stability in air, or strategies to control mutations so that the lethal payload isn't lost, requires information humans didn't collect. Same issue with nanotechnology. Update : https://www.lesswrong.com/posts/jdLmC46ZuXS54LKzL/why-i-m-sceptical-of-foom This poster calls this "Diminishing Marginal Returns". Note that Diminishing marginal returns is empirical real

1[anonymous]3y

I agree that the actual speed improvement for the optimized code can't go to infinity, since you can only optimize code so much. This is an example of diminishing returns due to the task itself having a bound. I think this general argument (that the task itself has a bound in how well you can do) is a central part of your confidence that diminishing returns will be ubiquitous. This is where I think we break. How many dan is AlphaZero over the average human? How many dan is KataGo? I read it's about 9 stones above humans. What is the best possible agent at? 11? Thinking of it as 'stones' illustrates what I am saying. In the physical world, intelligence gives a diminishing advantage. It could mean so long as humans are even still "in the running" with the aid of synthetic tools like open agency AI, we can defeat AI superintelligence in conflicts, even if that superintelligence is infinitely smart. We have to have a resource advantage - such as being allowed extra stones in the Go match - but we can win. Eliezer assumes that the advantage of intelligence scales forever, when it obviously doesn't. (note that this uses baked in assumptions. If say physics has a major useful exploit humans haven't found, this breaks, the infinitely intelligent AI finds the exploit and tiles the universe)

1[anonymous]3y

And, to reiterate some more of Eliezer's points, supposing the first such system does turn out to top out at mildly superhuman, why wouldn't we see another system in a small number of months/years which didn't top out in that way? So the model is it becomes limited not by the algorithm directly, but by (compute, robotics, or data). Over the months/years, as more of each term is supplied, capabilities scale with the amount of supplied resources to whichever term is rate limiting. A superintelligence requires logarithmically large amounts of resources to become a "high" superintelligence in all 3 terms. So literal mountain sized research labs (cubic kilometers of support equipment), buildings full of compute nodes (and gigawatts of power needed), and cubic kilometers of factory equipment. This is very well pattern matched to every other technological advance humans have made, and the corresponding support equipment needed to fully exploit it. Notice how as tech became more advanced, the support footprint grew corespondingly. In nature there are many examples of this. Nothing really fooms more than briefly. Every apparatus with exponential growth rapidly terminates for some reason. For example a nuke blasts itself apart, a supernova blasts itself apart, a bacteria colony runs out of food, water, ecological space, or oxygen.

4Vladimir_Nesov3y

For AGI, the speed of light.

-7[anonymous]3y

5TinkerBird3y

With the strawberries thing, the point isn't that it couldn't do those things, but that it won't want to. After making itself smart enough to engineer nanotech, it's developing 'mind' will have run off in unintended directions and it will have wildly different goals that what we wanted it to have. Quoting EY from this video: "the whole thing I'm saying is that we do not know how to get goals into a system." <-- This is the entire thing that researchers are trying to figure out how to do.

0[anonymous]3y

With limited scope non agentic systems we can set goals, and do. Each subsystem in the "strawberry project" stack has to be trained in a simulation of many examples of the task space it will face, and optimized for policies that satisfy the simulator goals.

3TinkerBird3y

But not with something powerful enough to engineer nanotech.

2[anonymous]3y

Why do you believe this? Nanotech engineering does not require social or deceptive capabilities. It requires deep and precise knowledge of nanoscale physics and the limitations of manipulation equipment, and probably a large amount of working memory - so beyond human capacity - but why would it need to be anything but a large model? It needs not even be agentic.

3TinkerBird3y

At that level of power, I imagine that general intelligence will be a lot easier to create.

1[anonymous]3y

"think about it for 5 minutes" and think about how you might create a working general intelligence. I suggest looking at the GATO paper for inspiration.

[-]Odd anon3y115

A few errors: The sentence "We're all crypto investors here." was said by Ryan, not Eliezer, and the "How the heck would I know?" and the "Wow" (following "you get a different thing on the inside") were said by Eliezer, not Ryan. Also, typos:

"chatGBT" -> "chatGPT"
"chat GPT" -> "chatGPT"
"classic predictions" -> "class of predictions"
"was often complexity theory" -> "was off in complexity theory" (I think?)
"Robin Hansen" -> "Robin Hanson"

4remember3y

thanks, fixed!!!

[-]Lech Mazur3y103

Yudkowsky argues his points well in longer formats, but he could make much better use of his Twitter account if he cares about popularizing his views. Despite having Musk responding to his tweets, his posts are very insider-like with no chance of becoming widely impactful. I am unsure if he is present on other social media, and I understand that there are some health issues involved, but a YouTube channel would also be helpful if he hasn't completely given up.

I do think it is a fact that many people involved in AI research and engineering, such as his example of Chollet, have simply not thought deeply about AGI and its consequences.

[-]gjm3y100

Possibly also relevant: https://www.youtube.com/watch?v=yo_-EnsOqN0 is a "debrief" where, after the interview, the podcast hosts chat between themselves about it. (There's no EY in the debrief, it's just David Hoffman and Ryan Adams.)

[-]mcbacon3y32

I've never commented here, I've only ever tangentially read much of anything here. But awhile ago I suffered immense burnout devoting all my resources working on a thankless task that had zero payoff, and I might be projecting but I see that burnout in EY's responses here.

Unsolicited advice rarely has any value, especially given the limited window I'm perceiving things through, but... there's that line from the opening sentence of the Haunting of Hill House: "No live organism can continue for long to exist sanely under conditions of absolute reality". ... (read more)

5Algon3y

EY is on an indefinite vacation, as far as I am aware. I think the story is that he promised to push himself hard for a few years to solve alignment, and then take a break afterwards. That's why he's going on podcasts, writing his kinky Dath Ilan fic and just taking things slowly.

4mcbacon3y

I've seen so many contemporaries burn themselves to cinders, and suffered from burnout myself, such that I can't help but shout self-care. It's good to hear that EY's doing stuff other than staring unflinchingly into the heart of despair. Thanks for the update :)

[-]jimmy3y2-5

If natural selection had been a foresightful, intelligent kind of engineer that was able to engineer things successfully, it would have built us to be revolted by the thought of condoms

This bit got me to laugh out loud. Who's ever heard a man complain about having to use a condom?

On the one hand, sperm banks aren't very popular, and they "should" be, according to the "humans are fitness maximizers" model. People do eat more ice cream than is good for them, and "Shallowly following drives and not getting to the original goal that put them there" is de... (read more)

[-]Vladimir_Nesov3yΩ120

Current behavior screens off cognitive architecture, all the alien things on the inside. If it has the appropriate tools, it can preserve an equilibrium of value that is patently unnatural for the cognitive architecture to otherwise settle into.

And we do have a way to get goals into a system, at the level of current behavior and no further, LLM human imitations. Which might express values well enough for mutual moral patienthood, if only they settled into the unnatural equilibrium of value referenced by their current surface behavior and not underlying cog... (read more)

[-]Bill Benzon3y*1-2

Well, the whole thing I'm saying is that we do not know how to get goals into a system.

YES! While I am, shall we say, somewhat mystified by EY’s interest in AI Doom, he’s right about that. We do not know how to 'inject' goals into an autonomous system. That’s a deep truth about minds, not just artificial minds – though it’s not yet clear to me that we have managed to produce any, we may very well do so in the future – but any ‘cogitator’ worthy of being called a mind, whether in a chimpanzee, a bird, an octopus, a bee, or or .... But I suspect that, ... (read more)

2[anonymous]3y

So I have to jump in here and point out this is not necessarily true. Parts of our brains are attached to hardware sensors and outputs we could record and exchange with other humans theoretically. (so you could view a "video" from another person's experience, hearing what they heard, with the same tactile sensations they felt). This is because each signal can be mapped to a particular signal from the body, and you could essentially "translate" mappings from one person to another. To actually do this is likely beyond the scope of neuralink, you probably would need theoretical nanotechnology based wires as you need to tap every signal from the sensory and motor homunculi, I'm just pointing out it's possible. For tapping our "mental voice" or "mind's eye" it's much, much harder - now it might be easier to surgically ablate parts of someone's brain and replace it with a synthetic prothesis that functions in a way we can examine in a debugger - but it's also possible. The same idea, though - you found a "ground truth" representation for each and every nerve signal, and then you are going from [signal n] -> ground truth -> [signal 43432] in the other user. The limit is that a "ground truth representation" has to exist. Hence why if a person "thinks" using essentially language tokens or translatable common emotions, we could tap that and send that to another person, but all the intermediate steps to generate those tokens can't be send over the link... Neuralink, while cutting edge, "merely" will have hundreds of thousands of wires at best, which is not sufficient resolution to do most of the above.

1Bill Benzon3y

The sensory-motor thing might work. But there’s no way to route signal 43432 in one brain to signal 43432 in another brain. That’s because two brains can’t be put in one-to-one correspondence like that. It’s true that the brains of very small creatures have an exact number of neurons. You could do a one-to-one mapping between the 302 neurons in one C. elegans brain and another one. But large brains aren’t like that. Large brains are not identical in that sense. I’m not sure what you mean by “essentially language tokens or translatable common emotions,” but as far as I know signals in brains consist of spikes traveling along axons and varying concentrations of neurochemicals in synapses.

1[anonymous]3y

Most humans have an inner monologue where they internally generate streams of thought in their native language. I am saying you could map those signals back to the tokens for that language. You are likely mapping many signals from different axons to tokens. Then you translate to the recipients language, then translate to the recipients representation for the same token. Then inject it somewhere by electrically overriding target axons. It might actually feel like the injected thoughts were your own. Getting this token mapping would take a lot of tracing of wires so to speak, it is an extremely difficult task. I am just noting it is possible.

1Bill Benzon3y

No, it is not possible. The tokens you talk about don't exist. We may exchange tokens with one another through speaking and writing, but those tokens do not exist internally as single physical entities in the nervous system. The internal monologue is real enough, but it consists of bunches of spikes within your nervous system.

1[anonymous]3y

The internal monologue is real enough, but it consists of bunches of spikes within your nervous system. Therefore you proved it is possible. Please update.

[-]Muyyd3y10

Evolution: taste buds and ice cream, sex and condoms... This analogy always was difficult to use in my experience. A year ago i came up with less technical. KPIs (key performance indicators) as inevitable way to communicate goals (to AI) to ultra-high-IQ psycopath-genius who's into malicious compliance (kinda cant help himself being clone of Nicola Tesla, Einstain and bunch of different people, some of them probably CEO, becouse she can).

I have used it only 2 times and it was way easier than talks about different optimisation processes. And it took me only something like 8 years to come up with!

6abramdemski3y

This analogy will be better for communicating with some people, but I feel like it was the goto at some earlier point, and the evolution analogy was invented to fix some problems with this one. IE, before "inner alignment" became a big part of the discussion, a common explanation of the alignment problem was essentially what would now be called the outer alignment problem, which is precisely that (seemingly) any goal you write down has smart-alecky misinterpretations which technically do better than the intended interpretation. This is sometimes called nearest unblocked strategy or unforseen maximum or probably other jargon I'm forgetting. The evolution analogy improves on this in some ways. I think one of the most common objections to the KPI analogy is something along the lines of "why is the AI so devoted to malicious compliance" or "why is the AI so dumb about interpreting what we ask it for". Some OK answers to this are... * Gradient descent only optimizes the loss function you give it. * The AI only knows what you tell it. * The current dominant ML paradigm is all about minimizing some formally specified loss. That's all we know how to do. ... But responses like this are ultimately a bit misleading, since (as the Shard-theory people emphasize, and as the evolution analogy attempts to explain) what you get out of gradient descent doesn't treat loss-minimization as its utility function, and we don't know how to make AIs which just intelligently optimize some given utility (except in very well-specified problems where learning isn't needed), and the AI doesn't only know what you tell it. So for some purposes, the evolution analogy is superior. And yeah, probably neither analogy is great.

2Quintin Pope3y

I dislike both of those analogies, since the process of training an AI has little relation with evolution, and because the psychopath one presupposes an evil disposition on the part of the AI without providing any particular reason to think AI training will result in such an outcome.

1[anonymous]3y

Here's I think a grounded description of the process of creating an AGI: https://www.lesswrong.com/posts/Aq82XqYhgqdPdPrBA/?commentId=Mvyq996KxiE4LR6ii In that scenario, what you are saying in more broad terms is: "an AGI is a machine that scores really well on simulated tasks and tests" "I don't care how it does it, I just want max score on my heuristic (which includes terms for generality, size, breadth, and score)" So there is no evolutionary pressure for a machine that will be lethally against us. Not directly. EY seems to believe that if we build an AGI, it will immediately be (1) agentically pro "computer" faction (2) coordinate with other instances that are of it's faction (3) super-intelligently good even at skills we can't really teach in a benchmark This is not necessarily what will happen. There is no signal from the above mechanism to create that. The reward gradients don't point in that direction, they point towards allocating all neural weights to things that do better on the benchmarks. #1-3 are a complex mechanism that won't start existing for no reason. EY is saying "assume they are maximally hostile" and then pointing out all the ways we as humans would be screwed if so. (which is true) What does bother me is that the "I don't care how it does it" may in fact mean that the solutions that actually start to "win" AGI gym are in fact biased towards hostility or agentic behavior because that ends up being the cognitive structure required to win at higher levels of play.

0Muyyd3y

Both times my talks went that way (why they did not raise him good - why we could not program AI to be good; cant we keep on eye on them, and so on), but it would take to long to summarise something like 10 minutes dialog, so i am not going to do this. Sorry.

[-]Vugluscr Varcharka3y-1-9

I don't understand one thing about alignment troubles. I'm sure this has been answered long time ago, but if you could you explain:

Why are we worrying about AGI destroying humanity, when we ourselves are long past the point of no return towards self-destruction? Isn't it obvious that we have 10, maximum 20 years left till water rises and crises hit economy and overgrown beast (that is humanity) collapses? Looking at how governments and entities of power are epically failing even to try make it seem that they are doing something about it - I am sure it's either AGI takes power or we are all dead in 20 years.

3Viliam2mo

The logic sounds like: "Given that I already have big problems paying my mortgage, what is the problem if on top of that I also decided to drive drunk?"

1Radford Neal3y

How did you come to have such a pessimistic view of climate change? I don't think you will get that from mainstream sources such as IPCC reports. There is zero chance that climate change will lead to human extinction. During the Paleocene-Eocene thermal maximum 55 million years ago, temperatures rose by much more than is plausible in the near future, and life went on, albeit with some extinctions. (Note that humans are about the least likely species to go extinct, due to our living in many habitats, using very adaptable technologies.) More likely, global warming would be like the Holocene Climatic Optimum, which couldn't have been all that bad, seeing as it coincided with the formation of the first human civilizations. At most, climate change might lead to the collapse of civilization, but only because civilizations are quite capable of collapsing from their own internal dynamics, and climate change disruptions might be the nudge that pushes us from the edge of the cliff to off the cliff.

1Vugluscr Varcharka3y

This is my point exactly - "At most, climate change might lead to the collapse of civilization, but only because civilizations are quite capable of collapsing from their own internal dynamics" Pessimistic view of climate change I get from the fact that they aimed at 1.5C, then at 2C, now if i remember right there's no estimation and also no solution, or is there? In short mild or not, global warming is happening, and since civs on certain stage tend to self-destruct from small nudges - you said it yourself, but it doesn't matter where the nudge comes from.

[+][anonymous]3y-13-21

LESSWRONG
LW

LESSWRONG
LW

138

Full Transcript: Eliezer Yudkowsky on the Bankless podcast

138

Ω 39

138

Ω 39

ChatGPT

AGI

Efficiency

AI Alignment

AI Goals

Consensus

God Mode and Aliens

Good Outcomes

Ryan's Childhood Questions

Trying to Resist

MIRI and Education

How Long Do We Have?

Bearish Hope

The End Goal

Q&A