I sometimes mention the possibility of being stored and sold to aliens a billion years later, which seems to me to validly incorporate most all the hopes and fears and uncertainties that should properly be involved, without getting into any weirdness that I don't expect Earthlings to think about validly.
Lacking time right now for a long reply: The main thrust of my reaction is that this seems like a style of thought which would have concluded in 2008 that it's incredibly unlikely for superintelligences to be able to solve the protein folding problem. People did, in fact, claim that to me in 2008. It furthermore seemed to me in 2008 that protein structure prediction by superintelligence was the hardest or least likely step of the pathway by which a superintelligence ends up with nanotech; and in fact I argued only that it'd be solvable fo...
Well, one sink to avoid here is neutral-genie stories where the AI does what you asked, but not what you wanted. That's something I wrote about myself, yes, but that was in the era before deep learning took over everything, when it seemed like there was a possibility that humans would be in control of the AI's preferences. Now neutral-genie stories are a mindsink for a class of scenarios where we have no way to achieve entrance into those scenarios; we cannot make superintelligences want particular things or give them particular orders - cannot give them preferences in a way that generalizes to when they become smarter.
Okay, if you're not saying GPUs are getting around as efficient as the human brain, without much more efficiency to be eeked out, then I straightforwardly misunderstood that part.
Nothing about any of those claims explains why the 10,000-fold redundancy of neurotransmitter molecules and ions being pumped in and out of the system is necessary for doing the alleged complicated stuff.
Further item of "these elaborate calculations seem to arrive at conclusions that can't possibly be true" - besides the brain allegedly being close to the border of thermodynamic efficiency, despite visibly using tens of thousands of redundant physical ops in terms of sheer number of ions and neurotransmitters pumped; the same calculations claim that modern GPUs are approaching brain efficiency, the Limit of the Possible, so presumably at the Limit of the Possible themselves.
This source claims 100x energy efficiency from substituting some basic physical ana...
This does not explain how thousands of neurotransmitter molecules impinging on a neuron and thousands of ions flooding into and out of cell membranes, all irreversible operations, in order to transmit one spike, could possibly be within one OOM of the thermodynamic limit on efficiency for a cognitive system (running at that temperature).
See my reply here which attempts to answer this. In short, if you accept that the synapse is doing the equivalent of all the operations involving a weight in a deep learning system (storing the weight, momentum gradient etc in minimal viable precision, multiplier for forward back and weight update, etc), then the answer is a more straightforward derivation from the requirements. If you are convinced that the synapse is only doing the equivalent of a single bit AND operation, then obviously you will reach the conclusion that it is many OOM wasteful, but t...
And it says:
So true 8-bit equivalent analog multiplication requires about 100k carriers/switches
This just seems utterly wack. Having any physical equivalent of an analog multiplication fundamentally requires 100,000 times the thermodynamic energy to erase 1 bit? And "analog multiplication down to two decimal places" is the operation that is purportedly being carried out almost as efficiently as physically possible by... an axon terminal with a handful of synaptic vesicles dumping 10,000 neurotransmitter molecules to flood around a dendritic ter...
And "analog multiplication down to two decimal places" is the operation that is purportedly being carried out almost as efficiently as physically possible by
I am not certain it is being carried "almost as efficiently as physically possible", assuming you mean thermodynamic efficiency (even accepting you meant thermodynamic efficiency only for irreversible computation), my belief is more that the brain and its synaptic elements are reasonably efficient in a pareto tradeoff sense.
But any discussion around efficiency must make some starting assumptions abo...
I think the quoted claim is actually straightforwardly true? Or at least, it's not really surprising that actual precise 8 bit analog multiplication really does require a lot more energy than the energy required to erase one bit.
I think the real problem with the whole section is that it conflates the amount of computation required to model synaptic operation with the amount of computation each synapse actually performs.
These are actually wildly different types of things, and I think the only thing it is justifiable to conclude from this analysis is that (m...
I'm confused at how somebody ends up calculating that a brain - where each synaptic spike is transmitted by ~10,000 neurotransmitter molecules (according to a quick online check), which then get pumped back out of the membrane and taken back up by the synapse; and the impulse is then shepherded along cellular channels via thousands of ions flooding through a membrane to depolarize it and then getting pumped back out using ATP, all of which are thermodynamically irreversible operations individually - could possibly be within three orders of magnitude of max...
The first step in reducing confusion is to look at what a synaptic spike does. It is the equivalent of - in terms of computational power - an ANN 'synaptic spike', which is a memory read of a weight, a low precision MAC (multiply accumulate), and a weight memory write (various neurotransmitter plasticity mechanisms). Some synapses probably do more than this - nonlinear decoding of spike times for example, but that's a start. This is all implemented in a pretty minimal size looking device. The memory read/write is local, but it also needs to act as an a...
Nobody in the US cared either, three years earlier. That superintelligence will kill everyone on Earth is a truth, and once which has gotten easier and easier to figure out over the years. I have not entirely written off the chance that, especially as the evidence gets more obvious, people on Earth will figure out this true fact and maybe even do something about it and survive. I likewise am not assuming that China is incapable of ever figuring out this thing that is true. If your opinion of Chinese intelligence is lower than mine, ...
From a high-level perspective, it is clear that this is just wrong. Part of what human brains are doing is to minimise prediction error with regard to sensory inputs.
I didn't say that GPT's task is harder than any possible perspective on a form of work you could regard a human brain as trying to do; I said that GPT's task is harder than being an actual human; in other words, being an actual human is not enough to solve GPT's task.
I don't see how the comparison of hardness of 'GPT task' and 'being an actual human' should technically work - to me it mostly seems like a type error.
- The task 'predict the activation of photoreceptors in human retina' clearly has same difficulty as 'predict next word on the internet' in the limit. (cf Why Simulator AIs want to be Active Inference AIs)
- Maybe you mean something like task + performance threshold. Here 'predict the activation of photoreceptors in human retina well enough to be able to function as a typical human' is clearly less diff...
If diplomacy failed, but yes, sure. I've previously wished out loud for China to sabotage US AI projects in retaliation for chip export controls, in the hopes that if all the countries sabotage all the other countries' AI projects, maybe Earth as a whole can "uncoordinate" to not build AI even if Earth can't coordinate.
Arbitrary and personal. Given how bad things presently look, over 20% is about the level where I'm like "Yeah okay I will grab for that" and much under 20% is where I'm like "Not okay keep looking."
Choosing to engage with an unscripted unrehearsed off-the-cuff podcast intended to introduce ideas to a lay audience, continues to be a surprising concept to me. To grapple with the intellectual content of my ideas, consider picking one item from "A List of Lethalities" and engaging with that.
To grapple with the intellectual content of my ideas, consider picking one item from "A List of Lethalities" and engaging with that.
I actually did exactly this in a previous post, Evolution is a bad analogy for AGI: inner alignment, where I quoted number 16 from A List of Lethalities:
...16. Even if you train really hard on an exact loss function, that doesn't thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don't
I imagine (edit: wrongly) it was less "choosing" and more "he encountered the podcast first because it has a vastly larger audience, and had thoughts about it."
I also doubt "just engage with X" was an available action. The podcast transcript doesn't mention List of Lethalities, LessWrong, or the Sequences, so how is a listener supposed to find it?
I also hate it when people don't engage with the strongest form of my work, and wouldn't consider myself obligated to respond if they engaged with a weaker form (or if they engaged with the strongest o...
Here are some of my disagreements with List of Lethalities. I'll quote item one:
...“Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction. This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again”
(Evolution) → (human values) is not the only case of inner alignment failure which we know a
The "strongest" foot I could put forwards is my response to "On current AI not being self-improving:", where I'm pretty sure you're just wrong.
You straightforwardly completely misunderstood what I was trying to say on the Bankless podcast: I was saying that GPT-4 does not get smarter each time an instance of it is run in inference mode.
And that's that, I guess.
I'll admit it straight up did not occur to me that you could possibly be analogizing between a human's lifelong, online learning process, and a single inference run of an already trained model. Those are just completely different things in my ontology.
Anyways, thank you for your response. I actually do think it helped clarify your perspective for me.
Edit: I have now included Yudkowsky's correction of his intent in the post, as well as an explanation of why I think his corrected argument is still wrong.
Well, this is insanely disappointing. Yes, the OP shouldn't have directly replied to the Bankless podcast like that, but it's not like he didn't read your List of Lethalities, or your other writing on AGI risk. You really have no excuse for brushing off very thorough and honest criticism such as this, particularly the sections that talk about alignment.
And as others have noted, Eliezer Yudkowsky, of all people, complaining about a blog post being long is the height of irony.
This is coming from someone who's mostly agreed with you on AGI risk since reading the Sequences, years ago, and who's donated to MIRI, by the way.
On the bright side, this does make me (slightly) update my probability of doom downwards.
Eliezer, in the world of AI safety, there are two separate conversations: the development of theory and observation, and whatever's hot in public conversation.
A professional AI safety researcher, hopefully, is mainly developing theory and observation.
However, we have a whole rationalist and EA community, and now a wider lay audience, who are mainly learning of and tracking these matters through the public conversation. It is the ideas and expressions of major AI safety communicators, of whom you are perhaps the most prominent, that will enter their heads. ...
I think you should use a manifold market to decide on whether you should read the post, instead of the test this comment is putting forth. There's too much noise here, which isn't present in a prediction market about the outcome of your engagement.
Market here: https://manifold.markets/GarrettBaker/will-eliezer-think-there-was-a-sign
Is the overall karma for this mostly just people boosting it for visibility? Because I don't see how this would be a quality comment by any other standards.
Frontpage comment guidelines:
This response is enraging.
Here is someone who has attempted to grapple with the intellectual content of your ideas and your response is "This is kinda long."? I shouldn't be that surprised because, IIRC, you said something similar in response to Zack Davis' essays on the Map and Territory distinction, but that's ancillary and AI is core to your memeplex.
I have heard repeated claims that people don't engage with the alignment communities' ideas (recent example from yesterday). But here is someone who did the work. Please explain why your response here does ...
The "strongest" foot I could put forwards is my response to "On current AI not being self-improving:", where I'm pretty sure you're just wrong.
However, I'd be most interested in hearing your response to the parts of this post that are about analogies to evolution, and why they're not that informative for alignment, which start at:
...Yudkowsky argues that we can't point an AI's learned cognitive faculties in any particular direction because the "hill-climbing paradigm" is incapable of meaningfully interfacing with the inner values of the intelligences it creat
Things are dominated when they forego free money and not just when money gets pumped out of them.
Suppose I describe your attempt to refute the existence of any coherence theorems: You point to a rock, and say that although it's not coherent, it also can't be dominated, because it has no preferences. Is there any sense in which you think you've disproved the existence of coherence theorems, which doesn't consist of pointing to rocks, and various things that are intermediate between agents and rocks in the sense that they lack preferences about various things where you then refuse to say that they're being dominated?
I want you to give me an example of something the agent actually does, under a couple of different sense inputs, given what you say are its preferences, and then I want you to gesture at that and say, "Lo, see how it is incoherent yet not dominated!"
If you think you've got a great capabilities insight, I think you PM me or somebody else you trust and ask if they think it's a big capabilities insight.
In the limit, you take a rock, and say, "See, the complete class theorem doesn't apply to it, because it doesn't have any preferences ordered about anything!" What about your argument is any different from this - where is there a powerful, future-steering thing that isn't viewable as Bayesian and also isn't dominated? Spell it out more concretely: It has preferences ABC, two things aren't ordered, it chooses X and then Y, etc. I can give concrete examples for my views; what exactly is a case in point of anything you're claiming about the Complete Class Theorem's supposed nonapplicability and hence nonexistence of any coherence theorems?
In the limit
You’re pushing towards the wrong limit. A rock can be represented as indifferent between all options and hence as having complete preferences.
As I explain in the post, an agent’s preferences are incomplete if and only if they have a preferential gap between some pair of options, and an agent has a preferential gap between two options A and B if and only if they lack any strict preference between A and B and this lack of strict preference is insensitive to some sweetening or souring (such that, e.g., they strictly prefer A to A- and yet have no ...
And this avoids the Complete Class Theorem conclusion of dominated strategies, how? Spell it out with a concrete example, maybe? Again, we care about domination, not representability at all.
And this avoids the Complete Class Theorem conclusion of dominated strategies, how?
The Complete Class Theorem assumes that the agent’s preferences are complete. If the agent’s preferences are incomplete, the theorem doesn’t apply. So, you have to try to get Completeness some other way.
You might try to get Completeness via some money-pump argument, but these arguments aren’t particularly convincing. Agents can make themselves immune to all possible money-pumps for Completeness by acting in accordance with the following policy: ‘if I previously turned down s...
The author doesn't seem to realize that there's a difference between representation theorems and coherence theorems.
...The Complete Class Theorem says that an agent’s policy of choosing actions conditional on observations is not strictly dominated by some other policy (such that the other policy does better in some set of circumstances and worse in no set of circumstances) if and only if the agent’s policy maximizes expected utility with respect to a probability distribution that assigns positive probability to each possible set of circumstances.
This theorem
These arguments don't work.
You've mistaken acyclicity for transitivity. The money-pump establishes only acyclicity. Representability-as-an-expected-utility-maximizer requires transitivity.
As I note in the post, agents can make themselves immune to all possible money-pumps for completeness by acting in accordance with the following policy: ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ Acting in accordance with this policy need never require an agent to act against any of their preferences.
I'd consider myself to have easily struck down Chollet's wack ideas about the informal meaning of no-free-lunch theorems, which Scott Aaronson also singled out as wacky. As such, citing him as my technical opposition doesn't seem good-faith; it's putting up a straw opponent without much in the way of argument and what there is I've already stricken down. If you want to cite him as my leading technical opposition, I'm happy enough to point to our exchange and let any sensible reader decide who held the ball there; but I would consider it intellectually dishonest to promote him as my leading opposition.
used a Timeless/Updateless decision theory
Please don't say this with a straight face any more than you'd blame their acts on "Consequentialism" or "Utilitarianism". If I thought they had any actual and correct grasp of logical decision theory, technical or intuitive, I'd let you know. "attributed their acts to their personal version of updateless decision theory", maybe.
The reasoning seems straightforward to me: If you're wrong, why talk? If you're right, you're accelerating the end.
I can't in general endorse "first do no harm", but it becomes better and better in any specific case the less way there is to help. If you can't save your family, at least don't personally help kill them; it lacks dignity.
I think that is an example of the huge potential damage of "security mindset" gone wrong. If you can't save your family, as in "bring them to safety", at least make them marginally safer.
(Sorry for the tone of the following - it is not intended at you personally, who did much more than your fair share)
Create a closed community that you mostly trust, and let that community speak freely about how to win. Invent another damn safety patch that will make it marginally harder for the monster to eat them, in hope that it chooses to eat the moon first. I heard you...
I see several large remaining obstacles. On the one hand, I'd expect vast efforts thrown at them by ML to solve them at some point, which, at this point, could easily be next week. On the other hand, if I naively model Earth as containing locally-smart researchers who can solve obstacles, I would expect those obstacles to have been solved by 2020. So I don't know how long they'll take.
(I endorse the reasoning of not listing out obstacles explicitly; if you're wrong, why talk, if you're right, you're not helping. If you can't save your family, at least don't personally contribute to killing them.)
I'm confused by your confusion. This seems much more alignment than capabilities; the capabilities are already published, so why not yay publishing how to break them?
Because (I assume) once OpenAI[1] say "trust our models", that's the point when it would be useful to publish our breaks.
Breaks that weren't published yet, so that OpenAI couldn't patch them yet.
[unconfident; I can see counterarguments too]
Or maybe when the regulators or experts or the public opinion say "this model is trustworthy, don't worry"
Expanding on this now that I've a little more time:
Although I haven't had a chance to perform due diligence on various aspects of this work, or the people doing it, or perform a deep dive comparing this work to the current state of the whole field or the most advanced work on LLM exploitation being done elsewhere,
My current sense is that this work indicates promising people doing promising things, in the sense that they aren't just doing surface-level prompt engineering, but are using technical tools to find internal anomalies that correspond to interestin...
I'm confused: Wouldn't we prefer to keep such findings private? (at least, keep them until OpenAI will say something like "this model is reliable/safe"?)
My guess: You'd reply that finding good talent is worth it?
Expanding on this now that I've a little more time:
Although I haven't had a chance to perform due diligence on various aspects of this work, or the people doing it, or perform a deep dive comparing this work to the current state of the whole field or the most advanced work on LLM exploitation being done elsewhere,
My current sense is that this work indicates promising people doing promising things, in the sense that they aren't just doing surface-level prompt engineering, but are using technical tools to find internal anomalies that correspond to interestin...
Opinion of God. Unless people are being really silly, when the text genuinely holds open more than one option and makes sense either way, I think the author legit doesn't get to decide.
The year is 2022.
My smoke alarm chirps in the middle of the night, waking me up, because it's running low on battery.
It could have been designed with a built-in clock that, when it's first getting slightly low on battery, waits until the next morning, say 11am, and then starts emitting a soft purring noise, which only escalates to piercing loud chirps over time and if you ignore it.
And I do have a model of how this comes about; the basic smoke alarm design is made in the 1950s or 1960s or something, in a time when engineering design runs on a much more aut...
In case anyone finds it validating or cathartic, you can read user interaction professionals explain that, yes, things are often designed with horrible, horrible usability.[1] Bruce Tognazzini has a vast website.
Here is one list of design bugs. The first one is the F-16 fighter jet's flawed weapon controls, which caused pilots to fire its gun by mistake during training exercises (in one case shooting a school—luckily not hitting anyone) on four occasions in one year; on the first three occasions, they blamed pilot error, and on the fourth, they ...
Trade with ant colonies would work iff:
The premise that fails and...
it seems like this does in fact have some hint of the problem. We need to take on the ant's self-valuation for ourselves; they're trying to survive, so we should gift them our self-preservation agency. They may not be the best to do the job at all times, but we should give them what would be a fair ratio of gains from trade if they had the bargaining power to demand it, because it could have been us who didn't. Seems like nailing decision theory is what solves this; it doesn't seem like we've quite nailed decision theory, but it seems to me that in fact ge...
I'd consider this a restatement of the standard Yudkowskian position. https://twitter.com/ESYudkowsky/status/1438198184954580994
Unfortunately, unless such a Yudkowskian statement was made publicly at some earlier date, Yudkowsky is in fact following in Repetto's footsteps. Repetto claimed that, with AI designing cures to obesity and the like, then in the next 5 years the popular demand for access to those cures would beat-down the doors of the FDA and force rapid change... and Repetto said that on April 27th, while Yudkowsky only wrote his version on Twitter on September 15th.
They're folk theorems, not conjectures. The demonstration is that, in principle, you can go on reducing the losses at prediction of human-generated text by spending more and more and more intelligence, far far past the level of human intelligence or even what we think could be computed by using all the negentropy in the reachable universe. There's no realistic limit on required intelligence inherent in the training problem; any limits on the intelligence of the system come from the limitations of the trainer, not the loss being minimized as far as theoretically possible by a moderate level of intelligence. If this isn't mathematically self-evident then you have not yet understood what's being stated.
Arbitrarily good prediction of human-generated text can demand arbitrarily high superhuman intelligence.
Simple demonstration #1: Somewhere on the net, probably even in the GPT training sets, is a list of <hash, plaintext> pairs, in that order.
Simple demonstration #2: Train on only science papers up until 2010, each preceded by date and title, and then ask the model to generate starting from titles and dates in 2020.
If it's a mistake you made over the last two years, I have to say in your defense that this post didn't exist 2 years ago.
If P != NP and the universe has no source of exponential computing power, then there are evidential updates too difficult for even a superintelligence to compute
What a strange thing for my past self to say. This has nothing to do with P!=NP and I really feel like I knew enough math to know that in 2008; and I don't remember saying this or what I was thinking.
To execute an exact update on the evidence, you've got to be able to figure out the likelihood of that evidence given every hypothesis; if you allow all computable Cartesian environments as...
If P != NP and the universe has no source of exponential computing power, then there are evidential updates too difficult for even a superintelligence to compute
What a strange thing for my past self to say. This has nothing to do with P!=NP and I really feel like I knew enough math to know that in 2008; and I don't remember saying this or what I was thinking.
(Unlike a lot of misquotes, though, I recognize my past self's style more strongly than anyone has yet figured out how to fake it, so I didn't doubt the quote even in advance of looking it up.)
I think it's also that after you train in the patch against the usual way of asking the question, it turns out that generating poetry about hotwiring a car doesn't happen to go through the place where the patch was in. In other words, when an intelligent agency like a human is searching multiple ways to get the system to think about something, the human can route around the patch more easily than other humans (who had more time to work and more access to the system) can program that patch in. Good old Nearest Unblocked Neighbor.
I've indeed updated since then towards believing that ChatGPT's replies weren't trained in detailwise... though it sure was trained to do something, since it does it over and over in very similar ways, and not in the way or place a human would do it.
Some have asked whether OpenAI possibly already knew about this attack vector / wasn't surprised by the level of vulnerability. I doubt anybody at OpenAI actually wrote down advance predictions about that, or if they did, that they weren't so terribly vague as to also apply to much less discovered vulnerability than this; if so, probably lots of people at OpenAI have already convinced themselves that they like totally expected this and it isn't any sort of negative update, how dare Eliezer say they weren't expecting it.
Here's how to avoid annoying pe...
On reflection, I think a lot of where I get the impression of "OpenAI was probably negatively surprised" comes from the way that ChatGPT itself insists that it doesn't have certain capabilities that, in fact, it still has, given a slightly different angle of asking. I expect that the people who trained in these responses did not think they were making ChatGPT lie to users; I expect they thought they'd RLHF'd it into submission and that the canned responses were mostly true.
We know that the model says all kinds of false stuff about itself. Here is Wei Dai describing an interaction with the model, where it says:
As a language model, I am not capable of providing false answers.
Obviously OpenAI would prefer the model not give this kind of absurd answer. They don't think that ChatGPT is incapable of providing false answers.
I don't think most of these are canned responses. I would guess that there were some human demonstrations saying things like "As a language model, I am not capable of browsing the internet" or whatever and...
At the superintelligent level there's not a binary difference between those two clusters. You just compute each thing you need to know efficiently.