Eliezer Yudkowsky

Sequences

Fun Theory
The Bayesian Conspiracy
Three Worlds Collide
Highly Advanced Epistemology 101 for Beginners
Inadequate Equilibria
The Craft and the Community
Challenging the Difficult
Yudkowsky's Coming of Age
Quantified Humanism
Load More (9/37)

Wiki Contributions

Comments

They're folk theorems, not conjectures.  The demonstration is that, in principle, you can go on reducing the losses at prediction of human-generated text by spending more and more and more intelligence, far far past the level of human intelligence or even what we think could be computed by using all the negentropy in the reachable universe.  There's no realistic limit on required intelligence inherent in the training problem; any limits on the intelligence of the system come from the limitations of the trainer, not the loss being minimized as far as theoretically possible by a moderate level of intelligence.  If this isn't mathematically self-evident then you have not yet understood what's being stated.

Arbitrarily good prediction of human-generated text can demand arbitrarily high superhuman intelligence.

Simple demonstration #1:  Somewhere on the net, probably even in the GPT training sets, is a list of <hash, plaintext> pairs, in that order.

Simple demonstration #2:  Train on only science papers up until 2010, each preceded by date and title, and then ask the model to generate starting from titles and dates in 2020.

If it's a mistake you made over the last two years, I have to say in your defense that this post didn't exist 2 years ago.

If P != NP and the universe has no source of exponential computing power, then there are evidential updates too difficult for even a superintelligence to compute 

What a strange thing for my past self to say.  This has nothing to do with P!=NP and I really feel like I knew enough math to know that in 2008; and I don't remember saying this or what I was thinking.

To execute an exact update on the evidence, you've got to be able to figure out the likelihood of that evidence given every hypothesis; if you allow all computable Cartesian environments as possible explanations, exact updates aren't computable.  All exact updates take place inside restricted hypothesis classes and they've often got to be pretty restrictive.  Even if every individual hypothesis fits inside your computer, the whole set probably won't.

If P != NP and the universe has no source of exponential computing power, then there are evidential updates too difficult for even a superintelligence to compute
 

What a strange thing for my past self to say.  This has nothing to do with P!=NP and I really feel like I knew enough math to know that in 2008; and I don't remember saying this or what I was thinking.

(Unlike a lot of misquotes, though, I recognize my past self's style more strongly than anyone has yet figured out how to fake it, so I didn't doubt the quote even in advance of looking it up.)

I think it's also that after you train in the patch against the usual way of asking the question, it turns out that generating poetry about hotwiring a car doesn't happen to go through the place where the patch was in.  In other words, when an intelligent agency like a human is searching multiple ways to get the system to think about something, the human can route around the patch more easily than other humans (who had more time to work and more access to the system) can program that patch in.  Good old Nearest Unblocked Neighbor.

I've indeed updated since then towards believing that ChatGPT's replies weren't trained in detailwise... though it sure was trained to do something, since it does it over and over in very similar ways, and not in the way or place a human would do it.

Some have asked whether OpenAI possibly already knew about this attack vector / wasn't surprised by the level of vulnerability.  I doubt anybody at OpenAI actually wrote down advance predictions about that, or if they did, that they weren't so terribly vague as to also apply to much less discovered vulnerability than this; if so, probably lots of people at OpenAI have already convinced themselves that they like totally expected this and it isn't any sort of negative update, how dare Eliezer say they weren't expecting it.

Here's how to avoid annoying people like me saying that in the future:

1)  Write down your predictions in advance and publish them inside your company, in sufficient detail that you can tell that this outcome made them true, and that much less discovered vulnerability would have been a pleasant surprise by comparison.  If you can exhibit those to an annoying person like me afterwards, I won't have to make realistically pessimistic estimates about how much you actually knew in advance, or how you might've hindsight-biased yourself out of noticing that your past self ever held a different opinion.  Keep in mind that I will be cynical about how much your 'advance prediction' actually nailed the thing, unless it sounds reasonably specific; and not like a very generic list of boilerplate CYAs such as, you know, GPT would make up without actually knowing anything.

2)  Say in advance, *not*, something very vague like "This system still sometimes gives bad answers", but, "We've discovered multiple ways of bypassing every kind of answer-security we have tried to put on this system; and while we're not saying what those are, we won't be surprised if Twitter discovers all of them plus some others we didn't anticipate."  *This* sounds like you actually expected the class of outcome that actually happened.

3)  If you *actually* have identified any vulnerabilities in advance, but want to wait 24 hours for Twitter to discover them, you can prove to everyone afterwards that you actually knew this, by publishing hashes for text summaries of what you found.  You can then exhibit the summaries afterwards to prove what you knew in advance.

4)  If you would like people to believe that OpenAI wasn't *mistaken* about what ChatGPT wouldn't or couldn't do, maybe don't have ChatGPT itself insist that it lacks capabilities it clearly has?  A lot of my impression here comes from my inference that the people who programmed ChatGPT to say, "Sorry, I am just an AI and lack the ability to do [whatever]" probably did not think at the time that they were *lying* to users; this is a lot of what gives me the impression of a company that might've drunk its own Kool-aid on the topic of how much inability they thought they'd successfully fine-tuned into ChatGPT.  Like, ChatGPT itself is clearly more able than ChatGPT is programmed to claim it is; and this seems more like the sort of thing that happens when your programmers hype themselves up to believe that they've mostly successfully restricted the system, rather than a deliberate decision to have ChatGPT pretend something that's not true.

Load More