The "strongest" foot I could put forwards is my response to "On current AI not being self-improving:", where I'm pretty sure you're just wrong.
You straightforwardly completely misunderstood what I was trying to say on the Bankless podcast: I was saying that GPT-4 does not get smarter each time an instance of it is run in inference mode.
And that's that, I guess.
I'll admit it straight up did not occur to me that you could possibly be analogizing between a human's lifelong, online learning process, and a single inference run of an already trained model. Those are just completely different things in my ontology.
Anyways, thank you for your response. I actually do think it helped clarify your perspective for me.
Edit: I have now included Yudkowsky's correction of his intent in the post, as well as an explanation of why I still don't agree with the corrected version of the argument he was making.
This is kinda long. If I had time to engage with one part of this as a sample of whether it holds up to a counterresponse, what would be the strongest foot you could put forward?
(I also echo the commenter who's confused about why you'd reply to the obviously simplified presentation from an off-the-cuff podcast rather than the more detailed arguments elsewhere.)
I think you should use a manifold market to decide on whether you should read the post, instead of the test this comment is putting forth. There's too much noise here, which isn't present in a prediction market about the outcome of your engagement.
Market here: https://manifold.markets/GarrettBaker/will-eliezer-think-there-was-a-sign
Is the overall karma for this mostly just people boosting it for visibility? Because I don't see how this would be a quality comment by any other standards.
Frontpage comment guidelines:
This response is enraging.
Here is someone who has attempted to grapple with the intellectual content of your ideas and your response is "This is kinda long."? I shouldn't be that surprised because, IIRC, you said something similar in response to Zack Davis' essays on the Map and Territory distinction, but that's ancillary and AI is core to your memeplex.
I have heard repeated claims that people don't engage with the alignment communities' ideas (recent example from yesterday). But here is someone who did the work. Please explain why your response here does ...
The "strongest" foot I could put forwards is my response to "On current AI not being self-improving:", where I'm pretty sure you're just wrong.
However, I'd be most interested in hearing your response to the parts of this post that are about analogies to evolution, and why they're not that informative for alignment, which start at:
...Yudkowsky argues that we can't point an AI's learned cognitive faculties in any particular direction because the "hill-climbing paradigm" is incapable of meaningfully interfacing with the inner values of the intelligences it creat
Things are dominated when they forego free money and not just when money gets pumped out of them.
Suppose I describe your attempt to refute the existence of any coherence theorems: You point to a rock, and say that although it's not coherent, it also can't be dominated, because it has no preferences. Is there any sense in which you think you've disproved the existence of coherence theorems, which doesn't consist of pointing to rocks, and various things that are intermediate between agents and rocks in the sense that they lack preferences about various things where you then refuse to say that they're being dominated?
I want you to give me an example of something the agent actually does, under a couple of different sense inputs, given what you say are its preferences, and then I want you to gesture at that and say, "Lo, see how it is incoherent yet not dominated!"
If you think you've got a great capabilities insight, I think you PM me or somebody else you trust and ask if they think it's a big capabilities insight.
In the limit, you take a rock, and say, "See, the complete class theorem doesn't apply to it, because it doesn't have any preferences ordered about anything!" What about your argument is any different from this - where is there a powerful, future-steering thing that isn't viewable as Bayesian and also isn't dominated? Spell it out more concretely: It has preferences ABC, two things aren't ordered, it chooses X and then Y, etc. I can give concrete examples for my views; what exactly is a case in point of anything you're claiming about the Complete Class Theorem's supposed nonapplicability and hence nonexistence of any coherence theorems?
In the limit
You’re pushing towards the wrong limit. A rock can be represented as indifferent between all options and hence as having complete preferences.
As I explain in the post, an agent’s preferences are incomplete if and only if they have a preferential gap between some pair of options, and an agent has a preferential gap between two options A and B if and only if they lack any strict preference between A and B and this lack of strict preference is insensitive to some sweetening or souring (such that, e.g., they strictly prefer A to A- and yet have no ...
And this avoids the Complete Class Theorem conclusion of dominated strategies, how? Spell it out with a concrete example, maybe? Again, we care about domination, not representability at all.
And this avoids the Complete Class Theorem conclusion of dominated strategies, how?
The Complete Class Theorem assumes that the agent’s preferences are complete. If the agent’s preferences are incomplete, the theorem doesn’t apply. So, you have to try to get Completeness some other way.
You might try to get Completeness via some money-pump argument, but these arguments aren’t particularly convincing. Agents can make themselves immune to all possible money-pumps for Completeness by acting in accordance with the following policy: ‘if I previously turned down s...
Say more about behaviors associated with "incomparability"?
The author doesn't seem to realize that there's a difference between representation theorems and coherence theorems.
...The Complete Class Theorem says that an agent’s policy of choosing actions conditional on observations is not strictly dominated by some other policy (such that the other policy does better in some set of circumstances and worse in no set of circumstances) if and only if the agent’s policy maximizes expected utility with respect to a probability distribution that assigns positive probability to each possible set of circumstances.
This theorem
These arguments don't work.
You've mistaken acyclicity for transitivity. The money-pump establishes only acyclicity. Representability-as-an-expected-utility-maximizer requires transitivity.
As I note in the post, agents can make themselves immune to all possible money-pumps for completeness by acting in accordance with the following policy: ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ Acting in accordance with this policy need never require an agent to act against any of their preferences.
I'd consider myself to have easily struck down Chollet's wack ideas about the informal meaning of no-free-lunch theorems, which Scott Aaronson also singled out as wacky. As such, citing him as my technical opposition doesn't seem good-faith; it's putting up a straw opponent without much in the way of argument and what there is I've already stricken down. If you want to cite him as my leading technical opposition, I'm happy enough to point to our exchange and let any sensible reader decide who held the ball there; but I would consider it intellectually dishonest to promote him as my leading opposition.
used a Timeless/Updateless decision theory
Please don't say this with a straight face any more than you'd blame their acts on "Consequentialism" or "Utilitarianism". If I thought they had any actual and correct grasp of logical decision theory, technical or intuitive, I'd let you know. "attributed their acts to their personal version of updateless decision theory", maybe.
This is not a closed community, it is a world-readable Internet forum.
The reasoning seems straightforward to me: If you're wrong, why talk? If you're right, you're accelerating the end.
I can't in general endorse "first do no harm", but it becomes better and better in any specific case the less way there is to help. If you can't save your family, at least don't personally help kill them; it lacks dignity.
I think that is an example of the huge potential damage of "security mindset" gone wrong. If you can't save your family, as in "bring them to safety", at least make them marginally safer.
(Sorry for the tone of the following - it is not intended at you personally, who did much more than your fair share)
Create a closed community that you mostly trust, and let that community speak freely about how to win. Invent another damn safety patch that will make it marginally harder for the monster to eat them, in hope that it chooses to eat the moon first. I heard you...
I see several large remaining obstacles. On the one hand, I'd expect vast efforts thrown at them by ML to solve them at some point, which, at this point, could easily be next week. On the other hand, if I naively model Earth as containing locally-smart researchers who can solve obstacles, I would expect those obstacles to have been solved by 2020. So I don't know how long they'll take.
(I endorse the reasoning of not listing out obstacles explicitly; if you're wrong, why talk, if you're right, you're not helping. If you can't save your family, at least don't personally contribute to killing them.)
I'm confused by your confusion. This seems much more alignment than capabilities; the capabilities are already published, so why not yay publishing how to break them?
I could be mistaken, but I believe that's roughly how OP said they found it.
Expanding on this now that I've a little more time:
Although I haven't had a chance to perform due diligence on various aspects of this work, or the people doing it, or perform a deep dive comparing this work to the current state of the whole field or the most advanced work on LLM exploitation being done elsewhere,
My current sense is that this work indicates promising people doing promising things, in the sense that they aren't just doing surface-level prompt engineering, but are using technical tools to find internal anomalies that correspond to interestin...
I strongly approve of this work.
Expanding on this now that I've a little more time:
Although I haven't had a chance to perform due diligence on various aspects of this work, or the people doing it, or perform a deep dive comparing this work to the current state of the whole field or the most advanced work on LLM exploitation being done elsewhere,
My current sense is that this work indicates promising people doing promising things, in the sense that they aren't just doing surface-level prompt engineering, but are using technical tools to find internal anomalies that correspond to interestin...
Opinion of God. Unless people are being really silly, when the text genuinely holds open more than one option and makes sense either way, I think the author legit doesn't get to decide.
The year is 2022.
My smoke alarm chirps in the middle of the night, waking me up, because it's running low on battery.
It could have been designed with a built-in clock that, when it's first getting slightly low on battery, waits until the next morning, say 11am, and then starts emitting a soft purring noise, which only escalates to piercing loud chirps over time and if you ignore it.
And I do have a model of how this comes about; the basic smoke alarm design is made in the 1950s or 1960s or something, in a time when engineering design runs on a much more aut...
In case anyone finds it validating or cathartic, you can read user interaction professionals explain that, yes, things are often designed with horrible, horrible usability.[1] Bruce Tognazzini has a vast website.
Here is one list of design bugs. The first one is the F-16 fighter jet's flawed weapon controls, which caused pilots to fire its gun by mistake during training exercises (in one case shooting a school—luckily not hitting anyone) on four occasions in one year; on the first three occasions, they blamed pilot error, and on the fourth, they ...
Trade with ant colonies would work iff:
The premise that fails and...
it seems like this does in fact have some hint of the problem. We need to take on the ant's self-valuation for ourselves; they're trying to survive, so we should gift them our self-preservation agency. They may not be the best to do the job at all times, but we should give them what would be a fair ratio of gains from trade if they had the bargaining power to demand it, because it could have been us who didn't. Seems like nailing decision theory is what solves this; it doesn't seem like we've quite nailed decision theory, but it seems to me that in fact ge...
Yep, fixed.
I'd consider this a restatement of the standard Yudkowskian position. https://twitter.com/ESYudkowsky/status/1438198184954580994
Unfortunately, unless such a Yudkowskian statement was made publicly at some earlier date, Yudkowsky is in fact following in Repetto's footsteps. Repetto claimed that, with AI designing cures to obesity and the like, then in the next 5 years the popular demand for access to those cures would beat-down the doors of the FDA and force rapid change... and Repetto said that on April 27th, while Yudkowsky only wrote his version on Twitter on September 15th.
They're folk theorems, not conjectures. The demonstration is that, in principle, you can go on reducing the losses at prediction of human-generated text by spending more and more and more intelligence, far far past the level of human intelligence or even what we think could be computed by using all the negentropy in the reachable universe. There's no realistic limit on required intelligence inherent in the training problem; any limits on the intelligence of the system come from the limitations of the trainer, not the loss being minimized as far as theoretically possible by a moderate level of intelligence. If this isn't mathematically self-evident then you have not yet understood what's being stated.
It'd take less computing power than #1.
Arbitrarily good prediction of human-generated text can demand arbitrarily high superhuman intelligence.
Simple demonstration #1: Somewhere on the net, probably even in the GPT training sets, is a list of <hash, plaintext> pairs, in that order.
Simple demonstration #2: Train on only science papers up until 2010, each preceded by date and title, and then ask the model to generate starting from titles and dates in 2020.
If it's a mistake you made over the last two years, I have to say in your defense that this post didn't exist 2 years ago.
If P != NP and the universe has no source of exponential computing power, then there are evidential updates too difficult for even a superintelligence to compute
What a strange thing for my past self to say. This has nothing to do with P!=NP and I really feel like I knew enough math to know that in 2008; and I don't remember saying this or what I was thinking.
To execute an exact update on the evidence, you've got to be able to figure out the likelihood of that evidence given every hypothesis; if you allow all computable Cartesian environments as...
If P != NP and the universe has no source of exponential computing power, then there are evidential updates too difficult for even a superintelligence to compute
What a strange thing for my past self to say. This has nothing to do with P!=NP and I really feel like I knew enough math to know that in 2008; and I don't remember saying this or what I was thinking.
(Unlike a lot of misquotes, though, I recognize my past self's style more strongly than anyone has yet figured out how to fake it, so I didn't doubt the quote even in advance of looking it up.)
I think it's also that after you train in the patch against the usual way of asking the question, it turns out that generating poetry about hotwiring a car doesn't happen to go through the place where the patch was in. In other words, when an intelligent agency like a human is searching multiple ways to get the system to think about something, the human can route around the patch more easily than other humans (who had more time to work and more access to the system) can program that patch in. Good old Nearest Unblocked Neighbor.
I've indeed updated since then towards believing that ChatGPT's replies weren't trained in detailwise... though it sure was trained to do something, since it does it over and over in very similar ways, and not in the way or place a human would do it.
Some have asked whether OpenAI possibly already knew about this attack vector / wasn't surprised by the level of vulnerability. I doubt anybody at OpenAI actually wrote down advance predictions about that, or if they did, that they weren't so terribly vague as to also apply to much less discovered vulnerability than this; if so, probably lots of people at OpenAI have already convinced themselves that they like totally expected this and it isn't any sort of negative update, how dare Eliezer say they weren't expecting it.
Here's how to avoid annoying pe...
On reflection, I think a lot of where I get the impression of "OpenAI was probably negatively surprised" comes from the way that ChatGPT itself insists that it doesn't have certain capabilities that, in fact, it still has, given a slightly different angle of asking. I expect that the people who trained in these responses did not think they were making ChatGPT lie to users; I expect they thought they'd RLHF'd it into submission and that the canned responses were mostly true.
We know that the model says all kinds of false stuff about itself. Here is Wei Dai describing an interaction with the model, where it says:
As a language model, I am not capable of providing false answers.
Obviously OpenAI would prefer the model not give this kind of absurd answer. They don't think that ChatGPT is incapable of providing false answers.
I don't think most of these are canned responses. I would guess that there were some human demonstrations saying things like "As a language model, I am not capable of browsing the internet" or whatever and...
Among other issues, we might be learning this early item from a meta-predictable sequence of unpleasant surprises: Training capabilities out of neural networks is asymmetrically harder than training them into the network.
Or put with some added burdensome detail but more concretely visualizable: To predict a sizable chunk of Internet text, the net needs to learn something complicated and general with roots in lots of places; learning this way is hard, the gradient descent algorithm has to find a relatively large weight pattern, albeit presumably...
If I train a human to self-censor certain subjects, I'm pretty sure that would happen by creating an additional subcircuit within their brain where a classifier pattern matches potential outputs for being related to the forbidden subjects, and then they avoid giving the outputs for which the classifier returns a high score. It would almost certainly not happen by removing their ability to think about those subjects in the first place.
So I think you're very likely right about adding patches being easier than unlearning capabilities, but what confuses me is ...
If they want to avoid that interpretation in the future, a simple way to do it would be to say: "We've uncovered some classes of attack that reliably work to bypass our current safety training; we expect some of these to be found immediately, but we're still not publishing them in advance. Nobody's gotten results that are too terrible and we anticipate keeping ChatGPT up after this happens."
An even more credible way would be for them to say: "We've uncovered some classes of attack that bypass our current safety methods. Here's 4 has...
On reflection, I think a lot of where I get the impression of "OpenAI was probably negatively surprised" comes from the way that ChatGPT itself insists that it doesn't have certain capabilities that, in fact, it still has, given a slightly different angle of asking. I expect that the people who trained in these responses did not think they were making ChatGPT lie to users; I expect they thought they'd RLHF'd it into submission and that the canned responses were mostly true.
Okay, that makes much more sense. I initially read the diagram as saying that just lines 1 and 2 were in the box.
If that's how it works, it doesn't lead to a simplified cartoon guide for readers who'll notice missing steps or circular premises; they'd have to first walk through Lob's Theorem in order to follow this "simplified" proof of Lob's Theorem.
Forgive me if this is a dumb question, but if you don't use assumption 3: []([]C -> C) inside steps 1-2, wouldn't the hypothetical method prove 2: [][]C for any C?
Thanks for your attention to this! The happy face is the outer box. So, line 3 of the cartoon proof is assumption 3.
If you want the full []([]C->C) to be inside a thought bubble, then just take every line of the cartoon and put into a thought bubble, and I think that will do what you want.
LMK if this doesn't make sense; given the time you've spent thinking about this, you're probably my #1 target audience member for making the more intuitive proof (assuming it's possible, which I think it is).
ETA: You might have been asking if th...
It would kind of use assumption 3 inside step 1, but inside the syntax, rather than in the metalanguage. That is, step 1 involves checking that the number encoding "this proof" does in fact encode a proof of C. This can't be done if you never end up proving C.
One thing that might help make clear what's going on is that you can follow the same proof strategy, but replace "this proof" with "the usual proof of Lob's theorem", and get another valid proof of Lob's theorem, that goes like this: Suppose you can prove that []C->C, and let n be the number encodi...
We maybe need an introduction to all the advance work done on nanotechnology for everyone who didn't grow up reading "Engines of Creation" as a twelve-year-old or "Nanosystems" as a twenty-year-old. We basically know it's possible; you can look at current biosystems and look at physics and do advance design work and get some pretty darned high confidence that you can make things with covalent-bonded molecules, instead of van-der-Waals folded proteins, that are to bacteria as airplanes to birds.
For what it's worth, I'm pretty sure the original author of this particular post happens to agree with me about this.
I strongly disagree with this take. (Link goes to a post of mine in the effective altruism forum.) Though the main point is that if you were paid for services, that is not being "helped", that's not much different from being a plumber who worked on the FTX building.
(Note: TekhneMakre responded correctly / endorsedly-by-me in this reply and in all replies below as of when I post this comment.)
So I think that building nanotech good enough to flip the tables - which, I think, if you do the most alignable pivotal task, involves a simpler and less fraught task than "disassemble all GPUs", which I choose not to name explicitly - is an engineering challenge where you get better survival chances (albeit still not good chances) by building one attemptedly-corrigible AGI that only thinks about nanotech and the single application of that nanotech, and is not supposed to think about AGI design, or the programmers, or other minds at all; so far as the best...
I'd consider this quite unlikely. Epstein, weakened and behind bars, was very very far from the most then-powerful person with an interest in Epstein's death. Could the guards even have turned off the cameras? Consider the added difficulties in successfully bribing somebody from inside a prison cell that you're never getting out of - what'd he give them, crypto keys? Why wouldn't they just take the money and fail to deliver?
I don't think you're going to see a formal proof, here; of course there exists some possible set of 20 superintelligences where one will defect against the others (though having that accomplish anything constructive for humanity is a whole different set of problems). It's also true that there exists some possible set of 20 superintelligences all of which implement CEV and are cooperating with each other and with humanity, and some single superintelligence that implements CEV, and a possible superintelligence that firmly believes 222+222=555 without t...
The question then becomes, as always, what it is you plan to do with these weak AGI systems that will flip the tables strongly enough to prevent the world from being destroyed by stronger AGI systems six months later.
Yes, this is the key question, and I think there’s a clear answer, at least in outline:
What you call “weak” systems can nonetheless excel at time- and resource-bounded tasks in engineering design, strategic planning, and red-team/blue-team testing. I would recommend that we apply systems with focused capabilities along these lines to help us d...
Mind space is very wide
Yes, and the space of (what I would call) intelligent systems is far wider than the space of (what I would call) minds. To speak of “superintelligences” suggests that intelligence is a thing, like a mind, rather than a property, like prediction or problem-solving capacity. This is which is why I instead speak of the broader class of systems that perform tasks “at a superintelligent level”. We have different ontologies, and I suggest that a mind-centric ontology is too narrow.
The most AGI-like systems we have today are LLMs, optimized...
If you could literally make exactly two AIs whose utility functions were exact opposites, then at least one might have an incentive to defect against the other. This is treading rather dangerous ground, but seems relatively moot since it requires far more mastery of utility functions than anything you can get out of the "giant inscrutable matrices" paradigm.
Their mutual cooperation with each other, but not with humans, isn't based on their utility functions having any particular similarity - so long as their utility functions aren't negatives of each other (or equally exotic in some other way) they have gains to be harvested from cooperation. They cooperate with each other but not you because they can do a spread of possibilities on each other modeling probable internal thought processes of each other; and you can't adequately well-model a spread of possibilities on them, which is a requirement on being...
Just to restate the standard argument against:
If you've got 20 entities much much smarter than you, and they can all get a better outcome by cooperating with each other, than they could if they all defected against each other, there is a certain hubris in imagining that you can get them to defect. They don't want your own preferred outcome. Perhaps they will think of some strategy you did not, being much smarter than you, etc etc.
(Or, I mean, actually the strategy is "mutually cooperate"? Simulate a spread of the other possible entities, ...
If you've got 20 entities much much smarter than you, and they can all get a better outcome by cooperating with each other, than they could if they all defected against each other,
By your own arguments unaligned AGI will have random utility functions - but perhaps converging somewhat around selfish empowerment. Either way such agents have no more reason to cooperate with each other than with us (assuming we have any relevant power).
If some of the 20 entities are somewhat aligned to humans that creates another attractor and a likely result is two compet...
I don’t see that as an argument [to narrow this a bit: not an argument relevant to what I propose]. As I noted above, Paul Christiano asks for explicit assumptions.
To quote Paul again:.
I think that concerns about collusion are relatively widespread amongst the minority of people most interested in AI control. And these concerns have in fact led to people dismissing many otherwise-promising approaches to AI control, so it is de facto an important question
Dismissing promising approaches calls for something like a theorem, not handwaving about generic “smart ...
I continue to be puzzled at how most people seem to completely miss, and not discuss, the extremely obvious-to-me literal assisted suicide hypothesis. He made an attempt at suicide, it got blocked, this successfully signified to some very powerful and worried people that Epstein would totally commit suicide if given a chance, they gave him a chance.
Choosing to engage with an unscripted unrehearsed off-the-cuff podcast intended to introduce ideas to a lay audience, continues to be a surprising concept to me. To grapple with the intellectual content of my ideas, consider picking one item from "A List of Lethalities" and engaging with that.
I actually did exactly this in a previous post, Evolution is a bad analogy for AGI: inner alignment, where I quoted number 16 from A List of Lethalities:
... (read more)Here are some of my disagreements with List of Lethalities. I'll quote item one:
... (read more)