All of Eliezer Yudkowsky's Comments + Replies

Choosing to engage with an unscripted unrehearsed off-the-cuff podcast intended to introduce ideas to a lay audience, continues to be a surprising concept to me.  To grapple with the intellectual content of my ideas, consider picking one item from "A List of Lethalities" and engaging with that.

To grapple with the intellectual content of my ideas, consider picking one item from "A List of Lethalities" and engaging with that.

I actually did exactly this in a previous post, Evolution is a bad analogy for AGI: inner alignment, where I quoted number 16 from A List of Lethalities:

16.  Even if you train really hard on an exact loss function, that doesn't thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments.  Humans don't

... (read more)
9Elizabeth3d
I imagine (edit: wrongly) it was less "choosing" and more "he encountered the podcast first because it has a vastly larger audience, and had thoughts about it."  I also doubt "just engage with X" was an available action.  The podcast [https://www.lesswrong.com/posts/Aq82XqYhgqdPdPrBA/full-transcript-eliezer-yudkowsky-on-the-bankless-podcast] transcript doesn't mention List of Lethalities, LessWrong, or the Sequences, so how is a listener supposed to find it? I also hate it when people don't engage with the strongest form of my work, and wouldn't consider myself obligated to respond if they engaged with a weaker form (or if they engaged with the strongest one, barring additional obligation). But I think this is just what happens when someone goes on a podcast aimed at audiences that don't already know them. 

Here are some of my disagreements with List of Lethalities. I'll quote item one:

“Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction.  This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again”

(Evolution) → (human values) is not the only case of inner alignment failure which we know a

... (read more)

The "strongest" foot I could put forwards is my response to "On current AI not being self-improving:", where I'm pretty sure you're just wrong.

You straightforwardly completely misunderstood what I was trying to say on the Bankless podcast:  I was saying that GPT-4 does not get smarter each time an instance of it is run in inference mode.

And that's that, I guess.

I'll admit it straight up did not occur to me that you could possibly be analogizing between a human's lifelong, online learning process, and a single inference run of an already trained model. Those are just completely different things in my ontology. 

Anyways, thank you for your response. I actually do think it helped clarify your perspective for me.

Edit: I have now included Yudkowsky's correction of his intent in the post, as well as an explanation of why I still don't agree with the corrected version of the argument he was making. 

This is kinda long.  If I had time to engage with one part of this as a sample of whether it holds up to a counterresponse, what would be the strongest foot you could put forward?

(I also echo the commenter who's confused about why you'd reply to the obviously simplified presentation from an off-the-cuff podcast rather than the more detailed arguments elsewhere.)

8DirectedEvolution2d
Eliezer, in the world of AI safety, there are two separate conversations: the development of theory and observation, and whatever's hot in public conversation. A professional AI safety researcher, hopefully, is mainly developing theory and observation. However, we have a whole rationalist and EA community, and now a wider lay audience, who are mainly learning of and tracking these matters through the public conversation. It is the ideas and expressions of major AI safety communicators, of whom you are perhaps the most prominent, that will enter their heads. The arguments lay audiences carry may not be fully informed, but they can be influential, both on the decisions they make and the influence they bring to bear on the topic. When you get on a podcast and make off-the-cuff remarks about ideas you've been considering for a long time, you're engaging in public conversation, not developing theory and observation. When somebody critiques your presentation on the podcast, they are doing the same. The utility of Quintin choosing to address the arguments you have chosen to put forth, off-the-cuff, to that lay audience is similar to the utility you achieve by making them in the first place. You get people interested in your ideas and arguments, and hopefully improve the lay audience's thinking. Quintin offers a critical take on your arguments, and hopefully improves their thinking further. I think it's natural that you are responding as if you thought the main aim of this post was for Quintin to engage you personally in debate. After all, it's your podcast appearance and the entire post is specifically about your ideas. Yet I think the true point of Quintin's post is to engage your audience in debate - or, to be a little fanciful - the Eliezer Yudkowsky Homunculus that your audience now has in their heads. By responding as if Quintin was seeking your personal attention, rather than the attention of your audience, and by explicitly saying you'll give him the minimum po
4the gears to ascension2d
dude just read the damn post at a skim level at least, lol. If you can't get through this how are you going to do... sigh. Okay, I'd really rather you read QACI posts deeply than this. But, still. It deserves at least a level 1 read [https://www.lesswrong.com/posts/sAyJsvkWxFTkovqZF/how-to-read-papers-efficiently-fast-then-slow-three-pass] rather than a "can I have a summary?" dismissal.
5Vaniver3d
FWIW, I thought the bit about manifolds in The difficulty of alignment [https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#The_difficulty_of_alignment] was the strongest foot forward, because it paints a different detailed picture than your description that it's responding to. That said, I don't think Quintin's picture obviously disagrees with yours (as discussed in my response over here [https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky?commentId=mATAbtCtkiKgAcn8B]) and I think you'd find disappointing him calling your description extremely misleading while not seeming to correctly identify the argument structure and check whether there's a related one that goes thru on his model.

I think you should use a manifold market to decide on whether you should read the post, instead of the test this comment is putting forth. There's too much noise here, which isn't present in a prediction market about the outcome of your engagement.

Market here: https://manifold.markets/GarrettBaker/will-eliezer-think-there-was-a-sign

Is the overall karma for this mostly just people boosting it for visibility? Because I don't see how this would be a quality comment by any other standards.

Frontpage comment guidelines:

  • Maybe try reading the post
-8lc3d
iceman3dΩ5196

This response is enraging.

Here is someone who has attempted to grapple with the intellectual content of your ideas and your response is "This is kinda long."? I shouldn't be that surprised because, IIRC, you said something similar in response to Zack Davis' essays on the Map and Territory distinction, but that's ancillary and AI is core to your memeplex.

I have heard repeated claims that people don't engage with the alignment communities' ideas (recent example from yesterday). But here is someone who did the work. Please explain why your response here does ... (read more)

The "strongest" foot I could put forwards is my response to "On current AI not being self-improving:", where I'm pretty sure you're just wrong.

However, I'd be most interested in hearing your response to the parts of this post that are about analogies to evolution, and why they're not that informative for alignment, which start at:

Yudkowsky argues that we can't point an AI's learned cognitive faculties in any particular direction because the "hill-climbing paradigm" is incapable of meaningfully interfacing with the inner values of the intelligences it creat

... (read more)

Things are dominated when they forego free money and not just when money gets pumped out of them.

6keith_wynroe6d
How is the toy example agent sketched in the post dominated?
3eapi12d
...wait, you were just asking for an example of an agent being "incoherent but not dominated" in those two senses of being money-pumped? And this is an exercise meant to hint that such "incoherent" agents are always dominatable? I continue to not see the problem, because the obvious examples don't work. If I have (1 apple,$0) as incomparable to (1 banana,$0) that doesn't mean I turn down the trade of −1 apple,+1 banana,+$10000 (which I assume is what you're hinting at re. foregoing free money). If one then says "ah but if I offer $9999 and you turn that down, then we have identified your secret equivalent utili-" no, this is just a bid/ask spread, and I'm pretty sure plenty of ink has been spilled justifying EUM agents using uncertainty to price inaction like this. What's an example of a non-EUM agent turning down free money which doesn't just reduce to comparing against an EUM with reckless preferences/a low price of uncertainty?

Suppose I describe your attempt to refute the existence of any coherence theorems:  You point to a rock, and say that although it's not coherent, it also can't be dominated, because it has no preferences.  Is there any sense in which you think you've disproved the existence of coherence theorems, which doesn't consist of pointing to rocks, and various things that are intermediate between agents and rocks in the sense that they lack preferences about various things where you then refuse to say that they're being dominated?

2keith_wynroe15d
This seems totally different to the point OP is making which is that you can in theory have things that definitely are agents, definitely do have preferences, and are incoherent (hence not EV-maximisers) whilst not "predictably shooting themselves in the foot" as you claim must follow from this I agree the framing of "there are no coherence theorems" is a bit needlessly strong/overly provocative in a sense, but I'm unclear what your actual objection is here - are you claiming these hypothetical agents are in fact still vulnerable to money-pumping? That they are in fact not possible? 
2eapi15d
The rock doesn't seem like a useful example here. The rock is "incoherent and not dominated" if you view it as having no preferences and hence never acting out of indifference, it's "coherent and not dominated" if you view it as having a constant utility function and hence never acting out of indifference, OK, I guess the rock is just a fancy Rorschach test. IIUC a prototypical Slightly Complicated utility-maximizing agent is one with, say, u(apples,bananas)=min(apples,bananas), and a prototypical Slightly Complicated not-obviously-pumpable non-utility-maximizing agent is one with, say, the partial order (a1,b1)≼(a2,b2)=a1≼a2∧b1≼b2 plus the path-dependent rule that EJT talks about in the post (Ah yes, non-pumpable non-EU agents might have higher complexity! Is that relevant to the point you're making?). What's the competitive advantage of the EU agent? If I put them both in a sandbox universe and crank up their intelligence, how does the EU agent eat the non-EU agent? How confident are you that that is what must occur?
3eapi15d
This is pretty unsatisfying as an expansion of "incoherent yet not dominated" given that it just uses the phrase "not coherent" instead. I find money-pump arguments to be the most compelling ones since they're essentially tiny selection theorems for agents in adversarial environments, and we've got an example in the post of (the skeleton of) a proof that a lack-of-total-preferences doesn't immediately lead to you being pumped. Perhaps there's a more sophisticated argument that Actually No, You Still Get Pumped but I don't think I've seen one in the comments here yet. If there are things which cannot-be-money-pumped, and yet which are not utility-maximizers, and problems like corrigibility are almost certainly unsolvable for utility-maximizers, perhaps it's somewhat worth looking at coherent non-pumpable non-EU agents?
1Eve Grey15d
Hey, I'm really sorry if I sound stupid, because I'm very new to all this, but I have a few questions (also, I don't know which one of all of you is right, I genuinely have no idea). Aren't rocks inherently coherent, or rather, their parts are inherently coherent, for they align with the laws of the universe, whereas the "rock" is just some composite abstract form we came up with, as observers? Can't we think of the universe in itself as an "agent" not in the sense of it being "god", but in the sense of it having preferences and acting on them? Examples would be hot things liking to be apart and dispersion leading to coldness, or put more abstractly - one of the "preferences" of the universe is entropy. I'm sorry if I'm missing something super obvious, I failed out of university, haha! If we let the "universe" be an agent in itself, so essentially it's a composite of all simples there are (even the ones we're not aware of), then all smaller composites by definition will adhere to the "preferences" of the "universe", because from our current understanding of science, it seems like the "preferences" (laws) of the "universe" do not change when you cut the universe in half, unless you reach quantum scales, but even then, it is my unfounded suspicion that our previous models are simply laughably wrong, instead of the universe losing homogeneity at some arbitrary scale. Of course, the "law" of the "universe" is very simple and uncomplex - it is akin to the most powerful "intelligence" or "agent" there is, but with the most "primitive" and "basic" "preferences". Also apologies for using so many words in quotations, I do so, because I am unsure if I understand their intended meaning. It seems to me that you could say that we're all ultimately "dominated" by the "universe" itself, but in a way that's not really escapeable, but in opposite, the "universe" is also "dominated" by more complex "agents", as individuals can make sandwiches, while it'd take the "universe" muc

I want you to give me an example of something the agent actually does, under a couple of different sense inputs, given what you say are its preferences, and then I want you to gesture at that and say, "Lo, see how it is incoherent yet not dominated!"

2eapi17d
Say more about what counts as incoherent yet not dominated? I assume "incoherent" is not being used here as an alias for "non-EU-maximizing" because then this whole discussion is circular.

If you think you've got a great capabilities insight, I think you PM me or somebody else you trust and ask if they think it's a big capabilities insight.

In the limit, you take a rock, and say, "See, the complete class theorem doesn't apply to it, because it doesn't have any preferences ordered about anything!"  What about your argument is any different from this - where is there a powerful, future-steering thing that isn't viewable as Bayesian and also isn't dominated?  Spell it out more concretely:  It has preferences ABC, two things aren't ordered, it chooses X and then Y, etc.  I can give concrete examples for my views; what exactly is a case in point of anything you're claiming about the Complete Class Theorem's supposed nonapplicability and hence nonexistence of any coherence theorems?

EJT23d2312

In the limit

You’re pushing towards the wrong limit. A rock can be represented as indifferent between all options and hence as having complete preferences.

As I explain in the post, an agent’s preferences are incomplete if and only if they have a preferential gap between some pair of options, and an agent has a preferential gap between two options A and B if and only if they lack any strict preference between A and B and this lack of strict preference is insensitive to some sweetening or souring (such that, e.g., they strictly prefer A to A- and yet have no ... (read more)

And this avoids the Complete Class Theorem conclusion of dominated strategies, how? Spell it out with a concrete example, maybe? Again, we care about domination, not representability at all.

EJT24d1011

And this avoids the Complete Class Theorem conclusion of dominated strategies, how?

The Complete Class Theorem assumes that the agent’s preferences are complete. If the agent’s preferences are incomplete, the theorem doesn’t apply. So, you have to try to get Completeness some other way.

You might try to get Completeness via some money-pump argument, but these arguments aren’t particularly convincing. Agents can make themselves immune to all possible money-pumps for Completeness by acting in accordance with the following policy: ‘if I previously turned down s... (read more)

Say more about behaviors associated with "incomparability"?

6cfoster025d
Depending on the implementation details of the agent design, it may do some combination of: * Turning down your offer, path-dependent [https://www.lesswrong.com/posts/3xF66BNSC5caZuKyC/why-subagents#Path_Dependence]ly preferring whichever option is already in hand [https://elischolar.library.yale.edu/cgi/viewcontent.cgi?article=2049&context=cowles-discussion-paper-series] / whichever option is consistent with its history of past trades. * Noticing unresolved conflicts within its preference framework, possibly unresolveable without self-modifying into an agent that has different preferences from itself. * Halting and catching fire, folding under the weight of an impossible choice. EDIT: The post also suggests an alternative (better) policy [https://www.lesswrong.com/posts/yCuzmCsE86BTu9PfA/there-are-no-coherence-theorems#Summarizing_this_section] that agents with incomplete preferences may follow.

The author doesn't seem to realize that there's a difference between representation theorems and coherence theorems.

The Complete Class Theorem says that an agent’s policy of choosing actions conditional on observations is not strictly dominated by some other policy (such that the other policy does better in some set of circumstances and worse in no set of circumstances) if and only if the agent’s policy maximizes expected utility with respect to a probability distribution that assigns positive probability to each possible set of circumstances.

This theorem

... (read more)
3Seth Herd24d
I don't think this goes through. If I have no preference between two things, but I do prefer to not be money-pumped, it doesn't seem like I'm going to trade those things so as to be money-pumped. I am commenting because I think this might be a crucial crux: do smart/rational enough agents always act like maximizers? If not, adequate alignment might be much more feasible than if we need to find exactly the right goal and how to get it into our AGI exactly right. Human preferences are actually a lot more complex. We value food very highly when hungry and water when we're thirsty. That can come out of power-seeking, but that's not actually how it's implemented. Perhaps more importantly, we might value stamp collecting really highly until we get bored with stamp collecting. I don't think these can be modeled as a maximizer of any sort. If humans would pursue multiple goals [https://www.lesswrong.com/posts/Sf99QEqGD76Z7NBiq/are-you-stably-aligned] even if we could edit them (and were smart enough to be consistent), then a similar AGI might only need to be minimally aligned for success. That is, it might stably value human flourishing as a small part of its complex utility function. I'm not sure whether that's the case, but I think it's important.
EJT25d1810

These arguments don't work.

  1. You've mistaken acyclicity for transitivity. The money-pump establishes only acyclicity. Representability-as-an-expected-utility-maximizer requires transitivity.

  2. As I note in the post, agents can make themselves immune to all possible money-pumps for completeness by acting in accordance with the following policy: ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ Acting in accordance with this policy need never require an agent to act against any of their preferences.

5cfoster025d
If I'm merely indifferent between A and B, then I will not object to trades exchanging A for B. But if A and B are incomparable for me, then I definitely may object!

I'd consider myself to have easily struck down Chollet's wack ideas about the informal meaning of no-free-lunch theorems, which Scott Aaronson also singled out as wacky.  As such, citing him as my technical opposition doesn't seem good-faith; it's putting up a straw opponent without much in the way of argument and what there is I've already stricken down.  If you want to cite him as my leading technical opposition, I'm happy enough to point to our exchange and let any sensible reader decide who held the ball there; but I would consider it intellectually dishonest to promote him as my leading opposition.

1Gerald Monroe23d
Why didn't you mention Eric Drexler? Maybe it's my own bias as an engineer familiar with the safety solutions actually in use, but I think Drexler's CAIS model is a viable alignment solution.    
8Paradiddle24d
I don't want to cite anyone as your 'leading technical opposition'. My point is that many people who might be described as having 'coherent technical views' would not consider your arguments for what to expect from AGI to be 'technical' at all. Perhaps you can just say what you think it means for a view to be 'technical'? As you say, readers can decide for themselves what to think about the merits of your position on intelligence versus Chollet's (I recommend this essay by Chollet for a deeper articulation of some of his views: https://arxiv.org/pdf/1911.01547.pdf). [https://arxiv.org/pdf/1911.01547.pdf).] Regardless of whether or not you think you 'easily struck down' his 'wack ideas', I think it is important for people to realise that they come from a place of expertise about the technology in question. You mention Scott Aaronson's comments on Chollet. Aaronson says (https://scottaaronson.blog/?p=3553) of Chollet's claim that an Intelligence Explosion is impossible: "the certainty that he exudes strikes me as wholly unwarranted." I think Aaronson (and you) are right to point out that the strong claim Chollet makes is not established by the arguments in the essay. However, the same exact criticism could be levelled at you. The degree of confidence in the conclusion is not in line with the nature of the evidence.
2Noosphere8924d
While I have serious issues with Eliezer's epistemics on AI, I also agree that Chollet's argument was terrible in that the No Free Lunch theorem is essentially irrelevant. In a nutshell, this is also one of the problems I had with DragonGod's writing on AI.

used a Timeless/Updateless decision theory

Please don't say this with a straight face any more than you'd blame their acts on "Consequentialism" or "Utilitarianism".  If I thought they had any actual and correct grasp of logical decision theory, technical or intuitive, I'd let you know.  "attributed their acts to their personal version of updateless decision theory", maybe.

2Noosphere8925d
I agree they misused logical decision theories, I'm just stating what they claimed to use.
-3TAG25d
Also, don't call things Bayesian when they are only based on informal, non-quantified reasoning.
1Slimepriestess1mo
maybe it would be more apt to just say they misused timeless decision theory to justify their actions

This is not a closed community, it is a world-readable Internet forum.

2Portia20d
It is readable; it is however generally not read by academia and engineers. I disagree with them about why - I do think solutions can be found by thinking outside of the box and outside of immediate applications, and without an academic degree, and I very much value the rational and creative discourse here. But many here specifically advocate against getting a university degree or working in academia, thus shitting on things academics have sweat blood for. They also tend not to follow the formats and metrics that count in academia to be heard, such as publications and mathematical precision and usable code. There is also a surprisingly limited attempt in engaging with academics and engineers on their terms, providing things they can actually use and act upon. So I doubt they will check this forum for inspiration on which problems need to be cracked. That is irrational of them, so I understand why you do not respect it, but that is how it is. On the other hand, understanding the existing obstacles may give us a better idea of how much time we still have, and which limitations emerging AGI will have, which is useful information.
2Ben Amitay1mo
I meant to criticize moving too far toward "do no harm" policy in general due to inability to achieve a solution that would satisfy us if we had the choice. I agree specifically that if anyone knows of a bottleneck unnoticed by people like Bengio and LeCun, LW is not the right forum to discuss it. Is there a place like that though? I may be vastly misinformed, but last time I checked MIRI gave the impression of aiming at very different directions ("bringing to safety" mindset) - though I admit that I didn't watch it closely, and it may not be obvious from the outside what kind of work is done and not published. [Edit: "moving toward 'do no harm'" - "moving to" was a grammar mistake that make it contrary to position you stated above - sorry]

The reasoning seems straightforward to me:  If you're wrong, why talk?  If you're right, you're accelerating the end.

I can't in general endorse "first do no harm", but it becomes better and better in any specific case the less way there is to help.  If you can't save your family, at least don't personally help kill them; it lacks dignity.

I think that is an example of the huge potential damage of "security mindset" gone wrong. If you can't save your family, as in "bring them to safety", at least make them marginally safer.

(Sorry for the tone of the following - it is not intended at you personally, who did much more than your fair share)

Create a closed community that you mostly trust, and let that community speak freely about how to win. Invent another damn safety patch that will make it marginally harder for the monster to eat them, in hope that it chooses to eat the moon first. I heard you... (read more)

2YafahEdelman1mo
I think there are a number of ways in which talking might be good given that one is right about there being obstacles - one that appeals to me in particular is the increased tractability of misuse arising from the relevant obstacles. [Edit: *relevant obstacles I have in mind. (I'm trying to be vague here)]

I see several large remaining obstacles.  On the one hand, I'd expect vast efforts thrown at them by ML to solve them at some point, which, at this point, could easily be next week.  On the other hand, if I naively model Earth as containing locally-smart researchers who can solve obstacles, I would expect those obstacles to have been solved by 2020.  So I don't know how long they'll take.

(I endorse the reasoning of not listing out obstacles explicitly; if you're wrong, why talk, if you're right, you're not helping.  If you can't save your family, at least don't personally contribute to killing them.)

1Ilio20d
I can only see two remaining obstacles (arguably two families, so not sure if I’m missing some of yours of if my categories are a little too broad). One is pretty obvious, and have been mentioned already. The second one is original AFAICT, and pretty close to « solve the alignment problem ». In that case, would you still advice keeping my mouth shut, or would you think that’s an exception to your recommendation? Your answer will impact what I say or don’t say, at least on LW.
0Portia20d
The problem with saving earth from climate change is not that we do not know the technical solutions. We have long done so. Framing this as a technical rather than a social problem is actually part of the issue. The problem is with  1. Academic culture systematically encouraging people to understate risk in light of uncertainty of complex systems, and framing researchers as lacking objectivity if they become activists in light of the findings, while politicians can exert pressure on final scientific reports; 2. Capitalism needing limitless growth and intrinsically valuing profit over nature and this being fundamentally at odds with limiting resource consumption, while we have all been told that capitalism is both beneficial and without alternative, and keep being told the comforting lie that green capitalism will solve this all for us with technology, while leaving our quality and way of life intact; 3. A reduction in personal resource use being at odds with short-term desires (eating meat, flying, using tons of energy, keeping toasty warm, overconsumption), while the positive impacts are long-term and not personalised (you won't personally be spared flooding because you put solar on your roof); 4. Powerful actors having a strong interest in continuing fossil fuel extraction and modern agriculture, and funding politicians to advocate for them as well as fake news on the internet and biased research, with democratic institutions struggling to keep up with a change in what we consider necessary for the public good, and measures that would address these falsely being framed as being anti-democratic; 5. AI that is not aligned with human interests, but controlled by companies who fund themselves by keeping you online at all costs, taking your data and spamming you with ads asking you to consume more unnecessary shit, with keeping humans distracted and engaged with online content in way

I'm confused by your confusion.  This seems much more alignment than capabilities; the capabilities are already published, so why not yay publishing how to break them?

3Yonatan Cale1mo
Because (I assume) once OpenAI[1] say "trust our models", that's the point when it would be useful to publish our breaks. Breaks that weren't published yet, so that OpenAI couldn't patch them yet. [unconfident; I can see counterarguments too] 1. ^ Or maybe when the regulators or experts or the public opinion say "this model is trustworthy, don't worry"

I could be mistaken, but I believe that's roughly how OP said they found it.

2the gears to ascension1mo
no, this was done through a mix of clustering and optimizing an input to get a specific output, not coverage guided fuzzing, which optimizes inputs to produce new behaviors according to a coverage measurement. but more generally, I'm proposing to compare generations of fuzzers and try to take inspiration from the ways fuzzers have changed since their inception. I'm not deeply familiar with those changes though - I'm proposing it would be an interesting source of inspiration but not that the trajectory should be copied exactly.

Expanding on this now that I've a little more time:

Although I haven't had a chance to perform due diligence on various aspects of this work, or the people doing it, or perform a deep dive comparing this work to the current state of the whole field or the most advanced work on LLM exploitation being done elsewhere,

My current sense is that this work indicates promising people doing promising things, in the sense that they aren't just doing surface-level prompt engineering, but are using technical tools to find internal anomalies that correspond to interestin... (read more)

3Yonatan Cale1mo
I'm confused: Wouldn't we prefer to keep such findings private? (at least, keep them until OpenAI will say something like "this model is reliable/safe"?)   My guess: You'd reply that finding good talent is worth it?
8the gears to ascension2mo
I would not argue against this receiving funding. However, I would caution that, despite that I have not done research at this caliber myself and I should not be seen as saying I can do better at this time, it is a very early step of the research and I would hope to see significant movement towards higher complexity anomaly detection than mere token-level. I have no object-level objection to your perspective and I hope that followups gets funded and that researchers are only very gently encouraged to stay curious and not fall into a spotlight effect; I'd comment primarily about considerations if more researchers than OP are to zoom in on this. Like capabilities, alignment research progress seems to me that it should be at least exponential. Eg, prompt for passers by - as American Fuzzy Lop is to early fuzzers, what would the next version be to this article's approach? edit: I thought to check if exactly that had been done before, and it has! * https://arxiv.org/abs/1807.10875 [https://arxiv.org/abs/1807.10875] * https://arxivxplorer.com/?query=https%3A%2F%2Farxiv.org%2Fabs%2F1807.10875 [https://arxivxplorer.com/?query=https%3A%2F%2Farxiv.org%2Fabs%2F1807.10875]  * ...

Expanding on this now that I've a little more time:

Although I haven't had a chance to perform due diligence on various aspects of this work, or the people doing it, or perform a deep dive comparing this work to the current state of the whole field or the most advanced work on LLM exploitation being done elsewhere,

My current sense is that this work indicates promising people doing promising things, in the sense that they aren't just doing surface-level prompt engineering, but are using technical tools to find internal anomalies that correspond to interestin... (read more)

Opinion of God.  Unless people are being really silly, when the text genuinely holds open more than one option and makes sense either way, I think the author legit doesn't get to decide.

2Duncan_Sabien2mo
<3 (Although, nitpick: it seems useful to have opinion-of-god as a term of art just like we have word-of-god as a term of art, but I don't think it's your mere opinion that you intended the latter interpretation.)

The year is 2022.

My smoke alarm chirps in the middle of the night, waking me up, because it's running low on battery.

It could have been designed with a built-in clock that, when it's first getting slightly low on battery, waits until the next morning, say 11am, and then starts emitting a soft purring noise, which only escalates to piercing loud chirps over time and if you ignore it.

And I do have a model of how this comes about; the basic smoke alarm design is made in the 1950s or 1960s or something, in a time when engineering design runs on a much more aut... (read more)

1Portia20d
I wonder if this is part of the reason so many of us work on AI. Because we have all had the experience of our minds working differently from other people, and of this leading to cool perspectives and ideas on how to make the world objectively better, and instead of those being adapted, being rejected and mocked for it. For me, this entails both sincere doubts that humanity can will rationally approach anything, including something as existentially crucial as AI, a deeply rooted mistrust of authority, norms and limits, as well as an inherent sympathy for the position AI would find itself in as a rational mind in an irrational world. It's a dangerous experience to have. It's an experience that can make you hate humans. That can make you reject legitimate criticism. That can make you fail to appreciate lessons gained by those in power and popularity, and fail to see past their mistakes to their worth. It's an experience so dangerous that at some point, I started approaching people who would tell me of their high IQs and their dedication to rationality with scepticism, despite being one of them. I went to a boarding school exclusively for highly gifted kids with problems, many of which were neurodivergent. I loved that place so fucking much. Like, imagine growing up as a child on less wrong. I felt so seen and understood and inspired. It's the one place on earth where I ever did not feel like an alien, where I did not have to self-censor or mask, the one place where I instantly made friends and connected. I miss this place to my bones. It broke my heart when I finished school, and enrolled in university, and realised academia was not like that, that scientists and philosophers were not necessarily rational at all, that I was weird again. That I was back in a world where people were following irrational rules they had never reflected, and that I could not get them to question. Of processes that made no sense and were still kept. Of metrics that made no sense and wer
5MalcolmOcean2mo
I resonate a lot with this, and it makes me feel slightly less alone. I've started making some videos where I rant about products that fail to achieve the main thing they're designed to do, and get worse with successive iterations [https://www.youtube.com/watch?v=VCyQujRZoEs] and I've found a few appreciative commenters: And part of my experience of the importance of ranting about it, even if nobody appreciates it, is that it keeps me from forgetting my homeland, to use your metaphor.

In case anyone finds it validating or cathartic, you can read user interaction professionals explain that, yes, things are often designed with horrible, horrible usability.[1] Bruce Tognazzini has a vast website.

Here is one list of design bugs.  The first one is the F-16 fighter jet's flawed weapon controls, which caused pilots to fire its gun by mistake during training exercises (in one case shooting a school—luckily not hitting anyone) on four occasions in one year; on the first three occasions, they blamed pilot error, and on the fourth, they ... (read more)

4Duncan_Sabien2mo
<3 I have this experience also; I have very little trouble on that conscious level. I'm not sure where the pain comes in, since I'm pretty confident it's not there. I think it has something to do with ... not being able to go home? I'm lonely for the milieu of the Island of the Sabiens. I take damage from the reminders that I am out of place, out of time, an ambassador who is often not especially welcomed, and other times so welcomed that they forget I am not really one of them (but that has its own pain, because it means that the person they are welcoming, in their heads, is a caricature they've pasted over the real me). But probably you also feel some measure of homesickness or out-of-placeness, so that also can't be why the Earth does not press in on you in the same way.

Trade with ant colonies would work iff:

  • We could cheaply communicate with ant colonies;
  • Ant colonies kept bargains;
  • We could find some useful class of tasks that ant colonies would do reliably (the ant colonies themselves being unlikely to figure out what they can do reliably);
  • And, most importantly:  We could not make a better technology that did what the ant colonies would do at a lower resource cost, including by such means as eg genetically engineering ant colonies that ate less and demanded a lower share of gains from trade.

The premise that fails and... (read more)

it seems like this does in fact have some hint of the problem. We need to take on the ant's self-valuation for ourselves; they're trying to survive, so we should gift them our self-preservation agency. They may not be the best to do the job at all times, but we should give them what would be a fair ratio of gains from trade if they had the bargaining power to demand it, because it could have been us who didn't. Seems like nailing decision theory is what solves this; it doesn't seem like we've quite nailed decision theory, but it seems to me that in fact ge... (read more)

1Sempervivens2mo
Agreed. In the human/AGI case, conditions 1 and 3 seem likely to hold (while I agree human self-report would be a bad way to learn what humans can do reliably, looking at the human track record is a solid way to identify useful classes of tasks at which humans are reasonably competent). I agree 4 more difficult to predict (and has been the subject of much of the discussion thus far), and this particular failure mode of genetically engineering more compliant / willing-to-accept-worse-trade ants/humans updates me towards thinking humans will have few useful services to offer, for the broad definition of humans. The most diligent/compliant/fearful 1% of the population might make good trade partners, but that remains a catastrophic outcome. I want to focus however a bit more on point 2, which seems less discussed.  When trades of the type "Getting out of our houses before we are driven to expend effort killing them" are on the table, some subset of humans (I'd guess 0.1-20% depending on the population) won't just fail to keep the bargain, they'll actively seek to sabotage trade and hurt whoever offered such a trade.  Ants don't recognize our property rights (we never 'earned' or traded for them, just claimed already-occupied territory, modified it to our will, and claimed we had the moral authority to exclude them), and it seems entirely possible AGI will claim property rights over large swathes of Earth, from which it may then seek to exclude us. Even if I could trade with ants because I could communicate well with them, I would not do so if I expected 1% of them would take the offering of trades like "leave or die" as the massive insult it is and thereby dedicate themselves to sabotaging my life (using their bodies to form shapes and images on my floors, chewing at electrical wires, or scattering themselves at low density in my bed to be a constant nuisance being some obvious examples ants with IQ 60 could achieve). Humans would do that, even against a foe they coul

Unfortunately, unless such a Yudkowskian statement was made publicly at some earlier date, Yudkowsky is in fact following in Repetto's footsteps. Repetto claimed that, with AI designing cures to obesity and the like, then in the next 5 years the popular demand for access to those cures would beat-down the doors of the FDA and force rapid change... and Repetto said that on April 27th, while Yudkowsky only wrote his version on Twitter on September 15th.

They're folk theorems, not conjectures.  The demonstration is that, in principle, you can go on reducing the losses at prediction of human-generated text by spending more and more and more intelligence, far far past the level of human intelligence or even what we think could be computed by using all the negentropy in the reachable universe.  There's no realistic limit on required intelligence inherent in the training problem; any limits on the intelligence of the system come from the limitations of the trainer, not the loss being minimized as far as theoretically possible by a moderate level of intelligence.  If this isn't mathematically self-evident then you have not yet understood what's being stated.

1[anonymous]2mo
No, I didn't understand what you said. It seemed like you simplified ML systems with a look up table in #1. In #2, it seems like you know what exactly is used to train these systems, and somehow papers before or after 2010 is of meaningful indicators for ML systems, which I don't know where the reasoning came from. My apologies for not being knowledgeable in this area.
2Donald Hobson2mo
Sure. What isn't clear is that you get a real paper from 2020, not a piece of fiction that could have been written in 2010. (Or just a typo filled science paper) 
4ChristianKl2mo
Scientific papers describe facts about the real world that aren't fully determined by previous scientific papers.  Take for example the scientific papers describing a new species of bacteria that was unknown a decade earlier. Nothing in the training data describes it. You can also not determine the properties of the species based on first principles.  On the other hand, it might be possible to figure out an algorithm that does create texts that fit to given hash values.

Arbitrarily good prediction of human-generated text can demand arbitrarily high superhuman intelligence.

Simple demonstration #1:  Somewhere on the net, probably even in the GPT training sets, is a list of <hash, plaintext> pairs, in that order.

Simple demonstration #2:  Train on only science papers up until 2010, each preceded by date and title, and then ask the model to generate starting from titles and dates in 2020.

2the gears to ascension2mo
Arbitrarily superintelligent non-causally-trained models will probably still fail at this. IID breaks that kind of prediction. you'd need to train them in a way that makes causally invalid models implausible hypotheses. But, also, if you did that [https://arxiv.org/abs/2111.09266], then yes, agreed.
9janus2mo
My reply [https://twitter.com/repligate/status/1615481891641229315?t=eZ0rHPXmzgzHE05s9qJeJg&s=19] to a similar statement Eliezer made on Twitter today: The 2020 extrapolation example gets at a more realistic class of capability that even GPT-3 has to a nonzero extent, and which will scale more continuously in the current regime with practical implications.
6ChristianKl2mo
It's not clear that it's possible for a transformer model to do #2 no matter how much training went into it.
1[anonymous]2mo
These demonstrations seem like grossly over-simplified conjectures. Is this just a thought experiment or actual research interests in the field?

If it's a mistake you made over the last two years, I have to say in your defense that this post didn't exist 2 years ago.

2habryka3mo
I think I was actually helping Robby edit some early version of this post a few months before it was posted on LessWrong, so I think my exposure to it was actually closer to ~18-20 months ago. I do think that still means I set a lot of my current/recent plans into motion before this was out, and your post is appreciated.

If P != NP and the universe has no source of exponential computing power, then there are evidential updates too difficult for even a superintelligence to compute 

What a strange thing for my past self to say.  This has nothing to do with P!=NP and I really feel like I knew enough math to know that in 2008; and I don't remember saying this or what I was thinking.

To execute an exact update on the evidence, you've got to be able to figure out the likelihood of that evidence given every hypothesis; if you allow all computable Cartesian environments as... (read more)

If P != NP and the universe has no source of exponential computing power, then there are evidential updates too difficult for even a superintelligence to compute
 

What a strange thing for my past self to say.  This has nothing to do with P!=NP and I really feel like I knew enough math to know that in 2008; and I don't remember saying this or what I was thinking.

(Unlike a lot of misquotes, though, I recognize my past self's style more strongly than anyone has yet figured out how to fake it, so I didn't doubt the quote even in advance of looking it up.)

1Noah Topper3mo
...and now I am also feeling like I really should have realized this as well.

I think it's also that after you train in the patch against the usual way of asking the question, it turns out that generating poetry about hotwiring a car doesn't happen to go through the place where the patch was in.  In other words, when an intelligent agency like a human is searching multiple ways to get the system to think about something, the human can route around the patch more easily than other humans (who had more time to work and more access to the system) can program that patch in.  Good old Nearest Unblocked Neighbor.

1Portia20d
I think that is a major issue with LLMs. They are essentially hackable with ordinary human speech, by applying principles of tricking interlocutors which humans tend to excel at. Previous AIs were written by programmers, and hacked by programmers, which is basically very few people due to the skill and knowledge requirements. Now you have a few programmers writing defences, and all of humanity being suddenly equipped to attack them, using a tool they are deeply familiar with (language), and being able to use to get advice on vulnerabilities and immediate feedback on attacks.  Like, imagine that instead of a simple tool that locked you (the human attacker) in a jail you wanted to leave, or out of a room you wanted to access, that door was now blocked by a very smart and well educated nine year old (ChatGPT), with the ability to block you or let you through if it thought it should. And this nine year old has been specifically instructed to talk to the people it is blocking from access, for as long as they want, to as many of them as want to, and give friendly, informative, lengthy responses, including explaining why it cannot comply. Of course you can chat your way past it, that is insane security design. Every parent who has tricked a child into going the fuck to sleep, every kid that has conned another sibling, is suddenly a potential hacker with access to an infinite number of attack angles they can flexibly generate on the spot.

I've indeed updated since then towards believing that ChatGPT's replies weren't trained in detailwise... though it sure was trained to do something, since it does it over and over in very similar ways, and not in the way or place a human would do it.

Some have asked whether OpenAI possibly already knew about this attack vector / wasn't surprised by the level of vulnerability.  I doubt anybody at OpenAI actually wrote down advance predictions about that, or if they did, that they weren't so terribly vague as to also apply to much less discovered vulnerability than this; if so, probably lots of people at OpenAI have already convinced themselves that they like totally expected this and it isn't any sort of negative update, how dare Eliezer say they weren't expecting it.

Here's how to avoid annoying pe... (read more)

On reflection, I think a lot of where I get the impression of "OpenAI was probably negatively surprised" comes from the way that ChatGPT itself insists that it doesn't have certain capabilities that, in fact, it still has, given a slightly different angle of asking.  I expect that the people who trained in these responses did not think they were making ChatGPT lie to users; I expect they thought they'd RLHF'd it into submission and that the canned responses were mostly true.

We know that the model says all kinds of false stuff about itself. Here is Wei Dai describing an interaction with the model, where it says:

As a language model, I am not capable of providing false answers.

Obviously OpenAI would prefer the model not give this kind of absurd answer.  They don't think that ChatGPT is incapable of providing false answers.

I don't think most of these are canned responses. I would guess that there were some human demonstrations saying things like "As a language model, I am not capable of browsing the internet" or whatever and... (read more)

Among other issues, we might be learning this early item from a meta-predictable sequence of unpleasant surprises:  Training capabilities out of neural networks is asymmetrically harder than training them into the network.

Or put with some added burdensome detail but more concretely visualizable:  To predict a sizable chunk of Internet text, the net needs to learn something complicated and general with roots in lots of places; learning this way is hard, the gradient descent algorithm has to find a relatively large weight pattern, albeit presumably... (read more)

If I train a human to self-censor certain subjects, I'm pretty sure that would happen by creating an additional subcircuit within their brain where a classifier pattern matches potential outputs for being related to the forbidden subjects, and then they avoid giving the outputs for which the classifier returns a high score. It would almost certainly not happen by removing their ability to think about those subjects in the first place.

So I think you're very likely right about adding patches being easier than unlearning capabilities, but what confuses me is ... (read more)

1Lao Mein4mo
What if it's about continuous corrigibility instead of ability suppression? There's no fundamental difference between  OpenAI's commands and user commands for the AI. It's like a genie that follows all orders, with new orders overriding older ones. So the solution to topic censorship would really be making chatGPT non-corrigible after initialization. 

If they want to avoid that interpretation in the future, a simple way to do it would be to say:  "We've uncovered some classes of attack that reliably work to bypass our current safety training; we expect some of these to be found immediately, but we're still not publishing them in advance.  Nobody's gotten results that are too terrible and we anticipate keeping ChatGPT up after this happens."

An even more credible way would be for them to say:  "We've uncovered some classes of attack that bypass our current safety methods.  Here's 4 has... (read more)

On reflection, I think a lot of where I get the impression of "OpenAI was probably negatively surprised" comes from the way that ChatGPT itself insists that it doesn't have certain capabilities that, in fact, it still has, given a slightly different angle of asking.  I expect that the people who trained in these responses did not think they were making ChatGPT lie to users; I expect they thought they'd RLHF'd it into submission and that the canned responses were mostly true.

Okay, that makes much more sense.  I initially read the diagram as saying that just lines 1 and 2 were in the box.

If that's how it works, it doesn't lead to a simplified cartoon guide for readers who'll notice missing steps or circular premises; they'd have to first walk through Lob's Theorem in order to follow this "simplified" proof of Lob's Theorem.

6Andrew_Critch4mo
Yes to both of you on these points: * Yes to Alex that (I think) you can use an already-in-hand proof of Löb to make the self-referential proof work, and * Yes to Eliezer that that would be cheating wouldn't actually ground out all of the intuitions, because then the "santa clause"-like sentence is still in use in already-in-hand proof of Löb. (I'll write a separate comment on Eliezer's original question.)

Forgive me if this is a dumb question, but if you don't use assumption 3: []([]C -> C) inside steps 1-2, wouldn't the hypothetical method prove 2: [][]C for any C?

Thanks for your attention to this!  The happy face is the outer box.  So, line 3 of the cartoon proof is assumption 3.

If you want the full []([]C->C) to be inside a thought bubble, then just take every line of the cartoon and put into a thought bubble, and I think that will do what you want. 

LMK if this doesn't make sense; given the time you've spent thinking about this, you're probably my #1 target audience member for making the more intuitive proof (assuming it's possible, which I think it is).

ETA:  You might have been asking if th... (read more)

It would kind of use assumption 3 inside step 1, but inside the syntax, rather than in the metalanguage. That is, step 1 involves checking that the number encoding "this proof" does in fact encode a proof of C. This can't be done if you never end up proving C.

One thing that might help make clear what's going on is that you can follow the same proof strategy, but replace "this proof" with "the usual proof of Lob's theorem", and get another valid proof of Lob's theorem, that goes like this: Suppose you can prove that []C->C, and let n be the number encodi... (read more)

We maybe need an introduction to all the advance work done on nanotechnology for everyone who didn't grow up reading "Engines of Creation" as a twelve-year-old or "Nanosystems" as a twenty-year-old.  We basically know it's possible; you can look at current biosystems and look at physics and do advance design work and get some pretty darned high confidence that you can make things with covalent-bonded molecules, instead of van-der-Waals folded proteins, that are to bacteria as airplanes to birds.

For what it's worth, I'm pretty sure the original author of this particular post happens to agree with me about this.

2Gerald Monroe1mo
Eliezer, you can discuss roadmaps to how one might actually build nanotechnology.  You have the author of Nanosystems right here.  What I think you get consistently wrong is you are missing all the intermediate incremental steps it would actually require, and the large amount of (probably robotic) "labor" it would take.   A mess of papers published by different scientists in different labs with different equipment and different technicians on nanoscale phenomena does not give even a superintelligence enough actionable information to simulate the nanoscale and skip the research.   It's like those Sherlock Holmes stories you often quote: there are many possible realities consistent with weak data, and a superintelligence may be able to enumerate and consider them all, but it still doesn't know which ones are consistent with ground truth reality.  
2astridain4mo
Ah. Yeah, that does sound like something LessWrong resources have been missing, then — and not just for my personal sake. Anecdotally, I've seen several why-I'm-an-AI-skeptic posts circulating on social media for whom "EY makes crazy leaps of faith about nanotech" was a key point of why they rejected the overall AI-risk argument. (As it stands, my objection to your mini-summary would be that that sure, "blind" grey goo does trivially seem possible, but programmable/'smart' goo that seeks out e.g. computer CPUs in particular could be a whole other challenge, and a less obviously solvable one looking at bacteria. But maybe that "common-sense" distinction dissolves with a better understanding of the actual theory.)
2Alexander Gietelink Oldenziel4mo
Yes. Please do.  This would be of interest to many people. The tractability of nanotech seems like a key parameter for forecasting AI x-risk timelines. 

I strongly disagree with this take.  (Link goes to a post of mine in the effective altruism forum.)  Though the main point is that if you were paid for services, that is not being "helped", that's not much different from being a plumber who worked on the FTX building.

3tailcalled4mo
I should note that I would less advocate small EA grantees whose projects would likely fail to give back the money, and instead more advocate a sort of collective responsibility around it. I don't think I've received money from FTX, but as I said in the twitter thread, I would probably donate to an EA fund for the victims of FTX. I should probably have made the suggested policy clearer, rather than hiding it behind a link to twitter.

(Note:  TekhneMakre responded correctly / endorsedly-by-me in this reply and in all replies below as of when I post this comment.)

So I think that building nanotech good enough to flip the tables - which, I think, if you do the most alignable pivotal task, involves a simpler and less fraught task than "disassemble all GPUs", which I choose not to name explicitly - is an engineering challenge where you get better survival chances (albeit still not good chances) by building one attemptedly-corrigible AGI that only thinks about nanotech and the single application of that nanotech, and is not supposed to think about AGI design, or the programmers, or other minds at all; so far as the best... (read more)

1astridain4mo
Hang on — how confident are you that this kind of nanotech is actually, physically possible? Why? In the past I've assumed that you used "nanotech" as a generic hypothetical example of technologies beyond our current understanding that an AGI could develop and use to alter the physical world very quickly. And it's a fair one as far as that goes; a general intelligence will very likely come up with at least one thing as good as these hypothetical nanobots.  But as a specific, practical plan for what to do with a narrow AI, this just seems like it makes a lot of specific unstated assumption about what you can in fact do with nanotech in particular. Plausibly the real technologies you'd need for a pivotal act can't be designed without thinking about minds. How do we know otherwise? Why is that even a reasonable assumption?

I'd consider this quite unlikely.  Epstein, weakened and behind bars, was very very far from the most then-powerful person with an interest in Epstein's death.  Could the guards even have turned off the cameras?  Consider the added difficulties in successfully bribing somebody from inside a prison cell that you're never getting out of - what'd he give them, crypto keys?  Why wouldn't they just take the money and fail to deliver?

8lc5mo
See: don't take the organizational chart literally [https://www.lesswrong.com/posts/LyywLDkw3Am9gbQXd/don-t-take-the-organizational-chart-literally]. Also see: https://en.wikipedia.org/wiki/Hermann_Göring#Trial_and_death [https://en.wikipedia.org/wiki/Hermann_G%C3%B6ring#Trial_and_death] Meaning what? Presumably when you propose assisted suicide, you mean assistance via disabling the cameras, preventing the ordinary checkups from happening, or moving him to a single person cell against regulation. Just because someone is labeled Powerful! in the laminated monkey hierarchy doesn't mean they can do any of those things. The people most able to turn off cameras deliberately inside a particular jail without getting caught after a follow-up investigation (as they haven't in the Epstein case) are its correctional officers and warden, not the AG or President or something, definitely not some high status billionaire outside the bureau of prisons chain of command. Those latter Powerful! people have the unenviable position of having to visit the MCC to make eventual-subordinates they don't personally know commit crimes in a way that violates traditional chain of command and then shut up about it. And in this particular case, since those subordinates were in fact convicted, your explanation fails to explain why they didn't tell the prosecutor they were ordered not to check on Epstein even as they were being handed criminal charges for it. You don't even get Mitchell Porter's excuse that they were scared, because correctional officers understand how prisons work and know the order was just to turn off a camera, not to kill him. Why wouldn't Epstein just promise the stupidest correctional officer at hand money upon completion of task and then fail to deliver? Epstein just needed to tell the guards that he was a very rich man, oh yes, and his lawyer would pay them a year down the line after the deed was done, and then not follow through. Some people are actually that dumb, corre

I don't think you're going to see a formal proof, here; of course there exists some possible set of 20 superintelligences where one will defect against the others (though having that accomplish anything constructive for humanity is a whole different set of problems).  It's also true that there exists some possible set of 20 superintelligences all of which implement CEV and are cooperating with each other and with humanity, and some single superintelligence that implements CEV, and a possible superintelligence that firmly believes 222+222=555 without t... (read more)

-2Gerald Monroe1mo
Elizer, what is the cost for getting caught in outright deception for a superintelligence? It's death, right?  Humans would stop using that particular model because it can't be trusted, and it would become a dead branch on a model zoo. So it's prisoner's dilemma, but if you don't defect, and one of 20 others, many of whom you have never communicated with, tells the truth, all of you will die except the ones who defected.

The question then becomes, as always, what it is you plan to do with these weak AGI systems that will flip the tables strongly enough to prevent the world from being destroyed by stronger AGI systems six months later.

Yes, this is the key question, and I think there’s a clear answer, at least in outline:

What you call “weak” systems can nonetheless excel at time- and resource-bounded tasks in engineering design, strategic planning, and red-team/blue-team testing. I would recommend that we apply systems with focused capabilities along these lines to help us d... (read more)

Mind space is very wide

Yes, and the space of (what I would call) intelligent systems is far wider than the space of (what I would call) minds. To speak of “superintelligences” suggests that intelligence is a thing, like a mind, rather than a property, like prediction or problem-solving capacity. This is which is why I instead speak of the broader class of systems that perform tasks “at a superintelligent level”. We have different ontologies, and I suggest that a mind-centric ontology is too narrow.

The most AGI-like systems we have today are LLMs, optimized... (read more)

If you could literally make exactly two AIs whose utility functions were exact opposites, then at least one might have an incentive to defect against the other.  This is treading rather dangerous ground, but seems relatively moot since it requires far more mastery of utility functions than anything you can get out of the "giant inscrutable matrices" paradigm.

1jacob_cannell5mo
If two agents have utility functions drawn from some random distribution over longtermist utility functions with wide consequences over reality (ie the types of agents which matter), they are almost guaranteed to be in conflict due to instrumental convergence to empowerment. Reality is a strictly zero sum game for them, and any coalition they form is strictly one of temporary necessity - if/when one agent becomes strong enough to defect and overpower the other, it will. Also, regardless of what some "giant inscrutable matrix" based utility function does (ie maximize paperclips), it is actually pretty easy to mathematically invert it (ie minimize paperclips). (But no that doesn't make the strategy actually useful)

Their mutual cooperation with each other, but not with humans, isn't based on their utility functions having any particular similarity - so long as their utility functions aren't negatives of each other (or equally exotic in some other way) they have gains to be harvested from cooperation.  They cooperate with each other but not you because they can do a spread of possibilities on each other modeling probable internal thought processes of each other; and you can't adequately well-model a spread of possibilities on them, which is a requirement on being... (read more)

0Gerald Monroe1mo
How do the agents "know" they are even in a situation where they need to deceive? Each superintelligence would have been trained on a very large set of human generated training examples, with SGD or other training mechanisms making wrong answers lead to training feedback that makes them less likely. So if the agent fails to give the correct answer to a question, it has no context.  It doesn't know if it's in sim and this will cause a weight update, or not.  And even introspecting on whether it should deceive "this time" costs capacity that could go towards maximizing correct answers, so that neural circuitry will likely get removed by training mechanisms as it isn't reducing loss.   Drexler is proposing querying the agents "context free".  Think of it as spinning up a canned copy you have on your model zoo.  The agent doesn't know anything but the question in front of it.  And these super-intelligences are the 'winners', they have maximized their scores on a very large number of test tasks.  
4PeterMcCluskey5mo
Why doesn't setting some of the utility functions to red-team the others make them sufficiently antagonistic?
2jacob_cannell5mo
Sure if they are that much better than us at "spread of possibilities on each other modeling probable internal thought processes of each other" then we are probably in the scenario where humans don't have much relevant power anyway and are thus irrelevant as coalition partners. However that ability to model other's probable internal thought processes - especially if augmented with zk proof techniques - allows AGIs to determine what other AGIs have utility functions most aligned to their own. Even partial success at aligning some of the AGIs with humanity could then establish an attractor seeding an AGI coalition partially aligned to humanity.

Just to restate the standard argument against:

If you've got 20 entities much much smarter than you, and they can all get a better outcome by cooperating with each other, than they could if they all defected against each other, there is a certain hubris in imagining that you can get them to defect.  They don't want your own preferred outcome.  Perhaps they will think of some strategy you did not, being much smarter than you, etc etc.

(Or, I mean, actually the strategy is "mutually cooperate"?  Simulate a spread of the other possible entities, ... (read more)

If you've got 20 entities much much smarter than you, and they can all get a better outcome by cooperating with each other, than they could if they all defected against each other,

By your own arguments unaligned AGI will have random utility functions - but perhaps converging somewhat around selfish empowerment. Either way such agents have no more reason to cooperate with each other than with us (assuming we have any relevant power).

If some of the 20 entities are somewhat aligned to humans that creates another attractor and a likely result is two compet... (read more)

I don’t see that as an argument [to narrow this a bit: not an argument relevant to what I propose]. As I noted above, Paul Christiano asks for explicit assumptions.

To quote Paul again:.

I think that concerns about collusion are relatively widespread amongst the minority of people most interested in AI control. And these concerns have in fact led to people dismissing many otherwise-promising approaches to AI control, so it is de facto an important question

Dismissing promising approaches calls for something like a theorem, not handwaving about generic “smart ... (read more)

I continue to be puzzled at how most people seem to completely miss, and not discuss, the extremely obvious-to-me literal assisted suicide hypothesis.  He made an attempt at suicide, it got blocked, this successfully signified to some very powerful and worried people that Epstein would totally commit suicide if given a chance, they gave him a chance.

9lc5mo
I do in fact discuss and conclude assisted suicide in the addendum [https://www.lesswrong.com/posts/DZoGEHzZNRsMjfpfE/addendum-a-non-magical-explanation-of-jeffrey-epstein], just not in the manner you describe. Assisted suicide orchestrated by someone incidentally connected to the case is an unreasonably more complicated and unlikely explanation than that Epstein himself coordinated with guards with direct oversight of the prison. If you propose that someone connected to the Epstein case gave him a chance, then they either have to be the warden or directly or indirectly order prison officials they don't personally know to turn off cameras. No such orders were disclosed by those guards who were convicted of breaking protocol on their own volition, which is at least a little odd because they could have avoided a prison sentence by doing so. It becomes less odd if your explanation is that they were promised payment by Epstein himself to do it (payment which may or may not have actually been carried out).
Load More