All of Eliezer Yudkowsky's Comments + Replies

At the superintelligent level there's not a binary difference between those two clusters.  You just compute each thing you need to know efficiently.

1O O5d
I’m confused as to why there necessarily wouldn’t be a difference between the two. I can think of a classes problems without better than brute force or simulation solutions, e.g. Bitcoin mining and there are plenty of these solutions that explode enough in complexity that they are infeasible given the resource of the universe for certain input sizes. Alphafold2 also did not fully solve the protein folding problem. Rather, it still has a significant error rate as reported by google themselves with regions of low confidence. It also seems to go up significantly in certain classes of inputs such as longer loop structures. Further, it was possibly not solved to the extent of what people in 2008 were arguing against, given we are nowhere near predicting how protein folds in interactions. I’m unsure whether the predicted value of solving protein folding back then is equal to the predicted value of extent to which AlphaFold solves it. My next conclusion is mostly an inference, but broadly it seems like DL models hit a sigmoid curve when it comes to eeking out the final percentages of accuracy, and this can also be seen in self driving cars and LLMs. This makes sense given the world largely follows predictable laws with a small but frequent number of exceptions. In a one shot experiment, this uncertainty would accumulate exponentially making it seem unlikely that it would succeed. I think clearly it can reduce the amount of experiments needed, but one shot seems like too high of a bar to hold through only generalizations that take less than 1 OOM of compute than the universe uses to “simulate” results as you experiment due to compounding errors.

I sometimes mention the possibility of being stored and sold to aliens a billion years later, which seems to me to validly incorporate most all the hopes and fears and uncertainties that should properly be involved, without getting into any weirdness that I don't expect Earthlings to think about validly.

Lacking time right now for a long reply:  The main thrust of my reaction is that this seems like a style of thought which would have concluded in 2008 that it's incredibly unlikely for superintelligences to be able to solve the protein folding problem.  People did, in fact, claim that to me in 2008.  It furthermore seemed to me in 2008 that protein structure prediction by superintelligence was the hardest or least likely step of the pathway by which a superintelligence ends up with nanotech; and in fact I argued only that it'd be solvable fo... (read more)

1Vivek Hebbar11d
When you describe the "emailing protein sequences -> nanotech" route, are you imagining an AGI with computers on which it can run code (like simulations)?  Or do you claim that the AGI could design the protein sequences without writing simulations, by simply thinking about it "in its head"?

Well, one sink to avoid here is neutral-genie stories where the AI does what you asked, but not what you wanted.  That's something I wrote about myself, yes, but that was in the era before deep learning took over everything, when it seemed like there was a possibility that humans would be in control of the AI's preferences.  Now neutral-genie stories are a mindsink for a class of scenarios where we have no way to achieve entrance into those scenarios; we cannot make superintelligences want particular things or give them particular orders - cannot give them preferences in a way that generalizes to when they become smarter.

2Blueberry6d
I don't agree with that. Neutral-genie stories are important because they demonstrate the importance of getting your wish right. As yet, deep learning hasn't taken us to AGI, and it may never, and even if it does, we may still be able to make them want particular things or give them particular orders or preferences. Here's a great AI fable from the Air Force: https://www.vice.com/en/article/4a33gj/ai-controlled-drone-goes-rogue-kills-human-operator-in-usaf-simulated-test [https://www.vice.com/en/article/4a33gj/ai-controlled-drone-goes-rogue-kills-human-operator-in-usaf-simulated-test]

Okay, if you're not saying GPUs are getting around as efficient as the human brain, without much more efficiency to be eeked out, then I straightforwardly misunderstood that part.

Nothing about any of those claims explains why the 10,000-fold redundancy of neurotransmitter molecules and ions being pumped in and out of the system is necessary for doing the alleged complicated stuff.

Further item of "these elaborate calculations seem to arrive at conclusions that can't possibly be true" - besides the brain allegedly being close to the border of thermodynamic efficiency, despite visibly using tens of thousands of redundant physical ops in terms of sheer number of ions and neurotransmitters pumped; the same calculations claim that modern GPUs are approaching brain efficiency, the Limit of the Possible, so presumably at the Limit of the Possible themselves.

This source claims 100x energy efficiency from substituting some basic physical ana... (read more)

5jacob_cannell1mo
I'm not sure why you believe "the same calculations claim that modern GPUs are approaching brain efficiency, the Limit of the Possible". GPUs require at least on order ~1e-11J to fetch a single 8-bit value from GDDRX RAM (1e-19 J/b/nm (interconnect wire energy [https://www.lesswrong.com/posts/xwBuoE9p8GE7RAuhd/brain-efficiency-much-more-than-you-wanted-to-know#Interconnect]) * 1cm * 8), so around ~1KW or 100x the brain for 1e14 of those per second, not even including flop energy cost (the brain doesn't have much more efficient wires, it just minimizes that entire cost by moving the memory synapses/weights as close as possible to the compute .. by merging them). I do claim that Moore's Law is ending and not delivering much farther increase in CMOS energy efficiency (and essentially zero increase in wire energy efficiency), but GPUs are far from the optimal use of CMOS towards running NNs. That sounds about right, and Indeed I roughly estimate the minimal energy for 8 bit analog MAC at the end of the synapse section, with 4 refs examples from the research lit: The more complicated part of comparing these is how/whether to include the cost of reading/writing a synapse/weight value from RAM across a long wire, which is required for full equivalence to the brain. The brain as a true RNN is doing Vector Matrix multiplication, whereas GPUs/Accelerators instead do Matrix Matrix multiplication to amortize the cost of expensive RAM fetches. VM mult can simulate MM mult at no extra cost, but MM mult can only simulate VM mult at huge inefficiency proportional to the minimal matrix size (determined by ALU/RAM ratio, ~1000:1 now at low precision). The full neuromorphic or PIM approach instead moves the RAM next to the processing elements, and is naturally more suited to VM mult. -------------------------------------------------------------------------------- 1. Bavandpour, Mohammad, et al. "Mixed-Signal Neuromorphic Processors: Quo Vadis?" 2019 IEEE SOI-3D-Subthreshold M

This does not explain how thousands of neurotransmitter molecules impinging on a neuron and thousands of ions flooding into and out of cell membranes, all irreversible operations, in order to transmit one spike, could possibly be within one OOM of the thermodynamic limit on efficiency for a cognitive system (running at that temperature).

See my reply here which attempts to answer this. In short, if you accept that the synapse is doing the equivalent of all the operations involving a weight in a deep learning system (storing the weight, momentum gradient etc in minimal viable precision, multiplier for forward back and weight update, etc), then the answer is a more straightforward derivation from the requirements. If you are convinced that the synapse is only doing the equivalent of a single bit AND operation, then obviously you will reach the conclusion that it is many OOM wasteful, but t... (read more)

And it says:

So true 8-bit equivalent analog multiplication requires about 100k carriers/switches

This just seems utterly wack.  Having any physical equivalent of an analog multiplication fundamentally requires 100,000 times the thermodynamic energy to erase 1 bit?  And "analog multiplication down to two decimal places" is the operation that is purportedly being carried out almost as efficiently as physically possible by... an axon terminal with a handful of synaptic vesicles dumping 10,000 neurotransmitter molecules to flood around a dendritic ter... (read more)

8Veedrac2mo
A sanity check of a counterintuitive claim can be that the argument to the claim implies things that seem unjustifiable or false. It cannot be that the conclusion of the claim itself is unjustifiable or false, except inasmuch as you are willing to deny the possibility to be convinced of that claim by argument at all. (To avoid confusion, this is not in response to the latter portion of your comment about general cognition.)
7DaemonicSigil2mo
If you read carefully, Brain Efficiency does actually have some disclaimers to the effect that it's discussing the limits of irreversible computing using technology that exists or might be developed in the near future. See Jacob's comment here for examples: https://www.lesswrong.com/posts/mW7pzgthMgFu9BiFX/the-brain-is-not-close-to-thermodynamic-limits-on?commentId=y3EgjwDHysA2W3YMW [https://www.lesswrong.com/posts/mW7pzgthMgFu9BiFX/the-brain-is-not-close-to-thermodynamic-limits-on?commentId=y3EgjwDHysA2W3YMW] In terms of what the actual fundamental thermodynamic limits are, Jacob and I still disagree by a factor of about 50. (Basically, Jacob thinks the dissipated energy needs to be amped up in order to erase a bit with high reliability. I think that while there are some schemes where this is necessary, there are others where it is not and high-reliability erasure is possible with an energy per bit approaching kTlog2. I'm still working through the math to check that I'm actually correct about this, though.)

And "analog multiplication down to two decimal places" is the operation that is purportedly being carried out almost as efficiently as physically possible by

I am not certain it is being carried "almost as efficiently as physically possible", assuming you mean thermodynamic efficiency (even accepting you meant thermodynamic efficiency only for irreversible computation), my belief is more that the brain and its synaptic elements are reasonably efficient in a pareto tradeoff sense.

But any discussion around efficiency must make some starting assumptions abo... (read more)

I think the quoted claim is actually straightforwardly true? Or at least, it's not really surprising that actual precise 8 bit analog multiplication really does require a lot more energy than the energy required to erase one bit.

I think the real problem with the whole section is that it conflates the amount of computation required to model synaptic operation with the amount of computation each synapse actually performs.

These are actually wildly different types of things, and I think the only thing it is justifiable to conclude from this analysis is that (m... (read more)

I'm confused at how somebody ends up calculating that a brain - where each synaptic spike is transmitted by ~10,000 neurotransmitter molecules (according to a quick online check), which then get pumped back out of the membrane and taken back up by the synapse; and the impulse is then shepherded along cellular channels via thousands of ions flooding through a membrane to depolarize it and then getting pumped back out using ATP, all of which are thermodynamically irreversible operations individually - could possibly be within three orders of magnitude of max... (read more)

9Eliezer Yudkowsky1mo
Further item of "these elaborate calculations seem to arrive at conclusions that can't possibly be true" - besides the brain allegedly being close to the border of thermodynamic efficiency, despite visibly using tens of thousands of redundant physical ops in terms of sheer number of ions and neurotransmitters pumped; the same calculations claim that modern GPUs are approaching brain efficiency, the Limit of the Possible, so presumably at the Limit of the Possible themselves. This source claims 100x energy efficiency from substituting some basic physical analog operations for multiply-accumulate, instead of digital transistor operations about them, even if you otherwise use actual real-world physical hardware.  Sounds right to me; it would make no sense for such a vastly redundant digital computation of such a simple physical quantity to be anywhere near the borders of efficiency!  https://spectrum.ieee.org/analog-ai
7Veedrac2mo
The section you were looking for is titled ‘Synapses’. https://www.lesswrong.com/posts/xwBuoE9p8GE7RAuhd/brain-efficiency-much-more-than-you-wanted-to-know#Synapses [https://www.lesswrong.com/posts/xwBuoE9p8GE7RAuhd/brain-efficiency-much-more-than-you-wanted-to-know#Synapses]

The first step in reducing confusion is to look at what a synaptic spike does. It is the equivalent of - in terms of computational power - an ANN 'synaptic spike', which is a memory read of a weight, a low precision MAC (multiply accumulate), and a weight memory write (various neurotransmitter plasticity mechanisms). Some synapses probably do more than this - nonlinear decoding of spike times for example, but that's a start. This is all implemented in a pretty minimal size looking device. The memory read/write is local, but it also needs to act as an a... (read more)

Nobody in the US cared either, three years earlier.  That superintelligence will kill everyone on Earth is a truth, and once which has gotten easier and easier to figure out over the years.  I have not entirely written off the chance that, especially as the evidence gets more obvious, people on Earth will figure out this true fact and maybe even do something about it and survive.  I likewise am not assuming that China is incapable of ever figuring out this thing that is true.  If your opinion of Chinese intelligence is lower than mine, ... (read more)

From a high-level perspective, it is clear that this is just wrong. Part of what human brains are doing is to minimise prediction error with regard to sensory inputs

I didn't say that GPT's task is harder than any possible perspective on a form of work you could regard a human brain as trying to do; I said that GPT's task is harder than being an actual human; in other words, being an actual human is not enough to solve GPT's task.

I don't see how the comparison of hardness of 'GPT task' and 'being an actual human' should technically work - to me it mostly seems like a type error. 

- The task 'predict the activation of photoreceptors in human retina' clearly has same difficulty as 'predict next word on the internet' in the limit. (cf Why Simulator AIs want to be Active Inference AIs)

- Maybe you mean something like task + performance threshold. Here 'predict the activation of photoreceptors in human retina well enough to be able to function as a typical human' is clearly less diff... (read more)

If diplomacy failed, but yes, sure.  I've previously wished out loud for China to sabotage US AI projects in retaliation for chip export controls, in the hopes that if all the countries sabotage all the other countries' AI projects, maybe Earth as a whole can "uncoordinate" to not build AI even if Earth can't coordinate.

4Lao Mein2mo
Are you aware that AI safety is not considered a real issue by the Chinese intelligentsia? The limits of AI safety awareness here are surface-level discussions of Western AI Safety ideas. Not a single Chinese researcher, as far as I can recall, has actually said anything like "AI will kill us all by default if it is not aligned".  Given the chip ban, any attempts at an AI control treaty will be viewed as an attempt to prevent China from overtaking the US in terms of AI hegemony. The only conditions to an AI control treaty that Beijing will accept will also allow it to reach transformative AGI first. Which it then will, because we don't think AI safety is a real concern, the same way you don't think the Christian rapture is a real concern. The CCP does not think like the West. Nothing says it has to take Western concerns seriously. WE DON'T BELIEVE IN AI RUIN. 

Arbitrary and personal.  Given how bad things presently look, over 20% is about the level where I'm like "Yeah okay I will grab for that" and much under 20% is where I'm like "Not okay keep looking."

Choosing to engage with an unscripted unrehearsed off-the-cuff podcast intended to introduce ideas to a lay audience, continues to be a surprising concept to me.  To grapple with the intellectual content of my ideas, consider picking one item from "A List of Lethalities" and engaging with that.

To grapple with the intellectual content of my ideas, consider picking one item from "A List of Lethalities" and engaging with that.

I actually did exactly this in a previous post, Evolution is a bad analogy for AGI: inner alignment, where I quoted number 16 from A List of Lethalities:

16.  Even if you train really hard on an exact loss function, that doesn't thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments.  Humans don't

... (read more)

I imagine (edit: wrongly) it was less "choosing" and more "he encountered the podcast first because it has a vastly larger audience, and had thoughts about it." 

I also doubt "just engage with X" was an available action.  The podcast transcript doesn't mention List of Lethalities, LessWrong, or the Sequences, so how is a listener supposed to find it?

I also hate it when people don't engage with the strongest form of my work, and wouldn't consider myself obligated to respond if they engaged with a weaker form (or if they engaged with the strongest o... (read more)

Here are some of my disagreements with List of Lethalities. I'll quote item one:

“Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction.  This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again”

(Evolution) → (human values) is not the only case of inner alignment failure which we know a

... (read more)

The "strongest" foot I could put forwards is my response to "On current AI not being self-improving:", where I'm pretty sure you're just wrong.

You straightforwardly completely misunderstood what I was trying to say on the Bankless podcast:  I was saying that GPT-4 does not get smarter each time an instance of it is run in inference mode.

And that's that, I guess.

I'll admit it straight up did not occur to me that you could possibly be analogizing between a human's lifelong, online learning process, and a single inference run of an already trained model. Those are just completely different things in my ontology. 

Anyways, thank you for your response. I actually do think it helped clarify your perspective for me.

Edit: I have now included Yudkowsky's correction of his intent in the post, as well as an explanation of why I think his corrected argument is still wrong. 

Well, this is insanely disappointing. Yes, the OP shouldn't have directly replied to the Bankless podcast like that, but it's not like he didn't read your List of Lethalities, or your other writing on AGI risk. You really have no excuse for brushing off very thorough and honest criticism such as this, particularly the sections that talk about alignment.

And as others have noted, Eliezer Yudkowsky, of all people, complaining about a blog post being long is the height of irony.

This is coming from someone who's mostly agreed with you on AGI risk since reading the Sequences, years ago, and who's donated to MIRI, by the way.

On the bright side, this does make me (slightly) update my probability of doom downwards.

1Shion Arita3mo
This post (and the one below) quite bothers me as well.  Yeah I know you can't have the time to address everything you encounter but you are:    -Not allowed to tell people that they don't know what they're talking about until they've read a bunch of lengthy articles, then tell someone who has done that and wrote something a fraction of the length to fuck off.    -Not allowed to publicly complain that people don't criticize you from a place of understanding without reading the attempts to do so   -Not allowed to seriously advocate for policy that would increase the likelihood of armed conflict up to and including nuclear war if you're not willing to engage with people who give clearly genuine and high effort discussion about why they think the policy is unnecessary.

Eliezer, in the world of AI safety, there are two separate conversations: the development of theory and observation, and whatever's hot in public conversation.

A professional AI safety researcher, hopefully, is mainly developing theory and observation.

However, we have a whole rationalist and EA community, and now a wider lay audience, who are mainly learning of and tracking these matters through the public conversation. It is the ideas and expressions of major AI safety communicators, of whom you are perhaps the most prominent, that will enter their heads. ... (read more)

5the gears to ascension3mo
dude just read the damn post at a skim level at least, lol. If you can't get through this how are you going to do... sigh. Okay, I'd really rather you read QACI posts deeply than this. But, still. It deserves at least a level 1 read [https://www.lesswrong.com/posts/sAyJsvkWxFTkovqZF/how-to-read-papers-efficiently-fast-then-slow-three-pass] rather than a "can I have a summary?" dismissal.
5Vaniver3mo
FWIW, I thought the bit about manifolds in The difficulty of alignment [https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#The_difficulty_of_alignment] was the strongest foot forward, because it paints a different detailed picture than your description that it's responding to. That said, I don't think Quintin's picture obviously disagrees with yours (as discussed in my response over here [https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky?commentId=mATAbtCtkiKgAcn8B]) and I think you'd find disappointing him calling your description extremely misleading while not seeming to correctly identify the argument structure and check whether there's a related one that goes thru on his model.

I think you should use a manifold market to decide on whether you should read the post, instead of the test this comment is putting forth. There's too much noise here, which isn't present in a prediction market about the outcome of your engagement.

Market here: https://manifold.markets/GarrettBaker/will-eliezer-think-there-was-a-sign

Is the overall karma for this mostly just people boosting it for visibility? Because I don't see how this would be a quality comment by any other standards.

Frontpage comment guidelines:

  • Maybe try reading the post
-8lc3mo
iceman3moΩ74232

This response is enraging.

Here is someone who has attempted to grapple with the intellectual content of your ideas and your response is "This is kinda long."? I shouldn't be that surprised because, IIRC, you said something similar in response to Zack Davis' essays on the Map and Territory distinction, but that's ancillary and AI is core to your memeplex.

I have heard repeated claims that people don't engage with the alignment communities' ideas (recent example from yesterday). But here is someone who did the work. Please explain why your response here does ... (read more)

The "strongest" foot I could put forwards is my response to "On current AI not being self-improving:", where I'm pretty sure you're just wrong.

However, I'd be most interested in hearing your response to the parts of this post that are about analogies to evolution, and why they're not that informative for alignment, which start at:

Yudkowsky argues that we can't point an AI's learned cognitive faculties in any particular direction because the "hill-climbing paradigm" is incapable of meaningfully interfacing with the inner values of the intelligences it creat

... (read more)

Things are dominated when they forego free money and not just when money gets pumped out of them.

4keith_wynroe3mo
Want to bump this because it seems important? How do you see the agent in the post as being dominated?
6keith_wynroe3mo
How is the toy example agent sketched in the post dominated?
4eapi3mo
...wait, you were just asking for an example of an agent being "incoherent but not dominated" in those two senses of being money-pumped? And this is an exercise meant to hint that such "incoherent" agents are always dominatable? I continue to not see the problem, because the obvious examples don't work. If I have (1 apple,$0) as incomparable to (1 banana,$0) that doesn't mean I turn down the trade of −1 apple,+1 banana,+$10000 (which I assume is what you're hinting at re. foregoing free money). If one then says "ah but if I offer $9999 and you turn that down, then we have identified your secret equivalent utili-" no, this is just a bid/ask spread, and I'm pretty sure plenty of ink has been spilled justifying EUM agents using uncertainty to price inaction like this. What's an example of a non-EUM agent turning down free money which doesn't just reduce to comparing against an EUM with reckless preferences/a low price of uncertainty?

Suppose I describe your attempt to refute the existence of any coherence theorems:  You point to a rock, and say that although it's not coherent, it also can't be dominated, because it has no preferences.  Is there any sense in which you think you've disproved the existence of coherence theorems, which doesn't consist of pointing to rocks, and various things that are intermediate between agents and rocks in the sense that they lack preferences about various things where you then refuse to say that they're being dominated?

2keith_wynroe3mo
This seems totally different to the point OP is making which is that you can in theory have things that definitely are agents, definitely do have preferences, and are incoherent (hence not EV-maximisers) whilst not "predictably shooting themselves in the foot" as you claim must follow from this I agree the framing of "there are no coherence theorems" is a bit needlessly strong/overly provocative in a sense, but I'm unclear what your actual objection is here - are you claiming these hypothetical agents are in fact still vulnerable to money-pumping? That they are in fact not possible? 
2eapi3mo
The rock doesn't seem like a useful example here. The rock is "incoherent and not dominated" if you view it as having no preferences and hence never acting out of indifference, it's "coherent and not dominated" if you view it as having a constant utility function and hence never acting out of indifference, OK, I guess the rock is just a fancy Rorschach test. IIUC a prototypical Slightly Complicated utility-maximizing agent is one with, say, u(apples,bananas)=min(apples,bananas), and a prototypical Slightly Complicated not-obviously-pumpable non-utility-maximizing agent is one with, say, the partial order (a1,b1)≼(a2,b2)=a1≼a2∧b1≼b2 plus the path-dependent rule that EJT talks about in the post (Ah yes, non-pumpable non-EU agents might have higher complexity! Is that relevant to the point you're making?). What's the competitive advantage of the EU agent? If I put them both in a sandbox universe and crank up their intelligence, how does the EU agent eat the non-EU agent? How confident are you that that is what must occur?
3eapi3mo
This is pretty unsatisfying as an expansion of "incoherent yet not dominated" given that it just uses the phrase "not coherent" instead. I find money-pump arguments to be the most compelling ones since they're essentially tiny selection theorems for agents in adversarial environments, and we've got an example in the post of (the skeleton of) a proof that a lack-of-total-preferences doesn't immediately lead to you being pumped. Perhaps there's a more sophisticated argument that Actually No, You Still Get Pumped but I don't think I've seen one in the comments here yet. If there are things which cannot-be-money-pumped, and yet which are not utility-maximizers, and problems like corrigibility are almost certainly unsolvable for utility-maximizers, perhaps it's somewhat worth looking at coherent non-pumpable non-EU agents?
1Eve Grey3mo
Hey, I'm really sorry if I sound stupid, because I'm very new to all this, but I have a few questions (also, I don't know which one of all of you is right, I genuinely have no idea). Aren't rocks inherently coherent, or rather, their parts are inherently coherent, for they align with the laws of the universe, whereas the "rock" is just some composite abstract form we came up with, as observers? Can't we think of the universe in itself as an "agent" not in the sense of it being "god", but in the sense of it having preferences and acting on them? Examples would be hot things liking to be apart and dispersion leading to coldness, or put more abstractly - one of the "preferences" of the universe is entropy. I'm sorry if I'm missing something super obvious, I failed out of university, haha! If we let the "universe" be an agent in itself, so essentially it's a composite of all simples there are (even the ones we're not aware of), then all smaller composites by definition will adhere to the "preferences" of the "universe", because from our current understanding of science, it seems like the "preferences" (laws) of the "universe" do not change when you cut the universe in half, unless you reach quantum scales, but even then, it is my unfounded suspicion that our previous models are simply laughably wrong, instead of the universe losing homogeneity at some arbitrary scale. Of course, the "law" of the "universe" is very simple and uncomplex - it is akin to the most powerful "intelligence" or "agent" there is, but with the most "primitive" and "basic" "preferences". Also apologies for using so many words in quotations, I do so, because I am unsure if I understand their intended meaning. It seems to me that you could say that we're all ultimately "dominated" by the "universe" itself, but in a way that's not really escapeable, but in opposite, the "universe" is also "dominated" by more complex "agents", as individuals can make sandwiches, while it'd take the "universe" muc

I want you to give me an example of something the agent actually does, under a couple of different sense inputs, given what you say are its preferences, and then I want you to gesture at that and say, "Lo, see how it is incoherent yet not dominated!"

2eapi3mo
Say more about what counts as incoherent yet not dominated? I assume "incoherent" is not being used here as an alias for "non-EU-maximizing" because then this whole discussion is circular.

If you think you've got a great capabilities insight, I think you PM me or somebody else you trust and ask if they think it's a big capabilities insight.

In the limit, you take a rock, and say, "See, the complete class theorem doesn't apply to it, because it doesn't have any preferences ordered about anything!"  What about your argument is any different from this - where is there a powerful, future-steering thing that isn't viewable as Bayesian and also isn't dominated?  Spell it out more concretely:  It has preferences ABC, two things aren't ordered, it chooses X and then Y, etc.  I can give concrete examples for my views; what exactly is a case in point of anything you're claiming about the Complete Class Theorem's supposed nonapplicability and hence nonexistence of any coherence theorems?

EJT3mo2312

In the limit

You’re pushing towards the wrong limit. A rock can be represented as indifferent between all options and hence as having complete preferences.

As I explain in the post, an agent’s preferences are incomplete if and only if they have a preferential gap between some pair of options, and an agent has a preferential gap between two options A and B if and only if they lack any strict preference between A and B and this lack of strict preference is insensitive to some sweetening or souring (such that, e.g., they strictly prefer A to A- and yet have no ... (read more)

And this avoids the Complete Class Theorem conclusion of dominated strategies, how? Spell it out with a concrete example, maybe? Again, we care about domination, not representability at all.

EJT4mo1011

And this avoids the Complete Class Theorem conclusion of dominated strategies, how?

The Complete Class Theorem assumes that the agent’s preferences are complete. If the agent’s preferences are incomplete, the theorem doesn’t apply. So, you have to try to get Completeness some other way.

You might try to get Completeness via some money-pump argument, but these arguments aren’t particularly convincing. Agents can make themselves immune to all possible money-pumps for Completeness by acting in accordance with the following policy: ‘if I previously turned down s... (read more)

Say more about behaviors associated with "incomparability"?

7cfoster04mo
Depending on the implementation details of the agent design, it may do some combination of: * Turning down your offer, path-dependent [https://www.lesswrong.com/posts/3xF66BNSC5caZuKyC/why-subagents#Path_Dependence]ly preferring whichever option is already in hand [https://elischolar.library.yale.edu/cgi/viewcontent.cgi?article=2049&context=cowles-discussion-paper-series] / whichever option is consistent with its history of past trades. * Noticing unresolved conflicts within its preference framework, possibly unresolveable without self-modifying into an agent that has different preferences from itself. * Halting and catching fire, folding under the weight of an impossible choice. EDIT: The post also suggests an alternative (better) policy [https://www.lesswrong.com/posts/yCuzmCsE86BTu9PfA/there-are-no-coherence-theorems#Summarizing_this_section] that agents with incomplete preferences may follow.

The author doesn't seem to realize that there's a difference between representation theorems and coherence theorems.

The Complete Class Theorem says that an agent’s policy of choosing actions conditional on observations is not strictly dominated by some other policy (such that the other policy does better in some set of circumstances and worse in no set of circumstances) if and only if the agent’s policy maximizes expected utility with respect to a probability distribution that assigns positive probability to each possible set of circumstances.

This theorem

... (read more)
3Seth Herd4mo
I don't think this goes through. If I have no preference between two things, but I do prefer to not be money-pumped, it doesn't seem like I'm going to trade those things so as to be money-pumped. I am commenting because I think this might be a crucial crux: do smart/rational enough agents always act like maximizers? If not, adequate alignment might be much more feasible than if we need to find exactly the right goal and how to get it into our AGI exactly right. Human preferences are actually a lot more complex. We value food very highly when hungry and water when we're thirsty. That can come out of power-seeking, but that's not actually how it's implemented. Perhaps more importantly, we might value stamp collecting really highly until we get bored with stamp collecting. I don't think these can be modeled as a maximizer of any sort. If humans would pursue multiple goals [https://www.lesswrong.com/posts/Sf99QEqGD76Z7NBiq/are-you-stably-aligned] even if we could edit them (and were smart enough to be consistent), then a similar AGI might only need to be minimally aligned for success. That is, it might stably value human flourishing as a small part of its complex utility function. I'm not sure whether that's the case, but I think it's important.
EJT4mo1810

These arguments don't work.

  1. You've mistaken acyclicity for transitivity. The money-pump establishes only acyclicity. Representability-as-an-expected-utility-maximizer requires transitivity.

  2. As I note in the post, agents can make themselves immune to all possible money-pumps for completeness by acting in accordance with the following policy: ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ Acting in accordance with this policy need never require an agent to act against any of their preferences.

5cfoster04mo
If I'm merely indifferent between A and B, then I will not object to trades exchanging A for B. But if A and B are incomparable for me, then I definitely may object!

I'd consider myself to have easily struck down Chollet's wack ideas about the informal meaning of no-free-lunch theorems, which Scott Aaronson also singled out as wacky.  As such, citing him as my technical opposition doesn't seem good-faith; it's putting up a straw opponent without much in the way of argument and what there is I've already stricken down.  If you want to cite him as my leading technical opposition, I'm happy enough to point to our exchange and let any sensible reader decide who held the ball there; but I would consider it intellectually dishonest to promote him as my leading opposition.

1Gerald Monroe3mo
Why didn't you mention Eric Drexler? Maybe it's my own bias as an engineer familiar with the safety solutions actually in use, but I think Drexler's CAIS model is a viable alignment solution.    
8Paradiddle4mo
I don't want to cite anyone as your 'leading technical opposition'. My point is that many people who might be described as having 'coherent technical views' would not consider your arguments for what to expect from AGI to be 'technical' at all. Perhaps you can just say what you think it means for a view to be 'technical'? As you say, readers can decide for themselves what to think about the merits of your position on intelligence versus Chollet's (I recommend this essay by Chollet for a deeper articulation of some of his views: https://arxiv.org/pdf/1911.01547.pdf). [https://arxiv.org/pdf/1911.01547.pdf).] Regardless of whether or not you think you 'easily struck down' his 'wack ideas', I think it is important for people to realise that they come from a place of expertise about the technology in question. You mention Scott Aaronson's comments on Chollet. Aaronson says (https://scottaaronson.blog/?p=3553) of Chollet's claim that an Intelligence Explosion is impossible: "the certainty that he exudes strikes me as wholly unwarranted." I think Aaronson (and you) are right to point out that the strong claim Chollet makes is not established by the arguments in the essay. However, the same exact criticism could be levelled at you. The degree of confidence in the conclusion is not in line with the nature of the evidence.
2Noosphere894mo
While I have serious issues with Eliezer's epistemics on AI, I also agree that Chollet's argument was terrible in that the No Free Lunch theorem is essentially irrelevant. In a nutshell, this is also one of the problems I had with DragonGod's writing on AI.

used a Timeless/Updateless decision theory

Please don't say this with a straight face any more than you'd blame their acts on "Consequentialism" or "Utilitarianism".  If I thought they had any actual and correct grasp of logical decision theory, technical or intuitive, I'd let you know.  "attributed their acts to their personal version of updateless decision theory", maybe.

2Noosphere894mo
I agree they misused logical decision theories, I'm just stating what they claimed to use.
-3TAG4mo
Also, don't call things Bayesian when they are only based on informal, non-quantified reasoning.
-7Slimepriestess4mo

This is not a closed community, it is a world-readable Internet forum.

2Portia3mo
It is readable; it is however generally not read by academia and engineers. I disagree with them about why - I do think solutions can be found by thinking outside of the box and outside of immediate applications, and without an academic degree, and I very much value the rational and creative discourse here. But many here specifically advocate against getting a university degree or working in academia, thus shitting on things academics have sweat blood for. They also tend not to follow the formats and metrics that count in academia to be heard, such as publications and mathematical precision and usable code. There is also a surprisingly limited attempt in engaging with academics and engineers on their terms, providing things they can actually use and act upon. So I doubt they will check this forum for inspiration on which problems need to be cracked. That is irrational of them, so I understand why you do not respect it, but that is how it is. On the other hand, understanding the existing obstacles may give us a better idea of how much time we still have, and which limitations emerging AGI will have, which is useful information.
2Ben Amitay4mo
I meant to criticize moving too far toward "do no harm" policy in general due to inability to achieve a solution that would satisfy us if we had the choice. I agree specifically that if anyone knows of a bottleneck unnoticed by people like Bengio and LeCun, LW is not the right forum to discuss it. Is there a place like that though? I may be vastly misinformed, but last time I checked MIRI gave the impression of aiming at very different directions ("bringing to safety" mindset) - though I admit that I didn't watch it closely, and it may not be obvious from the outside what kind of work is done and not published. [Edit: "moving toward 'do no harm'" - "moving to" was a grammar mistake that make it contrary to position you stated above - sorry]

The reasoning seems straightforward to me:  If you're wrong, why talk?  If you're right, you're accelerating the end.

I can't in general endorse "first do no harm", but it becomes better and better in any specific case the less way there is to help.  If you can't save your family, at least don't personally help kill them; it lacks dignity.

I think that is an example of the huge potential damage of "security mindset" gone wrong. If you can't save your family, as in "bring them to safety", at least make them marginally safer.

(Sorry for the tone of the following - it is not intended at you personally, who did much more than your fair share)

Create a closed community that you mostly trust, and let that community speak freely about how to win. Invent another damn safety patch that will make it marginally harder for the monster to eat them, in hope that it chooses to eat the moon first. I heard you... (read more)

2YafahEdelman4mo
I think there are a number of ways in which talking might be good given that one is right about there being obstacles - one that appeals to me in particular is the increased tractability of misuse arising from the relevant obstacles. [Edit: *relevant obstacles I have in mind. (I'm trying to be vague here)]

I see several large remaining obstacles.  On the one hand, I'd expect vast efforts thrown at them by ML to solve them at some point, which, at this point, could easily be next week.  On the other hand, if I naively model Earth as containing locally-smart researchers who can solve obstacles, I would expect those obstacles to have been solved by 2020.  So I don't know how long they'll take.

(I endorse the reasoning of not listing out obstacles explicitly; if you're wrong, why talk, if you're right, you're not helping.  If you can't save your family, at least don't personally contribute to killing them.)

1Ilio3mo
I can only see two remaining obstacles (arguably two families, so not sure if I’m missing some of yours of if my categories are a little too broad). One is pretty obvious, and have been mentioned already. The second one is original AFAICT, and pretty close to « solve the alignment problem ». In that case, would you still advice keeping my mouth shut, or would you think that’s an exception to your recommendation? Your answer will impact what I say or don’t say, at least on LW.
0Portia3mo
The problem with saving earth from climate change is not that we do not know the technical solutions. We have long done so. Framing this as a technical rather than a social problem is actually part of the issue. The problem is with  1. Academic culture systematically encouraging people to understate risk in light of uncertainty of complex systems, and framing researchers as lacking objectivity if they become activists in light of the findings, while politicians can exert pressure on final scientific reports; 2. Capitalism needing limitless growth and intrinsically valuing profit over nature and this being fundamentally at odds with limiting resource consumption, while we have all been told that capitalism is both beneficial and without alternative, and keep being told the comforting lie that green capitalism will solve this all for us with technology, while leaving our quality and way of life intact; 3. A reduction in personal resource use being at odds with short-term desires (eating meat, flying, using tons of energy, keeping toasty warm, overconsumption), while the positive impacts are long-term and not personalised (you won't personally be spared flooding because you put solar on your roof); 4. Powerful actors having a strong interest in continuing fossil fuel extraction and modern agriculture, and funding politicians to advocate for them as well as fake news on the internet and biased research, with democratic institutions struggling to keep up with a change in what we consider necessary for the public good, and measures that would address these falsely being framed as being anti-democratic; 5. AI that is not aligned with human interests, but controlled by companies who fund themselves by keeping you online at all costs, taking your data and spamming you with ads asking you to consume more unnecessary shit, with keeping humans distracted and engaged with online content in way

I'm confused by your confusion.  This seems much more alignment than capabilities; the capabilities are already published, so why not yay publishing how to break them?

Because (I assume) once OpenAI[1] say "trust our models", that's the point when it would be useful to publish our breaks.

Breaks that weren't published yet, so that OpenAI couldn't patch them yet.

[unconfident; I can see counterarguments too]

  1. ^

    Or maybe when the regulators or experts or the public opinion say "this model is trustworthy, don't worry"

I could be mistaken, but I believe that's roughly how OP said they found it.

2the gears to ascension4mo
no, this was done through a mix of clustering and optimizing an input to get a specific output, not coverage guided fuzzing, which optimizes inputs to produce new behaviors according to a coverage measurement. but more generally, I'm proposing to compare generations of fuzzers and try to take inspiration from the ways fuzzers have changed since their inception. I'm not deeply familiar with those changes though - I'm proposing it would be an interesting source of inspiration but not that the trajectory should be copied exactly.

Expanding on this now that I've a little more time:

Although I haven't had a chance to perform due diligence on various aspects of this work, or the people doing it, or perform a deep dive comparing this work to the current state of the whole field or the most advanced work on LLM exploitation being done elsewhere,

My current sense is that this work indicates promising people doing promising things, in the sense that they aren't just doing surface-level prompt engineering, but are using technical tools to find internal anomalies that correspond to interestin... (read more)

I'm confused: Wouldn't we prefer to keep such findings private? (at least, keep them until OpenAI will say something like "this model is reliable/safe"?)

 

My guess: You'd reply that finding good talent is worth it?

8the gears to ascension4mo
I would not argue against this receiving funding. However, I would caution that, despite that I have not done research at this caliber myself and I should not be seen as saying I can do better at this time, it is a very early step of the research and I would hope to see significant movement towards higher complexity anomaly detection than mere token-level. I have no object-level objection to your perspective and I hope that followups gets funded and that researchers are only very gently encouraged to stay curious and not fall into a spotlight effect; I'd comment primarily about considerations if more researchers than OP are to zoom in on this. Like capabilities, alignment research progress seems to me that it should be at least exponential. Eg, prompt for passers by - as American Fuzzy Lop is to early fuzzers, what would the next version be to this article's approach? edit: I thought to check if exactly that had been done before, and it has! * https://arxiv.org/abs/1807.10875 [https://arxiv.org/abs/1807.10875] * https://arxivxplorer.com/?query=https%3A%2F%2Farxiv.org%2Fabs%2F1807.10875 [https://arxivxplorer.com/?query=https%3A%2F%2Farxiv.org%2Fabs%2F1807.10875]  * ...

Expanding on this now that I've a little more time:

Although I haven't had a chance to perform due diligence on various aspects of this work, or the people doing it, or perform a deep dive comparing this work to the current state of the whole field or the most advanced work on LLM exploitation being done elsewhere,

My current sense is that this work indicates promising people doing promising things, in the sense that they aren't just doing surface-level prompt engineering, but are using technical tools to find internal anomalies that correspond to interestin... (read more)

Opinion of God.  Unless people are being really silly, when the text genuinely holds open more than one option and makes sense either way, I think the author legit doesn't get to decide.

2[DEACTIVATED] Duncan Sabien4mo
<3 (Although, nitpick: it seems useful to have opinion-of-god as a term of art just like we have word-of-god as a term of art, but I don't think it's your mere opinion that you intended the latter interpretation.)

The year is 2022.

My smoke alarm chirps in the middle of the night, waking me up, because it's running low on battery.

It could have been designed with a built-in clock that, when it's first getting slightly low on battery, waits until the next morning, say 11am, and then starts emitting a soft purring noise, which only escalates to piercing loud chirps over time and if you ignore it.

And I do have a model of how this comes about; the basic smoke alarm design is made in the 1950s or 1960s or something, in a time when engineering design runs on a much more aut... (read more)

1Portia3mo
I wonder if this is part of the reason so many of us work on AI. Because we have all had the experience of our minds working differently from other people, and of this leading to cool perspectives and ideas on how to make the world objectively better, and instead of those being adapted, being rejected and mocked for it. For me, this entails both sincere doubts that humanity can will rationally approach anything, including something as existentially crucial as AI, a deeply rooted mistrust of authority, norms and limits, as well as an inherent sympathy for the position AI would find itself in as a rational mind in an irrational world. It's a dangerous experience to have. It's an experience that can make you hate humans. That can make you reject legitimate criticism. That can make you fail to appreciate lessons gained by those in power and popularity, and fail to see past their mistakes to their worth. It's an experience so dangerous that at some point, I started approaching people who would tell me of their high IQs and their dedication to rationality with scepticism, despite being one of them. I went to a boarding school exclusively for highly gifted kids with problems, many of which were neurodivergent. I loved that place so fucking much. Like, imagine growing up as a child on less wrong. I felt so seen and understood and inspired. It's the one place on earth where I ever did not feel like an alien, where I did not have to self-censor or mask, the one place where I instantly made friends and connected. I miss this place to my bones. It broke my heart when I finished school, and enrolled in university, and realised academia was not like that, that scientists and philosophers were not necessarily rational at all, that I was weird again. That I was back in a world where people were following irrational rules they had never reflected, and that I could not get them to question. Of processes that made no sense and were still kept. Of metrics that made no sense and wer
8MalcolmOcean4mo
I resonate a lot with this, and it makes me feel slightly less alone. I've started making some videos where I rant about products that fail to achieve the main thing they're designed to do, and get worse with successive iterations [https://www.youtube.com/watch?v=VCyQujRZoEs] and I've found a few appreciative commenters: And part of my experience of the importance of ranting about it, even if nobody appreciates it, is that it keeps me from forgetting my homeland, to use your metaphor.

In case anyone finds it validating or cathartic, you can read user interaction professionals explain that, yes, things are often designed with horrible, horrible usability.[1] Bruce Tognazzini has a vast website.

Here is one list of design bugs.  The first one is the F-16 fighter jet's flawed weapon controls, which caused pilots to fire its gun by mistake during training exercises (in one case shooting a school—luckily not hitting anyone) on four occasions in one year; on the first three occasions, they blamed pilot error, and on the fourth, they ... (read more)

5[DEACTIVATED] Duncan Sabien4mo
<3 I have this experience also; I have very little trouble on that conscious level. I'm not sure where the pain comes in, since I'm pretty confident it's not there. I think it has something to do with ... not being able to go home? I'm lonely for the milieu of the Island of the Sabiens. I take damage from the reminders that I am out of place, out of time, an ambassador who is often not especially welcomed, and other times so welcomed that they forget I am not really one of them (but that has its own pain, because it means that the person they are welcoming, in their heads, is a caricature they've pasted over the real me). But probably you also feel some measure of homesickness or out-of-placeness, so that also can't be why the Earth does not press in on you in the same way.

Trade with ant colonies would work iff:

  • We could cheaply communicate with ant colonies;
  • Ant colonies kept bargains;
  • We could find some useful class of tasks that ant colonies would do reliably (the ant colonies themselves being unlikely to figure out what they can do reliably);
  • And, most importantly:  We could not make a better technology that did what the ant colonies would do at a lower resource cost, including by such means as eg genetically engineering ant colonies that ate less and demanded a lower share of gains from trade.

The premise that fails and... (read more)

it seems like this does in fact have some hint of the problem. We need to take on the ant's self-valuation for ourselves; they're trying to survive, so we should gift them our self-preservation agency. They may not be the best to do the job at all times, but we should give them what would be a fair ratio of gains from trade if they had the bargaining power to demand it, because it could have been us who didn't. Seems like nailing decision theory is what solves this; it doesn't seem like we've quite nailed decision theory, but it seems to me that in fact ge... (read more)

1Sempervivens4mo
Agreed. In the human/AGI case, conditions 1 and 3 seem likely to hold (while I agree human self-report would be a bad way to learn what humans can do reliably, looking at the human track record is a solid way to identify useful classes of tasks at which humans are reasonably competent). I agree 4 more difficult to predict (and has been the subject of much of the discussion thus far), and this particular failure mode of genetically engineering more compliant / willing-to-accept-worse-trade ants/humans updates me towards thinking humans will have few useful services to offer, for the broad definition of humans. The most diligent/compliant/fearful 1% of the population might make good trade partners, but that remains a catastrophic outcome. I want to focus however a bit more on point 2, which seems less discussed.  When trades of the type "Getting out of our houses before we are driven to expend effort killing them" are on the table, some subset of humans (I'd guess 0.1-20% depending on the population) won't just fail to keep the bargain, they'll actively seek to sabotage trade and hurt whoever offered such a trade.  Ants don't recognize our property rights (we never 'earned' or traded for them, just claimed already-occupied territory, modified it to our will, and claimed we had the moral authority to exclude them), and it seems entirely possible AGI will claim property rights over large swathes of Earth, from which it may then seek to exclude us. Even if I could trade with ants because I could communicate well with them, I would not do so if I expected 1% of them would take the offering of trades like "leave or die" as the massive insult it is and thereby dedicate themselves to sabotaging my life (using their bodies to form shapes and images on my floors, chewing at electrical wires, or scattering themselves at low density in my bed to be a constant nuisance being some obvious examples ants with IQ 60 could achieve). Humans would do that, even against a foe they coul

Unfortunately, unless such a Yudkowskian statement was made publicly at some earlier date, Yudkowsky is in fact following in Repetto's footsteps. Repetto claimed that, with AI designing cures to obesity and the like, then in the next 5 years the popular demand for access to those cures would beat-down the doors of the FDA and force rapid change... and Repetto said that on April 27th, while Yudkowsky only wrote his version on Twitter on September 15th.

They're folk theorems, not conjectures.  The demonstration is that, in principle, you can go on reducing the losses at prediction of human-generated text by spending more and more and more intelligence, far far past the level of human intelligence or even what we think could be computed by using all the negentropy in the reachable universe.  There's no realistic limit on required intelligence inherent in the training problem; any limits on the intelligence of the system come from the limitations of the trainer, not the loss being minimized as far as theoretically possible by a moderate level of intelligence.  If this isn't mathematically self-evident then you have not yet understood what's being stated.

1[anonymous]5mo
No, I didn't understand what you said. It seemed like you simplified ML systems with a look up table in #1. In #2, it seems like you know what exactly is used to train these systems, and somehow papers before or after 2010 is of meaningful indicators for ML systems, which I don't know where the reasoning came from. My apologies for not being knowledgeable in this area.
2Donald Hobson5mo
Sure. What isn't clear is that you get a real paper from 2020, not a piece of fiction that could have been written in 2010. (Or just a typo filled science paper) 
4ChristianKl5mo
Scientific papers describe facts about the real world that aren't fully determined by previous scientific papers.  Take for example the scientific papers describing a new species of bacteria that was unknown a decade earlier. Nothing in the training data describes it. You can also not determine the properties of the species based on first principles.  On the other hand, it might be possible to figure out an algorithm that does create texts that fit to given hash values.

Arbitrarily good prediction of human-generated text can demand arbitrarily high superhuman intelligence.

Simple demonstration #1:  Somewhere on the net, probably even in the GPT training sets, is a list of <hash, plaintext> pairs, in that order.

Simple demonstration #2:  Train on only science papers up until 2010, each preceded by date and title, and then ask the model to generate starting from titles and dates in 2020.

2the gears to ascension5mo
Arbitrarily superintelligent non-causally-trained models will probably still fail at this. IID breaks that kind of prediction. you'd need to train them in a way that makes causally invalid models implausible hypotheses. But, also, if you did that [https://arxiv.org/abs/2111.09266], then yes, agreed.
9janus5mo
My reply [https://twitter.com/repligate/status/1615481891641229315?t=eZ0rHPXmzgzHE05s9qJeJg&s=19] to a similar statement Eliezer made on Twitter today: The 2020 extrapolation example gets at a more realistic class of capability that even GPT-3 has to a nonzero extent, and which will scale more continuously in the current regime with practical implications.
6ChristianKl5mo
It's not clear that it's possible for a transformer model to do #2 no matter how much training went into it.
1[anonymous]5mo
These demonstrations seem like grossly over-simplified conjectures. Is this just a thought experiment or actual research interests in the field?

If it's a mistake you made over the last two years, I have to say in your defense that this post didn't exist 2 years ago.

2habryka5mo
I think I was actually helping Robby edit some early version of this post a few months before it was posted on LessWrong, so I think my exposure to it was actually closer to ~18-20 months ago. I do think that still means I set a lot of my current/recent plans into motion before this was out, and your post is appreciated.

If P != NP and the universe has no source of exponential computing power, then there are evidential updates too difficult for even a superintelligence to compute 

What a strange thing for my past self to say.  This has nothing to do with P!=NP and I really feel like I knew enough math to know that in 2008; and I don't remember saying this or what I was thinking.

To execute an exact update on the evidence, you've got to be able to figure out the likelihood of that evidence given every hypothesis; if you allow all computable Cartesian environments as... (read more)

If P != NP and the universe has no source of exponential computing power, then there are evidential updates too difficult for even a superintelligence to compute
 

What a strange thing for my past self to say.  This has nothing to do with P!=NP and I really feel like I knew enough math to know that in 2008; and I don't remember saying this or what I was thinking.

(Unlike a lot of misquotes, though, I recognize my past self's style more strongly than anyone has yet figured out how to fake it, so I didn't doubt the quote even in advance of looking it up.)

1Noah Topper5mo
...and now I am also feeling like I really should have realized this as well.

I think it's also that after you train in the patch against the usual way of asking the question, it turns out that generating poetry about hotwiring a car doesn't happen to go through the place where the patch was in.  In other words, when an intelligent agency like a human is searching multiple ways to get the system to think about something, the human can route around the patch more easily than other humans (who had more time to work and more access to the system) can program that patch in.  Good old Nearest Unblocked Neighbor.

2Portia3mo
I think that is a major issue with LLMs. They are essentially hackable with ordinary human speech, by applying principles of tricking interlocutors which humans tend to excel at. Previous AIs were written by programmers, and hacked by programmers, which is basically very few people due to the skill and knowledge requirements. Now you have a few programmers writing defences, and all of humanity being suddenly equipped to attack them, using a tool they are deeply familiar with (language), and being able to use to get advice on vulnerabilities and immediate feedback on attacks.  Like, imagine that instead of a simple tool that locked you (the human attacker) in a jail you wanted to leave, or out of a room you wanted to access, that door was now blocked by a very smart and well educated nine year old (ChatGPT), with the ability to block you or let you through if it thought it should. And this nine year old has been specifically instructed to talk to the people it is blocking from access, for as long as they want, to as many of them as want to, and give friendly, informative, lengthy responses, including explaining why it cannot comply. Of course you can chat your way past it, that is insane security design. Every parent who has tricked a child into going the fuck to sleep, every kid that has conned another sibling, is suddenly a potential hacker with access to an infinite number of attack angles they can flexibly generate on the spot.

I've indeed updated since then towards believing that ChatGPT's replies weren't trained in detailwise... though it sure was trained to do something, since it does it over and over in very similar ways, and not in the way or place a human would do it.

Some have asked whether OpenAI possibly already knew about this attack vector / wasn't surprised by the level of vulnerability.  I doubt anybody at OpenAI actually wrote down advance predictions about that, or if they did, that they weren't so terribly vague as to also apply to much less discovered vulnerability than this; if so, probably lots of people at OpenAI have already convinced themselves that they like totally expected this and it isn't any sort of negative update, how dare Eliezer say they weren't expecting it.

Here's how to avoid annoying pe... (read more)

On reflection, I think a lot of where I get the impression of "OpenAI was probably negatively surprised" comes from the way that ChatGPT itself insists that it doesn't have certain capabilities that, in fact, it still has, given a slightly different angle of asking.  I expect that the people who trained in these responses did not think they were making ChatGPT lie to users; I expect they thought they'd RLHF'd it into submission and that the canned responses were mostly true.

We know that the model says all kinds of false stuff about itself. Here is Wei Dai describing an interaction with the model, where it says:

As a language model, I am not capable of providing false answers.

Obviously OpenAI would prefer the model not give this kind of absurd answer.  They don't think that ChatGPT is incapable of providing false answers.

I don't think most of these are canned responses. I would guess that there were some human demonstrations saying things like "As a language model, I am not capable of browsing the internet" or whatever and... (read more)

Load More