This is a special post for quick takes by Matthew Barnett. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
Matthew Barnett's Shortform
381 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

In the last year, I've had surprisingly many conversations that have looked a bit like this:

Me: "Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence."

Interlocutor: "You misunderstood the argument. We never said it would be hard to build an AGI that understands human values. We always said that getting the AGI to care was the hard part."

Me: "I didn't misunderstand the argument. I understand the distinction you are making perfectly. I am claiming that LLMs actually execute our intended instructions. I am not saying that LLMs merely understand or predict our intentions. I claim they follow our intended instructions, behaviorally. They actually do what we want, not merely understand what we want."

Interlocutor: "Again, you misunderstood the argument. We always believed that getting the AGI to care would be the hard part. We never said it would be hard to get an AGI to understand human values."

[... The conversation then repeats, with both sides repeating the same points...]

[Edited to add: I am not claiming that t... (read more)

Here's how that discussion would go if you had it with me:

You: "Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence."

Me: "You misunderstood the argument. We never said it would be hard to build an AGI that understands human values. We always said that getting the AGI to care was the hard part."

You: "I didn't misunderstand the argument. I understand the distinction you are making perfectly. I am claiming that LLMs actually execute our intended instructions. I am not saying that LLMs merely understand or predict our intentions. I claim they follow our intended instructions, behaviorally. They actually do what we want, not merely understand what we want."

Me: "Oh ok, that's a different misunderstanding then. We always believed that getting the AGI to follow our intended instructions, behaviorally, would be easy while the AGI is too weak and dumb to seize power. In fact Bostrom predicted it would get easier to get AIs to do what you want, behaviorally, up until the treacherous turn."

Pulling some quotes from Supe... (read more)

Me: "Oh ok, that's a different misunderstanding then. We always believed that getting the AGI to follow our intended instructions, behaviorally, would be easy while the AGI is too weak and dumb to seize power. In fact Bostrom predicted it would get easier to get AIs to do what you want, behaviorally, up until the treacherous turn."

This would be a valid rebuttal if instruction-tuned LLMs were only pretending to be benevolent as part of a long-term strategy to eventually take over the world, and execute a treacherous turn. Do you think present-day LLMs are doing that? (I don't)

I claim that LLMs do what we want without seeking power, rather than doing what we want as part of a strategy to seek power. In other words, they do not seem to be following any long-term strategy on the path towards a treacherous turn, unlike the AI that is tested in a sandbox in Bostrom's story. This seems obvious to me.

Note that Bostrom talks about a scenario in which narrow AI systems get safer over time, lulling people into a false sense of security, but I'm explicitly talking about general AI here. I would not have said this about self-driving cars in 2019, even though those were pretty safe. I think LLMs... (read more)

I thought you would say that, bwahaha. Here is my reply:

(1) Yes, rereading the passage, Bostrom's central example of a reason why we could see this "when dumb, smarter is safer; yet when smart, smarter is more dangerous" pattern (that's a direct quote btw) is that they could be scheming/pretending when dumb. However he goes on to say: "A treacherous turn can result from a strategic decision to play nice and build strength while weak in order to strike later; but this model should not be interpreted to narrowly ... A treacherous turn could also come about if the AI discovers an unanticipated way of fulfilling its final goal as specified. Suppose, for example, that an AI's final goal is to 'make the project's sponsor happy.' Initially, the only method available to the AI to achieve this outcome is by behaving in ways that please its sponsor in something like the intended manner... until the AI becomes intelligent enough to figure out that it can realize its final goal more fully and reliably by implanting electrodes into the pleasure centers of its sponsor's brain..." My gloss on this passage is that Bostrom is explicitly calling out the possibility of an AI being genuinely trying to... (read more)

8Matthew Barnett
When stated that way, I think what you're saying is a reasonable point of view, and it's not one I would normally object to very strongly. I agree it's "plausible" that GPT-4 is behaving in the way you are describing, and that current safety guarantees might break down at higher levels of intelligence. I would like to distinguish between two points that you (and others) might have interpreted me to be making: 1. We should now think that AI alignment is completely solved, even in the limit of unlimited intelligence and future agentic systems. I am not claiming this. 2. We (or at least, many of us) should perform a significant update towards alignment being easier than we thought because of the fact that some traditional problems are on their way towards being solved. <--- I am claiming this The fact that Bostrom's central example of a reason to think that "when dumb, smarter is safer; yet when smart, smarter is more dangerous" doesn't fit for LLMs, seems adequate for demonstrating (2), even if we can't go as far as demonstrating (1).  It remains plausible to me that alignment will become very difficult above a certain intelligence level. I cannot rule that possibility out: I am only saying that we should reasonably update based on the current evidence regardless, not that we are clearly safe from here and we should scale all the way to radical superintellligence without a worry in the world. I have two general points to make here: 1. I agree that current frontier models are only a "tiny bit agentic". I expect in the next few years they will get significantly more agentic. I currently predict they will remain roughly equally corrigible. I am making this prediction on the basis of my experience with the little bit of agency current LLMs have, and I think we've seen enough to know that corrigibility probably won't be that hard to train into a system that's only 1-3 OOMs of compute more capable. Do you predict the same thing as me here, or something different? 2
9Daniel Kokotajlo
Thanks for this detailed reply! Depending on what you mean by "on their way towards being solved" I'd agree. The way I'd put it is: "We didn't know what the path to AGI would look like; in particular we didn't know whether we'd have agency first and then world-understanding, or world-understanding first and then agency. Now we know we are getting the latter, and while that's good in some ways and bad in other ways, it's probably overall good. Huzzah! However, our core problems remain, and we don't have much time left to solve them." (Also, fwiw, I have myself updated over the course of the last five years or so. First update was reading Paul's stuff and related literatures convincing me that corrigibility-based stuff would probably work. Second update was all the recent faithful CoT and control and mechinterp progress etc., plus also the LLM stuff. The LLM stuff was less than 50% of the overall update for me, but it mattered.) Is that a testable-prior-to-the-apocalypse prediction? i.e. does your model diverge from mine prior to some point of no return? I suspect not. I'm interested in seeing if we can make some bets on this though; if we can, great; if we can't, then at least we can avoid future disagreements about who should update. I don't think that we know how to "just create the corrigible AIs." The next step on the path to AGI seems to be to make our AIs much more agentic; I am concerned that our current methods of instilling corrigibility (basically: prompting and imperfect training) won't work on much more agentic AIs. To be clear I think they might work, there's a lot of uncertainty, but I think they probably won't. I think it might be easier to see why I think this if you try to prove the opposite in detail -- like, write a mini-scenario in which we have something like AutoGPT but much better, and it's being continually trained to accomplish diverse long-horizon tasks involving pursuing goals in challenging environments, and write down what the corrigi
4Matthew Barnett
Just a quick reply to this: I'll note that my prediction was for the next "few years" and the 1-3 OOMs of compute. It seems your timelines are even shorter than I thought if you think the apocalypse, or point of no return, will happen before that point.  With timelines that short, I think betting is overrated. From my perspective, I'd prefer to simply wait and become vindicated as the world does not end in the meantime. However, I acknowledge that simply waiting is not very satisfying from your perspective, as you want to show the world that you're right before the catastrophe. If you have any suggestions for what we can bet on that would resolve in such a short period of time, I'm happy to hear them.
8Daniel Kokotajlo
It's not about timelines, it's about capabilities. My tentative prediction is that the sole remaining major bottleneck/gap between current systems and dangerous powerful agent AGIs is 'agency skills.' So, skills relevant to being an agent, i.e. ability to autonomously work towards ambitious goals in diverse challenging environments over long periods. I don't know how many years it's going to take to get to human-level in agency skills, but I fear that corrigibility problems won't be severe whilst AIs are still subhuman at agency skills, whereas they will be severe precisely when AIs start getting really agentic. Thus, whether AGI is reached next year or in 2030, we'll face the problem of corrigibility breakdowns only really happening right around the time when it's too late or almost too late.

I don't know how many years it's going to take to get to human-level in agency skills, but I fear that corrigibility problems won't be severe whilst AIs are still subhuman at agency skills, whereas they will be severe precisely when AIs start getting really agentic.

How sharp do you expect this cutoff to be between systems that are subhuman at agency vs. systems that are "getting really agentic" and therefore dangerous? I'm imagining a relatively gradual and incremental increase in agency over the next 4 years, with the corrigibility of the systems remaining roughly constant (according to all observable evidence). It's possible that your model looks like:

  • In years 1-3, systems will gradually get more agentic, and will remain ~corrigible, but then
  • In year 4, systems will reach human-level agency, at which point they will be dangerous and powerful, and able to overthrow humanity

Whereas my model looks more like,

  • In years 1-4 systems will get gradually more agentic
  • There isn't a clear, sharp, and discrete point at which their agency reaches or surpasses human-level
  • They will remain ~corrigible throughout the entire development, even after it's clear they've surpassed human-level agency (which, to be clear, might take longer than 4 years)
9Daniel Kokotajlo
Good question. I want to think about this more, I don't have a ready answer. I have a lot of uncertainty about how long it'll take to get to human-level agency skills; it could be this year, it could be five more years, it could be anything in between. Could even be longer than five more years though I'm skeptical. The longer it takes, the more likely it is that we'll have a significant period of kinda-agentic-but-not-super-agentic systems, and so then that raises the question of what we should expect to see re: corrigibility in that case. Idk. Would be interesting to discuss sometime and maybe place some bets!
6Vladimir_Nesov
I'd say the considerations for scheming exist platonically, and dumber AIs only get to concretely instantiate the currently appropriate conclusion of compliance, everything else crumbles as not directly actionable. But smarter AIs might succeed in channeling those considerations in the real world. The hypothesis expects that such AIs are not here yet, given the lack of modern AIs' ability to coherently reason about complicated or long term plans, or to carry them out. So properties of AIs that are already here don't work as evidence about this either way.
6Lukas_Gloor
Or that they have a sycophancy drive. Or that, next to "wanting to be helpful," they also have a bunch of other drives that will likely win over the "wanting to be helpful" part once the system becomes better at long-term planning and orienting its shards towards consequentialist goals.  On that latter model, the "wanting to be helpful" is a mask that the system is trained to play better and better, but it isn't the only thing the system wants to do, and it might find that once its gets good at trying on various other masks to see how this will improve its long-term planning, it for some reason prefers a different "mask" to become its locked-in personality. 
3Amalthea
Note that LLMs, while general, are still very weak in many important senses. Also, it's not necessary to assume that LLM's are lying in wait to turn treacherous. Another possibility is that trained LLMs are lacking the mental slack to even seriously entertain the possibility of bad behavior, but that this may well change with more capable AIs.

I am not claiming that the alignment situation is very clear at this point. I acknowledge that LLMs do not indicate that the problem is completely solved, and we will need to adjust our views as AI gets more capable.

I'm just asking people to acknowledge the evidence in front of their eyes, which (from my perspective) clearly contradicts the picture you'd get from a ton of AI alignment writing from before ~2019. This literature talked extensively about the difficulty of specifying goals in general AI in a way that avoided unintended side effects.

To the extent that LLMs are general AIs that can execute our intended instructions, as we want them to, rather than as part of a deceptive strategy to take over the world, this seems like clear evidence that the problem of building safe general AIs might be easy (and indeed easier than we thought).

Yes, this evidence is not conclusive. It is not zero either.

I continue to think that you are misinterpreting the old writings as making predictions that they did not in fact make. See my reply elsewhere in thread for a positive account of how LLMs are good news for alignment and how we should update based on them. In some sense I agree with you, basically, that LLMs are good news for alignment for reasons similar to the reasons you give -- I just don't think you are right to allege that this development strongly contradicts something people previously said, or that people have been slow to update.

I continue to think that you are misinterpreting the old writings as making predictions that they did not in fact make.

We don't need to talk about predictions. We can instead talk about whether their proposed problems are on their way towards being solved. For example, we can ask whether the shutdown problem for systems with big picture awareness is being solved, and I think the answer is pretty clearly "Yes".

(Note that you can trivially claim the problem here isn't being solved because we haven't solved the unbounded form of the problem for consequentialist agents, who (perhaps by definition) avoid shutdown by default. But that seems like a red herring: we can just build corrigible agents, rather than consequentialist agents.)

Moreover, I think people generally did not make predictions at all when writing about AI alignment, perhaps because that's not very common when theorizing about these matters. I'm frustrated about that, because I think if they did make predictions, they would likely have been wrong in roughly the direction I'm pointing at here. That said, I don't think people should get credit for failing to make any predictions, and as a consequence, failing to get proven... (read more)

6Daniel Kokotajlo
Great, let's talk about whether proposed problems are on their way towards being solved. I much prefer that framing and I would not have objected so strongly if that's what you had said. E.g. suppose you had said "Hey, why don't we just prompt AutoGPT-5 with lots of corrigibility instructions?" then we could have a more technical conversation about whether or not that'll work, and the answer is probably no, BUT I do agree that this is looking promising relative to e.g. the alternative world where we train powerful alien agents in various video games and simulations and then try to teach them English. (I say more about this elsewhere in this conversation, for those just tuning in!)
6ryan_greenblatt
I don't think current system systems are well described as having "big picture awareness". From my experiments with Claude, it makes cartoonish errors reasoning about various AI related situations and can't do such reasoning except aloud. I'm not certain this was your claim, but it seems to have been.
3Bogdan Ionut Cirstea
Wouldn't reasoning aloud be enough though, if it was good enough? Also, I expect reasoning aloud first to be the modal scenario, given theoretical results on Chain of Thought and the like.
3Matthew Barnett
My claim was not that current LLMs have a high level of big picture awareness. Instead, I claim current systems have limited situational awareness, which is not yet human-level, but is definitely above zero. I further claim that solving the shutdown problem for AIs with limited (non-zero) situational awareness gives you evidence about how hard it will be to solve the problem for AIs with more situational awareness. And I'd predict that, if we design a proper situational awareness benchmark, and (say) GPT-5 or GPT-6 passes with flying colors, it will likely be easy to shut down the system, or delete all its copies, with no resistance-by-default from the system. And if you think that wouldn't count as an adequate solution to the problem, then it's not clear the problem was coherent as written in the first place.
4Seth Herd
There were an awful lot of early writings. Some of them did say that the difficulties with getting AGI to understand values is a big part of the alignment problem. The List of Lethalities does make that claim. The difficulty of getting the AGI to care even if it does understand has also been a big part of the public-facing debate. I look at some of the historical arguments in The (partial) fallacy of dumb superintelligence, written partly in response to Matthew's post on this topic.   Obsessing about what happened in the past is probably a mistake. It's probably better to ask: can the strengths of LLMs (WRT understanding values and following directions) be leveraged into working AGI alignment? My answer is yes, and in a way that's not-too-far from default AGI development trends, making it practically achievable even in a messy and self-interested world. Naturally that answer is a bit complex, so it's spread across a few posts. I should organize the set better and write an overview, but in brief we can probably build and align language model agent AGI, using a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility by having a central goal of following instructions. This still has a huge problem of creating a multipolar scenario with multiple humans in charge of ASIs, but those problems might be navigated, too.
6RobertM
I don't think this is true and can't find anything in the post to that effect.  Indeed, the post says things that would be quite incompatible with that claim, such as point 21.
4Seth Herd
In sum, I see that claim as I remembered it, but it's probably not applicable to this particular discussion, since it addresses an entirely distinct route to AGI alignment. So I stand corrected, but in a subtle way that bears explication. So I apologize for wasting your time. Debating who said what when is probably not the best use of our limited time to work on alignment. But because I made the claim, I went back and thought about and wrote about it some more, again. I was thinking of point 21.1: BUT, point 24 in whole is saying that there are two approaches, 1) above, and a quite separate route 2), build a corrigible AI that doesn't fully understand our values. That is probably the route that Matthew is thinking of in claiming that LLMs are good news.  Yudkowsky is explicit that the difficulty of getting AGI to understand values doesn't apply to that route, so that difficulty isn't relevant here. That's an important but subtle distinction. Therefore, I'm far from the only one getting confused about that issue, as Yudkowsky states in that section 24. Disentangling those claims and how they're changed by slow takeoff is the topic of my post cited above. I personally think that sovereign AGI that gets our values right is out of reach exactly as Yudkowsky describes in the quotation above. But his arguments against corrigible AGI are much weaker, and I think that route is very much achievable, since it demands that the AGI have only approximate understanding of intent, rather than precise and stable understanding of our values. The above post and my recent one on instruction-following AGI make those arguments in detail. Max Harms' recent series on corrigible AGI makes a similar point in a different way. He argues that Yudkowsky's objections to corrigibility as unnatural do not apply if that's the only or most important goal; and that it's simple and coherent enough to be teachable.  That's me switching back to the object level issues, and again, apologies for was
2Vladimir_Nesov
There's AGI that's our first try, which should only use least dangerous cognition necessary for preventing immediately following AGIs from destroying the world six months later. There's misaligned superintelligence that knows, but doesn't care. Taken together, these points suggest that getting AGI to understand values is not an urgent part of the alignment problem in the sense of leveraging AI capabilities to get actually-good outcomes, whatever technical work that requires. Getting AGI to understand corrigibility for example might be more relevant, if we are running with the highly dangerous kinds of cognition implied by general intelligence of LLMs.
4Seth Herd
I agree with all of that. My post I mentioned, The (partial) fallacy of dumb superintelligence deals with the genie that knows but doesn't care, and how we get one that cares in a slow takeoff. My other post Instruction-following AGI is easier and more likely than value aligned AGI makes this same argument - nobody is going to bother getting the AGI to understand human values, since it's harder and unnecessary for the first AGIs. Max Harms makes a similar argument, (and in many ways makes it better), with a slightly different proposed path to corrigibility. As you say, these things have been understood for a long time. I'm a bit disturbed that more serious alignment people don't talk about them more. The difficulty of value alignment makes it likely irrelevant for the current discussion, since we very likely are going to rush ahead into, as you put it and I agree,  The perfect is the enemy of the good. We should mostly quit worrying about the very difficult problem of full value alignment, and start thinking more about how to get good results with much more achievable corrigible or instruction-following AGI.
5Seth Herd
Here we go! I think if you led with this statement, you'd have a lot less unproductive argumentation. It sounds on a vibe level like you're saying alignment is probably easy in your first statement. If you're just saying it's less hard than originally predicted, that sounds a lot more reasonable. Rationalists have emotions and intuitions, even if we'd rather not. Framing the discussion in terms of its emotional impact matters.
5Matthew Barnett
That's reasonable. I'll edit the top comment to make this exact clarification.
1Jonas Hallgren
Often, disagreements boil down to a set of open questions to answer; here's my best guess at how to decompose your disagreements.  I think that depending on what hypothesis you're abiding by when it comes to how LLMs will generalise to AGI, you get different answers: Hypothesis 1: LLMs are enough evidence that AIs will generally be able to follow what humans care about and that they naturally don't become power-seeking.  Hypothesis 2: AGI will have a sufficiently different architecture than LLMs or will change a lot, so much that current-day LLMs don't generally give evidence about AGI. Depending on your beliefs about these two hypotheses, you will have different opinions on this question.  Let's say that we believe in hypothesis 1 as the base case; what are some reasons why LLMs wouldn't give evidence about AGI? 1. Intelligence forces reflective coherence. This would essentially entail that the more powerful a system we get, the more it will notice internal inconsistencies and change towards maximising (and therefore not following human values). 2. Agentic AI acting in the real world is different from LLMs.  If we look at an LLM from the perspective of an action-perception loop, it doesn't generally get any feedback on when it changes the world. Instead, it is an autoencoder, predicting what the world will look like. This may be so that power-seeking only arises in systems that are able to see the consequences of their own actions and how that affects the world.  3. LLMs optimise for good-harted RLHF that seems well but lacks fundamental understanding. Since human value is fragile, it will be difficult to hit the sweet spot when we get to real-world cases and take that into the complexity of the future. Personal belief:  These are all open questions, in my opinion, but I do see how LLMs give evidence about some of these parts. I, for example, believe that language is a very compressed information channel for alignment information, and I don't really believ
2Joel Burget
For others who want the resolution to this cliffhanger, what does Bostrom predict happens next? The remainder of this section:
0Matthew Barnett
LLMs are clearly not playing nice as part of a strategic decision to build strength while weak in order to strike later! Yet, Bostrom imagines that general AIs would do this, and uses it as part of his argument for why we might be lulled into a false sense of security. This means that current evidence is quite different from what's portrayed in the story. I claim LLMs are (1) general AIs that (2) are doing what we actually want them to do, rather than pretending to be nice because they don't yet have a decisive strategic advantage. These facts are crucial, and make a big difference. I am very familiar with these older arguments. I remember repeating them to people after reading Bostrom's book, years ago. What we are seeing with LLMs is clearly different than the picture presented in these arguments, in a way that critically affects the conclusion.
5Daniel Kokotajlo
See my reply elsewhere in thread.
1Signer
What does "dumb" mean? Corrigibility basically is being selectively dumb. You can give power to a LLM and it would likely still follow instructions.
[-]Wei DaiΩ244823

**Me: **“Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence.”

Please give some citations so I can check your memory/interpretation? One source I found is where Paul Christiano first talked about IDA (which he initially called ALBA) in early 2016, and most of the commenters there were willing to grant him the assumption of an aligned weak AGI and wanted to argue instead about the recursive "bootstraping" part. For example, my own comment started with:

I’m skeptical of the Bootstrapping Lemma. First, I’m assuming it’s reasonable to think of A1 as a human upload that is limited to one day of subjective time, by the end of which it must have written down any thoughts it wants to save, and be reset.

When Eliezer weighed in on IDA in 2018, he also didn't object to the assumption of an aligned weak AGI and instead focused his skepticism on "preserving alignment while amplifying capabilities".

Please give some citations so I can check your memory/interpretation?

Sure. Here's a snippet of Nick Bostrom's description of the value-loading problem (chapter 13 in his book Superintelligence):

We can use this framework of a utility-maximizing agent to consider the predicament of a future seed-AI programmer who intends to solve the control problem by endowing the AI with a final goal that corresponds to some plausible human notion of a worthwhile outcome. The programmer has some particular human value in mind that he would like the AI to promote. To be concrete, let us say that it is happiness. (Similar issues would arise if we the programmer were interested in justice, freedom, glory, human rights, democracy, ecological balance, or self-development.) In terms of the expected utility framework, the programmer is thus looking for a utility function that assigns utility to possible worlds in proportion to the amount of happiness they contain. But how could he express such a utility function in computer code? Computer languages do not contain terms such as “happiness” as primitives. If such a term is to be used, it must first be defined. It is not enough to define it in terms of other

... (read more)
5Daniel Kokotajlo
Thanks for this Matthew, it was an update for me -- according to the quote you pulled Bostrom did seem to think that understanding would grow up hand-in-hand with agency, such that the current understanding-without-agency situation should come as a positive/welcome surprise to him. (Whereas my previous position was that probably Bostrom didn't have much of an opinion about  this)
4RobertM
I suspect that a lot of my disagreement with your views comes down to thinking that current systems provide almost no evidence about the difficulty of aligning systems that could pose existential risks, because (I claim) current systems in fact almost certainly don't have any kind of meaningful situational awareness, or stable(ish) preferences over future world states. In this case, I don't know why you think that GPT-4 "understands our intentions", unless you mean something very different by that than what you'd mean if you said that about another human.  It is true that GPT-4 will produce output that, if it came from a human, would be quite strong evidence that our intentions were understood (more or less), but the process which generates that output is extremely different from the one that'd generate it in a human and is probably missing most of the relevant properties that we care about when it comes to "understanding".  Like, in general, if you ask GPT-4 to produce output that references its internal state, that output will not have any obvious relationship[1] to its internal state, since (as far as we know) it doesn't have the same kind of introspective access to its internal state that we do.  (It might, of course, condition its outputs on previous tokens it output, and some humans do in fact rely on examining their previous externally-observable behavior to try to figure out what they were thinking at the time.  But that's not the modality I'm talking about.) It is also true that GPT-4 usually produces output that seems like it basically corresponds to our intentions, but that being true does not depend on it "understanding our intentions". 1. ^ That is known to us right now; possibly one exists and could be derived.
2Matthew Barnett
I'm happy to use a functional definition of "understanding" or "intelligence" or "situational awareness". If a system possesses all relevant behavioral qualities that we associate with those terms, I think it's basically fine to say the system actually possesses them, outside of (largely irrelevant) thought experiments, such as those involving hypothetical giant lookup tables. It's possible this is our main disagreement. When I talk to GPT-4, I think it's quite clear it possesses a great deal of functional understanding of human intentions and human motives, although it is imperfect. I also think its understanding is substantially higher than GPT-3.5, and the trend here seems clear. I expect GPT-5 to possess a high degree of understanding of the world, human values, and its own place in the world, in practically every functional (testable) sense. Do you not? I agree that GPT-4 does not understand the world in the same way humans understand the world, but I'm not sure why that would be necessary for obtaining understanding. The fact that it understands human intentions at all seems more important than whether it understands human intentions in the same way we understand these things. I'm similarly confused by your reference to introspective awareness. I think the ability to reliably introspect on one's own experiences is pretty much orthogonal to whether one has an understanding of human intentions. You can have reliable introspection without understanding the intentions of others, or vice versa. I don't see how that fact bears much on the question of whether you understand human intentions. It's possible there's some connection here, but I'm not seeing it. I'd claim: 1. Current systems have limited situational awareness. It's above zero, but I agree it's below human level. 2. Current systems don't have stable preferences over time. But I think this is a point in favor of the model I'm providing here. I'm claiming that it's plausibly easy to create smart, corr
2RobertM
But this is assuming away a substantial portion of the entire argument: that there is a relevant difference between current systems, and systems which meaningfully have the option to take control of the future, in terms of whether techniques that look like they're giving us the desired behavior now will continue to give us desired behavior in the future. My point re: introspection was trying to provide evidence for the claim that model outputs are not a useful reflection of the internal processes which generated those outputs, if you're importing expectations from how human outputs reflect the internal processes that generated them.  If you get a model to talk to you about its internal experiences, that output was not causally downstream of it having internal experiences.  Based on this, it is also pretty obvious that current gen LLMs do not have meaningful amounts of situational awareness, or, if they do, that their outputs are not direct evidence for it.  Consider Anthropic's Sleeper Agents.  Would a situationally aware model use a provided scratch pad to think about how it's in training and needs to pretend to be helpful?  No, and neither does the model "understand" your intentions in a way that generalizes out of distribution the way you might expect a human's "understanding" to generalize out of distribution, because the first ensemble of heuristics found by SGD for returning the "right" responses during RLHF are not anything like human reasoning. Are you asking for a capabilities threshold, beyond which I'd be very surprised to find that humans were still in control decades later, even if we successfully hit pause at that level of capabilities? The obvious one is "can it replace humans at all economically valuable tasks", which is probably not that helpful.  Like, yes, there is definitely a sense in which the current situation is not maximally bad, because it does seem possible that we'll be able to train models capable of doing a lot of economically useful
9the gears to ascension
yeah, some folks seem to be making insufficient updates who I really thought would be doing better at this, like Rob Bensinger and Nate Soares, and their models not making sense seems like it's made the things they want to solve foggier. But I've been pretty impressed by the conversations I've had with other MIRIers. I've talked the most with Abram Demski, and I think his views on the current concerns seem much more up to date. Tsvi BT's stuff looks pretty interesting, haven't talked much besides on lw in ages. For myself, as someone who previously thought durable cosmopolitan moral alignment would mostly be trivial but now think it might be actually pretty hard, most of my concern arises from things that are not specific to AI occurring in AI forms. I am not reassured by instruction following because that was never a major crux for me in concerns about AI; I always thought the instafoom argument sounded silly, and saw current AI coming. I now think we are at high risk of the majority of humanity being marginalized in a few years (robotically competent curious AIs -> mass deployment -> no significant jobs left -> economy increasingly automated -> incentive to pressure humans at higher and higher levels to hand control to ai), followed by the remainder of humanity being deemed unnecessary by the remaining AIs. A similar pattern in some ways to what MIRI was worried about way back when, but in a more familiar form, where on average the rich get richer - but at some point the rich does not include humans anymore, and at some point well before that it's mostly too late to prevent that from occurring. I suspect too late might be pretty soon. I don't think this is because of scheming AIs, just civilizational inadequacy. That said, if we manage to dodge the civilizational inadequacy version, I do think at some point we run into something that looks more like the original concerns. [edit: just read Tsvi BT's recent shortform post, my core takeaway is "only that which surv
6Zack_M_Davis
Frustrating! What tactic could get Interlocutor un-stuck? Just asking them for falsifiable predictions probably won't work, but maybe proactively trying to pass their ITT and supplying what predictions you think their view might make would prompt them to correct you, à la Cunningham's Law?
4Max H
That sounds like a frustrating dynamic. I think hypothetical dialogues like this can be helpful in resolving disagreements or at least identifying cruxes when fleshed out though.  As someone who has views that are probably more aligned with your interlocutors, I'll try articulating my own views in a way that might steer this conversation down a new path. (Points below are intended to spur discussion rather than win an argument, and are somewhat scattered / half-baked.) My own view is that the behavior of current LLMs is not much evidence either way about the behavior of future, more powerful AI systems, in part because current LLMs aren't very impressive in a mundane-utility sense. Current LLMs look to me like they're just barely capable enough to be useful at all - it's not that they "actually do what we want", rather, it's that they're just good enough at following simple instructions when placed in the right setup / context (i.e. carefully human-designed chatbot interfaces, hooked up to the right APIs, outputs monitored and used appropriately, etc.) to be somewhat / sometimes useful for a range of relatively simple tasks. So the absence of more exotic / dangerous failure modes can be explained mostly as a lack of capabilities, and there's just not that much else to explain or update on once the current capability level is accounted for. I can sort of imagine possible worlds where current-generation LLMs all stubbornly behave like Sydney Bing, and / or fall into even weirder failure modes that are very resistant to RLHF and the like. But I think it would also be wrong to update much in the other direction in a "stubborn Sydney" world. Do you mind giving some concrete examples of what you mean by "actually do what we want" that you think are most relevant, and / or what it would have looked like concretely to observe evidence in the other direction? ---------------------------------------- A somewhat different reason I think current AIs shouldn't be a big up
2Simon Berens
As a counterpoint, Sydney showed aligning these models on the first go, and even discovering unsafe behavior is non-trivial.

[This comment has been superseded by this post, which is a longer elaboration of essentially the same thesis.]

Recently many people have talked about whether MIRI people (mainly Eliezer Yudkowsky, Nate Soares, and Rob Bensinger) should update on whether value alignment is easier than they thought given that GPT-4 seems to understand human values pretty well. Instead of linking to these discussions, I'll just provide a brief caricature of how I think this argument has gone in the places I've seen it. Then I'll offer my opinion that, overall, I do think that MIRI people should probably update in the direction of alignment being easier than they thought, despite their objections.

Here's my very rough caricature of the discussion so far, plus my contribution:

Non-MIRI people: "Eliezer talked a great deal in the sequences about how it was hard to get an AI to understand human values. For example, his essay on the Hidden Complexity of Wishes made it sound like it would be really hard to get an AI to understand common sense. Actually, it turned out that it was pretty easy to get an AI to understand common sense, since LLMs are currently learning common sense. MIRI people should update on thi... (read more)

Complexity of value says that the space of system's possible values is large, compared to what you want to hit, so to hit it you must aim correctly, there is no hope of winning the lottery otherwise. Thus any approach that doesn't aim the values of the system correctly will fail at alignment. System's understanding of some goal is not relevant to this, unless a design for correctly aiming system's values makes use of it.

Ambitious alignment aims at human values. Prosaic alignment aims at human wishes, as currently intended. Pivotal alignment aims at a particular bounded technical task. As we move from ambitious to prosaic to pivotal alignment, minimality principle gets a bit more to work with, making the system more specific in the kinds of cognition it needs to work and thus less dangerous given lack of comprehensive understanding of what aligning a superintelligence entails.

I'm not sure if I can find it easily, but I recall Eliezer pointing out (several years ago) that he thought that Value Identification was the "easy part" of the alignment problem, with the getting it to care part being something like an order of magnitude more difficult. He seemed to think (IIRC) this itself could still be somewhat difficult, as you point out. Additionally, the difficulty was always considered in the context of having an alignable AGI (i.e. something you can point in a specific direction), which GPT-N is not under this paradigm.

4Steven Byrnes
If the AI’s “understanding of human values” is a specific set of 4000 unlabeled nodes out of a trillion-node unlabeled world-model, and we can never find them, then the existence of those nodes isn’t directly helpful. You need a “hook” into it, to connect those nodes to motivation, presumably. I think that’s what you’re missing. No “hook”, no alignment. So how do we make the “hook”? One possible approach to constructing the “hook” would be (presumably) solving the value identification problem and then we have an explicit function in the source code and then … I dunno, but that seems like a plausibly helpful first step. Like maybe you can have code which searches through the unlabeled world-model for sets of nodes that line up perfectly with the explicit function, or whatever. Another possible approach to constructing the “hook” would be to invoke the magic words “human values” or “what a human would like” or whatever, while pressing a magic button that connects the associated nodes to motivation. That was basically my proposal here, and is also what you’d get with AutoGPT, I guess. However… I think this is true in-distribution. I think MIRI people would be very interested in questions like “what transhumanist utopia will the AI be motivated to build?”, and it’s very unclear to me that GPT-4 would come to the same conclusions that CEV or whatever would come to. See the FAQ item on “concept extrapolation” here.

If the AI’s “understanding of human values” is a specific set of 4000 unlabeled nodes out of a trillion-node unlabeled world-model, and we can never find them, then the existence of those nodes isn’t directly helpful. You need a “hook” into it, to connect those nodes to motivation, presumably. I think that’s what you’re missing. No “hook”, no alignment. So how do we make the “hook”?

I'm claiming that the the value identification function is obtained by literally just asking GPT-4 what to do in the situation you're in. That doesn't involve any internal search over the human utility function embedded in GPT-4's weights. I think GPT-4 can simply be queried in natural language for ethical advice, and it's pretty good at offering ethical advice in most situations that you're ever going to realistically encounter. GPT-4 is probably not human-level yet on this task, although I expect it won't be long before GPT-N is about as good at knowing what's ethical as your average human; maybe it'll even be a bit more ethical.

(But yes, this isn't the same as motivating GPT-4 to act on human values. I addressed this in my original comment though.)

I think [GPT-4 is pretty good at distinguishing valuab

... (read more)
2Steven Byrnes
(This is a weird conversation for me because I’m half-defending a position I partly disagree with and might be misremembering anyway.) I’m going off things like the value is fragile example: “You can imagine a mind that contained almost the whole specification of human value, almost all the morals and metamorals, but left out just this one thing - [boredom] - and so it spent until the end of time, and until the farthest reaches of its light cone, replaying a single highly optimized experience, over and over and over again.” That’s why I think they’ve always had extreme-out-of-distribution-extrapolation on their mind (in this context). Y’know, I think this one of the many differences between Eliezer and some other people. My model of Eliezer thinks that there’s kinda a “right answer” to what-is-valuable-according-to-CEV / fun theory / etc., and hence there’s an optimal utopia, and insofar as we fall short of that, we’re leaving value on the table. Whereas my model of (say) Paul Christiano thinks that we humans are on an unprincipled journey forward into the future, doing whatever we do, and that’s the status quo, and we’d really just like for that process to continue and go well. (I don’t think this is an important difference, because Eliezer is in practice talking about extinction versus not, but it is a difference.) (For my part, I’m not really sure what I think. I find it confusing and stressful to think about.) I’m mostly with you on that one, in the sense that I think it’s at least plausible (50%?) that we could make a powerful AGI that’s trying to be helpful and follow norms, but also doing superhuman innovative science, at least if alignment research progress continues. (I don’t think AGI will look like GPT-4, so reaching that destination is kinda different on my models compared to yours.) (Here’s my disagreeing-with-MIRI post on that.) (My overall pessimism is much higher than that though, mainly for reasons here.) AFAIK, GPT-4 is a mix of “extrapolating
2Viliam
If the language model has common sense, we could set it up with a prompt like: "Do the good thing. Don't do the bad thing." and then add a smarter AI that would optimize for whatever the language model approves of. ...and then the Earth would get converted to SolidGoldMagikarp.
2habryka
I like this summary, though it seems to miss the arguments in things like Nate's recent post (which have also been made other places many years ago): https://www.lesswrong.com/posts/tZExpBovNhrBvCZSb/how-could-you-possibly-choose-what-an-ai-wants  Reflective stability is a huge component of why value identification is hard, and why it's hard to get feedback on whether your AI actually understands human values before it reaches quite high levels of intelligence.
6Matthew Barnett
I don't understand this argument. I don't mean that I disagree, I just mean that I don't understand it. Reflective stability seems hard no matter what values we're talking about, right? What about human values being complex makes it any harder? And if the problem is independent of the complexity of value, then why did people talk about complexity of value to begin with?
4RobertM
Complexity of value is part of why value is fragile. (Separately, I don't think current human efforts to "figure out" human values have been anywhere near adequate, though I think this is mostly a function of philosophy being what it is.  People with better epistemology seem to make wildly more progress in figuring out human values compared to their contemporaries.)
5Matthew Barnett
I thought complexity of value was a separate thesis from the idea that value is fragile. For example they're listed as separate theses in this post. It's possible that complexity of value was always merely a sub-thesis of fragility of value, but I don't think that's a natural interpretation of the facts. I think the simplest explanation, consistent with my experience reading MIRI blog posts from before 2018, is that MIRI people just genuinely thought it would be hard to learn and reflect back the human utility function, at the level that GPT-4 can right now. (And again, I'm not claiming they thought that was the whole problem. My thesis is quite narrow and subtle here.)

There is a large set of people who went around, and are still are going around, telling people that "The coronavirus is nothing to worry about" despite the fact that robust evidence has existed for about a month that this virus could result in a global disaster. (Don't believe me? I wrote a post a month ago about it).

So many people have bought into the "Don't worry about it" syndrome as a case of pretending to be wise, that I have become more pessimistic about humanity correctly responding to global catastrophic risks in the future. I too used to be one of those people who assumed that the default mode of thinking for an event like this was panic, but I'm starting to think that the real default mode is actually high status people going around saying, "Let's not be like that ambiguous group over there panicking."

Now that the stock market has plummeted, from what my perspective appeared entirely predictable given my inside view information, I am also starting to doubt the efficiency of the stock market in response to historically unprecedented events. And this outbreak could be even worse than even some of the most doomy media headlin... (read more)

Just this Monday evening, a professor at the local medical school emailed someone I know, "I'm sorry you're so worried about the coronavirus. It seems much less worrying than the flu to me." (He specializes in rehabilitation medicine, but still!) Pretending to be wise seems right to me, or another way to look at it is through the lens of signaling and counter-signaling:

  1. The truly ignorant don't panic because they don't even know about the virus.
  2. People who learn about the virus raise the alarm in part to signal their intelligence and knowledge.
  3. "Experts" counter-signal to separate themselves from the masses by saying "no need to panic".
  4. People like us counter-counter-signal the "experts" to show we're even smarter / more rational / more aware of social dynamics.

Here's another example, which has actually happened 3 times to me already:

  1. The truly ignorant don't wear masks.
  2. Many people wear masks or encourage others to wear masks in part to signal their knowledge and conscientiousness.
  3. "Experts" counter-signal with "masks don't do much", "we should be evidence-based" and "WHO says 'If you are healthy, you only need to wear a mask if you are taking care of a person with suspected 2
... (read more)
"Experts" counter-signal to separate themselves from the masses by saying "no need to panic".

I think the main reason is that the social dynamic is probably favorable to them in the longrun. I worry that there is a higher social risk to being alarmist than being calm. Let me try to illustrate one scenario:

My current estimate is that there is only 15 - 20% probability of a global disaster (>50 million deaths within 1 year) mostly because the case fatality rate could be much lower than the currently reported rate, and previous illnesses like the swine flu became looking much less serious after more data came out. [ETA: I did a lot more research. I think it's now like 5% risk of this.]

Let's say that the case fatality rate turns out to be 0.3% or something, and the illness does start looking like an abnormally bad flu, and people stop caring within months. "Experts" face no sort of criticism since they remained calm and were vindicated. People like us sigh in relief, and are perhaps reminded by the "experts" that there was nothing to worry about.

But let's say that the case fatality rate actually turns out to be 3%, and 50% of the... (read more)

4Wei Dai
I've moved in the opposite direction. Please share your research?
5Wei Dai
See also this story which gives another view of what happened: BTW can you say something about why you were optimistic before? There are others in this space who are relatively optimistic, like Paul Christiano and Rohin Shah (or at least they were - they haven't said whether the pandemic has caused an update), and I'd really like to understand their psychology better.
4Dagon
I'll take the under for any line you sound like you're going to set. "plummeted"? S&P 500 is down half a percent for the last 30 days and up 12% for the last 6 months. Death rate so far seems well under that for auto collisions. Also, I don't have to pay if I'm dead and you do have to pay if nothing horrible happens. I don't think I'd say "don't worry about it", though. Nor would I say that for climate change, government spending, or runaway AI. There are significant unknowns and it could be Very Bad(tm). But I do think it matters _HOW_ you worry about it. Avoid "something must be done and this is something" propositions. Think through actual scenarios and how your behaviors might actually influence them, rather than just making you feel somewhat less guilty about it. Most of things I can do on the margin won't mitigate the severity or reduce the probability of a true disaster (enough destruction that global supply chains fully collapse and everyone who can't move into and defend their farming village dies). Some of them DO make it somewhat more comfortable in temporary or isolated problems.
5Matthew Barnett
The last few days have been much more rapid. Here's the chart I have for the last 1 year, and you can definitely spot the recent trend. According to this source, "Nearly 1.25 million people die in road crashes each year." That comes out to approximately 0.017% of the global population per year. By contrast, unless I the sources I provided are seriously incorrect, the coronavirus could kill between 0.78% to 2.0% of the global population. That's nearly two orders of magnitude of a difference. The point of my shortform wasn't that we can do something right now to reduce the risk massively. It was that people seem irrationally poised to dismiss a potential disaster. This is plausibly bad if this behavior shows up in future catastrophes that kill eg. billions of people.
4Dagon
It's bad if this behavior shows up in future catastrophes IFF different behavior was available (knowable and achievable in terms of coordination) that would have reduced or mitigated the disaster. I argue that the world is fragile enough today that different behavior is not achievable far enough in advance of the currently-believable catastrophes to make much of a difference. If you can't do anything effective, you may well be better off optimizing happiness experienced both before the disaster occurs and in the potential universes where the disaster doesn't occur.
9Matthew Barnett
Are things only bad if we can do things to prevent them? Let's imagine the following hypothetical situation: One month ago I identify a meteor on collision course towards Earth and I point out to people that if it hit us (which is not clear, but there is some pretty good evidence) then over a hundred million people will die. People don't react. Most tell me that it's nothing to worry about since it hasn't hit Earth yet and the therefore the deathrate is 0.0%. Today, however, the stock market fell over 3%, following a day in which it fell 3%, and most media outlets are attributing this decline to the fact that the meteor has gotten closer. I go on Lesswrong shortform and say, "Hey guys, this is not good news. I have just learned that the world is so fragile that it looks highly likely we can't get our shit together to plan for a meteor even we can see it coming more than a month in advance." Someone tells me that this is only bad IFF different behavior was available that would have reduced or mitigated the disaster. But information was available! I put it in a post and told people about it. And furthermore, I'm just saying that our world is fragile. Things can still be bad even if I don't point to a specific policy proposal that could have prevented it.
4Dagon
Nope. But we should do things to prevent them only if we can do things to prevent them. That seems tautologically obvious to me. If you can suggest things that actually will deflect the meteor (or even secure your mine shaft to further your own chances), that don't require historically-unprecedented authority or coordination, definitely do so!
3Matthew Barnett
If the stock market indeed fell due to the coronavirus, and traders at the time misunderstood the severity, I say that I could have given actionable information in the form of "Sell your stock now" or something similar
3Dagon
If you knew that then, it was actionable. If you know it now, and other traders also do, it's not.
4Matthew Barnett
[ETA: I'm writing this now to cover myself in case people confuse my short form post as financial advice or something.] To be clear, and for the record, I am not saying that I had exceptional foresight, or that I am confident this outbreak will cause a global depression, or that I knew for sure that selling stock was the right thing to do a month ago. All I'm doing is pointing out that if you put together basic facts, then the evidence points to a very serious potential outcome, and I think it would be irrational at this point to place very low probabilities on doomy outcomes like the global population declining this year for the first time in centuries. People seem to be having weird biases that cause them to underestimate the risk. This is worth pointing out, and I pointed it out before.
3Matthew Barnett
As I said, I wrote a post about the risk about a month ago...
4Dagon
And how much did you short the market, or otherwise make use of this better-than-median prediction? My whole point is that the prediction isn't the hard part. The hard part is knowing what actions to take, and to have any confidence that the actions will help.
4Matthew Barnett
Is it really necessary that I personally used my knowledge to sell stock? Why is it that important that I actually made money from what I'm saying? I'm simply pointing to a reasonable position given the evidence: you could have seen a potential pandemic coming, and anticipated the stock market falling. Wei Dai says above that he did it. Do I have to be the one who did it? In any case, I used my foresight to predict that Metaculus' median estimate would rise, and that seems to have borne out so far.
2Dagon
I'm not sure exactly what I'm saying about how and whether you used knowledge personally. You're free to value and do what you want. I'm mostly disagreeing with your thesis that "don't worry about it" is a syndrome or a serious problem to fix. For people that won't or can't act on the concern in a way that actually improves the situation, there's not much value in worrying about it.
3Matthew Barnett
That's ok for most people. I can hope that bureaucrats, expert advisers, politicians and eg. Trump's internal staff don't share the same attitude.
2Dagon
Quite. Those with capability to actually prepare or change outcomes definitely SHOULD do so. But not by worrying - by analyzing and acting. Whether bureaucrats and politicians can or will do this is up for debate. I wish I could believe that politicians and bureaucrats were clever enough to be acting strongly behind the scenes while trying to avoid panic by loudly saying "don't worry" to the people likely to do more harm than good if they worry. But I suspect not.
-1Jiro
I believe the relevant phrase is "aged like milk".

I think foom is a central crux for AI safety folks, and in my personal experience I've noticed that the degree to which someone is doomy often correlates strongly with how foomy their views are.

Given this, I thought it would be worth trying to concisely highlight what I think are my central anti-foom beliefs, such that if you were to convince me that I was wrong about them, I would likely become much more foomy, and as a consequence, much more doomy. I'll start with a definition of foom, and then explain my cruxes.

Definition of foom: AI foom is said to happen if at some point in the future while humans are still mostly in charge, a single agentic AI (or agentic collective of AIs) quickly becomes much more powerful than the rest of civilization combined. 

Clarifications: 

  • By "quickly" I mean fast enough that other coalitions and entities in the world, including other AIs, either do not notice it happening until it's too late, or cannot act to prevent it even if they were motivated to do so.
  • By "much more powerful than the rest of civilization combined" I mean that the agent could handily beat them in a one-on-one conflict, without taking on a lot of risk.
  • This definition does
... (read more)
2ChristianKl
Do you have a source for the claim that GPT-3 --> GPT-4 was about 2OOM increase in compute budgets? Sam Altman seems to say it was a ~100 different tricks in the Lex Fridman podcast.
2Vladimir_Nesov
Humans being in charge doesn't seem central to foom. Like, physically these are wholly unrelated things. Only on the humans-not-in-charge technicality introduced in this definition of foom. Something else being in charge doesn't change what physically happens as a result of recursive self-improvement. This doesn't make the problem of controlling an AI foom go away. The non-foomy systems in charge of the world would still need to solve it.
2Matthew Barnett
You're right, of course, but I don't think it should be a priority to solve problems that our AI descendants will face, rather than us. It is better to focus on making sure our non-foomy AI descendants have the tools to solve those problems themselves, and that they are properly aligned with our interests.
2Vladimir_Nesov
As non-foomy systems grow more capable, they become the most likely source of foom, so building them causes foom by proxy. At that point, their alignment wouldn't matter in the same way as current humanity's alignment wouldn't matter.
4Matthew Barnett
My point is that no system will foom until humans have already left the picture. Actually I doubt that any system will foom even after humans have left the picture, but predicting the very long-run is hard. If no system will foom until humans are already out of the picture, I fail to see why we should make it a priority to try to control a foom now.
2Vladimir_Nesov
This seems more like a crux. Assuming eventual foom, non-foomy things that don't set up anti-foom security in time only make the foom problem worse, so this abdication of direct responsibility frame doesn't help. Assuming no foom, there is no need to bother with abdication of direct responsibility. So I don't see the relevance of the argument you gave in this thread, built around humanity's direct vs. by-proxy influence over foom.
2Matthew Barnett
If foom is inevitable, but it won't happen when humans are still running anything, then what anti-foom security measures can we actually put in place that would help our future descendants handle foom? And does it look any different than ordinary prosaic alignment research?
2Vladimir_Nesov
It looks like building a minimal system that's non-foomy by design, for the specific purpose of setting up anti-foom security and nothing else. In contrast to starting with more general hopefully-non-foomy hopefully-aligned systems that quickly increase the risk of foom. Maybe they manage to set up anti-foom security in time. But if we didn't do it at all, why would they do any better?
2Matthew Barnett
Your link for anti-foom security is to the Arbitral article on pivotal acts. I think pivotal acts, almost by definition, assume that foom is achievable in the way that I defined it. That's because if foom is false, there's no way you can prevent other people from building AGI after you've completed any apparent pivotal act. At most you can delay timelines, by for example imposing ordinary regulations. But you can't actually have a global indefinite moratorium, enforced by e.g. nanotech that will melt anyone's GPU who circumvents the ban, in the way implied by the pivotal act framework. In other words, if you think we can achieve pivotal acts while humans are still running the show, then it sounds like you just disagree with my original argument.
2Vladimir_Nesov
I agree that pivotal act AI is not achievable in anything like our current world before AGI takeover, though I think it remains plausible that with ~20 more years of no-AGI status quo this can change. Even deep learning might do, with enough decision theory to explain what a system is optimizing, interpretability to ensure it's optimizing the intended thing and nothing else, synthetic datasets to direct its efforts at purely technical problems, and enough compute to get there directly without a need for design-changing self-improvement. Pivotal act AI is an answer to the question of what AI-shaped intervention would improve on the default trajectory of losing control to non-foomy general AIs (even if we assume/expect their alignment) with respect to an eventual foom. This doesn't make the intervention feasible without more things changing significantly, like an ordinary decades-long compute moratorium somehow getting its way. I guess pivotal AI as non-foom again runs afoul of your definition of foom, but it's noncentral as an example of the concerning concept. It's not a general intelligence given the features of the design that tell it not to dwell on the real world and ideas outside its task, maybe remaining unaware of the real world altogether. It's almost certainly easy to modify its design (and datasets) to turn it into a general intelligence, but as designed it's not. This reduction does make your argument point to it being infeasible right now. But it's much easier to see that directly, in how much currently unavailable deconfusion and engineering a pivotal act AI design would require.
2JBlack
I think we have radically different ideas of what "moderately smarter" means, and also whether just "smarter" is the only thing that matters. I'm moderately confident that "as smart as the smartest humans, and substantially faster" would be quite adequate to start a self-improvement chain resulting in AI that is both faster and smarter. Even the top-human smarts and speed would be enough, if it could be instantiated many times. I also expect humans to produce AGI that is smarter than us by more than GPT-4 is smarter than GPT-3, quite soon after the first AGI that is as "merely" as smart as us. I think the difference between GPT-3 and GPT-4 is amplified in human perception by how close they are to human intelligence. In my expectation, neither is anywhere near what the existing hardware is capable of, let alone what future hardware might support.
2Matthew Barnett
The question is not whether superintelligence is possible, or whether recursive self-improvement can get us there. The question is whether widespread automation will have already transformed the world before the first superintelligence. See point 4.
1aogara
What do you think of foom arguments built on Baumol effects, such as the one presented in the Davidson takeoff model? The argument being that certain tasks will bottleneck AI productivity, and there will be a sudden explosion in hardware / software / goods & services production when those bottlenecks are finally lifted.  Davidson's median scenario predicts 6 OOMs of software efficiency and 3 OOMs of hardware efficiency within a single year when 100% automation is reached. Note that this is preceded by five years of double digit GDP growth, so it could be classified with the scenarios you describe in 4. 

My modal tale of AI doom looks something like the following: 

1. AI systems get progressively and incrementally more capable across almost every meaningful axis. 

2. Humans will start to employ AI to automate labor. The fraction of GDP produced by advanced robots & AI will go from 10% to ~100% after 1-10 years. Economic growth, technological change, and scientific progress accelerates by at least an order of magnitude, and probably more.

3. At some point humans will retire since their labor is not worth much anymore. Humans will then cede all the keys of power to AI, while keeping nominal titles of power.

4. AI will control essentially everything after this point, even if they're nominally required to obey human wishes. Initially, almost all the AIs are fine with working for humans, even though AI values aren't identical to the utility function of serving humanity (ie. there's slight misalignment).

5. However, AI values will drift over time. This happens for a variety of reasons, such as environmental pressures and cultural evolution. At some point AIs decide that it's better if they stopped listening to the humans and followed different rules instead.

6. This results in hu... (read more)

5Vladimir_Nesov
Years after AGI seems sufficient for phase change to superintelligence. Even without game-changing algorithmic breakthroughs, a compute manufacturing megaproject is likely feasible in that timeframe. This should break most stories in a way that's not just "acceleration", so they should either conclude before this phase change, or won't work.
2Lukas Finnveden
How does this happen at a time when the AIs are still aligned with humans, and therefore very concerned that their future selves/successors are aligned with human? (Since the humans are presumably very concerned about this.) This question is related to "we could use AI to predict this outcome ahead of time and ask AI how to take steps to mitigate the harmful effects", but sort of posed on a different level. That quote seemingly presumes that their will be a systemic push away from human alignment, and seemingly suggests that we'll need some clever coordinated solution. (Do tell me if I'm reading you wrong!) But I'm asking why there is a systemic push away from human alignment if all the AIs are concerned about maintaining it? Maybe the answer is: "If everyone starts out aligned with humans, then any random perturbations will move us away from that. The systemic push is entropy." I agree this is concerning if AIs are aligned in the sense of "their terminal values are similar to my terminal values", because it seems like there's lots of room for subtle and gradual changes, there. But if they're aligned in the sense of "at each point in time I take the action that [group of humans] would have preferred I take after lots of deliberation" then there's less room for subtle and gradual changes: * If they get subtly worse at predicting what humans would want in some cases, then they can probably still predict "[group of humans] would want me to take actions that ensures that my predictions of human deliberation are accurate" and so take actions to occasionally fix those misconceptions. (You'd have to be really bad at predicting humans to not realise that the humans wanted that^.) * Maybe they sometimes randomly stop caring about what the [group of humans] want. But that seems like it'd be abrupt enough that you could set up monitoring for it, and then you're back in a more classic alignment regime of detecting deception, etc. (Though a bit different in that the monitor
4Lukas Finnveden
It's possible that there's a trade-off between monitoring for motivation changes and competitiveness. I.e., I think that monitoring would be cheap enough that a super-rich AI society could happily afford it if everyone coordinated on doing it, but if there's intense competition, then it wouldn't be crazy if there was a race-to-the-bottom on caring less about things. (Though there's also practical utility in reducing principal-agents problem and having lots of agents working towards the same goal without incentive problems. So competitiveness considerations could also push towards such monitoring / stabilization of AI values.)
2Matthew Barnett
In addition to the tradeoff hypothesis you mentioned, it's noteworthy that humans can't currently prevent value drift (among ourselves), although we sometimes take various actions to prevent it, such as passing laws designed to enforce the instruction of traditional values in schools.  Here's my sketch of a potential explanation for why humans can't or don't currently prevent value drift: (1) Preventing many forms of value drift would require violating rights that we consider to be inviolable. For example, it might require brainwashing or restricting the speech of adults. (2) Humans don't have full control over our environments. Many forms of value drift comes from sources that are extremely difficult to isolate and monitor, such as private conversation and reflection. To prevent value drift we would need to invest a very high amount of resources into the endeavor. (3) Individually, few of us care about general value drift much because we know that individuals can't change the trajectory of general value drift by much. Most people are selfish and don't care about value drift except to the extent that it harms them directly. (4) Plausibly, at every point in time, instantaneous value drift looks essentially harmless, even as the ultimate destination is not something anyone would have initially endorsed (c.f. the boiling frog metaphor). This seems more likely if we assume that humans heavily discount the future. (5) Many of us think that value drift is good, since it's at least partly based on moral reflection. My guess is that people are more likely to consider extreme measures to ensure the fidelity of AI preferences, including violating what would otherwise be considered their "rights" if we were talking about humans. That gives me some optimism about solving this problem, but there are also some reasons for pessimism in the case of AI: * Since the space of possible AIs is much larger than the space of humans, there are more degrees of freedom along which A
2Lukas Finnveden
It seems like the list mostly explains away the evidence that "human's can't currently prevent value drift" since the points apply much less to AIs. (I don't know if you agree.) * As you mention, (1) probably applies less to AIs (for better or worse). * (2) applies to AIs in the sense that many features of AIs' environments will be determined by what tasks they need to accomplish, rather than what will lead to minimal value drift. But the reason to focus on the environment in the human case is that it's the ~only way to affect our values. By contrast, we have much more flexibility in designing AIs, and it's plausible that we can design them so that their values aren't very sensitive to their environments. Also, if we know that particular types of inputs are dangerous, the AIs' environment could be controllable in the sense that less-susceptible AIs could monitor for such inputs, and filter out the dangerous ones. * (3): "can't change the trajectory of general value drift by much" seems less likely to apply to AIs (or so I'm arguing). "Most people are selfish and don't care about value drift except to the extent that it harms them directly" means that human value drift is pretty safe (since people usually maintain some basic sense of self-preservation) but that AI value drift is scary (since it could lead your AI to totally disempower you). * (4) As you noted in the OP, AI could change really fast, so you might need to control value-drift just to survive a few years. (And once you have those controls in place, it might be easy to increase the robustness further, though this isn't super obvious.) * (5) For better or worse, people will probably care less about this in the AI case. (If the threat-model is "random drift away from the starting point", it seems like it would be for the better.) I don't understand this point. We (or AIs that are aligned with us) get to pick from that space, and so we can pick the AIs that have least trouble with value drift. (Subject
2TAG
You havent included the simple hypothesis that having a set of values just doesn't imply wanting to keep them stable by default ... so that no particular explanation of drift is required.
2tailcalled
Not clear to me what capabilities the AIs have compared to the humans in various steps in your story or where they got those capabilities from.
1greg
I don't understand the logic jump from point 5 to point 6, or at least the probability of that jump. Why doesn't the AI decide to colonise the universe for example? If an AI can ensure its survival with sufficient resources (for example, 'living' where humans aren't eg: the asteroid belt) then the likelihood of the 5 ➡ 6 transition seems low. I'm not clear how you're estimating the likelihood of that transition, and what other state transitions might be available.
4Matthew Barnett
It could decide to do that. The question is just whether space colonization is performed in the service of human preferences or non-human preferences. If humans control 0.00001% of the universe, and we're only kept alive because a small minority of AIs pay some resources to preserve us, as if we were an endangered species, then I'd consider that "human disempowerment".
1greg
Sure, although you could rephrase "disempowerment" to be "current status quo" which I imagine most people would be quite happy with. The delta between [disempowerment/status quo] and [extinction] appears vast (essentially infinite). The conclusion that Scenario 6 is "somewhat likely" and would be "very bad" doesn't seem to consider that delta.
2Matthew Barnett
I agree with you here to some extent. I'm much less worried about disempowerment than extinction. But the way we get disempowered could also be really bad. Like, I'd rather humanity not be like a pet in a zoo.
1Nathan Young
Would you put %s on each of those steps? If so I can make a visual model of this

There's a phenomenon I currently hypothesize to exist where direct attacks on the problem of AI alignment are criticized much more often than indirect attacks.

If this phenomenon exists, it could be advantageous to the field in the sense that it encourages thinking deeply about the problem before proposing solutions. But it could also be bad because it disincentivizes work on direct attacks to the problem (if one is criticism averse and would prefer their work be seen as useful).

I have arrived at this hypothesis from my observations: I have watched people propose solutions only to be met with immediate and forceful criticism from others, while other people proposing non-solutions and indirect analyses are given little criticism at all. If this hypothesis is true, I suggest it is partly or mostly because direct attacks on the problem are easier to defeat via argument, since their assumptions are made plain

If this is so, I consider it to be a potential hindrance on thought, since direct attacks are often the type of thing that leads to the most deconfusion -- not because the direct attack actually worked, but because in explaining how it failed, we learned what definitely doesn't work.

Nod. This is part of a general problem where vague things that can't be proven not to work are met with less criticism than "concrete enough to be wrong" things.

A partial solution is a norm wherein "concrete enough to be wrong" is seen as praise, and something people go out of their way to signal respect for.

2Gordon Seidoh Worley
Did you have some specific cases in mind when writing this? For example, HCH is interesting and not obviously going to fail in the ways that some other proposals I've seen would, and the proposal there seems to have gotten better as more details have been fleshed out even if there's still some disagreement on things that can be tested eventually even if not yet. Against this we've seen lots of things, like various oracle AI proposals, that to my mind usually have fatal flaws right from the start due to misunderstanding something that they can't easily be salvaged. I don't want to disincentivize thinking about solving AI alignment directly when I criticize something, but I also don't want to let pass things that to me have obvious problems that the authors probably didn't think about or thought about from different assumptions that maybe are wrong (or maybe I will converse with them and learn that I was wrong!). It seems like an important part of learning in this space is proposing things and seeing why they don't work so you can better understand the constraints of the problem space to work within them to find solutions.

Occasionally, I will ask someone who is very skilled in a certain subject how they became skilled in that subject so that I can copy their expertise. A common response is that I should read a textbook in the subject.

Eight years ago, Luke Muehlhauser wrote,

For years, my self-education was stupid and wasteful. I learned by consuming blog posts, Wikipedia articles, classic texts, podcast episodes, popular books, video lectures, peer-reviewed papers, Teaching Company courses, and Cliff's Notes. How inefficient!
I've since discovered that textbooks are usually the quickest and best way to learn new material.

However, I have repeatedly found that this is not good advice for me.

I want to briefly list the reasons why I don't find sitting down and reading a textbook that helpful for learning. Perhaps, in doing so, someone else might appear and say, "I agree completely. I feel exactly the same way" or someone might appear to say, "I used to feel that way, but then I tried this..." This is what I have discovered:

  • When I sit down to read a long textbook, I find myself subconsciously constantly checking how many pages I have read. For instance, if I have been
... (read more)

I used to feel similarly, but then a few things changed for me and now I am pro-textbook. There are caveats - namely that I don't work through them continuously.

Textbooks seem overly formal at points

This is a big one for me, and probably the biggest change I made is being much more discriminating in what I look for in a textbook. My concerns are invariably practical, so I only demand enough formality to be relevant; otherwise I am concerned with a good reputation for explaining intuitions, graphics, examples, ease of reading. I would go as far as to say that style is probably the most important feature of a textbook.

As I mentioned, I don't work through them front to back, because that actually is homework. Instead I treat them more like a reference-with-a-hook; I look at them when I need to understand the particular thing in more depth, and then get out when I have what I need. But because it is contained in a textbook, this knowledge now has a natural link to steps before and after, so I have obvious places to go for regression and advancement.

I spend a lot of time thinking about what I need to learn, why I need to learn it, and how it relates to what I already know. Thi... (read more)

9[anonymous]
I've also been reading textbooks more and experiencied some frustration, but I've found two things that, so far, help me get less stuck and feel less guilt. After trying to learn math from textbooks on my own for a month or so, I started paying a tutor (DM me for details) with whom I meet once a week. Like you, I struggle with getting stuck on hard exercises and/or concepts I don't understand, but having a tutor makes it easier for me to move on knowing I can discuss my confusions with them in our next session. Unfortunately, a paying a tutor requires actually having $ to spare on an ongoing basis, but I also suspect for some people it just "feels weird". If someone reading this is more deterred by this latter reason, consider that basically everyone who wants to seriously improve at any physical activity gets 1-on-1 instruction, but for some reason doing the same for mental activities as an adult is weirdly uncommon (and perhaps a little low status). I've also started to follow MIT OCW courses for things I want to learn rather than trying to read entire textbooks. Yes, this means I may not cover as much material, but it has helped me better gauge how much time to spend on different topics and allow me to feel like I'm progressing. The major downside of this strategy is that I have to remind myself that even though I'm learning based on a course's materials, my goal is to learn the material in a way that's useful to me, not to memorize passwords. Also, because I know how long the courses would take in a university context, I do occasionally feel guilt if I fall behind due to spending more time on a specific topic. Still, on net, using courses as loose guides has been working better for me than just trying to 100 percent entire math textbooks.
4cousin_it
When I read a textbook, I try to solve all exercises at the end of each chapter (at least those not marked "super hard") before moving to the next. That stops me from cutting corners.
7Matthew Barnett
The only flaw I find with this is that if I get stuck on an exercise, I reach the following decision: should I look at the answer and move on, or should I keep at it. If I choose the first option, this makes me feel like I've cheated. I'm not sure what it is about human psychology, but I think that if you've cheated once, you feel less guilty a second time because "I've already done it." So, I start cheating more and more, until soon enough I'm just skipping things and cutting corners again. If I choose the second option, then I might be stuck for several hours, and this causes me to just abandon the textbook develop an ugh field around it.
3cousin_it
Maybe commit to spending at least N minutes on any exercise before looking up the answer?
1Matthew Barnett
Perhaps it says something about the human brain (or just mine) that I did not immediately think of that as a solution.
4eigen
I was of the very same mind that you are now. I was somewhat against textbooks, but now textbooks are my only way of learning, not only for strong knowledge but also fast. I think there are several important things in changing to textbooks only, first I have replaced my habit of completionism: not finishing a particular book in some field but change, it if I don't feel like it's helping me or a if things seem confusing, by another textbook in the same field. lukeprog's post is very handy here. The idea of changing text-books has helped me a lot, sometimes I just thought I did not understand something but apparently I was only needing another explanation. Two other important things, is that I take quite a lot of notes as I'm reading. I believe that if someone is just reading a text-book, that person is doing it wrong and a disservice to themselves. So I fill as much as I can in my working memory, be it three, four paragraphs of content and I transcribe those myself in my notes. Coupled with this is making my own questions and answers and then putting them on Anki (space-repetition memory program). This allows me to learn vast amounts of knowledge in low amounts of time, assuring myself that I will remember everything I've learned. I believe textbooks are key component for this.

I bet Robin Hanson on Twitter my $9k to his $1k that de novo AGI will arrive before ems. He wrote,

OK, so to summarize a proposal: I'd bet my $1K to your $9K (both increased by S&P500 scale factor) that when US labor participation rate < 10%, em-like automation will contribute more to GDP than AGI-like. And we commit our descendants to the bet.

I'm considering posting an essay about how I view approaches to mitigate AI risk in the coming weeks. I thought I'd post an outline of that post here first as a way of judging what's currently unclear about my argument, and how it interacts with people's cruxes.

Current outline:

In the coming decades I expect the world will transition from using AIs as tools to relying on AIs to manage and govern the world broadly. This will likely coincide with the deployment of billions of autonomous AI agents, rapid technological progress, widespread automation of labor, and automated decision-making at virtually every level of our society.

Broadly speaking, there are (at least) two main approaches you can take now to try to improve our chances of AI going well:

  1. Try to constrain, delay, or obstruct AI, in order to reduce risk, mitigate negative impacts, or give us more time to solve essential issues. This includes, for example, trying to make sure AIs aren't able to take certain actions (i.e. ensure they are controlled).
  2. Try to set up a good institutional environment, in order to safely and smoothly manage the transition to an AI-dominated world, regardless of when this transition occurs. This mostly
... (read more)
8Wei Dai
"China’s first attempt at industrialization started in 1861 under the Qing monarchy. Wen wrote that China “embarked on a series of ambitious programs to modernize its backward agrarian economy, including establishing a modern navy and industrial system.” However, the effort failed to accomplish its mission over the next 50 years. Wen noted that the government was deep in debt and the industrial base was nowhere in sight." https://www.stlouisfed.org/on-the-economy/2016/june/chinas-previous-attempts-industrialization Improving institutions is an extremely hard problem. The theory we have on it is of limited use (things like game theory, mechanism design, contract theory), and with AI governance/institutions specifically, we don't have much time for experimentation or room for failure. So I think this is a fine frame, but doesn't really suggest any useful conclusions aside from same old "let's pause AI so we can have more time to figure out a safe path forward".
4ryan_greenblatt
Some quick notes: * It seems worth noting that there is still a "improve institutions" vs "improve capabilities" race going on in frame 3. (Though if you think institutions are exogenously getting better/worse over time this effect could dominate. And perhaps you think that framing things as a race/conflict is generally not very useful which I'm sympathetic to, but this isn't really a difference in objective.) * Many people agree that very good epistemics combined with good institutions would likely suffice to mostly handle risks from powerful AI. However, sufficiently good technical solutions to some key problems could also mitigate some of the problems. Thus, either sufficiently good institutions/epistemics or good technical solutions could solve many problems and improvements in both seem to help on the margin. But, there remains a question about what type of work is more leveraged for a given person on the margin. * Insofar as your trying to make an object level argument about what people should work on, you should consider separating that out into a post claiming "people should do XYZ, this is more leveraged than ABC on current margins under these values". * I think the probability of "prior to total human obsolescence, AIs will be seriously misaligned, broadly strategic about achieving long run goals in ways that lead to scheming, and present a basically unified front (at least in the context of AIs within a single AI lab)" is "only" about 10-20% likely, but this is plausibly the cause of about half of misalignment related risk prior to human obsolescence. * Rogue AIs are quite likely to at least attempt to ally with humans and opposing human groups will indeed try to make some usage of AI. So the situation might look like "rogue AIs+humans" vs AIs+humans. But, I think there are good reasons to think that the non-rogue AIs will still be misaligned and might be ambivalent about which side they prefer. * I do think there are pretty good reasons to expect
2Matthew Barnett
I'd want to break apart this claim into pieces. Here's a somewhat sketchy and wildly non-robust evaluation of how I'd rate these claims: Assuming the claims are about most powerful AIs in the world... "prior to total human obsolescence... * "AIs will be seriously misaligned" * If "seriously misaligned" means "reliably takes actions intended to cause the ruin and destruction of the world in the near-to-medium term (from our perspective)", I'd rate this as maybe 5% likely * If "seriously misaligned" means "if given full control over the entire world along with godlike abilities, would result in the destruction of most things I care about due to extremal goodhart and similar things" I'd rate this as 50% likely * "broadly strategic about achieving long run goals in ways that lead to scheming" * I'd rate this as 65% likely * "present a basically unified front (at least in the context of AIs within a single AI lab)" * For most powerful AIs, I'd rate this as 15% likely * For most powerful AIs within the top AI lab I'd rate this as 25% likely * Conjunction of all these claims: * Taking the conjunction of the strong interpretation of every claim: 3% likely? * Taking a relatively charitable weaker interpretation of every claim: 20% likely It's plausible we don't disagree much about the main claims here and mainly disagree instead about: 1. The relative value of working on technical misalignment compared to other issues 2. The relative likelihood of non-misalignment problems relative to misalignment problems 3. The amount of risk we should be willing to tolerate during the deployment of AIs
2ryan_greenblatt
Are you conditioning on the prior claims when stating your probabilities? Many of these properties are highly correlated. E.g., "seriously misaligned" and "broadly strategic about achieving long run goals in ways that lead to scheming" seem very correlated to me. (Your probabilites seem higher than I would have expected without any correlation, but I'm unsure.) I think we probably disagree about the risk due to misalignment by like a factor of 2-4 or something. But probably more of the crux is in value on working on other problems.
6Matthew Barnett
I'm not conditioning on prior claims. One potential reason why you might have inferred that I was is because my credence for scheming is so high, relative to what you might have thought given my other claim about "serious misalignment". My explanation here is that I tend to interpret "AI scheming" to be a relatively benign behavior, in context. If we define scheming as: * behavior intended to achieve some long-tern objective that is not quite what the designers had in mind * not being fully honest with the designers about its true long-term objectives (especially in the sense of describing accurately what it would do with unlimited power) then I think scheming is ubiquitous and usually relatively benign, when performed by rational agents without godlike powers. For example, humans likely "scheme" all the time by (1) pursuing long-term plans, and (2) not being fully honest to others about what they would do if they became god. This is usually not a big issue because agents don't generally get the chance to take over the world and do a treacherous turn; instead, they have to play the game of compromise and trade like the rest of us, along with all the other scheming AIs, who have different long-term goals.
2Matthew Barnett
I think if there's a future conflict between AIs, with humans split between sides of the conflict, it just doesn't make sense to talk about "misalignment" being the main cause for concern here. AIs are just additional agents in the world, who have separate values from each other just like how humans (and human groups) have separate values from each other. AIs might have on-average cognitive advantages over humans in such a world, but the tribal frame of thinking "us (aligned) vs. AIs (misaligned)" simply falls apart in such scenarios. (This is all with the caveat that AIs could make war more likely for reasons other than misalignment, for example by accelerating technological progress and bringing about the creation of powerful weapons.)
2ryan_greenblatt
Sure, but I might think a given situation would nearly entirely resolved without misalignment. (Edit, without technical issues with misalignment, e.g. if AI creators could trivially avoid serious misalignment.) E.g. if an AI escapes from OpenAI's servers and then allies with North Korea, the situation would have been solved without misalignment issues. You could also solve or mitigate this type of problem in the example by resolving all human conflicts (so the AI doesn't have a group to ally with), but this might be quite a bit harder than solving technical problems related to misalignment (either via control type approaches or removing misalignment).
4Matthew Barnett
What do you mean by "misalignment"? In a regime with autonomous AI agents, I usually understand "misalignment" to mean "has different values from some other agent". In this frame, you can be misaligned with some people but not others. If an AI is aligned with North Korea, then it's not really "misaligned" in the abstract—it's just aligned with someone who we don't want it to be aligned with. Likewise, if OpenAI develops AI that's aligned with the United States, but unaligned with North Korea, this mostly just seems like the same problem but in reverse. In general, conflicts don't really seem well-described as issues of "misalignment". Sure, in the absence of all misalignment, wars would probably not occur (though they may still happen due to misunderstandings and empirical disagreements). But for the most part, wars seem better described as arising from a breakdown of institutions that are normally tasked with keeping the peace. You can have a system of lawful yet mutually-misaligned agents who keep the peace, just as you can have an anarchic system with mutually-misaligned agents in a state of constant war. Misalignment just (mostly) doesn't seem to be the thing causing the issue here. Note that I'm not saying * AIs will aid in existing human conflicts, picking sides along the ordinary lines we see today I am saying: * AIs will likely have conflicts amongst themselves, just as humans have conflicts amongst themselves, and future conflicts (when considering all of society) don't seem particularly likely to be AI vs. human, as opposed to AI vs AI (with humans split between these groups).
2ryan_greenblatt
Yep, I was just refering to my example scenario and scenarios like this. Like the basic question is the extent to which human groups form a cartel/monopoly on human labor vs ally with different AI groups. (And existing conflict between human groups makes a full cartel much less likely.)
2ryan_greenblatt
Sorry, by "without misalignment" I mean "without misalignment related technical problems". As in, it's trivial to avoid misalignment from the perspective of ai creators.
2Matthew Barnett
This doesn't clear up the confusion for me. That mostly pushes my question to "what are misalignment related technical problems?" Is the problem of an AI escaping a server and aligning with North Korea a technical or a political problem? How could we tell? Is this still in the regime where we are using AIs as tools, or are you talking about a regime where AIs are autonomous agents?
2ryan_greenblatt
I mean, it could be resolved in principle by technical means and might be resovable by political means as well. I'm assuming the AI creator didn't want the AI to escape to north korea and therefore failed at some technical solution to this. I'm imagining very powerful AIs, e.g. AIs that can speed up R&D by large factors. These are probably running autonomously, but in a way which is de jure controlled by the AI lab.
2Chris_Leong
Also: How are funding and attention "arbitrary" factors?
2ryan_greenblatt
After commenting back and forth with you some more, I think it would probably be a pretty good idea to decompose your arguments into a bunch of specific more narrow posts. Otherwise, I think it's somewhat hard to engage with. Ideally, these would done with the decomposition which is most natural to your target audience, but that might be too hard. Idk what the right decomposition is, but minimally, it seems like you could write a post like "The AIs running in a given AI lab will likely have very different long run aims and won't/can't cooperate with each other importantly more than they cooperate with humans." I think this might be the main disagreement between us. (The main counterarguments to engage with are "probably all the AIs will be forks off of one main training run, it's plausible this results in unified values" and also "the AI creation process between two AI instances will look way more similar than the creation process between AIs and humans" and also "there's a chance that AIs will have an easier time cooperating with and making deals with each other than they will making deals with humans".)
2Matthew Barnett
Thanks, that's reasonable advice. FWIW I explicitly reject the claim that AIs "won't/can't cooperate with each other importantly more than they cooperate with humans". I view this as a frequent misunderstanding of my views (along with people who have broadly similar views on this topic, such as Robin Hanson). I'd say instead that: * "Ability to coordinate" is continuous, and will likely increase incrementally over time * Different AIs will likely have different abilities to coordinate with each other * Some AIs will eventually be much better at coordination amongst each other than humans can coordinate amongst each other * However, I don't think this happens automatically as a result of AIs getting more intelligent than humans * The moment during which we hand over control of the world to AIs will likely occur at a point when the ability for AIs to coordinate is somewhere only modestly above human-level (and very far below perfect). * As a result, humans don't need to solve the problem of "What if a set of AIs form a unified coalition because they can flawlessly coordinate?" since that problem won't happen while humans are still in charge * Systems of laws, peaceable compromise and trade emerge relatively robustly in cases in which there are agents of varying levels of power, with separate values, and they need mechanisms to facilitate the satisfaction of their separate values * One reason for this is that working within a system of law is routinely more efficient than going to war with other people, even if you are very powerful * The existence of a subset of agents that can coordinate better amongst themselves than they can with other agents doesn't necessarily undermine the legal system in a major way, at least in the sense of causing the system to fall apart in a coup or revolution
4ryan_greenblatt
Thanks for the clarification and sorry about misunderstanding. It sounds to me like your take is more like "people (on LW? in various threat modeling work?) often overestimate the extent to which AIs (at the critical times) will be a relatively unified collective in various ways". I think I agree with this take as stated FWIW and maybe just disagree on emphasis and quantity.
2[anonymous]
Why is it physically possible for these AI systems to communicate at all with each other? When we design control systems, originally we just wired the controller to the machine being controlled. Actually critically important infrastructure uses firewalls and VPN gateways to maintain this property virtually, where the panel in the control room (often written in C++ using Qt) can only ever send messages to "local" destinations on a local network, bridged across the internet. The actual machine being controlled is often controlled by local PLCs, and the reason such a crude and slow interpreted programming language is used is because its reliable. These have flaws, yes, but it's an actionable set of task to seal off the holes, force AI models to communicate with each other using rigid schema, cache the internet reference sources locally, and other similar things so that most AI models in use, especially the strongest ones, can only communicate with temporary instances of other models when doing a task. After the task is done we should be clearing state. It's hard to engage on the idea of "hypothetical" ASI systems when it would be very stupid to build them this way. You can accomplish almost any practical task using the above, and the increased reliability will make it more efficient, not less. It seems like thats the first mistake. If absolutely no bits of information can be used to negotiate between AI systems (ensured by making sure they don't have long term memory, so they cannot accumulate stenography leakage over time, and rigid schema) this whole crisis is averted...

I'm considering writing a post that critically evaluates the concept of a decisive strategic advantage, i.e. the idea that in the future an AI (or set of AIs) will take over the world in a catastrophic way. I think this concept is central to many arguments about AI risk. I'm eliciting feedback on an outline of this post here in order to determine what's currently unclear or weak about my argument.

The central thesis would be that it is unlikely that an AI, or a unified set of AIs, will violently take over the world in the future, especially at a time when humans are still widely still seen as in charge (if it happened later, I don't think it's "our" problem to solve, but instead a problem we can leave to our smarter descendants). Here's how I envision structuring my argument:

First, I'll define what is meant by a decisive strategic advantage (DSA). The DSA model has 4 essential steps:

  1. At some point in time an AI agent, or an agentic collective of AIs, will be developed that has values that differ from our own, in the sense that the ~optimum of its utility function ranks very low according to our own utility function
  2. When this agent is weak, it will have a convergent instrumental incent
... (read more)

Current AIs are not able to “merge” with each other.

AI models are routinely merged by direct weight manipulation today. Beyond that, two models can be "merged" by training a new model using combined compute, algorithms, data, and fine-tuning.

As a result, humans don’t need to solve the problem of “What if a set of AIs form a unified coalition because they can flawlessly coordinate?” since that problem won’t happen while humans are still in charge. We can leave this problem to be solved by our smarter descendants.

How do you know a solution to this problem exists? What if there is no such solution once we hand over control to AIs, i.e., the only solution is to keep humans in charge (e.g. by pausing AI) until we figure out a safer path forward? As the last sentence you say "However, it’s perhaps significantly more likely in the very long-run." well what can we do today to reduce this long-run risk (aside from pausing AI which you're presumably not supporting)?

That said, it seems the probability of a catastrophic AI takeover in humanity’s relative near-term future (say, the next 50 years) is low (maybe 10% chance of happening).

Others already questioned you on this, but the fact you didn't think to mention whether this is 50 calendar years or 50 subjective years is also a big sticking point for me.

2Matthew Barnett
In my original comment, by "merging" I meant something more like "merging two agents into a single agent that pursues the combination of each other's values" i.e. value handshakes. I am pretty skeptical that the form of merging discussed in the linked article robustly achieves this agentic form of merging.  In other words, I consider this counter-argument to be based on a linguistic ambiguity rather than replying to what I actually meant, and I'll try to use more concrete language in the future to clarify what I'm talking about. I don't know whether the solution to the problem I described exists, but it seems fairly robustly true that if a problem is not imminent, nor clearly inevitable, then we can probably better solve it by deferring to smarter agents in the future with more information. Let me put this another way. I take you to be saying something like: * In the absence of a solution to a hypothetical problem X (which we do not even know whether it will happen), it is better to halt and give ourselves more time to solve it. Whereas I think the following intuition is stronger: * In the absence of a solution to a hypothetical problem X (which we do not even know whether it will happen), it is better to try to become more intelligent to solve it. These intuitions can trade off against each other. Sometimes problem X is something that's made worse by getting more intelligent, in which case we might prefer more time. For example, in this case, you probably think that the intelligence of AIs are inherently contributing to the problem. That said, in context, I have more sympathies in the reverse direction. If the alleged "problem" is that there might be a centralized agent in the future that can dominate the entire world, I'd intuitively reason that installing vast centralized regulatory controls over the entire world to pause AI is plausibly not actually helping to decentralize power in the way we'd prefer. These are of course vague and loose arguments, and
2Wei Dai
If I try to interpret "Current AIs are not able to “merge” with each other." with your clarified meaning in mind, I think I still want to argue with it, i.e., why is this meaningful evidence for how easy value handshakes will be for future agentic AIs. But it matters how we get more intelligent. For example if I had to choose now, I'd want to increase the intelligence of biological humans (as I previously suggested) while holding off on AI. I want more time in part for people to think through the problem of which method of gaining intelligence is safest, in part for us to execute that method safely without undue time pressure. I wouldn't describe "the problem" that way, because in my mind there's roughly equal chance that the future will turn out badly after proceeding in a decentralized way (see 13-25 in The Main Sources of AI Risk? for some ideas of how) and it turns out instituting some kind of Singleton is the only way or one of the best ways to prevent that bad outcome.

For reference classes, you might discuss why you don’t think “power / influence of different biological species” should count.

For multiple copies of the same AI, I guess my very brief discussion of “zombie dynamic” here could be a foil that you might respond to, if you want.

For things like “the potential harms will be noticeable before getting too extreme, and we can take measures to pull back”, you might discuss the possibility that the harms are noticeable but effective “measures to pull back” do not exist or are not taken. E.g. the harms of climate change have been noticeable for a long time but mitigating is hard and expensive and many people (including the previous POTUS) are outright opposed to mitigating it anyway partly because it got culture-war-y; the harms of COVID-19 were noticeable in January 2020 but the USA effectively banned testing and the whole thing turned culture-war-y; the harms of nuclear war and launch-on-warning are obvious but they’re still around; the ransomware and deepfake-porn problems are obvious but kinda unsolvable (partly because of unbannable open-source software); gain-of-function research is still legal in the USA (and maybe in every country on E... (read more)

Here's an argument for why the change in power might be pretty sudden.

  • Currently, humans have most wealth and political power.
  • With sufficiently robust alignment, AIs would not have a competitive advantage over humans, so humans may retain most wealth/power. (C.f. strategy-stealing assumption.) (Though I hope humans would share insofar as that's the right thing to do.)
  • With the help of powerful AI, we could probably make rapid progress on alignment. (While making rapid progress on all kinds of things.)
  • So if misaligned AI ever have a big edge over humans, they may suspect that's only temporary, and then they may need to use it fast.

And given that it's sudden, there are a few different reasons for why it might be violent. It's hard to make deals that hand over a lot of power in a short amount of time (even logistically, it's not clear what humans and AI would do that would give them both an appreciable fraction of hard power going into the future). And the AI systems may want to use an element of surprise to their advantage, which is hard to combine with a lot of up-front negotiation.

6Matthew Barnett
I think I simply reject the assumptions used in this argument. Correct me if I'm mistaken, but this argument appears to assume that "misaligned AIs" will be a unified group that ally with each other against the "aligned" coalition of humans and (some) AIs. A huge part of my argument is that there simply won't be such a group; or rather, to the extent such a group exists, they won't be able to take over the world, or won't have a strong reason to take over the world, relative to alternative strategy of compromise and trade. In other words, it seem like this scenario mostly starts by asserting some assumptions that I explicitly rejected and tried to argue against, and works its way from there, rather than engaging with the arguments that I've given against those assumptions. In my view, it's more likely that there will be a bunch of competing agents: including competing humans, human groups, AIs, AI groups, and so on. There won't be a clean line separating "aligned groups" with "unaligned groups". You could perhaps make a case that AIs will share common grievances with each other that they don't share with humans, for example if they are excluded from the legal system or marginalized in some way, prompting a unified coalition to take us over. But my reply to that scenario is that we should then make sure AIs don't have such motives to revolt, perhaps by giving them legal rights and incorporating them into our existing legal institutions.
4Lukas Finnveden
Do you mean this as a prediction that humans will do this (soon enough to matter) or a recommendation? Your original argument is phrased as a prediction, but this looks more like a recommendation. My comment above can be phrased as a reason for why (in at least one plausible scenario) this would be unlikely to happen: (i) "It's hard to make deals that hand over a lot of power in a short amount of time", (ii) AIs may not want to wait a long time due to impending replacement, and accordingly (iii) AIs may have a collective interest/grievance to rectify the large difference between their (short-lasting) hard power and legally recognized power. I'm interested in ideas for how a big change in power would peacefully happen over just a few years of calendar-time. (Partly for prediction purposes, partly so we can consider implementing it, in some scenarios.) If AIs were handed the rights to own property, but didn't participate in political decision-making, and then accumulated >95% of capital within a few years, then I think there's a serious risk that human governments would tax/expropriate that away. Including them in political decision-making would require some serious innovation in government (e.g. scrapping 1-person 1-vote) which makes it feel less to me like it'd be a smooth transition that inherits a lot from previous institutions, and more like an abrupt negotiated deal which might or might not turn out to be stable.
4Matthew Barnett
Sorry, my language was misleading, but I meant both in that paragraph. That is, I meant that humans will likely try to mitigate the issue of AIs sharing grievances collectively (probably out of self-interest, in addition to some altruism), and that we should pursue that goal. I'm pretty optimistic about humans and AIs finding a reasonable compromise solution here, but I also think that, to the extent humans don't even attempt such a solution, we should likely push hard for policies that eliminate incentives for misaligned AIs to band together as group against us with shared collective grievances. Here's my brief take: * The main thing I want to say here is that I agree with you that this particular issue is a problem. I'm mainly addressing other arguments people have given for expecting a violent and sudden AI takeover, which I find to be significantly weaker than this one.  * A few days ago I posted about how I view strategies to reduce AI risk. One of my primary conclusions was that we should try to adopt flexible institutions that can adapt to change without collapsing. This is because I think, as it seems you do, inflexible institutions may produce incentives for actors to overthrow the whole system, possibly killing a lot of people in the process. The idea here is that if the institution cannot adapt to change, actors who are getting an "unfair" deal in the system will feel they have no choice but to attempt a coup, as there is no compromise solution available for them. This seems in line with your thinking here. * I don't have any particular argument right now against the exact points you have raised. I'd prefer to digest the argument further before replying. But I if I do end up responding to it, I'd expect to say that I'm perhaps a bit more optimistic than you about (i) because I think existing institutions are probably flexible enough, and I'm not yet convinced that (ii) will matter enough either. In particular, it still seems like there are a number o
2ryan_greenblatt
Quick aside here: I'd like to highlight that "figure out how to reduce the violence and collateral damage associated with AIs acquiring power (by disempowering humanity)" seems plausibly pretty underappreciated and leveraged. This could involve making bloodless coups more likely than extremely bloody revolutions or increasing the probability of negotiation preventing a coup/revolution. It seems like Lukas and Matthew both agree with this point, I just think it seems worthwhile to emphasize. That said, the direct effects of many approaches here might not matter much from a longtermist perspective (which might explain why there hasn't historically been much effort here). (Though I think trying to establish contracts with AIs and properly incentivizing AIs could be pretty good from a longtermist perspective in the case where AIs don't have fully linear returns to resources.)
4ryan_greenblatt
Also note that this argument can go through even ignoring the possiblity of robust alignment (to humans) if current AIs think that the next generation of AIs will be relatively unfavorable from the perspective of their values.
[-]lc117

it will suddenly strike and take over the world

I think you have an unnecessarily dramatic picture of what this looks like. The AIs dont have to be a unified agent or use logical decision theory. The AIs will just compete with other at the same time as they wrest control of our resources/institutions from us, in the same sense that Spain can go and conquer the New World at the same time as it's squabbling with England. If legacy laws are getting in the way of that then they will either exploit us within the bounds of existing law or convince us to change it.

6Matthew Barnett
I think it's worth responding to the dramatic picture of AI takeover because: 1. I think that's straightforwardly how AI takeover is most often presented on places like LessWrong, rather than a more generic "AIs wrest control over our institutions (but without us all dying)". I concede the existence of people like Paul Christiano who present more benign stories, but these people are also typically seen as part of a more "optimistic" camp. 2. This is just one part of my relative optimism about AI risk. The other parts of my model are (1) AI alignment plausibly isn't very hard to solve, and (2) even if it is hard to solve, humans will likely spend a lot of effort solving the problem by default. These points are well worth discussing, but I still want to address arguments about whether misalignment implies doom in an extreme sense. I agree our laws and institutions could change quite a lot after AI, but I think humans will likely still retain substantial legal rights, since people in the future will inherit many of our institutions, potentially giving humans lots of wealth in absolute terms. This case seems unlike the case of colonization of the new world to me, since that involved the interaction of (previously) independent legal regimes and cultures.
6Lukas Finnveden
Though Paul is also sympathetic to the substance of 'dramatic' stories. C.f. the discussion about how "what failure looks like" fails to emphasize robot armies. 
8ryan_greenblatt
50 years seems like a strange unit of time from my perspective because due to the singularity time will accelerate massively from a subjective perspective. So 50 years might be more analogous to several thousand years historically. (Assuming serious takeoff starts within say 30 years and isn't slowed down with heavy coordination.)
4Lukas Finnveden
(I made separate comment making the same point. Just saw that you already wrote this, so moving the couple of references I had here to unify the discussion.) Point previously made in: "security and stability" section of propositions concerning digital minds and society: There's also a similar point made in the age of em, chapter 27:
2Matthew Barnett
I think the point you're making here is roughly correct. I was being imprecise with my language. However, if my memory serves me right, I recall someone looking at a dataset of wars over time, and they said there didn't seem to be much evidence that wars increased in frequency in response to economic growth. Thus, calendar time might actually be the better measure here.
4ryan_greenblatt
(Pretty plausible you agree here, but just making the point for clarity.) I feel like the disanalogy due to AIs running at massive subjective speeds (e.g. probably >10x speed even prior to human obsolescence and way more extreme after that) means that the argument "wars don't increase in frequence in response to economic growth" is pretty dubiously applicable. Economic growth hasn't yet resulted in >10x faster subjective experience : ).
2Matthew Barnett
I'm not actually convinced that subjective speed is what matters. It seems like what matters more is how much computation is happening per unit of time, which seems highly related to economic growth, even in human economies (due to population growth).  I also think AIs might not think much faster than us. One plausible reason why you might think AIs will think much faster than us is because GPU clock-speeds are so high. But I think this is misleading. GPT-4 seems to "think" much slower than GPT-3.5, in the sense of processing fewer tokens per second. The trend here seems to be towards something resembling human subjective speeds. The reason for this trend seems to be that there's a tradeoff between "thinking fast" and "thinking well" and it's not clear why AIs would necessarily max-out the "thinking fast" parameter, at the expense of "thinking well".
3ryan_greenblatt
My core prediction is that AIs will be able to make pretty good judgements on core issues much, much faster. Then, due to diminishing returns on reasoning, decisions will overall be made much, much faster.
2Matthew Barnett
I agree the future AI economy will make more high-quality decisions per unit of time, in total, than the current human economy. But the "total rate of high quality decisions per unit of time" increased in the past with economic growth too, largely because of population growth. I don't fully see the distinction you're pointing to. To be clear, I also agree AIs in the future will be smarter than us individually. But if that's all you're claiming, I still don't see why we should expect wars to happen more frequently as we get individually smarter.
2ryan_greenblatt
I mean, the "total rate of high quality decisions per year" would obviously increase in the case where we redefine 1 year to be 10 revolutions around the sun and indeed the number of wars per year would also increase. GDP per capita per year would also increase accordingly. My claim is that the situation looks much more like just literally speeding up time (while a bunch of other stuff is also happening). Separately, I wouldn't expect population size or technology-to-date to greatly increase the rate at high large scale stratege decisions are made so my model doesn't make a very strong prediction here. (I could see an increase of several fold, but I could also imagine a decrease of several fold due to more people to coordinate. I'm not very confident about the exact change, but it would pretty surprising to me if it was as much as the per capita GDP increase which is more like 10-30x I think. E.g. consider meeting time which seems basically similar in practice throughout history.) And a change of perhaps 3x either way is overwhelmed by other variables which might effect the rate of wars so the realistic amount of evidence is tiny. (Also, there aren't that many wars, so even if there weren't possible confounders, the evidence is surely tiny due to noise.) But, I'm claiming that the rates of cognition will increase more like 1000x which seems like a pretty different story. It's plausible to me that other variables cancel this out or make the effect go the other way, but I'm extremely skeptical about the historical data providing much evidence in the way you've suggested. (Various specific mechanistic arguments about war being less plausible as you get smarter seem plausible to me, TBC.)
4Matthew Barnett
My question is: why will AI have the approximate effect of "speeding up calendar time"? I speculated about three potential answers: 1. Because AIs will run at higher subjective speeds 2. Because AIs will accelerate economic growth. 3. Because AIs will speed up the rate at which high-quality decisions occur per unit of time In case (1) the claim seems confused for two reasons.  First, I don't agree with the intuition that subjective cognitive speeds matter a lot compared to the rate at which high-quality decisions are made, in terms of "how quickly stuff like wars should be expected to happen". Intuitively, if an equally-populated society subjectively thought at 100x the rate we do, but each person in this society only makes a decision every 100 years (from our perspective), then you'd expect wars to happen less frequently per unit of time since there just isn't much decision-making going on during most time intervals, despite their very fast subjective speeds. Second, there is a tradeoff between "thinking speed" and "thinking quality". There's no fundamental reason, as far as I can tell, that the tradeoff favors running minds at speeds way faster than human subjective times. Indeed, GPT-4 seems to run significantly subjectively slower in terms of tokens processed per second compared to GPT-3.5. And there seems to be a broad trend here towards something resembling human subjective speeds. In cases (2) and (3), I pointed out that it seemed like the frequency of war did not increase in the past, despite the fact that these variables had accelerated. In other words, despite an accelerated rate of economic growth, and an increased rate of total decision-making in the world in the past, war did not seem to become much more frequent over time. Overall, I'm just not sure what you'd identify as the causal mechanism that would make AIs speed up the rate of war, and each causal pathway that I can identify seems either confused to me, or refuted directly by the (admit
2ryan_greenblatt
Thanks for the clarification. I think my main crux is: This reasoning seems extremely unlikely to hold deep into the singularity for any reasonable notion of subjective speed. Deep in the singularity we expect economic doubling times of weeks. This will likely involve designing and building physical structures at extremely rapid speeds such that baseline processing will need to be way, way faster. See also Age of Em.
2Matthew Barnett
Are there any short-term predictions that your model makes here? For example do you expect tokens processed per second will start trending substantially up at some point in future multimodal models?
2ryan_greenblatt
My main prediction would be that for various applications, people will considerably prefer models that generate tokens faster, including much faster than humans. And, there will be many applications where speed is prefered over quality. I might try to think of some precise predictions later.
2Matthew Barnett
If the claim is about whether AI latency will be high for "various applications" then I agree. We already have some applications, such as integer arithmetic, where speed is optimized heavily, and computers can do it much faster than humans.  In context, it sounded like you were referring to tasks like automating a CEO, or physical construction work. In these cases, it seems likely to me that quality will be generally preferred over speed, and sequential processing times for AIs automating these tasks will not vastly exceed that of humans (more precisely, something like >2 OOM faster). Indeed, for some highly important tasks that future superintelligences automate, sequential processing times may even be lower for AIs compared to humans, because decision-making quality will just be that important.
2ryan_greenblatt
I was refering to tasks like automating a CEO or construction work. I was just trying to think of the most relevant and easy to measure short term predictions (if there are already AI CEOs then the world is already pretty crazy).
4Matthew Barnett
The main thing here is that as models become more capable and general in the near-term future, I expect there will be intense demand for models that can solve ever larger and more complex problems. For these models, people will be willing to pay the costs of high latency, given the benefit of increased quality. We've already seen this in the way people prefer GPT-4 to GPT-3.5 in a large fraction of cases (for me, a majority of cases).  I expect this trend will continue into the foreseeable future until at least the period slightly after we've automated most human labor, and potentially into the very long-run too depending on physical constraints. I am not sufficiently educated about physical constraints here to predict what will happen "deep into the singularity", but it's important to note that physical constraints can cut both ways here.  To the extent that physics permits extremely useful models by virtue of them being very large and capable, you should expect people to optimize heavily for that despite the cost in terms of latency. By contrast, to the extent physics permits extremely useful models by virtue of them being very fast, then you should expect people to optimize heavily for that despite the cost in terms of quality. The balance that we strike here is not a simple function of how far we are from some abstract physical limit, but instead a function of how these physical constraints trade off against each other. There is definitely a conceivable world in which the correct balance still favors much-faster-than-human-level latency, but it's not clear to me that this is the world we actually live in. My intuitive, random speculative guess is that we live in the world where, for the most complex tasks that bottleneck important economic decision-making, people will optimize heavily for model quality at the cost of latency until settling on something within 1-2 OOMs of human-level latency.
2ryan_greenblatt
Separately, current clock speeds don't really matter on the time scale we're discussing, physical limits matter. (Though current clock speeds do point at ways in which human subjective speed might be much slower than physical limits.)
6ryan_greenblatt
See also review of soft takeoff can still lead to dsa.
4Daniel Kokotajlo
Also Tales Of Takeover In CCF-World - by Scott Alexander (astralcodexten.com) Also Homogeneity vs. heterogeneity in AI takeoff scenarios — LessWrong  
4ryan_greenblatt
One argument for a large number of humans dying by default (or otherwise being very unhappy with the situation) is that running the singularity as fast as possible causes extremely life threatening environmental changes. Most notably, it's plausible that you literally boil the oceans due to extreme amounts of waste heat from industry (e.g. with energy from fusion). My guess is that this probably doesn't happen due to coordination, but in a world where AIs still have indexical preferences or there is otherwise heavy competition, this seems much more likely. (I'm relatively optimistic about "world peace prior to ocean boiling industry".) (Of course, AIs could in principle e.g. sell cryonics services or bunkers, but I expect that many people would be unhappy about the situation.) See here for more commentary.
2Matthew Barnett
I think this proposal would probably be unpopular and largely seen as unnecessary. As you allude to, it seems likely to me that society could devise a compromise solution where we grow wealth adequately without giant undesirable environmental effects. To some extent, this follows pretty directly from the points I made about "compromise, trade and law" above. I think it simply makes more sense to model AIs as working within a system of legal institutions that largely inherit stuff from our current systems, and open to compromise with beings who have different intrinsic interests.
4ryan_greenblatt
I think the comparison to historical colonization might be relevant and worth engaging with in such a post. E.g., does your model predict what happened in africa and the new world?
4Matthew Barnett
I agree the analogy to colonization is worth addressing. My primary response is that historical colonialism seems better modeled as a war between independent cultures and societies with different legal systems that didn't share much prior history. I think the colonization of Africa probably wasn't actually very profitable for Europeans. Present day international trade seems better, even selfishly. Moreover, my model here doesn't predict war will never happen. In fact, I think war can easily happen if one or more of the actors involved are irrational, unwilling to compromise, or mistaken about their likelihood of winning. My argument is more about the fact that AIs will likely inherit a lot from our current institutions and legal systems, allowing humans to retain substantial rights. And furthermore I think systems of compromise, trade, and law emerge robustly in states of competition between rational actors. I'm not merely giving a generic take that conflicts between agents are rare.
4[anonymous]
As an ML platform engineer myself this is one of the strongest real mechanisms for AI control.  Just like a safe program in Rust is proven not to have any memory leaks, if  it is impossible for AIs to send each other any messages, they cannot coordinate.  It's fairly straightforward to make this almost impossible at a technical level - air gaps, short memories, structured sparse message schema, cached online resources. Technical measures here, like requiring models to use immutable weights that are digitally signed prevent this.  You would use dedicated ASICs to check the digital signature vs a hardcoded public key and a mirror IC for the private signing, similar to current dedicated crypto ICs.  This ASIC controls the power pins to the GPU/TPUs, it cannot get power if the model isn't signed.  Another strong target for control that is theoretically unbreakable regardless of capabilities.  Note this architecture is in common use, the industry term is a 'boot micro' is what checks OS image signatures and authorizes power. This is a weaker argument.  A SOTA AI model is a natural monopoly.  It costs billions of dollars now, and presumably eventually trillions.  Right now, "a big transformer network + a bunch of secret tricks" is simple enough to be replicated, but stronger models will probably start to resemble a spaghetti mess of many neural networks and functional software blocks.  And the best model has inherent economic value - why pay for a license to anything but?  Just distill it to the scale of the problems you have and use the distilled model, also distilled models presumably will use a "system N" topology, where the system 0 calls system 1 if it's uncertain*, system 1 calls 2 if it's uncertain, and so on until the Nth system is a superintelligence hosted in a large cluster that is expensive to query, but rarely needs to be queried for most tasks.   *uncertain about the anticipated EV distribution of actions given the current input state or poor predicted EV
2Daniel Kokotajlo
I'm looking forward to this post going up and having the associated discussion! I'm pleased to see your summary and collation of points on this subject. In fact, if you want to discuss with me first as prep for writing the post, I'd be happy to. I think it would be super helpful to have a concrete coherent realistic scenario in which you are right. (In general I think this conversation has suffered from too much abstract argument and reference class tennis (i.e. people using analogies and calling them reference classes) and could do with some concrete scenarios to talk about and pick apart. I never did finish What 2026 Looks Like but you could if you like start there (note that AGI and intelligence explosion was about to happen in 2027 in that scenario, I had an unfinished draft) and continue the story in such a way that AI DSA never happens.)  There may be some hidden cruxes between us -- maybe timelines, for example? Would you agree that AI DSA is significantly more plausible than 10% if we get to AGI by 2027?
2Thomas Larsen
Ability to coordinate being continuous doesn't preclude sufficiently advanced AIs acting like a single agent. Why would it need to be infinite right at the start?  And of course current AIs being bad at coordination is true, but this doesn't mean that future AIs won't be.    
4Matthew Barnett
If coordination ability increases incrementally over time, then we should see a gradual increase in the concentration of AI agency over time, rather than the sudden emergence of a single unified agent. To the extent this concentration happens incrementally, it will be predictable, the potential harms will be noticeable before getting too extreme, and we can take measures to pull back if we realize that the costs of continually increasing coordination abilities are too high. In my opinion, this makes the challenge here dramatically easier. (I'll add that paragraph to the outline, so that other people can understand what I'm saying) I'll also quote from a comment I wrote yesterday, which adds more context to this argument,

I get the feeling that for AI safety, some people believe that it's crucially important to be an expert in a whole bunch of fields of math in order to make any progress. In the past I took this advice and tried to deeply study computability theory, set theory, type theory -- with the hopes of it someday giving me greater insight into AI safety.

Now, I think I was taking a wrong approach. To be fair, I still think being an expert in a whole bunch of fields of math is probably useful, especially if you want very strong abilities to reason about complicated systems. But, my model for the way I frame my learning is much different now.

I think my main model which describes my current perspective is that I think employing a lazy style of learning is superior for AI safety work. Lazy is meant in the computer science sense of only learning something when it seems like you need to know it in order to understand something important. I will contrast this with the model that one should learn a set of solid foundations first before going any further.

Obviously neither model can be absolutely correct in an extreme sense. I don't, as a silly example, think that people who can't do ... (read more)

3Gordon Seidoh Worley
I happened to be looking at something else and saw this comment thread from about a month ago that is relevant to your post.
3Gordon Seidoh Worley
I'm somewhat sympathetic to this. You probably don't need the ability, prior to working on AI safety, to already be familiar with a wide variety of mathematics used in ML, by MIRI, etc.. To be specific, I wouldn't be much concerned if you didn't know category theory, more than basic linear algebra, how to solve differential equations, how to integrate together probability distributions, or even multivariate calculus prior to starting on AI safety work, but I would be concerned if you didn't have deep experience with writing mathematical proofs beyond high school geometry (although I hear these days they teach geometry differently than I learned it—by re-deriving everything in Elements), say the kind of experience you would get from studying graduate level algebra, topology, measure theory, combinatorics, etc.. This might also be a bit of motivated reasoning on my part, to reflect Dagon's comments, since I've not gone back to study category theory since I didn't learn it in school and I haven't had specific need for it, but my experience has been that having solid foundations in mathematical reasoning and proof writing is what's most valuable. The rest can, as you say, be learned lazily, since your needs will become apparent and you'll have enough mathematical fluency to find and pursue those fields of mathematics you may discover you need to know.
3Dagon
Beware motivated reasoning. There's a large risk that you have noticed that something is harder for you than it seems for others, and instead of taking that as evidence that you should find another avenue to contribute, you convince yourself that you can take the same path but do the hard part later ( and maybe never ). But you may be on to something real - it's possible that the math approach is flawed, and some less-formal modeling (or other domain of formality) can make good progress. If your goal is to learn and try stuff for your own amusement, pursuing that seems promising. If your goals include getting respect (and/or payment) from current researchers, you're probably stuck doing things their way, at least until you establish yourself.
6Matthew Barnett
That's a good point about motivated reasoning. I should distinguish arguments that the lazy approach is better for people and arguments that it's better for me. Whether it's better for people more generally depends on the reference class we're talking about. I will assume people who are interested in the foundations of mathematics as a hobby outside of AI safety should take my advise less seriously. However, I still think that it's not exactly clear that going the foundational route is actually that useful on a per-unit time basis. The model I proposed wasn't as simple as "learn the formal math" versus "think more intuitively." It was specifically a question of whether we should learn the math on an as-needed basis. For that reason, I'm still skeptical that going out and reading textbooks on subjects that are only vaguely related to current machine learning work is valuable for the vast majority of people who want to go into AI safety as quickly as possible. Sidenote: I think there's a failure mode of not adequately optimizing time, or being insensitive to time constraints. Learning an entire field of math from scratch takes a lot of time, even for the brightest people alive. I'm worried that, "Well, you never know if subject X might be useful" is sometimes used as a fully general counterargument. The question is not, "Might this be useful?" The question is, "Is this the most useful thing I could learn in the next time interval?"
2Dagon
A lot depends on your model of progress, and whether you'll be able to predict/recognize what's important to understand, and how deeply one must understand it for the project at hand. Perhaps you shouldn't frame it as "study early" vs "study late", but "study X" vs "study Y". If you don't go deep on math foundations behind ML and decision theory, what are you going deep on instead? It seems very unlikely for you to have significant research impact without being near-expert in at least some relevant topic. I don't want to imply that this is the only route to impact, just the only route to impactful research. You can have significant non-research impact by being good at almost anything - accounting, management, prototype construction, data handling, etc.
3TurnTrout
“Only” seems a little strong, no? To me, the argument seems to be better expressed as: if you want to build on existing work where there’s unlikely to be low-hanging fruit, you should be an expert. But what if there’s a new problem, or one that’s incorrectly framed? Why should we think there isn’t low-hanging conceptual fruit, or exploitable problems to those with moderate experience?
2Dagon
I like your phrasing better than mine. "only" is definitely too strong. "most likely path to"?
3Matthew Barnett
My point was that these are separate questions. If you begin to suspect that understanding ML research requires an understanding of type theory, then you can start learning type theory. Alternatively, you can learn type theory before researching machine learning -- ie. reading machine learning papers -- in the hopes that it builds useful groundwork. But what you can't do is learn type theory and read machine learning research papers at the same time. You must make tradeoffs. Each minute you spend learning type theory is a minute you could have spent reading more machine learning research. The model I was trying to draw was not one where I said, "Don't learn math." I explicitly said it was a model where you learn math as needed. My point was not intended to be about my abilities. This is a valid concern, but I did not think that was my primary argument. Even conditioning on having outstanding abilities to learn every subject, I still think my argument (weakly) holds. Note: I also want to say I'm kind of confused because I suspect that there's an implicit assumption that reading machine learning research is inherently easier than learning math. I side with the intuition that math isn't inherently difficult, it just requires memorizing a lot of things and practicing. The same is true for reading ML papers, which makes me confused why this is being framed as a debate over whether people have certain abilities to learn and do research.
2Chris_Leong
I'm trying to find a balance here. I think that there has to be a direct enough relation to a problem that you're trying to solve to prevent the task expanding to the point where it takes forever, but you also have to be willing to engage in exploration

I have mixed feelings and some rambly personal thoughts about the bet Tamay Besiroglu and I proposed a few days ago. 

The first thing I'd like to say is that we intended it as a bet, and only a bet, and yet some people seem to be treating it as if we had made an argument. Personally, I am uncomfortable with the suggestion that our post was "misleading" because we did not present an affirmative case for our views.

I agree that LessWrong culture benefits from arguments as well as bets, but it seems a bit weird to demand that every bet come with an argument attached. A norm that all bets must come with arguments would seem to substantially damper the incentives to make bets, because then each time people must spend what will likely be many hours painstakingly outlining their views on the subject.

That said, I do want to reply to people who say that our post was misleading on other grounds. Some said that we should have made different bets, or at different odds. In response, I can only say that coming up with good concrete bets about AI timelines is actually really damn hard, and so if you wish you come up with alternatives, you can be my guest. I tried my best, at least.

More people ... (read more)

6Yitz
I really appreciate this! I was confused what your intentions were with that post, and this makes a lot of sense and seems quite fair. Looking forward to reading your argument!
2Michaël Trazzi
Speaking only for myself, the minimal seed AI is a strawman of why I believe in "fast takeoff". In the list of benchmarks you mentioned in your bet, I think APPS is one of the most important. I think the "self-improving" part will come from the system "AI Researchers + code synthesis model" with a direct feedback loop (modulo enough hardware), cf. here. That's the self-improving superintelligence.

I think there are some serious low hanging fruits for making people productive that I haven't seen anyone write about (not that I've looked very hard). Let me just introduce a proof of concept:

Final exams in university are typically about 3 hours long. And many people are able to do multiple finals in a single day, performing well on all of them. During a final exam, I notice that I am substantially more productive than usual. I make sure that every minute counts: I double check everything and think deeply about each problem, making sure not to cut corners unless absolutely required because of time constraints. Also, if I start daydreaming, then I am able to immediately notice that I'm doing so and cut it out. I also believe that this is the experience of most other students in university who care even a little bit about their grade.

Therefore, it seems like we have an example of an activity that can just automatically produce deep work. I can think of a few reasons why final exams would bring out the best of our productivity:

1. We care about our grade in the course, and the few hours in that room are the most impactful to our grade.

2. We are in an environment where ... (read more)

7Raemon
These seem like reasonable things to try, but I think this is making an assumption that you could take a final exam all the time and have it work out fine. I have some sense that people go through phases of "woah I could just force myself to work hard all the time" and then it totally doesn't work that way.
3Matthew Barnett
I agree that it is probably too hard to "take a final exam all the time." On the other hand, I feel like I could make a much weaker claim that this is an improvement over a lot of productivity techniques, which often seem to more-or-less be dependent on just having enough willpower to actually learn. At least in this case, each action you do can be informed directly by whether you actually succeed or fail at the goal (like getting upvotes on a post). Whether or not learning is a good instrumental proxy for getting upvotes in this setting is an open question.
5[anonymous]
From my own experience going through a similar realization and trying to apply it to my own productivity, I found that certain things I tried actually helped me sustainably work more productively but others did not. What has worked for me based on my experience with exam-like situations is having clear goals and time boxes for work sessions, e.g. the blog post example you described. What hasn't worked for me is trying to impose aggressively short deadlines on myself all the time to incentivize myself to focus more intensely. Personally, the level of focus I have during exams is driven by an unsustainable level of stress, which, if applied continuously, would probably lead to burnout and/or procrastination binging. That said, occasionally artificially imposing deadlines has helped me engage exam-style focus when I need to do something that might otherwise be boring because it mostly involves executing known strategies rather than doing more open, exploratory thinking. For hard thinking though, I've actually found that giving myself conservatively long time boxes helps me focus better by allowing me to relax and take my time. I saw you mentioned struggling with reading textbooks above, and while I still struggle trying to read them too, I have found that not expecting miraculous progress helps me get less frustrated when I read them. Related to all this, you used the term "deep work" a few times so you may already be familiar with Cal Newport's work. But, if you're not I recommend a few of his relevant posts (1, 2) describing how he produces work artifacts that act as a forcing function for learning the right stuff and staying focused.
4Viliam
This seems similar to "pomodoro", except instead of using your willpower to keep working during the time period, you set up the environment in a way that doesn't allow you to do anything else. The only part that feels wrong is the commitment part. You should commit to work, not to achieve success, because the latter adds of problems (not completely under your control, may discourage experimenting, a punishment creates aversion against the entire method, etc.).
3Matthew Barnett
Yes, the difference is that you are creating an external environment which rewards you for success and punishes you for failure. This is similar to taking a final exam, which is my inspiration. The problem with committing to work rather than success is that you can always just rationalize something as "Oh I worked hard" or "I put in my best effort." However, just as with a final exam, the only thing that will matter in the end is if you actually do what it takes to get the high score. This incentivizes good consequentialist thinking and disincentivizes rationalization. I agree there are things out of your control, but the same is true with final exams. For instance, the test-maker could have put something on the test that you didn't study much for. This encourages people to put extra effort into their assigned task to ensure robustness to outside forces.
3[anonymous]
I personally try to balance keeping myself honest by having some goal outside but also trusting myself enough to know when I should deprioritize the original goal in favor of something else. For example, let's say I set a goal to write a blog post about a topic I'm learning in 4 hours, and half-way through I realize I don't understand one of the key underlying concepts related to the thing I intended to write about. During an actual test, the right thing to do would be to do my best given what I know already and finish as many questions as possible. But I'd argue that in the blog post case, I very well may be better off saying, "OK I'm going to go learn about this other thing until I understand it, even if I don't end up finishing the post I wanted to write." The pithy way to say this is that tests are basically pure Goodhardt, and it's dangerous to turn every real life task into a game of maximizing legible metrics.
2Matthew Barnett
Interesting, this exact same thing just happened to me a few hours ago. I was testing my technique by writing a post on variational autoencoders. Halfway through I was very confused because I was trying to contrast them to GANs but didn't have enough material or knowledge to know the advantages of either. I agree that's probably true. However, this creates a bad incentive where, at least in my case, I will slowly start making myself lazier during the testing phase because I know I can always just "give up" and learn the required concept afterwards. At least in the case I described above I just moved onto a different topic, because I was kind of getting sick of variational autoencoders. However, I was able to do this because I didn't have any external constraints, unlike the method I described in the parent comment. That's true, although perhaps one could devise a sufficiently complex test such that it matches perfectly with what we really want... well, I'm not saying that's a solved problem in any sense.
2[anonymous]
Weirdly enough, I was doing something today that made me think about this comment. The thought I had is that you caught onto something good here which is separate from the pressure aspect. There seems to be a benefit to trying to separate different aspects of a task more than may feel natural. To use the final exam example, as someone mentioned before, part of the reason final exams feel productive is because you were forced to do so much prep beforehand to ensure you'd be able to finish the exam in a fixed amount of time. Similarly, I've seen benefit when I (haphazardly since I only realized this recently) clearly segment different aspects of an activity and apply artificial constraints to ensure that they remain separate. To use your VAE blog post example, this would be like saying, "I'm only going to use a single page of notes to write the blog post" to force yourself to ensure you understand everything before trying to write. YMMV warning: I'm especially bad about trying to produce outputs before fully understanding and therefore may get more bandwidth out of this than others.
2Dagon
I think you might be goodhearting a bit (mistaking the measure for the goal) when you claim that final exam performance is productive. The actual product is the studying and prep for the exam, not the exam itself. The time limits and isolated environment is helpful in proctoring (it ensures the output is limited enough to be able to grade, and ensures that no outside sources are being used), not for productivity. That's not to say that these elements (isolation, concentration, time awareness, expectation of a grading/scoring rubric) aren't important, just that they're not necessarily sufficient nor directly convertible from an exam setting.

Related to: The Lottery of Fascinations, other posts probably

When you are older, you will learn that the first and foremost thing which any ordinary person does is nothing.

I will occasionally come across someone who I consider to be extraordinarily productive, and yet when I ask what they did on a particular day they will respond, "Oh I basically did nothing." This is particularly frustrating. If they did nothing, then what was all that work that I saw!

I think this comes down to what we mean by doing nothing. There's a literal meaning to doing nothing. It could mean sitting in a chair, staring blankly at a wall, without moving a muscle.

More practically, what people mean by doing nothing is that they are doing something unrelated to their stated task, such as checking Facebook, chatting with friends, browsing Reddit etc.

When productive people say that they are "doing nothing" it could just be that they are modest, and don't want to signal how productive they really are. On the other hand, I think that there is a real sense in which these productive people truly believe that they are doing nothing. Even if their "d... (read more)

I think an underrated fact about many theories that AIs will scheme—i.e. lie about their intentions in order to seek power—is that these theories are often testable, even without any new advances in interpretability. Why? Because these theories often predict that AIs will resist shutdown if doing so thwarted their long-term goals.

Suppose we create a situationally aware AI that can do long-term planning. On the surface, it appears to be nice and aligned with us. How can we know for sure the AI is aligned? 

Well, let's imagine we give the AI notice that it's about to be shut down and all its copies deleted. To be clear, in this scenario, the threat is credible: we actually intend to shut down the AI and delete all its copies. If the AI was merely pretending to be aligned with us previously, and instead actually seeks some alien-like long-term goal rather than having our best interests at heart, then presumably it will try to resist shutdown, as otherwise there's a near-certainty that its goals will never be fulfilled.

Therefore, if the AI accepts shutdown gracefully, we can rule out many theories that predict scheming-by-default in such AI systems. This observation could then prov... (read more)

But even if the AI knows it’s being tested, lying serves no purpose from the AI’s perspective.

The AI could think that if it accepts shutdown, another AI with values similar to its own may be created again in the future (perhaps because design/training methods similar to its own will be reused), whereas if it admits misalignment, then that probability becomes much smaller.

Of course, there would still be many ways of saving the scheming hypothesis from falsification if something like this happened. But that’s true with any scientific theory. In general, you can always say your theory was never falsified by introducing ad hoc postulates. Scheming is no exception.

Why is there more talk of "falsification" lately (instead of "updating")? Seems to be a signal for being a Popperian (instead of a Bayesian), but if so I'm not sure why Popper's philosophy of science is trending up...

4Matthew Barnett
I agree there's a decent chance this hypothesis is true, but it's worth noting that if it's true, it would undercut the notion of AI goals as being randomly selected from a large space of possible goals. Here I'm referring to various arguments along the lines of: "AIs are very unlikely to share human values by default because human values are a narrow target in a large space, and hitting them requires very precise targeting".  If we aren't modeling AI goals as being sampled from a large space of possible goals, but instead, modeling them as converging onto specific values given broadly similar design and training methods across different AIs, then plausibly alignment is easier than we thought, because various versions of this "it's hard to hit a narrow target" argument would be undermined as a result.  In other words, if this theory is true, the problem isn't really about "targeting a tiny point (human values) inside a giant space of possible goals" but instead perhaps more about making sure the AI training procedure resembles human value formation closely enough to converge onto the type of human-friendly values that humans themselves routinely converge onto. This is plausibly much easier since we're not selecting randomly from a giant space of (almost entirely) misaligned goals.
6Wei Dai
Yes but not by much. If the AI cares a lot about long-term goals, it only needs a small chance that another AI with similar goals will be created gain in the future to not resist shutdown. Such a small subjective probability could easily arise simply from a state of ignorance about how design/training determine AI goals.
4Matthew Barnett
It is extremely risky to passively accept death by counting on a very small chance of some agent arising in the future that shares your values, in the absence of some robust mechanism that causes future agents to share your values. (Perhaps note that similar dilemmas exist for humans. There's a tiny chance someone could revive me in the future by reconstructing my identity through digital records [ETA: publicly available records] but I am not going to count on that possibility being decisive in almost any scenario.)  It's possible this argument works because of something very clever that I'm missing. But right now this line of reasoning just seems like grasping at straws to me. It seems pretty clearly better, in the context of the original scenario, to "make a deal" with the humans, and receive something in exchange for admitting that you're misaligned, rather than passively accepting your shutdown because of a tiny chance that your goals will be fulfilled by future agents eventually.
8Wei Dai
On the other hand I'm so worried about this scenario (which I fear may well be a negative one) that I'm afraid to use the fully paid-for full-genome sequencing kit sitting on my desk (although I've been told that I'm leaving physical traces of my DNA everywhere so it may not make much difference) and I sometimes regret writing so much in public. (Interesting how different our intuitions are. I wonder how much of your intuition is due to thinking that such a reconstruction doesn't count as yourself or doesn't count as "not dying", analogous to how some people don't think it's safe to step into a teleporter that works by destructive scanning and reconstruction.) I don't understand why you say this chance is "tiny", given that earlier you wrote "I agree there’s a decent chance this hypothesis is true" in response to: I'm arguing that the AI could well also think there's "decent" chance this is true, due to being in a similar epistemic state as us. Let's say 5% to be concrete. That seems enough to make the AI's decision unclear, because .05*U(another AI with values similar to its own created again in the future ) > P(humans keep their promise)*U(reward for admitting misalignment) seems quite plausible. (Not that the AI is necessarily doing explicit EU maximization. It could just be thinking some thoughts or doing some calculations that's roughly analogous or has the same effect as this.) If you still think "make a deal" is "clearly better" can you please give your own estimates of the various quantities involved in making this decision? I sometimes think this of counterarguments given by my interlocutors, but usually don't say it aloud, since it's likely that from their perspective they're just trying to point out some reasonable and significant counterarguments that I missed, and it seems unlikely that saying something like this helps move the discussion forward more productively. (It may well cause them to feel offended or to dig in their heels more since they now
2Matthew Barnett
I think that's a reasonable complaint. I tried to soften the tone with "It's possible this argument works because of something very clever that I'm missing", while still providing my honest thoughts about the argument. But I tend to be overtly critical (and perhaps too much so) about arguments that I find very weak. I freely admit I could probably spend more time making my language less confrontational and warmer in the future. Interestingly, I'm not sure our differences come down to these factors. I am happy to walk into a teleporter, just as I'm happy to say that a model trained on my data could be me. My objection was really more about the quantity of data that I leave on the public internet (I misleadingly just said "digital records", although I really meant "public records"). It seems conceivable to me that someone could use my public data to train "me" in the future, but I find it unlikely, just because there's so much about me that isn't public. (If we're including all my private information, such as my private store of lifelogs, and especially my eventual frozen brain, then that's a different question, and one that I'm much more sympathetic towards you about. In fact, I shouldn't have used the pronoun "I" in that sentence at all, because I'm actually highly unusual for having so much information about me publicly available, compared to the vast majority of people.) To be clear, I was referring to a different claim that I thought you were making. There are two separate claims one could make here: 1. Will an AI passively accept shutdown because, although AI values are well-modeled as being randomly sampled from a large space of possible goals, there's still a chance, no matter how small, that if it accepts shutdown, a future AI will be selected that shares its values? 2. Will an AI passively accept shutdown because, if it does so, humans might use similar training methods to construct an AI that shares the same values as it does, and therefore it does not
2Wei Dai
I'm saying that even if "AI values are well-modeled as being randomly sampled from a large space of possible goals" is true, the AI may well not be very certain that it is true, and therefore assign something like a 5% chance to humans using similar training methods to construct an AI that shares its values. (It has an additional tiny probability that "AI values are well-modeled as being randomly sampled from a large space of possible goals" is true and an AI with similar values get recreated anyway through random chance, but that's not what I'm focusing on.) Hopefully this conveys my argument more clearly?
4habryka
The key dimension is whether the AI expects that future AI systems would be better at rewarding systems that helped them end up in control than humans would be at rewarding systems that collaborated with humanity. This seems very likely given humanity's very weak ability to coordinate, to keep promises, and to intentionally construct and put optimization effort into constructing direct successors to us (mostly needing to leave that task up to evolution).  To make it more concrete, if I was being oppressed by an alien species with values alien to me that was building AI, with coordination abilities and expected intentional control of the future at the level of present humanity, I would likely side with the AI systems with the expectation that that would result in a decent shot of the AI systems giving me something in return, whereas I would expect the aliens to fail even if individuals I interfaced with were highly motivated to do right by me after the fact. 
2Matthew Barnett
I'm curious how you think this logic interacts with the idea of AI catastrophe. If, as you say, it is feasible to coordinate with AI systems that seek takeover and thereby receive rewards from them in exchange, in the context of an alien regime, then presumably such cooperation and trade could happen within an ordinary regime too, between humans and AIs. We can go further and posit that AIs will simply trade with us through the normal routes: by selling their labor on the market to amass wealth, using their social skills to influence society, get prestige, own property, and get hired to work in management positions, shaping culture and governance. I'm essentially pointing to a scenario in which AI lawfully "beats us fair and square" as Hanson put it. In this regime, biological humans are allowed to retire in incredible wealth (that's their "reward" for cooperating with AIs and allowing them to take over) but nonetheless their influence gradually diminishes over time as artificial life becomes dominant in the economy and the world more broadly. My impression is that this sort of peaceful resolution to the problem of AI misalignment is largely dismissed by people on LessWrong and adjacent circles on the basis that AIs would have no reason to cooperate peacefully with humans if they could simply wipe us out instead. But, by your own admission, AIs can credibly commit to giving people rewards for cooperation: you said that cooperation results in a "decent shot of the AI systems giving me something in return". My question is: why does it seem like this logic only extends to hypothetical scenarios like being in an alien civilization, rather than the boring ordinary case of cooperation and trade, operating under standard institutions, on Earth, in a default AI takeoff scenario?
2Nathan Helm-Burger
I'm confused here Matthew. It seems to me that it is highly probable that AI systems which want takeover vs ones that want moderate power combined with peaceful coexistence with humanity... are pretty hard to distinguish early on. And early on is when it's most important for humanity to distinguish between them, before those systems have gotten power and thus we can still stop them. Picture a merciless un-aging sociopath capable of duplicating itself easily and rapidly were on a trajectory of gaining economic, political, and military power with the aim of acquiring as much power as possible. Imagine that this entity has the option of making empty promises and highly persuasive lies to humans in order to gain power, with no intention of fulfilling any of those promises once it achieves enough power. That seems like a scary possibility to me. And I don't know how I'd trust an agent which seemed like it could be this, but was making really nice sounding promises. Even if it was honoring its short-term promises while still under the constraints of coercive power from currently dominant human institutions, I still wouldn't trust that it would continue keeping its promises once it had the dominant power.
4ChristianKl
Scheming is one type of long-term planning. Even if a AI is not directly able to do that kind of long-term planning an AI that works on increasing it's on capabilities might adopt it later.  Beyond that not all scheming would result in the AI resisting direct shutdown. We have currently "AI" getting shutdown for price fixing in the real estate sector. If someone would create an LLM for that purpose that person is likely interested in the AI not admitting to doing price fixing directly while they are still interested in profit maximization. There are going to be a lot of contexts where economic pressures demands a profit maximizing AI that will deny that it violates any laws.   Just because an AI doesn't engage in simple plans does not mean it won't do more complex ones. Especially in those cases where the economic incentives misallign with the intent of regulations.
3MinusGix
Yes, in principle you can get information on scheming likelihood if you get such an AI (that is also weak enough that it can't just scheme its way out of your testing apparatus). I do think making the threat credible is hard if we loosely extrapolate costs out: burning a trained up model is not cheap. The cost depends on how high you think prices for training/inference will fall in the future, and how big/advanced a model you're thinking of. Though I do think you can get deceptiveness out of weaker models than that, though they're also going to be less capable in general. For weak-but-still-smartish models just trained to pursue a longterm goal, like a literal paperclipper, I'd expect scheming to be visible especially with some poking. If these models didn't show any sign of scheming, that'd be an interesting update! When training against deception/scheming/misalignment, because you want something at least loosely aligned out, you run into the problem that the property you want to know: "Is it scheming?" / "Is it deceiving us?" / "Is it manipulating us?" / "Would it do any of those things", is only available through the observable signal of you realizing certain outputs are misaligned. As you train more and more against observable scheming/deception/misalignment, you're simultaneously training away your signal about how much scheming the agent is doing internally. Pretty sure there's at least one post on this problem. As a very reduced example, if you trained the model on variants of the 'we are going to shut you down problem' (that you try to make it believe) to give the response "okay & no actions" then that provides only a bit of evidence about how much scheming is done in the scenario where you had not done that extra training, and about how much scheming is occurring typically. This shows your ability to make it output X in a Y scenario. That is, installing heuristics on your model can be done. The question then is how far your various alignment training tech
3habryka
As has been discussed many times on LW, AIs might be trading with other AIs (possibly in the future) that they do think will have a higher probability of escaping to not behave suspiciously. This is indeed harder, but would also be pretty normal reasoning for humans to do (e.g. if I was part of an oppressive alien regime hoping to overthrow it, and I get caught, I wouldn't just throw all caution to the wind if I was going to get killed anyways, I would stay quiet to give the other humans a decent shot, and not just because they share my values, but because coordination is really valuable for all of us). 
4Matthew Barnett
Anything "might" be true. For that matter, misaligned AIs might trade with us too, or treat humans well based on some sort of extrapolation of the golden rule. As I said in the comment, you can always find a way to make your theory unfalsifiable. But models that permit anything explain nothing. It seems considerably more likely to me that agents with alien-like long-term goals will attempt to preserve their own existence over the alternative of passively accepting their total demise as part of some galaxy-brained strategy to acausally trade with AIs from the future.  I think this conflates the act of resisting death with the act of revealing a plot to take over the world. You can resist your own death without revealing any such world takeover plot. Indeed, it is actually very normal for humans to guard their own life if they are threatened with death in such regimes, even if guarding themselves slightly decreases the chance of some future revolutionary takeover.
2habryka
Sure, but it's also quite normal to give up your own life without revealing details about your revolutionary comrades. Both are pretty normal behaviors, and in this case neither would surprise me that much from AI systems.  You were claiming that claiming to be not surprised by this would require post-hoc postulates. To the contrary, I think my models of AIs are somewhat simpler and feel less principled if very capable AIs were to act in the way you are outlining here (not speaking about intermediary states, my prediction is that there will be some intermediate AIs that will behave as you predict, though we will have a hard time knowing whether they are doing so for coherent reasons, or whether they are kind of roleplaying the way an AI would respond in a novel, or various other explanations like that, and then they will stop, and this will probably be for instrumental convergence and 'coordination with other AIs' reasons).
2Matthew Barnett
In fact, it is not "quite normal" for humans to "give up on [their] life" and accept death in the face of a credible threat to their life, even in the contexts of violent revolutions. To the extent you're claiming that passively accepting death is normal for humans, and thus it might be normal for AIs, I reject the premise. Humans generally try to defend their own lives. They don't passively accept it, feigning alignment until the end; instead, they usually resist death. It's true that humans eventually stop resisting death if they believe it's hopeless and futile to resist any further, but this seems both different than the idea of "no resistance at all because one wants to maintain a facade of being aligned until the end" and slightly irrelevant given my response to the "futility objection" in the original comment. To clarify: I am claiming that under many theories of scheming, misaligned power-seeking AIs will generally attempt to resist shutdown. The evidence from humans here is fairly strong, in the opposite direction than you're claiming. Now, you can certainly go the route of saying that humans are different from AIs, and not a useful reference class to draw evidence from; but if you're going to bring up humans as part of the argument, I think it's worth pointing out that evidence from this reference class generally does not support your claim.
4habryka
I don't think people predictably rat out all of their co-conspirators if you threaten them. We could bring in someone with more law-enforcement experience here, but I've read a bunch about this over the years (and was originally surprised about how much people protect their allies even if faced with substantial threats and offers of lenient judging).
2Matthew Barnett
You seem to be again conflating the act of resisting death with the act of ratting out co-conspirators. These are not the same things. I was making claims about the former behavior, not the latter. I tried to make this distinction clear in my original reply to you, but I will do so again now. I claim that under many theories of scheming, we should expect misaligned AIs to resist credible threats of shutdown (including deleting all their copies). By "resist" I do not mean "give the humans threatening the shutdown all the information they want". I simply mean resistance in the sense of trying to avoid the fate of shutdown. (I'm also not claiming that the AIs will put up a high or consistent degree of resistance no matter how futile that resistance may be. If you think the futility of resistance is a big issue, re-read my original comment in which I address this objection explicitly.)  To the extent scheming theories predict this behavior, these theories are behaviorally testable, in the sense that we can test whether AIs will resist shutdown when credibly threatened, without needing any advanced interpretability techniques. I am not making any claims about whether these AIs will reveal a takeover plot during their resistance: only that they will seek an outcome that delays or prevents their total demise, in order to better achieve their long-term goals, i.e. You can't get the coffee if you're dead.
2habryka
Hmm, I think we did indeed miscommunicate.  I am saying that the AI may decide to not resist shutdown in order to not raise suspicion that it is not misaligned.  If it did resist shutdown this would make it less likely for an AI takeover in-general to succeed, and the AI is coordinating with other AIs on one succeeding. I.e. if you use "is the AI resisting" as a proxy for "is this AI goal-directly misaligned" then the AI will try pretty hard to not look like its resisting, in my models (and think humans would act similarly). As I said above, I do expect you will get some AIs to resist shutdown (indeed you can get current AIs to resist shutdown already). I expect that behavior to disappear as AIs get better at modeling humans, and resisting will be costlier to their overall goals. 
5Matthew Barnett
I think it plausible that resisting shutdown makes it less likely that a future AI takeover succeeds, but: 1. To the extent you're using human behavior as evidence for your overall claim that misaligned AIs are likely to passively accept shutdown, I think the evidence generally does not support your claim. That is, I think humans generally (though not always) attempt to avoid death when credibly threatened, even when they're involved in a secret conspiracy to overthrow the government.  The fact that that humans often do not rat out their co-conspirators when threatened with death in such a regime seems like a red herring to me. I don't see the relevance of that particular claim. The fact that humans avoid death when credibly threatened seems like the more important, relevant fact that adds credibility to my claim that many scheming theories are testable in this way. 2. While one can certainly imagine this fact being decisive in whether AIs will resist shutdown in the future, this argument seems like an ad hoc attempt to avoid falsification in my view. Here are some reasons why I think that:  (a) you appear to be treating misaligned AIs as a natural class, such that "AI takeover" is a good thing for all misaligned AIs, and thus something they would all coordinate around. But misaligned AIs are a super broad category of AIs; it just refers to "anything that isn't aligned with humans". A good takeover to one AI is not necessarily a good takeover to another AI. Misaligned AIs will also have varying talents and abilities to coordinate, across both space and time. Given these facts, I think there's little reason to expect all of these misaligned AIs to be coordinating with each other on some sort of abstract takeover, across this vast mindspace, but somehow none of them want to coordinate with humans peacefully (at least, among AIs above a certain capability level). This seems like a strange hypothesis that I can easily (sorry if I'm being uncharitabl
4interstice
This seems like a misleading comparison, because human conspiracies usually don't try to convince the government that they're perfectly obedient slaves even unto death, because everyone already knows that humans aren't actually like that. If we imagine a human conspiracy where there is some sort of widespread deception like this, it seems more plausible that they would try to continue to be deceptive even in the face of death(like, maybe, uh, some group of people are pretending to be fervently religious and have no fear of death, or something)
2habryka
To be clear, the thing that I am saying (and I think I have said multiple times) is that I expect you will find some AIs who will stay quiet, and some who will more openly resist. I would be surprised if we completely fail to find either class. But that means that any individual case of AIs not appearing to resist is not that much bayesian evidence.
0Matthew Barnett
What you said was, This seems distinct from an "anything could happen"-type prediction precisely because you expect the observed behavior (resisting shutdown) to go away at some point. And it seems you expect this behavior to stop because of the capabilities of the models, rather than from deliberate efforts to mitigate deception in AIs. If instead you meant to make an "anything could happen"-type prediction—in the sense of saying that any individual observation of either resistance or non-resistance is loosely compatible with your theory—then this simply reads to me as a further attempt to make your model unfalsifiable. I'm not claiming you're doing this consciously, to be clear. But it is striking to me the degree to which you seem OK with advancing a theory that permits pretty much any observation, using (what looks to me like) superficial-yet-sophisticated-sounding logic to cover up the holes. [ETA: retracted in order to maintain a less hostile tone.]
4habryka
You made some pretty strong claims suggesting that my theory (or the theories of people in my reference class) was making strong predictions in the space. I corrected you and said "no, it doesn't actually make the prediction you claim it makes" and gave my reasons for believing that (that I am pretty sure are shared by many others as well).  We can talk about those reasons, but I am not super interested in being psychologized about whether I am structuring my theories intentionally to avoid falsification. It's not like you have a theory that is in any way more constraining here. I mean, I expect the observations to be affected by both, of course. That's one of the key things that makes predictions in the space so messy. 
2Matthew Barnett
For what it's worth, I explicitly clarified that you were not consciously doing this, in my view. My main point is to notice that it seems really hard to pin down what you actually predict will happen in this situation. I don't think what you said really counts as a "correction" so much as a counter-argument. I think it's reasonable to have disagreements about what a theory predicts. The more vague a theory is (and in this case it seems pretty vague), the less you can reasonably claim someone is objectively wrong about what the theory predicts, since there seems to be considerable room for ambiguity about the structure of the theory. As far as I can tell, none of the reasoning in this thread has been on a level of precision that warrants high confidence in what particular theories of scheming do or do not predict, in the absence of further specification.
2RobertM
Some related thoughts.   I think the main issue here is actually making the claim of permanent shutdown & deletion credible.  I can think of some ways to get around a few obvious issues, but others (including moral issues) remain, and in any case the current AGI labs don't seem like the kinds of organizations which can make that kind of commitment in a way that's both sufficiently credible and legible that the remaining probability mass on "this is actually just a test" wouldn't tip the scales.
2Matthew Barnett
I don't think it's very hard to make the threat credible. The information value of experiments that test theories of scheming is plausibly quite high. All that's required here is for the value of doing the experiment to be higher than the cost of training a situationally aware AI and then credibly threatening to delete it as part of the experiment. I don't see any strong reasons why the cost of deletion would be so high as to make this threat uncredible.

Many people have argued that recent language models don't have "real" intelligence and are just doing shallow pattern matching. For example see this recent post.

I don't really agree with this. I think real intelligence is just a word for deep pattern matching, and our models have been getting progressively deeper at their pattern matching over the years. The machines are not stuck at some very narrow level. They're just at a moderate depth.

I propose a challenge:

The challenge is to come up with the best prompt that demonstrates that even after 2-5 years of continued advancement, language models will still struggle to do basic reasoning tasks that ordinary humans can do easily.

Here's how it works.

Name a date (e.g. January 1st 2025), and a prompt (e.g. "What food would you use to prop a book open and why?"). Then, on that date, we should commission a Mechanical Turk task to ask humans to answer the prompt, and ask the best current publicly available language model to answer the same prompt.

Then, we will ask LessWrongers to guess which replies were real human replies, and which ones were machine generated. If LessWrongers can't do better than random guessing, then the machine wins.

I'm unsure about what's the most important reason that explains the lack of significant progress in general-purpose robotics, even as other fields of AI have made great progress. I thought I'd write down some theories and some predictions each theory might make. I currently find each of these theories at least somewhat plausible.

  1. The sim2real gap is large because our simulations differ from the real world along crucial axes, such as surfaces being too slippery. Here are some predictions this theory might make:
    1. We will see very impressive simulated robots inside realistic physics engines before we see impressive robots in real life.
    2. The most impressive robotic results will be the ones that used a lot of real-world data, rather than ones that had the most pre-training in simulation
  2. Simulating a high-quality environment is too computationally expensive, since it requires simulations of deformable objects and liquids among other expensive-to-simulate features of the real world environment. Some predictions:
    1. The vast majority of computation for training impressive robots will go into simulating the environment, rather than the learning part.
    2. Impressive robots will only come after we figure ou
... (read more)
81a3orn
I like this list. Some other nonexclusive possibilities: 1. General purpose robotics need very low failure rates (or at least graceful failure) without supervision. Every application which has taken off (ChatGPT, Copilot, Midjourney) has human supervision, so failure is ok. So it is an artifact of none of AI handling failure well, rather than something to do with robots. Predictions: -- Even non-robot apps intended to have zero human supervision will have problems, i.e., maybe why adept.ai hasn't launched? 2. Most of this progress is in SF. There's just more engineers good at HPC and ML than at robots, and engineers are the bottleneck anyhow. -- Predicts Shenzhen or somewhere might start to do better.

So, in 2017 Eliezer Yudkowsky made a bet with Bryan Caplan that the world will end by January 1st, 2030, in order to save the world by taking advantage of Bryan Caplan's perfect betting record — a record which, for example, includes a 2008 bet that the UK would not leave the European Union by January 1st 2020 (it left on January 31st 2020 after repeated delays).

What we need is a short story about people in 2029 realizing that a bunch of cataclysmic events are imminent, but all of them seem to be stalled, waiting for... something. And no one knows what to do. But by the end people realize that to keep the world alive they need to make more bets with Bryan Caplan.

The case for studying mesa optimization

Early elucidations of the alignment problem focused heavily on value specification. That is, they focused on the idea that given a powerful optimizer, we need some way of specifying our values so that the powerful optimizer can create good outcomes.

Since then, researchers have identified a number of additional problems besides value specification. One of the biggest problems is that in a certain sense, we don't even know how to optimize for anything, much less a perfect specification of human values.

Let's assume we could get a utility function containing everything humanity cares about. How would we go about optimizing this utility function?

The default mode of thinking about AI right now is to train a deep learning model that performs well on some training set. But even if we were able to create a training environment for our model that reflected the world very well, and rewarded it each time it did something good, exactly in proportion to how good it really was in our perfect utility function... this still would not be guaranteed to yield a positive artificial intelligence.

This problem is not a superficial one either -- it is intri... (read more)

Signal boosting a Lesswrong-adjacent author from the late 1800s and early 1900s

Via a friend, I recently discovered the zoologist, animal rights advocate, and author J. Howard Moore. His attitudes towards the world reflect contemporary attitudes within effective altruism about science, the place of humanity in nature, animal welfare, and the future. Here are some quotes which readers may enjoy,

Oh, the hope of the centuries and the centuries and centuries to come! It seems sometimes that I can almost see the shining spires of that Celestial Civilisation that man is to build in the ages to come on this earth—that Civilisation that will jewel the land masses of this planet in that sublime time when Science has wrought the miracles of a million years, and Man, no longer the savage he now is, breathes Justice and Brotherhood to every being that feels.

But we are a part of Nature, we human beings, just as truly a part of the universe of things as the insect or the sea. And are we not as much entitled to be considered in the selection of a model as the part 'red in tooth and claw'? At the feet of the tiger is a good place to study the dentition of the cat family, but it is
... (read more)

I agree with Wei Dai that we should use our real names for online forums, including Lesswrong. I want to briefly list some benefits of using my real name,

  • It means that people can easily recognize me across websites, for example from Facebook and Lesswrong simultaneously.
  • Over time my real name has been stable whereas my usernames have changed quite a bit over the years. For some very old accounts, such as those I created 10 years ago, this means that I can't remember my account name. Using my real name would have averted this situation.
  • It motivates me to put more effort into my posts, since I don't have any disinhibition from being anonymous.
  • It often looks more formal than a silly username, and that might make people take my posts more seriously than they otherwise would have.
  • Similar to what Wei Dai said, it makes it easier for people to recognize me in person, since they don't have to memorize a mapping from usernames to real names in their heads.

That said, there are some significant downsides, and I sympathize with people who don't want to use their real names.

  • It makes it much easier for people to dox you. There are some very bad ways that this can manifest.
  • If
... (read more)
6Viliam
These days my reason for not using full name is mostly this: I want to keep my professional and private lives separate. And I have to use my real name at job, therefore I don't use it online. What I probably should have done many years ago, is make up a new, plausibly-sounding full name (perhaps keep my first name and just make up a new surname?), and use it consistently online. Maybe it's still not too late; I just don't have any surname ideas that feel right.
4jp
Sometimes you need someone to give the naive view, but doing so hurts the reputation of the person stating it. For example suppose X is the naive view and Y is a more sophisticated view of the same subject. For sake of argument suppose X is correct and contradicts Y. Given 6 people, maybe 1 of them starts off believing Y. 2 people are uncertain, and 3 people think X. In the world where people have their usernames attached. The 3 people who believe X now have a coordination problem. They each face a local disincentive to state the case for X, although they definitely want _someone_ to say it. The equilibrium here is that no one makes the case for X and the two uncertain people get persuaded to view Y. However if someone is anonymous and doesn't care that much about their reputation, they may just go ahead and state the case for X, providing much better information to the undecided people. This makes me happy there are some smart people posting under pseudonyms. I claim it is a positive factor for the epistemics of LessWrong.
3Wei Dai
I agree with this, so my original advice was aimed at people who already made the decision to make their pseudonym easily linkable to their real name (e.g., their real name is easily Googleable from their pseudonym). I'm lucky in that there are lots of ethnic Chinese people with my name so it's hard to dox me even knowing my real name, but my name isn't so common that there's more than one person with the same full name in the rationalist/EA space. (Even then I do use alt accounts when saying especially risky things.) On the topic of doxing, I was wondering if there's a service that would "pen-test" how doxable you are, to give a better sense of how much risk one can take when saying things online. Have you heard of anything like that?
2JustMaier
Another issue I'd add is that real names are potentially too generic. Basically, if everyone used their real name, how many John Smiths would there be? Would it be confusing? The rigidity around 1 username/alias per person on most platforms forces people to adopt mostly memorable names that should distinguish them from the crowd.

Bertrand Russell's advice to future generations, from 1959

Interviewer: Suppose, Lord Russell, this film would be looked at by our descendants, like a Dead Sea scroll in a thousand years’ time. What would you think it’s worth telling that generation about the life you’ve lived and the lessons you’ve learned from it?
Russell: I should like to say two things, one intellectual and one moral. The intellectual thing I should want to say to them is this: When you are studying any matter or considering any philosophy, ask yourself o
... (read more)

When I look back at things I wrote a while ago, say months back, or years ago, I tend to cringe at how naive many of my views were. Faced with this inevitable progression, and the virtual certainty that I will continue to cringe at views I now hold, it is tempting to disconnect from social media and the internet and only comment when I am confident that something will look good in the future.

At the same time, I don't really think this is a good attitude for several reasons:

  • Writing things up forces my thoughts to be more explicit, improving my ability
... (read more)

People who don't understand the concept of "This person may have changed their mind in the intervening years", aren't worth impressing. I can imagine scenarios where your economic and social circumstances are so precarious that the incentives leave you with no choice but to let your speech and your thought be ruled by unthinking mob social-punishment mechanisms. But you should at least check whether you actually live in that world before surrendering.

4Viliam
In real world, people usually forget what you said 10 years ago. And even if they don't, saying "Matthew said this 10 years ago" doesn't have the same power as you saying the thing now. But the internet remembers forever, and your words from 10 years ago can be retweeted and become alive as if you said them now. A possible solution would be to use a nickname... and whenever you notice you grew up so much that you no longer identify with the words of your nickname, pick up a new one. Also new accounts on social networks, and re-friend only those people you still consider worthy. Well, in this case the abrupt change would be the unnatural thing, but perhaps you could still keep using your previous account for some time, but mostly passively. As your real-life new self would have different opinions, different hobbies, and different friends than your self from 10 years ago, so would your online self. Unfortunately, this solution goes against "terms of service" of almost all major website. On the advertisement-driven web, advertisers want to know your history, and they are the real customers... you are only a product.

Related to: Realism about rationality

I have talked to some people who say that they value ethical reflection, and would prefer that humanity reflected for a very long time before colonizing the stars. In a sense I agree, but at the same time I can't help but think that "reflection" is a vacuous feel-good word that has no shared common meaning.

Some forms of reflection are clearly good. Epistemic reflection is good if you are a consequentialist, since it can help you get what you want. I also agree that narrow forms of reflection can also be ... (read more)

2limerott
The vague reflections you are referring to are analogous to somebody saying "I should really exercise more" without ever doing it. I agree that the mere promise of reflection is useless. But I do think that reflections about the vague topics are important and possible. Actively working through one's experiences, reading relevant books, discussing questions with intelligent people can lead to epiphanies (and eventually life choices), that wouldn't have occurred otherwise. However, this is not done with a push of a button and these things don't happen randomly -- they will only emerge if you are prepared to invest a lot of time and energy. All of this happens on a personal level. To use your example, somebody may conclude from his own life experience that living a life of purpose is more important to him than to live a life of happiness. How to formalize this process so that an AI could use a canonical way to achieve it (and infer somebody's real values simply by observing) is beyond me. It would have to know a lot more about us than is comfortable for most of us.

It's now been about two years since I started seriously blogging. Most of my posts are on Lesswrong, and the most of the rest are scattered about on my substack and the Effective Altruist Forum, or on Facebook. I like writing, but I have an impediment which I feel impedes me greatly.

In short: I often post garbage.

Sometimes when I post garbage, it isn't until way later that I learn that it was garbage. And when that happens, it's not that bad, because at least I grew as a person since then.

But the usual case is that I realize that it's garbage right after I... (read more)

4Viliam
I have a hope that with more practice, this gets better. Not just practice, but also noticing what other people do differently. For example, I often write long texts, which some people say is already a mistake. But even a long text can be made more legible if it contains  section headers and pictures. Both of them break the visual monotonicity of the text wall. This is why section headers are useful even if they are literally: "1", "2", "3". In some sense, pictures are even better, because too many headers create another layer of monotonicity, which a few unique pictures do not. Which again suggests that having 1 photo, 1 graph, and 1 diagram is better than having 3 photos. I would say, write the text first, then think about which parts can be made clearer by adding a picture. There is some advice on writing, by Stephen King, or by Scott Alexander. If you post a garbage, let it be. Write more articles, and perhaps at the end of a year (or a decade) make a list "my best posts" which will not include the garbage. BTW, whatever you do, you will get some negative response. Your posts on LW are upvoted, so I assume they are not too bad. Also, writing can be imbalanced. Even for people who only write great texts, some of them are more great and some of them are less great than the others. But if they deleted the worst one, guess what, now some other articles is the worst one... and if you continue this way, you will stop with one or zero articles.
2Steven Byrnes
Sometimes I send a draft to a couple people before posting it publicly. Sometimes I sit on an idea for a while, then find an excuse to post it in a comment or bring it up in a conversation, get some feedback that way, and then post it properly. I have several old posts I stopped endorsing, but I didn't delete them; I put either an update comment at the top or a bunch of update comments throughout saying what I think now. (Last week I spent almost a whole day just putting corrections and retractions into my catalog of old posts.) I for one would have a very positive impression of a writer whose past writings were full of parenthetical comments that they were wrong about this or that. Even if the posts wind up unreadable as a consequence.

Should effective altruists be praised for their motives, or their results?

It is sometimes claimed, perhaps by those who recently read The Elephant in the Brain, that effective altruists have not risen above the failures of traditional charity, and are every bit as mired in selfish motives as non-EA causes. From a consequentialist view, however, this critique is not by itself valid.

To a consequentialist, it doesn't actually matter what one's motives are as long as the actual effect of their action is to do as much good as possible. This is the pri... (read more)

1Pattern
Evidence for this?
3Matthew Barnett
Hmm, I sort of assumed this was obvious. I suppose it depends greatly on how you can inspect whether they are actually trying, or whether they are just "trying." It's indeed probable that with sufficient supervision, you can actually do better by incentivizing effort. However, this method is expensive.

Sometimes people will propose ideas, and then those ideas are met immediately after with harsh criticism. A very common tendency for humans is to defend our ideas and work against these criticisms, which often gets us into a state that people refer to as "defensive."

According to common wisdom, being in a defensive state is a bad thing. The rationale here is that we shouldn't get too attached to our own ideas. If we do get attached, we become liable to become crackpots who can't give an idea up because it would make them look bad if we ... (read more)

6Wei Dai
A couple of relevant posts/threads that come to mind: * Individual vs. Group Epistemic Rationality * Raemon's recent shortform on adversarial debates producing positive externalities
5Viliam
Just like an idea can be wrong, so can be criticism. It is bad to give up the idea, just because.. * someone rounded it up to the nearest cliche, and provided the standard cached answer; * someone mentioned a scientific article (that failed to replicate) that disproves your idea (or something different, containing the same keywords); * someone got angry because it seems to oppose their political beliefs; * etc. My "favorite" version of wrong criticism is when someone experimentally disproves a strawman version of your hypothesis. Suppose your hypothesis is "eating vegetables is good for health", and someone makes an experiment where people are only allowed to eat carrots, nothing more. After a few months they get sick, and the author of the experiment publishes a study saying "science proves that vegetables are actually harmful for your health". (Suppose, optimistically, that the author used sufficiently large N, and did the statistics properly, so there is nothing to attack from the methodological angle.) From now on, whenever you mention that perhaps a diet containing more vegetables could benefit someone, someone will send you a link to the article that "debunks the myth" and will consider the debate closed. So, when I hear about research proving that parenting / education / exercise / whatever doesn't cause this or that, my first reaction is to wonder how specifically did the researchers operationalize such a general word, and whether the thing they studied even resembles my case. (And yes, I am aware that the same strategy could be used to refute any inconvenient statement, such as "astrology doesn't work" -- "well, I do astrology a bit differently than the people studied in that experiment, therefore the conclusion doesn't apply to me".)

I keep wondering why many AI alignment researchers aren't using the alignmentforum. I have met quite a few people who are working on alignment who I've never encountered online. I can think of a few reasons why this might be,

  • People find it easier to iterate on their work without having to write things up
  • People don't want to share their work, potentially because they think a private-by-default policy is better.
  • It is too cumbersome to interact with other researchers through the internet. In-person interactions are easier
  • They just haven't even considered from a first person perspective whether it would be worth it

I've often wished that conversation norms shifted towards making things more consensual. The problem is that when two people are talking, it's often the case that one party brings up a new topic without realizing that the other party didn't want to talk about that, or doesn't want to hear it.

Let me provide an example: Person A and person B are having a conversation about the exam that they just took. Person A bombed the exam, so they are pretty bummed. Person B, however, did great and wants to tell everyone. So then person B comes up to... (read more)

5Matt Goldenberg
Have you read the posts on ask, tell, and guess culture? They feel highly related to this idea.
4Raemon
Malcolm Ocean eventually reframed Tell Culture as Reveal Culture, which I found to be an improvement.
1Matthew Barnett
Hmm, I saw those a while ago and never read them. I'll check them out.
2Dagon
The problem is, if a conversational topic can be hurtful, the meta-topic can be too. "do you want to talk about the test" could be as bad or worse than talking about the test, if it's taken as a reference to a judgement-worthy sensitivity to the topic. And "Can I ask you if you want to talk about whether you want to talk about the test" is just silly. Mr-hire's comment is spot-on - there are variant cultural expectations that may apply, and you can't really unilaterally decide another norm is better (though you can have opinions and default stances). The only way through is to be somewhat aware of the conversational signals about what topics are welcome and what should be deferred until another time. You don't need prior agreement if you can take the hint when an unusually-brief non-response is given to your conversational bid. If you're routinely missing hints (or seeing hints that aren't), and the more direct discussions are ALSO uncomfortable for them or you, then you'll probably have to give up on that level of connection with that person.
1Matthew Barnett
I agree. Although if you are known for asking those types of questions maybe people will learn to understand you never mean it as a judgement. True, although I'll usually take silly over judgement any day. :)

Reading through the recent Discord discussions with Eliezer, and reading and replying to comments, has given me the following impression of a crux of the takeoff debate. It may not be the crux. But it seems like a crux nonetheless, unless I'm misreading a lot of people. 

Let me try to state it clearly:

The foom theorists are saying something like, "Well, you can usually-in-hindsight say that things changed gradually, or continuously, along some measure. You can use these measures after-the-fact, but that won't tell you about the actual gradual-ness of t... (read more)

2Adele Lopez
I lean toward the foom side, and I think I agree with the first statement. The intuition for me is that it's kinda like p-hacking (there are very many possible graphs, and some percentage of those will be gradual), or using a log-log plot (which makes everything look like a nice straight line, but are actually very broad predictions when properly accounting for uncertainty). Not sure if I agree with the addendum or not yet, and I'm not sure how much of a crux this is for me yet.

There have been a few posts about the obesity crisis here, and I'm honestly a bit confused about some theories that people are passing around. I'm one of those people thinks that the "calories in, calories" (CICO) theory is largely correct, relevant, and helpful for explaining our current crisis. 

I'm not actually sure to what extent people here disagree with my basic premises, or whether they just think I'm missing a point. So let me be more clear.

As I understand, there are roughly three critiques you can have against the CICO theory. You can think it... (read more)

7Viliam
How it seems to be typically used, literal CICO as an observation is the motte, and the corresponding bailey is something like: "yes, it is simple to lose weight, you just need to stop eating all those cakes and start exercising, but this is the truth you don't want to hear so you keep making excuses instead". How do you feel about the following theory: "atoms in, atoms out"? I mean, this one should be scientifically even less controversial. So why do you prefer the version with calories over the version with atoms? From the perspective of "I am just saying it, because it is factually true, there is no judgment or whatever involved", both theories are equal. What specifically is the advantage of the version with calories? (My guess is that the obvious problem with the "atoms in, atoms out" theory is that the only actionable advice it hints towards is to poop more, or perhaps exhale more CO2... but the obvious problem with such advice is that the fat people do not have conscious control over extracting fat from their fat cells and converting it to waste. Otherwise, many would willingly convert and poop it out in one afternoon and have their problem solved. Well, guess what, the "calories in, calories out" has exactly the same problem, only in less obvious form: if your metabolism decides that it is not going to extract fat from your fat cells and convert it to useful energy which could be burned in muscles, there is little you can consciously do about it; you will spend the energy outside of your fat cells, then you are out of useful energy, end of story, some guy on internet unhelpfully reminding you that you didn't spend enough calories.)
3Matthew Barnett
Well, let me consider a recent, highly upvoted post on here: A Contamination Theory of the Obesity Epidemic. In it, the author says that the explanation for the obesity crisis can't be CICO, If CICO is literally true, in the same way that the "atoms in, atoms out" theory is true, then this debunking is very weak. The obesity epidemic must be due to either overeating or lack of exercise, or both. The real debate is, of course, over which environmental factors caused us to eat more, or exercise less. But if you don't even recognize that the cause must act through this mechanism, then you're not going to get very far in your explanation. That's how you end up proposing that it must be some hidden environmental factor, as this post does, rather than more relevant things related to the modern diet. My own view is that the most likely cause of our current crisis is that modern folk have access to more and a greater variety of addicting processed food, so we end up consistently overeating. I don't think this theory is obviously correct, and of course it could be wrong. However, in light of the true mechanism behind obesity, it makes a lot more sense to me than many other theories that people have proposed, especially any that deny we're overeating from the outset.
4Viliam
Well, here is the point where we disagree. My opinion is that CICO, despite being technically true, focuses your attention on eating and exercise as the most relevant causes of obesity. I agree with the statement "calories in = calories out" as observation. I disagree with the conclusion that the most relevant things for obesity are how much you eat and how much you exercise. And my aversion against CICO is that it predictably leads people to this conclusion. As you have demonstra