All of acylhalide's Comments + Replies

The Alignment Problem

I'm not sure I follow.

So it doesn't matter how early AGIs think, formulating coherent preferences is also a convergent instrumental drive.

Just to clarify, do you mean "most AI systems that don't initially have coherent preferences, will eventually self-modify / evolve / follow some other process and become agents with coherent preferences"?

If yes, I would be keen on some evidence or arguments.

Because in my ontology, things like "without worrying about goodharting" and "most efficient way of handling that is with strong optimization ..." comes after you have coherent preferences, not before.

Which singularity schools plus the no singularity school was right?

IMO your post would be more readable if it used the heading formatting available on LW

Thanks, but how would I convert the post to a heading format?
Internal Double Crux

Thanks this is helpful!

I agree I could've phrased my comment better - I meant someone who eventually came to the conclusion that the less virtuous conclusion was better - not someone aiming for that as the target right from the start.

Internal Double Crux

Interesting. I'd be keen to know if there are examples of people using internal double crux to not do the thing that is framed as more virtuous-on-reflection. For instance someone who debates between Netflix and running, and makes all the parts of himself okay with watching Netflix over running. Or someone debating working on AI alignment or making money, and deciding all of his parts are okay with the latter.

Especially in an EA frame, some people seem afraid of tolerating value drift (in themselves) that risks moving them away from whatever their current ... (read more)

Er. It feels important to note that you don't use IDC to do things. Like, you don't use it to persuade yourself into a predecided target option. You use it when you feel torn, and a decision emerges from the interaction of your two possible courses of action. That being said, I have both used IDC and seen others use IDC in a fashion that resulted in an embrace of the "less virtuous" option, multiple times. That resulted in a recognition that the less virtuous seeming option was in fact more likely to be the right move/the best tradeoff.
Forum participation as a research strategy

FP often leads to long, winding discussions that may end with two researchers agreeing, but the resulting transcript is not great for future readers.

Highly agree. Often valuable content in discussions - either what you wrote or what the other person wrote - just gets lost.

Rereading your old discussions and then distilling the useful stuff into posts is a full time job.

This is a major reason I find very lengthy comments not often worth writing. I wonder if there is a way ro solve this problem. (Maybe more advanced search functionality on your comments?)

What should you change in response to an "emergency"? And AI risk

Just out of curiosity, how much of the burnout you mention is because of:

  1. Working too hard, focusing too much on narrow plans and sacrificing other areas of your life Versus:
  2. A world where you suddenly see high AI x-risk and s-risk, and nobody working on it, is just a fairly depressing world if you haven't adequately calibrated to it.

Your post mostly aims at 1, but I wonder how much of it is 2.

Calibration is hard imo. People can (and should celebrate) even 0.1% changes in x-risk, it's disorienting to suddenly update your whole world model from like 1% x... (read more)

On how various plans miss the hard bits of the alignment challenge

Nice to know. 50-80 years might offer hope to people with longer timelines.

Who is this MSRayne person anyway?

Why haven't you been able to get a source of income?

Who is this MSRayne person anyway?

I see!

How old are you? And where do you live? If you don't wanna doxx, some rough idea would also work.

(And ofcourse, asking this isn't meant to condescend or dismiss your thoughts, it's just that your age seems relevant to plans that let you to change your situation.)

25, Alabama.
On how various plans miss the hard bits of the alignment challenge

Thanks, will see if I get time!

Yudkowsky mentioned (in a video, sorry, don't have link) that he doesn't work on this path because he doesn't believe we'll get this tech first. I'm sure atleast some other AI safety researchers also believe the same.

But yeah I don't have much knowledge on this either so am curious if it could change.

6Nathan Helm-Burger1mo
I did study this idea for several years as one of my main focuses in grad school for neuroscience. I eventually decided it wasn't feasible within the timeframe we're talking about for AGI. I'd guess that if AGI progress were magically halted, we might get to BCI efficacy sufficient to be highly useful in something like 50-80 years.
Who is this MSRayne person anyway?

I live out in the middle of nowhere and going places is not generally an easy thing to do for me

This seems like a major bottleneck you should optimise towards solving. It would help if you work on solving this with a friend, or a therapist.

(And in general I'd be keen to know what's your experience / opinion of therapists.)

I would like to get a therapist, but since I do not have a source of income, I would be reliant on my parents to pay for it. And besides being the ones who made me this way, they are not very "mental health aware" and I know from prior attempts to get them to care about my mental health that they would scoff, make fun of me, drag their feet, suggest every other possible alternative, and ultimately probably refuse to pay for it. Particularly since my mother distrusts doctors (having had multiple traumatic medical experiences) and thinks most of them are quacks. That said, I have read a lot about therapy and some ideas from CBT in particular have helped me, though I've never worked through Feeling Good with any regularity.
On how various plans miss the hard bits of the alignment challenge

Where can I read more on this?

There's a big gap between "humans will have a communication channel with computers that is faster and higher throughput than keyboards and monitors" and "computers will understand human brain wirings to the point they can extract human values from it, and not be dangerous".

Well, I have had the idea that it ought to be possible to wire human brains together to make a higher intelligence or at least transfer ideas directly for a long time, but I have never actually learned enough about neuroscience or engineering to have any idea how. I am trying to rectify the former, but I will likely never be an engineer - it's just not fitting for my mind structure. As for Openwater however: [] Essentially they are working on technology which can use red and infrared light together with holography to scan the interior of the human body at high resolution. Currently the resolution is already enough to possibly revolutionize cancer diagnosis. The ultimate goal is to be able to image the brain at the level of individual neurons, and they claim, though I don't perfectly understand the physics of it, that they should end up able to use, I think ultrasound? to influence neurons as well. I forget the details on that front. For some reason I never hear about this company but what they are doing is amazing and innovative. I'd also suggest you check out their TED talk [] , which gives a good overview of the idea behind the technology.
The Alignment Problem

Yup I don't find this obvious either, not to 99% certainty for sure.

I get the chain of reasoning from coherent preferences to utility functions to instrumental converge and self-preservation. The case for an AI having coherent preferences seems weaker, esp. in paradigms like DL that at surface-level look non-agenty, or don't have coherence hard-coded into them.

See also: Rohin Shah on goal-directed behaviour

The case for coherent preferences is that eventually AGIs would want to do something about the cosmic endowment [] and the most efficient way of handling that is with strong optimization to a coherent goal, without worrying about goodharting []. So it doesn't matter how early AGIs think, formulating coherent preferences is also a convergent instrumental drive. At the same time, if coherent preferences are only arrived-at later, that privileges certain shapes of the process that formulates them, which might make them non-arbitrary in ways relevant to humanity's survival.


It could also act as a filter against people who are aware at some level of the risks, but are mentally compartmentalising them too much to be capable of saying that "yes AI could kill everyone on earth".


Agree with this!

If you're maximizing expected utility of some sort, you often should focus more on where the long tails are, and less on whether those are in the 1-st order impacts or in n-th order impacts for some n.

Ultimately all impacts will add up (if they're independent), and often what's important is which ones have magnitude large enough that all other impacts can be ignored.

Scott Aaronson and Steven Pinker Debate AI Scaling

If you're studying state-of-the-art AI you don't need to know any of these topics.

Scott Aaronson and Steven Pinker Debate AI Scaling

They're core to the agent model of AI, with coherent preferences. Once you get coherent preferences you get utility maximization, which gets instrumental convergence, uncorrigibility, self-preservation and so on.

Questions like:

  • how likely is it we build agent AI? Either explicitly or indirectly
  • how likely is it our agent AI will have coherent preferences?

Are still open questions, and different researchers will have different opinions.

3Bill Benzon2mo
I get that. But there are lots of AI researchers who know little or nothing of discussions here. What's the likelihood that they know or care about things like instrumental convergence and corrigibility?
Scott Aaronson and Steven Pinker Debate AI Scaling

Is there a longform discussion anywhere on pros and cons of nuking Russia post WW2 and establishing world govt?

It isn't as obvious to me what the correct answer was, the way it is obvious to ppl in this discussion.

Also this seems like a central clash of opinion that will resurface in the AI race, or any attempts today to reduce nuclear risk.

LessWrong Has Agree/Disagree Voting On All New Comment Threads

Thanks for responding.

Opt-in sounds like a lot of cognitive overhead for every single comment

Maybe opt-out is better then, not sure.

Also, giving readers 2 axes to vote on is another form of cognitive overhead. If it is known that the second axis won't be meaningful for a comment it might make more sense to hide the axis than have every reader independently realise why the second axis is not meaningful for a given comment.

This is a small downside, but may still be larger than the overhead placed on commentors to opt out of the second axis. Not sure.


... (read more)
LessWrong Has Agree/Disagree Voting On All New Comment Threads

I'd weakly prefer the agree/ disagree axis to be opt-in (or atleast opt-out) for each comment, with the commentor choosing whether to have the axis or not.

IMO having the agree/disagree button on all comments faces following issues:

  • what if a single comment states multiple positions? You might agree with some and not with others.

  • what if you're uncertain if you've understood the commentor's position? Let's say you don't vote. The vote is biased by people who think they correctly understood the position. What if you're unsure about loopholes or edge case

... (read more)
4Ben Pace2mo
A few responses: * Opt-in sounds like a lot of cognitive overhead for every single comment, and also (in-principle) allows for people to avoid having the truth value of their comments be judged when they make especially key claims in their argument. * Re "what if a single comment states multiple positions? You might agree with some and not with others" <- I expect the result is that (a) the agree/disagree button won't be used that much for such comments, or (b) it just will be less meaningful for such comments. Neither of these seem very costly to me. * Re "what if you're uncertain if you've understood the commenter's position... The vote is biased by people who think they correctly understood the position." <- If lots of people agree with a given comment because of a misunderstanding, making this fact known improves others' ability to respond to the false belief. In general my current model is that while consensus views can surely be locally false, understanding what is the consensus helps to respond to it faster and with more focus and discover the error. * Re "what if the comment isn't an opinion, it's a quote or a collation of other people's perspectives?" <- Seems like either the button won't get much use or will be less meaningful than other occasions. Note that there are many comments on the site who also don't get many upvote/downvotes, and I don't consider this a serious reason to make up/downvoting optional on comments just because it's often not used. One more thing is that my guess is the agree/disagree voting axis will encourage people to split up their comments more, and state things that are more cleanly true or false. (For example, I felt this impulse to split up these [] two [
Coherence arguments imply a force for goal-directed behavior

Thanks for your reply!

You've changed my mind maybe, although I was super uncertain when I wrote that comment too. I won't take more of your time.

Coherence arguments imply a force for goal-directed behavior

Thanks for replying!

I understand better what you're trying to say now, maybe I'm just not fully convinced yet. Would be keen on your thoughts on following if you have time! Mostly trying to convey some intuitions of mine (which could be wrong) than a rigorous argument.

I feel like things that tend to preserve coherence might have an automatic tendency to behave in a way as if they have goals.

And then it might have instrumentally convergent subgoals and so on.

(And that this could happen even if you didn't start off by giving the program goals or ability to... (read more)

6Rohin Shah2mo
I don't buy this. You could finetune a language model today to do chain-of-thought reasoning, which sounds like "an oracle AI that takes compute steps to minimize incoherence in world models". I predict that if you then add an additional head that can do read/write to an external memory, or you allow it to output sentences like "Write [X] to memory slot [Y]" or "Read from memory slot [Y]" that we support via an external database, it does not then start using that memory in some particularly useful way. You could say that this is because current language models are too dumb, but I don't particularly see why this will necessarily change in the future (besides that we will probably specifically train the models to use external memory). Overall I'm at "sure maybe a sufficiently intelligent oracle AI would do this but it seems like whether it happens depends on details of how the AI works".
What’s the contingency plan if we get AGI tomorrow?

Are you asking short-term or long-term?

Short-term the only thing that matters is buying time. It isn't very obvious buying a little time helps, but if you can't buy time nothing helps so you need to buy time.

Some strategies:

 - Persuasion: Convince people in the company to not deploy or atleast delay deploying the AI.

 - Persuade other actors: Persuade other actors who can exert control over the company and prevent them from deploying it, such as US military. There is a lot of nuance over which actors are likely to correctly respond to the threat w... (read more)

Coherence arguments imply a force for goal-directed behavior

(Update: Made some edits within 15 min of making the comment)

Thank you for replying!

This makes sense, and yes it definitely makes sense to consider programs that don't follow this exact paradigm you mentioned in the first para.

But I also feel like coherence arguments can apply even to agents that don't fit in this paradigm. You can for instance have really dumb programs which can be money pumped, and really dumb programs that can't be money pumped (say, because it is hard-coded with the right answers on the limited tasks it is designed for). None of these ... (read more)

3Rohin Shah2mo
I certainly agree that even amongst programs that don't have the structure above there are still ones that are more coherent or less coherent. I'm mostly saying that this seems not very related to whether such a system takes over the world and so I don't think about it that much. I think it can make sense to talk about an agent having coherent preferences over its internal state, and whether that's a useful abstraction depends on why you're analyzing that agent and more concrete details of the setup.
Coherence arguments imply a force for goal-directed behavior

I'm trying to understand what you mean by intelligence that is not goal directed. Your examples in your post include agents that attempt to have acccurate beliefs about the world. Could this be understood as a preference ordering over states internal to the agent?

And if yes, is there a meaningful difference between agents that have preference orderings over world states internal to the agent, and those that have preference orderings over world states external to the agent? Understanding this better probably comes under the embedded agency agenda.

3Rohin Shah2mo
There's a particular kind of cognition that considers a variety of plans, predicts their consequences, rates them according to some (simple / reasonable but not aligned) metric, and executes the one that scores highest. That sort of cognition will consider plans that disempower humans and then directly improve the metric, as well as plans that don't disempower humans and directly improve the metric, and in predicting the consequences of these plans and rating them will assign higher ratings to the plans that do disempower humans. I kinda want to set aside the term "goal directed" and ask instead to what extent the AI's cognition does the thing above (or something basically equivalent). When I've previously said "intelligence without goal directedness" I think what I should have been saying "cognitions that aren't structured as described above, that nonetheless are useful to us in the real world". (For example, an AI that predicts consequences of plans and directly translates those consequences into human-understandable terms, would be very useful while not implementing the dangerous sort of cognition above.)
The inordinately slow spread of good AGI conversations in ML

Oh okay, got it! Thanks for replying.

Doesn't uncontrolled discussion among ML researchers also eventually percolate to the public? Here's some random reasons why I felt telling public could help. I would love to know your view though; you've likely thought about it more than me and most of the people in the poll.

Update: I just saw your replies and I agree it's not worth it in the presence of an opportunity cost.

The inordinately slow spread of good AGI conversations in ML

P.S. For some data, here's a poll I did on EA community twitter group. Out of 55 votes, 5 people (9%) voted net negative, and 26 people (47%) voted highly unsure. "Highly unsure" group was large hence I felt the need for more consensus building.

The poll asked whether discussion among general public was net good, not ML researchers though.

3Rob Bensinger2mo
I would say that "AI risk advocacy among larger public" is probably net bad, and I'm very confused that this isn't a much more popular option! I don't see what useful thing the larger public is supposed to do with this information. What are we "advocating"? Since I nonetheless think that AI risk outreach within ML is very net-positive, this poll strikes me as extraordinarily weak evidence that a lot of EAs think we shouldn't do AI risk outreach within ML. Only 5 of the 55 respondents endorsed this for the general public, which strikes me as a way lower bar than 'keep this secret from ML'.
The inordinately slow spread of good AGI conversations in ML

I haven't talked to enough such people to feel confident giving real examples myself, but I could maybe imagine myself in the shoes of such a person and give a set of arguments, if that would be helpful for you. Many of the arguments I could imagine are ML researchers not parsing the discussions correctly for various reasons, and becoming on net more excited about accelerating capabilities or more distrustful of alignment folk.

Best would be to talk to such people directly. I'm guessing they will prefer even this (meta) discussion to be done more privately.... (read more)

The inordinately slow spread of good AGI conversations in ML

Lol. There are ways to do this discretely. For instance Rob Bensinger (or anyone else) could have a poll on LW where they indicate they have access to the votes. Then they follow it up with private DMs to everyone who voted that more discussion is net negative.

The inordinately slow spread of good AGI conversations in ML

Implicit in this post is the fact that more such conversation is a good thing. Some lesswrongers believe discussion is net negative, it would be useful if we could establish consensus on this one way or another.

2Rob Bensinger2mo
What are examples of reasons people believe discussion is net-negative?

As a proud member of the "discussion is net-negative" side, I shall gladly attempt to establish consensus using my favorite method — dictatorial power!

Where I agree and disagree with Eliezer

Weak opinion as I don't have sufficient alignment knowledge:

I feel like Paul Christiano's view is relying more on facts about deep learning today also generalising to DL-with-orders-more-compute AGI or non-DL AGI. I think Yudkowksy or someone else needs to attempt to frame (Yudkowsky's) criticisms more inside of a DL framework in order to help resolve these cruxes.

To be clear, I'm not saying DL is a useful framework to gain new insights on the kinds of topics that Yudkowsky brings up, be it deception, coherence, capability jumps, decision theory and so on.... (read more)

Where I agree and disagree with Eliezer

Oh, got it thanks!

Then I think yes what I'm basically missing here is Paul Christiano's intuition for why SGD will easily be able to find solutions that don't "sandbag". I would be keen to understand it.

I feel like when searching sufficiently large spaces, what we 'aimed' to search for may be less predictive of what we get, than deeper structures in the search space.

"If You're Not a Holy Madman, You're Not Trying"

Put less metaphorically, I think this is because human brains don't have an inner core of values and sufficient capability to rewrite the rest of their brains as instruments to fulfill these values.

Humans in general have less capability to rewrite their brain's algos and self-modify like this, AI might be able to do self-modification better.

Where I agree and disagree with Eliezer

Thanks for replying!

Where can I read more about "sandbagging"?

I'm not imagining doing gradient descent on impressiveness directly. One thing I could be imagining is: doing gradient descent on something that proxies for human-level intelligence (say a large dataset of solutions to human-level problems), such that the locally good solutions we find are those that contain some inner core of general intelligence, and those solutions more often look like ones that when run have primitive world models containing hostile agents to be deceptive to, because most so... (read more)

2Evan R. Murphy2mo
I think "sandbagging" was just another term Paul was using for what you described as the AIs "underplaying their capabilities".
Where I agree and disagree with Eliezer

Thank you for replying!

I realised there was a lot of nuance so I had to take time formulating a reply.

And of course I imagine AI systems doing alignment research, generating new technological solutions, a clearer understanding of how to deploy AI systems, improving implementation quality at relevant labs, helping identify key risks and improve people's thinking about those risks, etc.

This seems like maybe the biggest crux between you and Yudkowsky to be very honest. Would I be correct? And also the bigger decider on x-risk. More so than things that consumi... (read more)

Where I agree and disagree with Eliezer

I appreciate you choosing to reveal your real reasons, inspite of the reasons to not reveal them.

Where I agree and disagree with Eliezer
  1. One important factor seems to be that Eliezer often imagines scenarios in which AI systems avoid making major technical contributions, or revealing the extent of their capabilities, because they are lying in wait to cause trouble later. But if we are constantly training AI systems to do things that look impressive, then SGD will be aggressively selecting against any AI systems who don’t do impressive-looking stuff. So by the time we have AI systems who can develop molecular nanotech, we will definitely have had systems that did something slightly-less-impr
... (read more)
I think "just enough to impress the programmer" doesn't work---if you are doing gradient descent on impressiveness, then some other model will do even more and so be preferred. In order for this to be robust, I think you need either gradient hacking to be underway, or to have a very strong sandbagging coalition such that SGD naturally can't find any direction to push towards less sandbagging. That feels really unlikely to me, at least much harder than anything Eliezer normally argues for about doom by default.
Where I agree and disagree with Eliezer
  1. The notion of an AI-enabled “pivotal act” seems misguided. Aligned AI systems can reduce the period of risk of an unaligned AI by advancing alignment research, convincingly demonstrating the risk posed by unaligned AI, and consuming the “free energy” that an unaligned AI might have used to grow explosively. No particular act needs to be pivotal in order to greatly reduce the risk from unaligned AI, and the search for single pivotal acts leads to unrealistic stories of the future and unrealistic pictures of what AI labs should do.

I have two sets of question... (read more)

Consuming free energy means things like: taking the jobs that unaligned AI systems could have done, making it really hard to hack into computers (either by improving defenses or, worst case, just having an ecosystem where any vulnerable machine is going to be compromised quickly by an aligned AI), improving the physical technology of militaries or law enforcement so that a misaligned AI does not have a significant advantage.

I also imagine AI systems doing things like helping negotiate and enforce agreements to reduce access to destructive technologies or m... (read more)

Where I agree and disagree with Eliezer

Is countering your viewpoint in scope for comments on this post? If yes, please find in my replies, hopefully some of it is original or useful.

I'd say so, though I may not engage a lot of the time.
You could have systems that hide their deceptive or malovolent intent, while still realising they need to do impressive things if their descendants or future instantiations are to be similar to them. You can have systems that go for not expressing full capability, but just enough to impress the programmer. And maybe they can also acausal trade or otherwise somehow cooperate with future instantiations to play the same strategy. (To elaborate, an AI might feel confident underplaying their capabilities, because they know if they do so, future instantiations will be similar to them, and they are confident enough these these future instantiations will also underplay their capabilities hence not ruining the whole plan. There may be stronger forms of coordination possible but I won't get into that.) Notably these forms of behaviour may require the AI to understand almost nothing about human psychology or the real world. I would be keen to know if I'm wrong.
I have two sets of questions with this point. The stronger is I'm not how much it matters whether you actually need to execute a pivotal act to make the world safer (you are positing we don't), the very fact that you can execute a pivotal act makes you a threat even before you build the system that lets you do this. The weaker question is I'd be keen on seeing your specific non-pivotal but useful steps.
In defense of flailing, with foreword by Bill Burr

Awesome, no worries and glad we're thinking along the same lines.

In defense of flailing, with foreword by Bill Burr

Hi Ic, I wonder if your post was motivated by my earlier post with the exact same title, which I ended up deleting at the time.

Anyways, your post and the number of upvotes it got motivated me to republish my post as is.

I did not see it. I've had this post in my drafts for a few days now and only got around to publishing it now, as the LW mods can confirm (I asked them to review). Funny coincidence. Edit: Appears it was posted earlier than that, but yeah. We just came up with the same title independently.
Pivotal outcomes and pivotal processes

From the perspective of a tech CEO, it's quite unnerving to employ and empower AGI developers who are willing to do that sort of thing.

I feel like principal agent problems inside of an AGI company are likely to get more severe as we get closer to AGI, even if the CEO employs otherwise honest and boundary-respecting people.

I don't know if they'll be enough to break apart the company or have rogue devs run away with copies of code and research, but it will be something constantly on the mind of executives regardless.

FYI: I’m working on a book about the threat of AGI/ASI for a general audience. I hope it will be of value to the cause and the community

Can we find more details about you as well as the book? Did you make this post with a specific objective in mind?

2Darren McKee2mo
I don't have much more to share about the book at this stage as many parts are still in flux. I don't have much on hand to point you towards (like a personal website or anything). I had a blog years ago and do that podcast I mentioned. Perhaps if you have a specific question or two? I think a couple loose objectives. 1. To allow for synergies if others are doing something similar, 2. to possible hear good arguments for why it shouldn't happen, 3. to see about getting help, and 4. other unknown possibilities (perhaps someone connects me to someone else what provides a useful insight or something)
Yes, AI research will be substantially curtailed if a lab causes a major disaster

Thanks for your reply.


I agree my scenario may not be likely. But I would push back against it being so low probability as to be a distraction to discuss. Atleast as per my current understanding of the problem, which could easily be flawed, hence I would love to learn your viewpoint.

My understanding is we could have people in an AI race who correctly predict the danger goes up with capability increases, but their primary personal concern is one of avoiding immediate danger (rather than say reducing alignment tax). For such a person it could make sense... (read more)

Contra EY: Can AGI destroy us without trial & error?

The two main escapes I know require human manipulation and cyberhacking, and cyberhacking can be defeated with an airgapped box.

Humans will put it out, it doesn't need to escape, we will do it in order to be able to use it.We will connect it to internet to do whatever task we think it would be profitable to do and it will be out. Call it manipulating an election, call it customer care, call it whatever you want....
Yes, AI research will be substantially curtailed if a lab causes a major disaster

Thank you for replying!

I don't think warning shots are random, but if they have a large impact it may be in unexpected directions, perhaps for butterfly-effect-y reasons.

Agreed there will be highly random impacts, I just wonder if there are also some impacts that are predictable and good. Which again requires us to mostly model minds of people involved in AGI labs, and impacts on their worldviews.


Glad I defined warning shot properly, we clearly were thinking of different things!

I agree my idea of a warning shot is much closer to reaching AGI (if we d... (read more)

7Rob Bensinger2mo
I don't know what you mean by "worth it". I'm not planning to make a warning shot happen, and would strongly advise that others not do so either. :p A very late-stage warning shot might help a little, but the whole scenario seems unimportant and 'not where the action is' to me. The action is in slow, unsexy earlier work to figure out how to actually align AGI systems (or failing that, how to achieve some non-AGI world-saving technology). Fantasizing about super-unlikely scenarios that wouldn't even help strikes me as a distraction from figuring out how to make alignment progress today. I'm much more excited by scenarios like: 'a new podcast comes out that has top-tier-excellent discussion of AI alignment stuff, it becomes super popular among ML researchers, and the culture norms, and expectations of ML thereby shift such that water-cooler conversations about AGI catastrophe are more serious, substantive, informed, candid, and frequent'. It's rare for a big positive cultural shift like that to happen; but it does happen sometimes, and it can result in very fast changes to the Overton window. And since it's a podcast containing many hours of content, there's the potential to seed subsequent conversations with a lot of high-quality background thoughts. By comparison, I'm less excited about individual researchers who explicitly say the words 'I'll only work on AGI risk after a catastrophe has happened'. This is a really foolish view to consciously hold, and makes me a lot less optimistic about the relevance and quality of research that will end up produced in the unlikely event that (a) a catastrophe actually happens, and (b) they actually follow through and drop everything they're working on to do alignment.
On A List of Lethalities

Fair, this is an attemptable goal.

(I would caution against assuming what humans find hard to be hard in a fundamental sense though, they're correlated but not the same. Even today humans and computers find very different tasks easy and hard. For instance million digit arithmetic is hard for humans, maybe (fine it's a big maybe) if mathematicians could do this they'd be better manipulators.)

Yes, AI research will be substantially curtailed if a lab causes a major disaster

Thanks for your reply!

I definitely get what you're saying now. I am still maybe a bit more optimistic, but still highly uncertain.

I think if we want this discussion to go further we will need to discuss precisely what parameters of the world model are different AGI labs getting wrong or likely to get wrong in the future. Then it can make sense to discuss to what extent a warning shot is likely to correct all of them, or some of them (and if some of them, could the result still be bad or worse).

And I definitely think such a discussion is useful but it'll a... (read more)

2Rob Bensinger2mo
I don't think warning shots are random, but if they have a large impact it may be in unexpected directions, perhaps for butterfly-effect-y reasons. I'm not defining warning shots that way; I'd be much more surprised to see an event like that happen (because it's more conjunctive), and I'd be much more confident that a warning shot like that won't shift us from a very-bad trajectory to an OK one (because I'd expect an event like that to come very shortly before AGI destroys or saves the world, if an event like that happened at all). When I say 'warning shot' I just mean 'an event where AI is perceived to have caused a very large amount of destruction'. The warning shots I'd expect to do the most good are ones like: '20 or 30 or 40 years before we'd naturally reach AGI, a huge narrow-AI disaster unrelated to AGI risk occurs. This disaster is purely accidental (not terrorism or whatever). Its effect is mainly just to cause it to be in the Overton window that a wider variety of serious technical people can talk about scary AI outcomes at all, and maybe it slows timelines by five years or whatever. Also, somehow none of this causes discourse to become even dumber; e.g., people don't start dismissing AGI risk because "the real risk is narrow AI symptoms like the one we just saw", and there isn't a big ML backlash to regulatory/safety efforts, and so on.' I don't expect anything at all like that to happen, not least because I suspect we may not have 20+ years left before AGI. But that's a scenario where I could imagine real, modest improvements. Maybe. Optimistically.
Load More