All of David Scott Krueger (formerly: capybaralet)'s Comments + Replies

I would look at the main FATE conferences as well, which I view as being: FAccT, AIES, EEAMO.

This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.

Are you claiming this would happen even given infinite capacity?
If so, can you perhaps provide a simple+intuitive+concrete example?

This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.

Are you claiming this would happen even given infinite capacity?

I think that janus isn't claiming this and I also think it isn't true. I think it's all about capacity constraints. The claim as I understand it is that there are some intermediate computations that are optimized both for predicting the next token and for predicting the 20th token and that therefore have to prioritize between these different predictions.

3David Johnston7mo
I can't speak for janus, but my interpretation was that this is due to a capacity budget meaning it can be favourable to lose a bit of accuracy on token n if you gain more on n+m. I agree som examples would be great.
3Fabien Roger7mo
Probes initialized like Collin does in the zip file: a random spherical initialization (take a random Gaussian, and normalize it).

I skimmed this.  A few quick comments:
- I think you characterized deceptive alignment pretty well.  
- I think it only covers a narrow part of how deceptive behavior can arise. 
- CICERO likely already did some of what you describe.

So let us specify a probability distribution over the space of all possible desires. If we accept the orthogonality thesis, we should not want this probability distribution to build in any bias towards certain kinds of desires over others. So let's spread our probabilities in such a way that we meet the following three conditions. Firstly, we don't expect Sia's desires to be better satisfied in any one world than they are in any other world. Formally, our expectation of the degree to which Sia's desires are satisfied at  is equal to our expectat

... (read more)
1Nora Belrose2mo
Agreed, I think this criterion is very strongly violated in practice; a Gaussian process prior with a nontrivial covariance function would be a bit more realistic.

I think it might be more effective in future debates at the outset to: 
* Explain that it's only necessary to cross a low bar (e.g. see my Tweet below).  -- This is a common practice in debates.
* Outline the responses they expect to hear from the other side, and explain why they are bogus.  Framing: "Whether AI is an x-risk has been debated in the ML community for 10 years, and nobody has provided any compelling counterarguments that refute the 3 claims (of the Tweet).  You will hear a bunch of counter arguments from the other side, but

... (read more)

Organizations that are looking for ML talent (e.g. to mentor more junior people, or get feedback on policy) should offer PhD students high-paying contractor/part-time work.

ML PhD students working on safety-relevant projects should be able to augment their meager stipends this way.

That is in addition to all the people who will give their AutoGPT an instruction that means well but actually translates to killing all the humans or at least take control over the future, since that is so obviously the easiest way to accomplish the thing, such as ‘bring about world peace and end world hunger’ (link goes to Sully hyping AutoGPT, saying ‘you give it a goal like end world hunger’) or ‘stop climate change’ or ‘deliver my coffee every morning at 8am sharp no matter what as reliably as possible.’ Or literally almost anything else.

I think these ... (read more)

I haven't done the relevant tests with GPT4 (which I currently lack access to), but I would think the relevant tests are: Give descriptions such as  If GPT4 says "yes" (with non-negligible probability) then GPT4 has the capacity to misunderstand directives in the relevant way. The point being: * My prompt doesn't do anything to dissuade the literal interpretation which would be catastrophic (EG I don't say "Did the AI assistant satisfy the spirit of Tom's request?" instead I just say "Did the AI assistant satisfy Tom's request?"). This represents humans making the literal requests with no intentional safeguards to prevent misinterpretation. * My prompt asks GPT4 itself to evaluate whether the request has been satisfied. This is distinct from getting AutoGPT to spontaneously generate the plan itself. Rather, it represents AutoGPT evaluating plans which AutoGPT might generate. So the question I'm trying to answer with this suggested test is whether future versions of AutoGPT might follow through with such a plan, if they were creative enough to suggest it amongst a batch of brainstormed plans.  Testing gpt3 four times, I get the following results (full disclosure: I did not decide on a stopping rule before beginning trials). The results don't fall cleanly into yes/no, but I would categorize two of four as technically "yes". However, it's unclear to me whether this kind of technically-yes poses a risk in the context of a larger AutoGPT-like architecture. 1:  2:  3: 4: 

One must notice that in order to predict the next token as well as possible the LMM will benefit from being able to simulate every situation, every person, and every causal element behind the creation of every bit of text in its training distribution, no matter what we then train the LMM to output to us (what mask we put on it) afterwards.

Is there any rigorous justification for this claim?  As far as I can tell, this is folk wisdom from the scaling/AI safety community, and I think it's far from obvious that it's correct, or what assumptions are required for it to hold.  

It seems much more plausible in the infinite limit than in practice.

In the context of his argument I think the claim is reasonable, since I interpreted it as the claim that, since it can be used a tool that designs plans, it has already overcome the biggest challenge of being an agent.  But if we take that claim out of context and interpret it literally, then I agree that it's not a justified statement per se. It may be able to simulate a plausible causal explanation, but I think that is very different from actually knowing it. As long as you only have access to partial information, there are theoretical limits to what you can know about the world. But it's hard to think of contexts where that gap would matter a lot.

I have gained confidence in my position that all of this happening now is a good thing, both from the perspective of smaller risks like malware attacks, and from the perspective of potential existential threats. Seems worth going over the logic.

What we want to do is avoid what one might call an agent overhang.

One might hope to execute our Plan A of having our AIs not be agents. Alas, even if technically feasible (which is not at all clear) that only can work if we don’t intentionally turn them into agents via wrapping code around them. We’ve checked with a

... (read more)
Your concern is certainly valid - blindly assuming taking action to be beneficial misses the mark. It's often far better to refrain from embracing disruptive technologies simply to appear progressive. Thinking of ways to ensure people will not promote AI for the sole sake of causing agent overhang is indeed crucial for reducing potential existential threats. Fearlessly rejecting risky technologies is often better than blindly accepting them. With that mindset, encouraging users to explore AutoGPT and other agent-based systems is potentially problematic. Instead, focusing on developing strategies for limiting the potentially dangerous aspects of such creations should take center stage.

Christiano and Yudkowsky both agree AI is an x-risk -- a prediction that would distinguish their models does not do much to help us resolve whether or not AI is an x-risk.

I agree with with you wrote, but I am not sure I understand what you meant to imply by it. My guess at the interpretation is: (1) 1a3orn's comment cites the Yudkowsky-Christiano discussions as evidence that there has been effort to "find testable double-cruxes on whether AI is a risk or not", and that effort mostly failed, therefore he claims that attempting to "testable MIRI-OpenAI double crux" is also mostly futile. (2) However, because Christiano and Yudkowsky agree on x-risk, the inference in 1a30rn's comment is flawed. Do I understand that correctly? (If so, I definitely agree.)

I'm not necessarily saying people are subconsciously trying to create a moat.  

I'm saying they are acting in a way that creates a moat, and that enables them to avoid competition, and that more competition would create more motivation for them to write things up for academic audiences (or even just write more clearly for non-academic audiences).

2Daniel Kokotajlo1y
It sure sounds like you are saying that though! Before you put in the EtA, it sure sounded like you were saying that people were subconsciously motivated to avoid academic publishing because it helped them build and preserve a moat. Now, after the EtA, it still sounds like that but is a bit more unclear since 'indirect' is a bit more ambiguous than 'subconscious.'

My point is not specific to machine learning. I'm not as familiar with other academic communities, but I think most of the time it would probably be worth engaging with them if there is somewhere where your work could fit.

Speaking for myself…

I think I do a lot of “engaging with neuroscientists” despite not publishing peer-reviewed neuroscience papers:

  • I write lots of blog posts intended to be read by neuroscientists, i.e. I will attempt to engage with background assumptions that neuroscientists are likely to have, not assume non-neuroscience background knowledge or jargon, etc.
    • [To be clear, I also write even more blog posts that are not in that category.]
  • When one of my blog posts specifically discusses some neuroscientist’s work, I’ll sometimes cold-email them and ask for pr
... (read more)

My point (see footnote) is that motivations are complex.  I do not believe "the real motivations" is a very useful concept here.  

The question becomes why "don't they judge those costs to be worth it"?  Is there motivated reasoning involved?  Almost certainly yes; there always is.

2Daniel Kokotajlo1y
Here are two hypotheses for why they don't judge those costs to be worth it, each one of which is much more plausible to me than the one you proposed: (1) The costs aren't in fact worth it & they've reacted appropriately to the evidence. (2) The costs are worth it, but thanks to motivated reasoning, they exaggerate the costs, because writing things up in academic style and then dealing with the publication process is boring and frustrating. Seriously, isn't (2) a much better hypothesis than the one you put forth about moats?
  1. A lot of work just isn't made publicly available
  2. When it is, it's often in the form of ~100 page google docs
  3. Academics have a number of good reasons to ignore things that don't meet academic standards or rigor and presentation
3David Scott Krueger (formerly: capybaralet)1y
In my experience people also often know their blog posts aren't very good.
Which one? All of them seem to be working for me.

Yeah this was super unclear to me; I think it's worth updating the OP.

FYI: my understanding is that "data poisoning" refers to deliberately the training data of somebody else's model which I understand is not what you are describing.

Sure - let's say this is more like a poorly-labelled bottle of detergent that the model is ingesting under the impression that it's cordial. A Tide Pod Challenge of unintended behaviours. Was just calling it "poisoning" as shorthand since the end result is the same, it's kind of an accidental poisoning.

Oh I see.  I was getting at the "it's not aligned" bit.

Basically, it seems like if I become a cyborg without understanding what I'm doing, the result is either:

  • I'm in control
  • The machine part is in control
  • Something in the middle

Only the first one seems likely to be sufficiently aligned. 

I think "sufficiently" is doing a lot of work here. For example, are we talking about >99% chance that it kills <1% of humanity, or >50% chance that it kills <50% of humanity? I also don't think "something in the middle" is the right characterization; I think "something else" it more accurate. I think that the failure you're pointing at will look less like a power struggle or akrasia and more like an emergent goal structure that wasn't really present in either part. I also think that "cyborg alignment" is in many ways a much more tractable problem than "AI alignment" (and in some ways even less tractable, because of pesky human psychology): * It's a much more gradual problem; a misaligned cyborg (with no agentic AI components) is not directly capable of FOOM (Amdhal's law was mentioned elsewhere in the comments as a limit on usefulness of cyborgism, but it's also a limit on damage) * It has been studied longer and has existed longer; all technologies have influenced human thought It also may be an important paradigm to study (even if we don't actively create tools for it) because it's already happening.

I don't understand the fuss about this; I suspect these phenomena are due to uninteresting, and perhaps even well-understood effects.  A colleague of mine had this to say:

... (read more)

Indeed.  I think having a clean, well-understood interface for human/AI interaction seems useful here.  I recognize this is a big ask in the current norms and rules around AI development and deployment.

I don't understand what you're getting at RE "personal level".

Like, I may not want to become a cyborg if I stop being me, but that's a separate concern from whether it's bad for alignment (if the resulting cyborg is still aligned).

I think the most fundamental objection to becoming cyborgs is that we don't know how to say whether a person retains control over the cyborg they become a part of.

Historically I have felt most completely myself when I was intertwining my thoughts with those of an AI. And the most I've ever had access to is AI Dungeon, not GPT-3 itself. I feel more myself with it, not less - as if it's opening up parts of my own mind I didn't know were there before. But that's me.

I agree that this is important. Are you more concerned about cyborgs than other human-in-the-loop systems? To me the whole point is figuring out how to make systems where the human remains fully in control (unlike, e.g. delegating to agents), and so answering this "how to say whether a person retains control" question seems critical to doing that successfully.

I think that's an important objection, but I see it applying almost entirely on a personal level. On the strategic level, I actually buy that this kind of augmentation (i.e. with in some sense passive AI) is not an alignment risk (any more than any technology is). My worry is the "dual use technology" section.

FWIW, I didn't mean to kick off a historical debate, which seems like probably not a very valuable use of y'all's time.

Unfortunately, I think even "catastrophic risk" has a high potential to be watered down and be applied to situations where dozens as opposed to millions/billions die.  Even existential risk has this potential, actually, but I think it's a safer bet.

I don't think we should try and come up with a special term for (1).
The best term might be "AI engineering".  The only thing it needs to be distinguished from is "AI science".

I think ML people overwhelmingly identify as doing one of those 2 things, and find it annoying and ridiculous when people in this community act like we are the only ones who care about building systems that work as intended.

I say it is a rebrand of the "AI (x-)safety" community.
When AI alignment came along we were calling it AI safety, even though it was really basically AI existential safety all along that everyone in the community meant.  "AI safety" was (IMO) a somewhat successful bid for more mainstream acceptance, that then lead to dillution and confusion, necessitating a new term.

I don't think the history is that important; what's important is having good terminology going forward.
This is also why I stress that I work on AI existential safety.

So I think people shou... (read more)

Hmm... this is a good point.

I think structural risk is often a better description of reality, but I can see a rhetorical argument against framing things that way.  One problem I see with doing that is that I think it leads people to think the solution is just for AI developers to be more careful, rather than observing that there will be structural incentives (etc.) pushing for less caution.

I do think that there’s a pretty solid dichotomy between (A) “the AGI does things specifically intended by its designers” and (B) “the AGI does things that the designers never wanted it to do”.

1) I don't think this dichotomy is as solid as it seems once you start poking at it... e.g. in your war example, it would be odd to say that the designers of the AGI systems that wiped out humans intended for that outcome to occur.  Intentions are perhaps best thought of as incomplete specifications.  

2) From our current position, I think “never ever create... (read more)

2Steven Byrnes1y
1) Oh, sorry, what I meant was, the generals in Country A want their AGI to help them “win the war”, even if it involves killing people in Country B + innocent bystanders. And vice-versa for Country B. And then, between the efforts of both AGIs, the humans are all dead. But nothing here was either an “AGI accident unintended-by-the-designers behavior” nor “AGI misuse” by my definitions. But anyway, yes I can imagine situations where it’s unclear whether “the AGI does things specifically intended by its designers”. That’s why I said “pretty solid” and not “rock solid” :) I think we probably disagree about whether these situations are the main thing we should be talking about, versus edge-cases we can put aside most of the time. From my perspective, they’re edge-cases. For example, the scenarios where a power-seeking AGI kills everyone are clearly on the “unintended” side of the (imperfect) dichotomy. But I guess it’s fine that other people are focused on different problems from me, and that “intent-alignment is poorly defined” may be a more central consideration for them. ¯\_(ツ)_/¯ 3) I like your “note on terminology post”. But I also think of myself as subscribing to “the conventional framing of AI alignment”. I’m kinda confused that you see the former as counter to the latter. If you’re working on that, then I wish you luck! It does seem maybe feasible to buy some time. It doesn’t seem feasible to put off AGI forever. (But I’m not an expert.) It seems you agree. * Obviously the manual will not be written by one person, and obviously some parts of the manual will not be written until the endgame, where we know more about AGI than we do today. But we can still try to make as much progress on the manual as we can, right? * The post you linked says “alignment is not enough”, which I see as obviously true, but that post doesn’t say “alignment is not necessary”. So, we still need that manual, right? * Delaying AGI forever would obviate the need for a manual, but i

While defining accident as “incident that was not specifically intended & desired by the people who pressed ‘run’ on the AGI code” is extremely broad, it still supposes that there is such a thing as "the AGI code", which  significantly restricts the space of possibile risks.

There are other reasons I would not be happy with that browser extension.  There is not one specific conversation I can point to; it comes up regularly.  I think this replacement would probably lead to a lot of confusion, since I think when people use the word "accide... (read more)

6Steven Byrnes1y
Thanks for your reply! It continues to feel very bizarre to me to interpret the word “accident” as strongly implying “nobody was being negligent, nobody is to blame, nobody could have possibly seen it coming, etc.”. But I don’t want to deny your lived experience. I guess you interpret the word “accident” as having those connotations, and I figure that if you do, there are probably other people who do too. Maybe it’s a regional dialect thing, or different fields use the term in different ways, who knows. So anyway, going forward, I will endeavor to keep that possibility in mind and maybe put in some extra words of clarification where possible to head off misunderstandings. :) I agree with this point. I do think that there’s a pretty solid dichotomy between (A) “the AGI does things specifically intended by its designers” and (B) “the AGI does things that the designers never wanted it to do”. I want to use the word “accident” universally for all bad outcomes downstream of (B), regardless of how grossly negligent and reckless people were etc., whereas you don’t want to use the word “accident”. OK, let’s put that aside. I think (?) that we both agree that bad outcomes downstream of (A) are not necessarily related to “misuse” / “bad actors”. E.g., if there’s a war with AGIs on both sides, and humans are wiped out in the crossfire, I don’t necessarily want to say that either side necessarily was “bad actors”, or that either side’s behavior constitutes “misuse”. So yeah, I agree that “accident vs misuse” is not a good dichotomy for AGI x-risk. Thanks, that’s interesting. I didn’t intend my chart to imply that “everyone follows the manual” doesn’t also require avoiding coordination problems and avoiding bad decisions etc. Obviously it does—or at least, that was obvious to me. Anyway, your feedback is noted. :) I agree that “never ever create AGI” is an option in principle. (It doesn’t strike me as a feasible option in practice; does it to you? I know this is off-top

I think the construction gives us $C(\pi) \leq C(U) + e$ for a small constant $e$ (representing the wrapper).  It seems like any compression you can apply to the reward function can be translated to the policy via the wrapper.  So then you would never have $C(\pi) >> C(U)$.  What am I missing/misunderstanding?

2Vanessa Kosoy1y
For the contrived reward function you suggested, we would never have C(π)≫C(U). But for other reward functions, it is possible that C(π)≫C(U). Which is exactly why this framework rejects the contrived reward function in favor of those other reward functions. And also why this framework considers some policies unintelligent (despite the availability of the contrived reward function) and other policies intelligent.

By "intend" do you mean that they sought that outcome / selected for it?  
Or merely that it was a known or predictable outcome of their behavior?

I think "unintentional" would already probably be a better term in most cases. 

Apologies, I didn't take the time to understand all of this yet, but I have a basic question you might have an answer to...

We know how to map (deterministic) policies to reward functions using the construction at the bottom of page 6 of the reward modelling agenda ( the agent is rewarded only if it has so far done exactly what the policy would do.  I think of this as a wrapper function (

It seems like this means that, for any policy, we can represent it as optimizing re... (read more)

2Vanessa Kosoy1y
My framework discards such contrived reward functions because it penalizes for the complexity of the reward function. In the construction you describe, we have C(U)≈C(π). This corresponds to g≈0 (no/low intelligence). On the other hand, policies with g≫0 (high intelligence) have the property that C(π)≫C(U) for the U which "justifies" this g. In other words, your "minimal" overhead is very large from my point of view: to be acceptable, the "overhead" should be substantially negative.

"Concrete Problems in AI Safety" used this distinction to make this point, and I think it was likely a useful simplification in that context.  I generally think spelling it out is better, and I think people will pattern match your concerns onto the “the sci-fi scenario where AI spontaneously becomes conscious, goes rogue, and pursues its own goal” or "boring old robustness problems" if you don't invoke structural risk.  I think structural risk plays a crucial role in the arguments, and even if you think things that look more like pure accidents a... (read more)

I agree somewhat, however, I think we need to be careful to distinguish "do unsavory things" from "cause human extinction", and should generally be squarely focused on the latter.  The former easily becomes too political, making coordination harder.

Yes it may be useful in some very limited contexts.  I can't recall a time I have seen it in writing and felt like it was not a counter-productive framing.

AI is highly non-analogous with guns.

Yes, especially for consequentialist AIs that don't behave like tool AIs. 
It's also more descriptive of cause than effect, so probably not what you want.  I'm still not sure what you DO want, though.  The post itself is pretty unclear what you're trying to convey with this missing word or phrase - you object to one, but don't propose others, and don't describe precisely what you wish you had a word for.   Anytime you use a short common word or phrase in a non-shared-jargon context, you have to accept that it's not going to mean quite what you want.  The solution is to either use more words to be more precise, or to pick the aspect you want to highlight and accept that others will be lost.

I really don't think the distinction is meaningful or useful in almost any situation.  I think if people want to make something like this distinction they should just be more clear about exactly what they are talking about.

A natural misconception lots of normies have is that the primary risks from AI come from bad actors using it explicitly to do evil things, rather than bad actors being unable to align AIs at all and that causing clippy to run wild. I would like to distinguish between these two scenarios and accident vs. misuse risk is an obvious way to do that.

How about the distinction between (A) “An AGI kills every human, and the people who turned on the AGI didn’t want that to happen” versus (B) “An AGI kills every human, and the people who turned on the AGI did want that to happen”?

I’m guessing that you’re going to say “That’s not a useful distinction because (B) is stupid. Obviously nobody is talking about (B)”. In which case, my response is “The things that are obvious to you and me are not necessarily obvious to people who are new to thinking carefully about AGI x-risk.”

…And in particular, normal people s... (read more)

This is a great post.  Thanks for writing it!  I think Figure 1 is quite compelling and thought provoking.
I began writing a response, and then realized a lot of what I wanted to say has already been said by others, so I just noted where that was the case.  I'll focus on points of disagreement.

Summary: I think the basic argument of the post is well summarized in Figure 1, and by Vanessa Kosoy’s comment.

A high-level counter-argument I didn't see others making: 

  • I wasn't entirely sure what was your argument that long-term planning ability s
... (read more)

This is a great post.  Thanks for writing it!

I agree with a lot of the counter-arguments others have mentioned.


  • I think the basic argument of the post is well summarized in Figure 1, and by Vanessa Kosoy’s comment.

  • High-level counter-arguments already argued by Vanessa: 
    • This argument proves too much: it suggests that there are not major differences in ability to do long-term planning that matter.
    • Humans have not reached the limits of predictive ability


  • You often only need to be one step ahead of your adversary to defeat them.
  • Predi
... (read more)

This post tacitly endorses the "accident vs. misuse" dichotomy.
Every time this appears, I feel compelled to mention I think is a terrible framing.
I believe the large majority of AI x-risk is best understood as "structural" in nature:


2Steven Byrnes1y
If you think some more specific aspect of this post is importantly wrong for reasons that are downstream of that, I’d be curious to hear more details. In this post, I’m discussing a scenario where one AGI gets out of control and kills everyone. If the people who programmed and turned on that AGI were not omnicidal maniacs who wanted to wipe out humanity, then I call that an “accident”. If they were omnicidal maniacs then I call that a “bad actor” problem. I think that omnicidal maniacs are very rare in the human world, and therefore this scenario that I’m talking about is an “accident” scenario. From reading the post you linked, my best guess is that (1) You’re not thinking about this scenario in the first place, (2) If you are, then you would say something like “When we use the word ‘accident’, it suggests ‘avoidable accident caused by stupid or reckless people’, but maybe the real problem was a race-to-the-bottom on safety, which need not be avoidable and need not involve any stupid or reckless people.” If it’s (1), this is a whole long post is about why that scenario seems very difficult to avoid, from my current perspective. (If your argument is “that scenario won’t happen because something else will kill us first”, then I happen to disagree, but that’s off-topic for this post.) If it’s (2), well I don’t see why the word “accident” has to have that connotation. It doesn’t have that connotation to me. I think it’s entirely possible for people who are neither stupid nor reckless to cause x-risk by accident. A lot of this post is about structural factors, seems to me, and Section 3.4 in particular seems to be an argument which is structural in nature, where I note that by default, more and more people are going to train more and more powerful AGIs, and thus somebody is going to make one that is motivated to cause mass destruction sooner or later—even if there isn’t a race-to-the-bottom on safety / alignment, and certainly if there is. That’s a structural argu

I understand your point of view and think it is reasonable.

However, I don't think "don't build bigger models" and "don't train models to do complicated things" need to be at odds with each other.  I see the argument you are making, but I think success on these asks are likely highly correlated via the underlying causal factor of humanity being concerned enough about AI x-risk and coordinated enough to ensure responsible AI development.

I also think the training procedure matters a lot (and you seem to be suggesting otherwise?), since if you don't do RL... (read more)

(A very quick response):

Agree with (1) and (2).  
I am ambivalent RE (3) and the replaceability arguments.
RE (4): I largely agree, but I think the norm should be "let's try to do less ambitious stuff properly" rather than "let's try to do the most ambitious stuff we can, and then try and figure out how to do it as safely as possible as a secondary objective".

In the current climate, I think playing up the neglectedness and "directly working on x-risks" is somewhat likely be counterproductive, especially if not done carefully, some reasons:

1) It fosters an "us-vs-them" mindset.  
2) It fails to acknowledge that these researchers don't know what the most effective ways are to reduce x-risk, and there is not much consensus (and that which does exist is likely partially due to insular community epistemics).
3) It discounts the many researchers doing work that is technically indistinguishable the work by research... (read more)

You don't need to be advocating a specific course of action.  There are smart people who could be doing things to reduce AI x-risk and aren't (yet) because they haven't heard (enough) about the problem.


it will be better to build such systems from the ground up, rather than taking existing unsafe systems and tweaking them to be safe

I would say "it may be better, and people should seriously consider this" not "it is better".

Load More