Yonatan Cale

Wiki Contributions


Any ideas on how much to read this as "Sam's actual opinions" vs "Sam trying to say things that will satisfy the maximum amount of people"?

(do we have priors on his writings? do we have information about him absolutely not meaning one or more of the things here?)

Hey Kaj :)

The part-hiding-complexity here seems to me like "how exactly do you take a-simulation/prediction-of-a-person and get from it the-preferences-of-the-person".

For example, would you simulate a negotiation with the human and how the negotiation would result? Would you simulate asking the human and then do whatever the human answers? (there were a few suggestions in the post, I don't know if you endorse a specific one or if you even think this question is important)

Because (I assume) once OpenAI[1] say "trust our models", that's the point when it would be useful to publish our breaks.

Breaks that weren't published yet, so that OpenAI couldn't patch them yet.

[unconfident; I can see counterarguments too]

  1. ^

    Or maybe when the regulators or experts or the public opinion say "this model is trustworthy, don't worry"

I'm confused: Wouldn't we prefer to keep such findings private? (at least, keep them until OpenAI will say something like "this model is reliable/safe"?)


My guess: You'd reply that finding good talent is worth it?

This seems like great advice, thanks!

I'd be interested in an example for what "a believable story in which this project reduces AI x-risk" looks like, if Dane (or someone else) would like to share.

A link directly to the corrigibility part (skipping unrelated things that are in the same page) :


This post got me to do something like exposure therapy to myself in 10+ situations, which felt like the "obvious" thing to do in those situations. This is a huge amount of life-change-per-post

My thoughts:

[Epistemic status + impostor syndrome: Just learning, posting my ideas to hear how they are wrong and in hope to interact with others in the community. Don't learn from my ideas]


Victoria: “I don't think that the internet has a lot of particularly effective plans to disempower humanity.

I think:

  1. Having ready plans on the internet and using them is not part of the normal threat model from an AGI. If that was the problem, we could just filter out those plans from the training set.
  2. (The internet does have such ideas. I will briefly mention biosecurity, but I prefer not spreading ideas on how to disempower humanity)



[Victoria:] I think coming up with a plan that gets past the defenses of human society requires thinking differently from humans.

TL;DR: I think some ways to disempower humanity don't require thinking differently than humans

I'll split up AI's attack vectors into 3 buckets:

  1. Attacks that humans didn't even think of (such as what we can do to apes)
  2. Attacks that humans did think of but are not defending against (for example, we thought about pandemic risks but we didn't defended against them so well). Note this does not require thinking about things that humans didn't think about.
  3. Attacks that humans are actively defending against, such as using robots with guns or trading in the stock market or playing go (go probably won't help taking over the world, but humans are actively working on winning go games, so I put the example here). Having an AI beat us in one of these does require it to be in some important (to me) sense smarter than us, but not all attacks are in this bucket.



[...] requires thinking differently from humans

I think AIs already today think differently than humans in any reasonable way we could mean that. In fact, if we could make an them NOT think differently than humans, my [untrustworthy] opinion is that this would be non-negligible progress towards solving alignment. No?



The intelligence threshold for planning to take over the world isn't low

First, disclaimers: 

(1) I'm not an expert and this isn't widely reviewed, (2) I'm intentionally being not detailed in order to not spread ideas on how to take over the world, I'm aware this is bad epistemic and I'm sorry for it, it's the tradeoff I'm picking

So, mainly based on A, I think a person who is 90% as intelligent as Elon Musk in all dimensions would probably be able to destroy humanity, and so (if I'm right), the intelligence threshold is lower than "the world's smartest human". Again sorry for the lack of detail. [mods, if this was already too much, feel free to edit/delete my comment]

"Doing a Turing test" is a solution to something. What's the problem you're trying to solve?

As a judge, I'd ask the test subject to write me a rap song about turing tests. If it succeeds, I guess it's a ChatGPT ;P


More seriously - it would be nice to find a judge that doesn't know the capabilities and limitations of GPT models. Knowing those is very very useful

Load More