Hey Kaj :)
The part-hiding-complexity here seems to me like "how exactly do you take a-simulation/prediction-of-a-person and get from it the-preferences-of-the-person".
For example, would you simulate a negotiation with the human and how the negotiation would result? Would you simulate asking the human and then do whatever the human answers? (there were a few suggestions in the post, I don't know if you endorse a specific one or if you even think this question is important)
Because (I assume) once OpenAI[1] say "trust our models", that's the point when it would be useful to publish our breaks.
Breaks that weren't published yet, so that OpenAI couldn't patch them yet.
[unconfident; I can see counterarguments too]
Or maybe when the regulators or experts or the public opinion say "this model is trustworthy, don't worry"
I'm confused: Wouldn't we prefer to keep such findings private? (at least, keep them until OpenAI will say something like "this model is reliable/safe"?)
My guess: You'd reply that finding good talent is worth it?
This seems like great advice, thanks!
I'd be interested in an example for what "a believable story in which this project reduces AI x-risk" looks like, if Dane (or someone else) would like to share.
A link directly to the corrigibility part (skipping unrelated things that are in the same page) :
This post got me to do something like exposure therapy to myself in 10+ situations, which felt like the "obvious" thing to do in those situations. This is a huge amount of life-change-per-post
My thoughts:
[Epistemic status + impostor syndrome: Just learning, posting my ideas to hear how they are wrong and in hope to interact with others in the community. Don't learn from my ideas]
A)
Victoria: “I don't think that the internet has a lot of particularly effective plans to disempower humanity.
I think:
B)
[Victoria:] I think coming up with a plan that gets past the defenses of human society requires thinking differently from humans.
TL;DR: I think some ways to disempower humanity don't require thinking differently than humans
I'll split up AI's attack vectors into 3 buckets:
C)
[...] requires thinking differently from humans
I think AIs already today think differently than humans in any reasonable way we could mean that. In fact, if we could make an them NOT think differently than humans, my [untrustworthy] opinion is that this would be non-negligible progress towards solving alignment. No?
D)
The intelligence threshold for planning to take over the world isn't low
First, disclaimers:
(1) I'm not an expert and this isn't widely reviewed, (2) I'm intentionally being not detailed in order to not spread ideas on how to take over the world, I'm aware this is bad epistemic and I'm sorry for it, it's the tradeoff I'm picking
So, mainly based on A, I think a person who is 90% as intelligent as Elon Musk in all dimensions would probably be able to destroy humanity, and so (if I'm right), the intelligence threshold is lower than "the world's smartest human". Again sorry for the lack of detail. [mods, if this was already too much, feel free to edit/delete my comment]
"Doing a Turing test" is a solution to something. What's the problem you're trying to solve?
As a judge, I'd ask the test subject to write me a rap song about turing tests. If it succeeds, I guess it's a ChatGPT ;P
More seriously - it would be nice to find a judge that doesn't know the capabilities and limitations of GPT models. Knowing those is very very useful
Any ideas on how much to read this as "Sam's actual opinions" vs "Sam trying to say things that will satisfy the maximum amount of people"?
(do we have priors on his writings? do we have information about him absolutely not meaning one or more of the things here?)