Yes. Valid. How to avoid reducing to a toy problem or such narrowing assumptions (in order to achieve a proof) that allows Mr. CEO to dismiss it.
When I revise, I'm going to work backwards with CEO/Senator dialog in mind.
Agreed. Proof or disproof should win.
All the way up meaning at increasing levels of intelligence…your 10,000 becomes 100,000X, etc.
At some level of performance, a moral person faces new temptations because of increased capabilities and greater power for damage, right?
In other words, your simulation may fail to be aligned at 20,000...30,000...
Okay, maybe I'm moving the bar, hopefully not and this thread is helpful...
Your counter-example, your simulation would prove that examples of aligned systems - at a high level - are possible. Alignment at some level is possible, of course. Functioning thermostats are aligned.
What I'm trying to propose is the search for a proof that a guarantee of alignment - all the way up - is mathematically impossible. We could then make the statement: "If we proceed down this path, no one will ever be able to guarantee that humans remain in control." I'm proposing we see if we can prove that Stuart Russell's "provably beneficial" does not exist.
If a guarantee is proved to be impossible, I am contending that the public conversation changes.
Maybe many people - especially on LessWrong - take this fact as a given. Their internal belief is close enough to a proof...that there is not a guarantee all the way up.
I think a proof that there is no guarantee would be important news for the wider world...the world that has to move if there is to be regulation.
Great question. I think the answer must be "yes." The alignment-possible provers must get the prize, too.
And, that would be fantastic. Proving a thing is possible, accelerates development. (US uses atomic bomb. Russia has it 4 years later.) Okay, it would be fantastic if the possible proof did not create false security in the short term. It's important when alignment gets solved. A peer-reviewed paper can't get the coffee. (That thought is an aside and not enough to kill the value of the prize, IMHO. If we prove it is possible, that must accelerate alignment work and inform it.)
Getting definitions and criteria right will be harder than raising the $10 million. And important. And contribute to current efforts.
Making it agnostic to possible/impossible would also have the benefit of removing political/commercial antibodies to the exercise, I think.
I envision the org that offers the prize, after broad expert input, would set the definitions and criteria.
Yes, surely the definition/criteria exercise would be a hard thing...but hopefully valuable.
Yes, surely the proof would be very difficult or impossible. However, enough people have the nagging worry that it is impossible to justify the effort to see if we can prove that it is impossible...and update.
But, if the effort required for a proof is - I don't know - 120 person months - let's please, Humanity, not walk right past that one into the blades.
I am not advocating that we divert dozens of people from promising alignment work.
Even if it failed, I would hope the prove-impossibility effort would throw off beneficial by-products like:
I thought there was a 60%+ chance I would get a quick education on the people who are trying or who have tried to prove impossibility.
But, I also thought, perhaps this is one of those those Nate Soares blind spots...maybe caused by the fact that those who understand the issues are the types who want to fix.
Has it gotten the attention it needs?
Like dr_s stated, I'm contending that proof would be qualitatively different from "very hard" and powerful ammunition for advocating a pause...
Senator X: “Mr. CEO, your company continues to push the envelope and yet we now have proof that neither you nor anyone else will ever be able to guarantee that humans remain in control. You talk about safety and call for regulation but we seem to now have the answer. Human control will ultimately end. I repeat my question: Are you consciously working to replace humanity? Do you have children, sir?”
AI expert to Xi Jinping: “General Secretary, what this means is that we will not control it. It will control us. In the end, Party leadership will cede to artificial agents. They may or may not adhere to communist principals. They may or may not believe in the primacy of China. Population advantage will become nothing because artificial minds can be copied 10 billion times. Our own unification of mind, purpose, and action will pale in comparison. Our chief advantages of unity and population will no longer exist.”
AI expert to US General: “General, think of this as building an extremely effective infantry soldier who will become CJCS then POTUS in a matter of weeks or months.”
I would love to see a video or transcript of this technique in action in a 1:1 conversation about ai x-risk.
Answer to my own question: https://www.youtube.com/watch?v=0VBowPUluPc