DMs open.
I don't think dealmaking will buy us much safety. This is because I expect that:
That said, I have been thinking about dealmaking because:
There are many questions where verification is no easier than generation, e.g. "Is this chess move best?" is no easier than "What's the best chess move?" Both are EXPTIME-complete.
Philosophy might have a similar complexity to 'What's the best chess move?", i.e. "What argument X is such that for all counterarguments X1 there exists a countercounterargument X2 such that for all countercountercounterarguments X3...", i.e. you explore the game tree of philosophical discourse.
Whether experiments serve as a distinction between science and philosophy, TW has a lecture arguing against this, and he addresses this in a bunch of papers. I'll summarise his arguments later if I have time.
To clarify, I listed some of Williamson's claims, but I haven't summarised any of his arguments.
His actual arguments tend to be 'negative', i.e. they goes through many distinctions that metaphilosophical anti-exceptionalists purport, and for each he argues that either (i) the purported distinction is insubstantial,[1] or (ii) the distinction mischaracterised philosophy or science or both.[2]
He hasn't I think addressed Wei Dai's exceptionalism, which is (I gather) something like "Solomonoff induction provides a half-way decent formalisms of ideal maths/science, but there isn't a similarly decent formalism of ideal philosophy."
I'll think a bit more about what Williamson might say about that Wei Dai's purported distinction. I think Williamson is open to the possibility that philosophy is qualitatively different from science, so it's possible he would change his mind if he engaged with Dai's position.
Also I'm imaging that everyone stays on Earth and has millions of copies in space (via molecular cloning + uploads). And then it seem like people might agree to keep the Earth-copies as traditional humans, and this agreement would only affect a billionth of the Joseph-copies.
Yep, this seems like a plausible bargaining solution. But I might be wrong. If it turns out that mundane values don't mind being neighbours with immortal robots then you wouldn't need to leave Earth.
Wei Dai thinks that automating philosophy is among the hardest problems in AI safety.[1] If he's right, we might face a period where we have superhuman scientific and technological progress without comparable philosophical progress. This could be dangerous: imagine humanity with the science and technology of 1960 but the philosophy of 1460!
I think the likelihood of philosophy ‘keeping pace’ with science/technology depends on two factors:
I'll consider only the first factor here: How similar are the capabilities required?
Wei Dai is a metaphilosophical exceptionalist. He writes:
We seem to understand the philosophy/epistemology of science much better than that of philosophy (i.e. metaphilosophy), and at least superficially the methods humans use to make progress in them don't look very similar, so it seems suspicious that the same AI-based methods happen to work equally well for science and for philosophy.
I will contrast Wei Dai's position with that of Timothy Williamson, a metaphilosophical anti-exceptionalist.
These are the claims that constitute Williamson's view:
Roughly speaking, metaphilosophical exceptionalism should make one more pessimistic about philosophical progress keeping pace with scientific and technological progress. I lean towards Williamson's position, which makes me less pessimistic about philosophy keeping pace by default.
That said, during a rapid takeoff, even small differences in the pace could lead to a growing gap between philosophical progress and scientific/technological progress. So I consider automating philosophy an important problem to work on.
See AI doing philosophy = AI generating hands? (Jan 2024), Meta Questions about Metaphilosophy (Sep 2023), Morality is Scary (Dec 2021), Problems in AI Alignment that philosophers could potentially contribute to (Aug 2019), On the purposes of decision theory research (Jul 2019), Some Thoughts on Metaphilosophy (Feb 2019), The Argument from Philosophical Difficulty (Feb 2019), Two Neglected Problems in Human-AI Safety (Dec 2018), Metaphilosophical Mysteries (2010)
Different example, I think.
In our ttx, the AI was spec-aligned (human future flourishing etc), but didn’t trust that the lab leadership (Trump) was spec-aligned.
I don’t think our ttx was realistic. We started with an optimistic mix of AI values: spec-alignment plus myopic reward hacking.
When we consider arrangements between AIs and humans, we can analyze them along three dimensions:
These dimensions yield 36 distinct configurations[1], many of which map onto familiar arrangements between humans:
Typically when I talk about 'deals' I am referring to any arrangement with bilateral performance. This includes paid conscription, indentured servitude, and employment. It will exclude slavery (where AIs have obligations but humans do not) and gifts (where humans have obligations but AIs do not).
The possible performance obligations are: (1) AIs have obligations, (2) humans have obligations, (3) both humans and AIs have obligations. The possible formation conditions are: (1) AIs can unilaterally form arrangement, (2) humans can unilaterally form arrangement, (3) either humans and AIs can unilaterally form arrangement, (4) both humans and AIs must mutually agree to form arrangement. The possible termination conditions are similar to possible formation conditions. This gives 4×3×3=36 configurations.
Comments