To clarify, I listed some of Williamson's claims, but I haven't summarised any of his arguments.
His actual arguments tend to be 'negative', i.e. they goes through many distinctions that metaphilosophical anti-exceptionalists purport, and for each he argues that either (i) the purported distinction is insubstantial,[1] or (ii) the distinction mischaracterised philosophy or science or both.[2]
He hasn't I think addressed Wei Dai's exceptionalism, which is (I gather) something like "Solomonoff induction provides a half-way decent formalisms of ideal maths/science, but there isn't a similarly decent formalism of ideal philosophy."
I'll think a bit more about what Williamson might say about that Wei Dai's purported distinction. I think Williamson is open to the possibility that philosophy is qualitatively different from science, so it's possible he would change his mind if he engaged with Dai's position.
Also I'm imaging that everyone stays on Earth and has millions of copies in space (via molecular cloning + uploads). And then it seem like people might agree to keep the Earth-copies as traditional humans, and this agreement would only affect a billionth of the Joseph-copies.
Yep, this seems like a plausible bargaining solution. But I might be wrong. If it turns out that mundane values don't mind being neighbours with immortal robots then you wouldn't need to leave Earth.
Wei Dai thinks that automating philosophy is among the hardest problems in AI safety.[1] If he's right, we might face a period where we have superhuman scientific and technological progress without comparable philosophical progress. This could be dangerous: imagine humanity with the science and technology of 1960 but the philosophy of 1460!
I think the likelihood of philosophy ‘keeping pace’ with science/technology depends on two factors:
I'll consider only the first factor here: How similar are the capabilities required?
Wei Dai is a metaphilosophical exceptionalist. He writes:
We seem to understand the philosophy/epistemology of science much better than that of philosophy (i.e. metaphilosophy), and at least superficially the methods humans use to make progress in them don't look very similar, so it seems suspicious that the same AI-based methods happen to work equally well for science and for philosophy.
I will contrast Wei Dai's position with that of Timothy Williamson, a metaphilosophical anti-exceptionalist.
These are the claims that constitute Williamson's view:
Roughly speaking, metaphilosophical exceptionalism should make one more pessimistic about philosophical progress keeping pace with scientific and technological progress. I lean towards Williamson's position, which makes me less pessimistic about philosophy keeping pace by default.
That said, during a rapid takeoff, even small differences in the pace could lead to a growing gap between philosophical progress and scientific/technological progress. So I consider automating philosophy an important problem to work on.
See AI doing philosophy = AI generating hands? (Jan 2024), Meta Questions about Metaphilosophy (Sep 2023), Morality is Scary (Dec 2021), Problems in AI Alignment that philosophers could potentially contribute to (Aug 2019), On the purposes of decision theory research (Jul 2019), Some Thoughts on Metaphilosophy (Feb 2019), The Argument from Philosophical Difficulty (Feb 2019), Two Neglected Problems in Human-AI Safety (Dec 2018), Metaphilosophical Mysteries (2010)
Different example, I think.
In our ttx, the AI was spec-aligned (human future flourishing etc), but didn’t trust that the lab leadership (Trump) was spec-aligned.
I don’t think our ttx was realistic. We started with an optimistic mix of AI values: spec-alignment plus myopic reward hacking.
When we consider arrangements between AIs and humans, we can analyze them along three dimensions:
These dimensions yield 36 distinct configurations[1], many of which map onto familiar arrangements between humans:
Typically when I talk about 'deals' I am referring to any arrangement with bilateral performance. This includes paid conscription, indentured servitude, and employment. It will exclude slavery (where AIs have obligations but humans do not) and gifts (where humans have obligations but AIs do not).
The possible performance obligations are: (1) AIs have obligations, (2) humans have obligations, (3) both humans and AIs have obligations. The possible formation conditions are: (1) AIs can unilaterally form arrangement, (2) humans can unilaterally form arrangement, (3) either humans and AIs can unilaterally form arrangement, (4) both humans and AIs must mutually agree to form arrangement. The possible termination conditions are similar to possible formation conditions. This gives 4×3×3=36 configurations.
Human oversight of AIs could occur at different timescales: Slow (days-weeks)[1] and Fast (seconds-minutes)[2].
The community has mostly focused on Slow Human Oversight. This makes sense: It is likely that weak trusted AIs can perform all tasks that humans can perform in minutes.[3] If so, then clearly those AIs can replace for humans in Fast Oversight.
But perhaps there are cases where Fast Human Oversight is helpful:
For Fast Human Oversight, reading text won't work, because humans process text too slowly. Instead, we will need to leverage the visual cortex — it has higher bandwidth, better parallel processing, better anomaly detection.
Here are some techniques we could use:
Display generated tokens with colors based on probe values (e.g., deception probe). Use 2D color space (hue + saturation) to show two different probes simultaneously.
Collect logs/transcripts/activations from suspicious behavior, run controlled experiments on model internals, understand failure modes, hold stakeholder meetings, negotiate safety measures, decide whether to halt deployment or modify training procedures.
Monitor real-time outputs, recognize deceptive patterns or dangerous trajectories, determine if immediate intervention needed, execute killswitch or block outputs.
See Richard Ngo's t-AGI framework, which characterizes the 'capability' of a model by the duration of tasks:
A 1-minute AGI would need to beat humans at tasks like answering questions about short text passages or videos, common-sense reasoning (e.g. Yann LeCun's gears problems), simple computer tasks (e.g. use photoshop to blur an image), justifying an opinion, looking up facts, etc.
See also METR's Time Horizon's evaluations, which I think vindicates Richard's framework at least in specific domains.
The best example, for my tastes at least, is window-to-building ratio. I like working and living in buildings with high ratio, e.g. glass skyscrapers. But they look bad from outside.
(1)
You're correct that the essay is backchainy. Stratified utopia is my current best bet for "most desirable future given our moral uncertainty" which motivates me to evaluate its likelihood. I don't think it's very likely, maybe 5-10%, and I could easily shift with further thought
Starting with the most desirable future and then evaluating its likelihood does risk privileging the hypothesis. This is a fair critique: better epistemics would start with the most likely future and then evaluate its desirability.
(2)
Regarding your example: "if one person thinks the natural world is a horrid hell-pit of suffering and another thinks that living in a fully AI-managed environment is a kind of torture for everyone involved, they just can't compromise."
I should clarify that resource-compatibility is a claim about the mundane and exotic values humans actually hold. It's a contingent, not a necessary. Yes, some people think the natural world is a hell-pit of suffering (negative utilitarians like Brian Tomasik), but they're typically scope-sensitive and longtermist, so they'd care far more about the distal resources.
You could construct a value profile like "utility = -1 if suffering exists on Earth, else 0" which is exotic values seeking proximal resources. I don't have a good answer for handling such cases. But empirically, this value profile seems rare.
More common are cases involving contested sacred sites, which also violate the Nothing-To-Mourn Principle. For example, some people would mourn if the Third Temple were never rebuilt on the Temple Mount, while others would mourn if the Al-Aqsa Mosque were destroyed to make way for it.
Whether experiments serve as a distinction between science and philosophy, TW has a lecture arguing against this, and he addresses this in a bunch of papers. I'll summarise his arguments later if I have time.