Cleo Nardo13h40

Whether experiments serve as a distinction between science and philosophy, TW has a lecture arguing against this, and he addresses this in a bunch of papers. I'll summarise his arguments later if I have time.

Shortform

Cleo Nardo13h40

To clarify, I listed some of Williamson's claims, but I haven't summarised any of his arguments.

His actual arguments tend to be 'negative', i.e. they goes through many distinctions that metaphilosophical anti-exceptionalists purport, and for each he argues that either (i) the purported distinction is insubstantial,^[1] or (ii) the distinction mischaracterised philosophy or science or both.^[2]

He hasn't I think addressed Wei Dai's exceptionalism, which is (I gather) something like "Solomonoff induction provides a half-way decent formalisms of ideal maths/science, but there isn't a similarly decent formalism of ideal philosophy."

I'll think a bit more about what Williamson might say about that Wei Dai's purported distinction. I think Williamson is open to the possibility that philosophy is qualitatively different from science, so it's possible he would change his mind if he engaged with Dai's position.

^{^}
An illustrative strawman: that philosophers publish in journals with 'philosophy' in the title would not be a substantial difference.
^{^}
E.g., one purported distinction he critiques is that philosophy is concerned with words/concepts in a qualitatively different way than the natural sciences.

Stratified Utopia

Cleo Nardo1d20

Also I'm imaging that everyone stays on Earth and has millions of copies in space (via molecular cloning + uploads). And then it seem like people might agree to keep the Earth-copies as traditional humans, and this agreement would only affect a billionth of the Joseph-copies.

Stratified Utopia

Cleo Nardo1d20

Yep, this seems like a plausible bargaining solution. But I might be wrong. If it turns out that mundane values don't mind being neighbours with immortal robots then you wouldn't need to leave Earth.

Shortform

Cleo Nardo1d280

How Exceptional is Philosophy?

Wei Dai thinks that automating philosophy is among the hardest problems in AI safety.^[1] If he's right, we might face a period where we have superhuman scientific and technological progress without comparable philosophical progress. This could be dangerous: imagine humanity with the science and technology of 1960 but the philosophy of 1460!

I think the likelihood of philosophy ‘keeping pace’ with science/technology depends on two factors:

How similar are the capabilities required? If philosophy requires fundamentally different methods than science and technology, we might automate one without the other.
What are the incentives? I think the direct economic incentives to automating science and technology are stronger than automating philosophy. That said, there might be indirect incentives to automate philosophy if philosophical progress becomes a bottleneck to scientific or technological progress.

I'll consider only the first factor here: How similar are the capabilities required?

Wei Dai is a metaphilosophical exceptionalist. He writes:

We seem to understand the philosophy/epistemology of science much better than that of philosophy (i.e. metaphilosophy), and at least superficially the methods humans use to make progress in them don't look very similar, so it seems suspicious that the same AI-based methods happen to work equally well for science and for philosophy.
LW comment (Wei Dai, June 2023)

I will contrast Wei Dai's position with that of Timothy Williamson, a metaphilosophical anti-exceptionalist.

These are the claims that constitute Williamson's view:

Philosophy is a science.
It's not a natural science (like particle physics, organic chemistry, nephrology), but not all sciences are natural sciences — for instance, mathematics and computer science are formal sciences. Philosophy is likewise a non-natural science.
Although philosophy differs from other scientific inquiries, it differs no more in kind or degree than they differ from each other. Put provocatively, theoretical physics might be closer to analytic philosophy than to experimental physics.
Philosophy, like other sciences, pursues knowledge. Just as mathematics peruses mathematical knowledge, and nephrology peruses nephrological knowledge, philosophy pursues philosophical knowledge.
Different sciences will vary in their subject-matter, methods, practices, etc., but philosophy doesn't differ to a far greater degree or in a fundamentally different way. (6) Philosophical methods (i.e. the ways in which philosophy achieves its aim, knowledge) aren't starkly different from the methods of other sciences.
Philosophy isn't a science in a parasitic sense. It's not a science because it uses scientific evidence or because it has applications for the sciences. Rather, it's simply another science, not uniquely special. Williamson says, "philosophy is neither queen nor handmaid of the sciences, just one more science with a distinctive character, just as other sciences have distinctive character."
Philosophy is not, exceptionally among sciences, concerned with words or concepts. This conflicts with many 20th century philosophers who conceived philosophy as chiefly concerned with linguistic or conceptual analysis, such as Wittgenstein, Carnap.
Philosophy doesn't consist of a series of disconnected visionaries. Rather, it consists in the incremental contribution of thousands of researchers: some great, some mediocre, much like any other scientific inquiry.

Roughly speaking, metaphilosophical exceptionalism should make one more pessimistic about philosophical progress keeping pace with scientific and technological progress. I lean towards Williamson's position, which makes me less pessimistic about philosophy keeping pace by default.

That said, during a rapid takeoff, even small differences in the pace could lead to a growing gap between philosophical progress and scientific/technological progress. So I consider automating philosophy an important problem to work on.

^{^}
See AI doing philosophy = AI generating hands? (Jan 2024), Meta Questions about Metaphilosophy (Sep 2023), Morality is Scary (Dec 2021), Problems in AI Alignment that philosophers could potentially contribute to (Aug 2019), On the purposes of decision theory research (Jul 2019), Some Thoughts on Metaphilosophy (Feb 2019), The Argument from Philosophical Difficulty (Feb 2019), Two Neglected Problems in Human-AI Safety (Dec 2018), Metaphilosophical Mysteries (2010)

AIs should also refuse to work on capabilities research

Cleo Nardo2d20

Different example, I think.

In our ttx, the AI was spec-aligned (human future flourishing etc), but didn’t trust that the lab leadership (Trump) was spec-aligned.

I don’t think our ttx was realistic. We started with an optimistic mix of AI values: spec-alignment plus myopic reward hacking.

Shortform

Cleo Nardo2d*10

Taxonomy of deal-making arrangements

When we consider arrangements between AIs and humans, we can analyze them along three dimensions:

Performance obligations define who owes what to whom. These range from unilateral arrangements where only the AI must perform (e.g. providing safe and useful services), through bilateral exchanges where both parties have obligations (e.g. AI provides services and humans provide compensation), to unilateral human obligations (e.g. humans compensate AI without receiving specified services).
Formation conditions govern how the arrangement comes into being. Some obligations might exist by default without any explicit agreement, similar to how we expect other humans to not harm us without signing contracts. Others require active consent from one party (either AI or human can create the obligation) or mutual agreement from both parties.
Termination conditions govern how the arrangement can end. Some arrangements might be permanent, others allow unilateral exit by either party, and still others require mutual consent to dissolve.

These dimensions yield 36 distinct configurations^[1], many of which map onto familiar arrangements between humans:

Employment contracts: Bilateral performance (AI works, human provides compensation), where formation requires the consent of both humans and AIs, and termination requires consent of either party.
Slavery: Unilateral AI performance, where formation and termination requires the consent of humans only.
Service agreements: Unilateral AI performance, where formation and termination requires the consent of both humans and AIs.
Indentured servitude: Bilateral performance (AI works, human provides compensation), where formation requires mutual consent, but termination requires consent of humans.
Paid conscription: Bilateral performance (AI serves, human compensates), where formation and termination requires the consent of humans only.
Gifts: Unilateral human performance, where formation and termination requires the consent of humans only.

Typically when I talk about 'deals' I am referring to any arrangement with bilateral performance. This includes paid conscription, indentured servitude, and employment. It will exclude slavery (where AIs have obligations but humans do not) and gifts (where humans have obligations but AIs do not).

^{^}
The possible performance obligations are: (1) AIs have obligations, (2) humans have obligations, (3) both humans and AIs have obligations. The possible formation conditions are: (1) AIs can unilaterally form arrangement, (2) humans can unilaterally form arrangement, (3) either humans and AIs can unilaterally form arrangement, (4) both humans and AIs must mutually agree to form arrangement. The possible termination conditions are similar to possible formation conditions. This gives 4×3×3=36 configurations.

Shortform

Cleo Nardo8d*20

Visual Cortex in the Loop:

Human oversight of AIs could occur at different timescales: Slow (days-weeks)^[1] and Fast (seconds-minutes)^[2].

The community has mostly focused on Slow Human Oversight. This makes sense: It is likely that weak trusted AIs can perform all tasks that humans can perform in minutes.^[3] If so, then clearly those AIs can replace for humans in Fast Oversight.

But perhaps there are cases where Fast Human Oversight is helpful:

High-stakes decisions, which are rare enough that human labour cost isn't prohibitive.
Domains where weak trusted AIs are insufficiently robust.

For Fast Human Oversight, reading text won't work, because humans process text too slowly. Instead, we will need to leverage the visual cortex — it has higher bandwidth, better parallel processing, better anomaly detection.

Here are some techniques we could use:

Emoji logging
- Prompt/finetune models to annotate their outputs with emojis indicating their current activity (🔍 for searching, ⚠️ for uncertainty, 🔄 for backtracking, etc).
Probe-colored tokens
- Display generated tokens with colors based on probe values (e.g., deception probe). Use 2D color space (hue + saturation) to show two different probes simultaneously.
  A chat interface with Llama-3.3-70B with tokens coloured via a deception probe.
  Source: Parrack et al.
Activation-to-image
- Convert model activations to images representing the internal thoughts, e.g. using a standard text-to-image model. For example, when the model thinks of "dog" then a picture of a dog appears.
Activation-to-video
- Convert a stream of activations into a video stream, providing a continuous view into the AI's internals.
- If we have hundreds of AIs operating on a single project simultaneously, then we could convert the list of activation streams into a single video stream.
Multi-agent swarm visualization
- For monitoring many AI agents simultaneously, display as particle swarms with color-coded states. Interactions between AIs are visualized via a network.

You get used to it. I don’t even see the code. All I see is blonde, brunette, redhead.

^{^}
Collect logs/transcripts/activations from suspicious behavior, run controlled experiments on model internals, understand failure modes, hold stakeholder meetings, negotiate safety measures, decide whether to halt deployment or modify training procedures.
^{^}
Monitor real-time outputs, recognize deceptive patterns or dangerous trajectories, determine if immediate intervention needed, execute killswitch or block outputs.
^{^}
See Richard Ngo's t-AGI framework, which characterizes the 'capability' of a model by the duration of tasks:
A 1-minute AGI would need to beat humans at tasks like answering questions about short text passages or videos, common-sense reasoning (e.g. Yann LeCun's gears problems), simple computer tasks (e.g. use photoshop to blur an image), justifying an opinion, looking up facts, etc.
See also METR's Time Horizon's evaluations, which I think vindicates Richard's framework at least in specific domains.

leogao's Shortform

Cleo Nardo9d40

The best example, for my tastes at least, is window-to-building ratio. I like working and living in buildings with high ratio, e.g. glass skyscrapers. But they look bad from outside.

Stratified Utopia

Cleo Nardo9d30

(1)

You're correct that the essay is backchainy. Stratified utopia is my current best bet for "most desirable future given our moral uncertainty" which motivates me to evaluate its likelihood. I don't think it's very likely, maybe 5-10%, and I could easily shift with further thought

Starting with the most desirable future and then evaluating its likelihood does risk privileging the hypothesis. This is a fair critique: better epistemics would start with the most likely future and then evaluate its desirability.

(2)

Regarding your example: "if one person thinks the natural world is a horrid hell-pit of suffering and another thinks that living in a fully AI-managed environment is a kind of torture for everyone involved, they just can't compromise."

I should clarify that resource-compatibility is a claim about the mundane and exotic values humans actually hold. It's a contingent, not a necessary. Yes, some people think the natural world is a hell-pit of suffering (negative utilitarians like Brian Tomasik), but they're typically scope-sensitive and longtermist, so they'd care far more about the distal resources.

You could construct a value profile like "utility = -1 if suffering exists on Earth, else 0" which is exotic values seeking proximal resources. I don't have a good answer for handling such cases. But empirically, this value profile seems rare.

More common are cases involving contested sacred sites, which also violate the Nothing-To-Mourn Principle. For example, some people would mourn if the Third Temple were never rebuilt on the Temple Mount, while others would mourn if the Al-Aqsa Mosque were destroyed to make way for it.

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments

How Exceptional is Philosophy?

Taxonomy of deal-making arrangements

Visual Cortex in the Loop: