James Payor

Message

I think about AI alignment; send help.

I'm also on twitter. More links on my homepage payor.io.

1234

224

148

11y

Working through a small tiling result

tl;dr it seems that you can get basic tiling to work by proving that there will be safety proofs in the future, rather than trying to prove safety directly. "Tiling" here roughly refers to a state of affairs in which we have a program that is able to prove itself...

May 13, 202572

Thinking about maximization and corrigibility

Thanks in no small part to Goodhart's curse, there are broad issues with getting safe/aligned output from AI designed like "we've given you some function f(x), now work on maximizing it as best you can". Part of the failure mode is that when you optimize for highly scoring x, you...

Apr 21, 202363

Some constructions for proof-based cooperation without Löb

This post presents five closely-related ways to achieve proof-based cooperation without using Löb's theorem, and muses on legible cooperation in the real world. I'm writing this as a follow-up to Andrew Critch's recent post, to share more of my perspective on the subject. We're going to dive straight into the...

Mar 21, 202343

A proof of inner Löb's theorem

This is a short post that offers a slightly different take on the standard proof of Löb's theorem. It offers nothing else of any value :) We seek to prove the "inner" version, which we write as: □P↔□(□P→P) The proof uses quining to build a related sentence L, the "Löb...

Feb 21, 202313

LESSWRONG
LW

LESSWRONG
LW

James Payor

James Payor

James Payor

James Payor

Working through a small tiling result

Thinking about maximization and corrigibility

Some constructions for proof-based cooperation without Löb

A proof of inner Löb's theorem

Working through a small tiling result

Thinking about maximization and corrigibility

Some constructions for proof-based cooperation without Löb

A proof of inner Löb's theorem