James Payor

Sorted by New

# Wiki Contributions

Saying some more things, Löb's Theorem is a true statement: whenever talks about an inner theory at least as powerful as e.g. PA, the theorem shows you how to prove .

This means you cannot prove , or similarly that .

is one way we can attempt to formalize "self-trust" as "I never prove false things". So indeed the problem is with this formalization: it doesn't work, you can't prove it.

This doesn't mean we can't formalize self-trust a different way, but it shows the direct way is broken.

Hm I think your substitution isn't right, and the more correct one is "it is provable that (it is not provable that False) implies (it is provable that False)", ala .

I'm again not following well, but here are some other thoughts that might be relevant:

• It's provable for any that , i.e. from "False" anything follows. This is how we give "False" grounding: if the system proves False, then it proves everything, and distinguishes nothing.

• There are two levels at which we can apply Löb's Theorem, which I'll call "outer Löb's Theorem" and "inner Löb's Theorem".

• Outer Löb's Theorem says that whenever PA proves , then PA also proves . It constructs the proof of using the proof of .
• Inner Löb's Theorem is the same, formalized in PA. It proves . The logic is the same, but it shows that PA can translate an inner proof of into an inner proof of .
• Notably, the outer version is not . We need to have available the proof of in order to prove .

I'm not sure I understand what you're interested in, but can say a few concrete things.

We might hope that PA can "learn things" by looking at what a copy of itself can prove. We might furthermore expect that it can see that a copy of itself will only prove true sentences.

Naively this should be possible. Outside of PA we can see that PA and its copy are isomorphic. Can we then formalize this inside PA?

In the direct attempt to do so, we construct our inner copy , where is a statement that says "there exists a proof of in the inner copy of PA".

But Löb's Theorem rules out formalizing self-trust this way. The statement means "there are no ways to prove falsehood in the inner copy of PA". But if PA could prove that, Löb's Theorem turns it directly into a proof of !

This doesn't AFAICT mean self-trust of the form "I trust myself not to prove false things" is impossible, just that this approach fails, and you have to be very careful about deferral.

Something I'm now realizing, having written all these down: the core mechanism really does echo Löb's theorem! Gah, maybe these are more like Löb than I thought.

(My whole hope was to generalize to things that Löb's theorem doesn't! And maybe these ideas still do, but my story for why has broken, and I'm now confused.)

As something to ponder on, let me show you how we can prove Löb's theorem following the method of ideas #3 and #5:

• is assumed
• We consider the loop-cutter
• We verify that if activates then must be true:
• Then, can satisfy by finding the same proof.
• So activates, and is true.

In english:

• We have who is blocked on
• We introduce to the loop cutter , who will activate if activation provably leads to being true
• encounters the argument "if activates then is true, and this causes to activate"
• This satisfies 's requirement for some , so becomes true.

Perhaps the confusion is mostly me being idiosyncratic! I don't have a good reference, but can attempt an explanation.

The propositions and are meant to model the behaviour of some agents, say Alice and Bob. The proposition means "Alice cooperates", likewise means "Bob cooperates".

I'm probably often switching viewpoints, talking about is if it's Alice, when formally is some statement we're using to model Alice's behaviour.

When I say " tries to prove that ", what I really mean is: "In this scenario, Alice is looking for a proof that if she cooperates, then Bob cooperates. We model this with meaning 'Alice cooperates', and follows from ."

Note that every time we use we're talking about proofs of of any size. This makes our model less realistic, since Alice and Bob only have a limited amount of time in which to reason about each other and try to prove things. The next step would be to relax the assumptions to things like , which says "Alice cooperates whenever there is a proof that Bob cooperates in less than steps".

I think I saw OpenAI do some "rubric" thing which resembles the Anthropic method. It seems easy enough for me to imagine that they'd do a worse job of it though, or did something somewhat worse, since folks at Anthropic seem to be the originators of the idea (and are more likely to have a bunch of inside view stuff that helped them apply it well)

Yeah, all four of those are real things happening, and are exactly the sorts of things I think the post has in mind.

I take "make AI alignment seem legit" to refer to a bunch of actions that are optimized to push public discourse and perceptions around. Here's a list of things that come to my mind:

• Trying to get alignment research to look more like a mainstream field, by e.g. funding professors and PhD students who frame their work as alignment and giving them publicity, organizing conferences that try to rope in existing players who have perceived legitimacy, etc
• Papers like Concrete Problems in AI Safety that try to tie AI risk to stuff that's already in the overton window / already perceived as legitimate
• Optimizing language in posts / papers to be perceived well, by e.g. steering clear of the part where we're worried AI will literally kill everyone
• Efforts to make it politically untenable for AI orgs to not have some narrative around safety

Each of these things seems like they have a core good thing, but according to me they've all backfired to the extend that they were optimized to avoid the thorny parts of AI x-risk, because this enables rampant goodharting. Specifically I think the effects of avoiding the core stuff have been bad, creating weird cargo cults around alignment research, making it easier for orgs to have fake narratives about how they care about alignment, and etc.

Also Plan B is currently being used to justify accelerating various danger tech by folks with no solid angles on Plan A...

My troll example is a fully connected network with all zero weights and biases, no skip connections.

This isn't something that you'd reach in regular training, since networks are initialized away from zero to avoid this. But it does exhibit a basic ingredient in controlling the gradient flow.

To look for a true hacker I'd try to reconfigure the way the downstream computation works (by modifying attention weights, saturating relus, or similar) based on some model of the inputs, in a way that pushes around where the gradients go.

It looks like you're investigating an angle that I can't follow, but here's my two cents re bounded agents:

My main idea to port this to the bounded setting is to have a bot that searches for increasingly long proofs, knowing that if it takes longer to find a proof then it is itself a bit harder to reason about.

We can instantiate this like:

The idea is that if there is a short way to prove that the opponent would cooperate back, then it takes just a constant steps more to prove that we cooperate. So it doesn't open us up to exploitation to assume that our own cooperation is provable in steps.

The way in which this works at all is by cutting the loop at the point where the opponent is thinking about our own behaviour. This bot cuts it rather aggressively: it assumes that no matter the context, when thinks about whether cooperates, it's provable that does cooperate. (I think this isn't great and can be improved to a weaker assumption that would lead to more potential cooperation.)

If you construct similarly, I claim that and mutually cooperate if and are large enough, and mutually defect otherwise.

Similarly, I claim can mutually cooperate with other bots like .