Hm I think your substitution isn't right, and the more correct one is "it is provable that (it is not provable that False) implies (it is provable that False)", ala .
I'm again not following well, but here are some other thoughts that might be relevant:
It's provable for any that , i.e. from "False" anything follows. This is how we give "False" grounding: if the system proves False, then it proves everything, and distinguishes nothing.
There are two levels at which we can apply Löb's Theorem, which I'll call "outer Löb's Theorem" and "inner Löb's Theorem".
I'm not sure I understand what you're interested in, but can say a few concrete things.
We might hope that PA can "learn things" by looking at what a copy of itself can prove. We might furthermore expect that it can see that a copy of itself will only prove true sentences.
Naively this should be possible. Outside of PA we can see that PA and its copy are isomorphic. Can we then formalize this inside PA?
In the direct attempt to do so, we construct our inner copy , where is a statement that says "there exists a proof of in the inner copy of PA".
But Löb's Theorem rules out formalizing self-trust this way. The statement means "there are no ways to prove falsehood in the inner copy of PA". But if PA could prove that, Löb's Theorem turns it directly into a proof of !
This doesn't AFAICT mean self-trust of the form "I trust myself not to prove false things" is impossible, just that this approach fails, and you have to be very careful about deferral.
Something I'm now realizing, having written all these down: the core mechanism really does echo Löb's theorem! Gah, maybe these are more like Löb than I thought.
(My whole hope was to generalize to things that Löb's theorem doesn't! And maybe these ideas still do, but my story for why has broken, and I'm now confused.)
As something to ponder on, let me show you how we can prove Löb's theorem following the method of ideas #3 and #5:
Perhaps the confusion is mostly me being idiosyncratic! I don't have a good reference, but can attempt an explanation.
The propositions and are meant to model the behaviour of some agents, say Alice and Bob. The proposition means "Alice cooperates", likewise means "Bob cooperates".
I'm probably often switching viewpoints, talking about is if it's Alice, when formally is some statement we're using to model Alice's behaviour.
When I say " tries to prove that ", what I really mean is: "In this scenario, Alice is looking for a proof that if she cooperates, then Bob cooperates. We model this with meaning 'Alice cooperates', and follows from ."
Note that every time we use we're talking about proofs of of any size. This makes our model less realistic, since Alice and Bob only have a limited amount of time in which to reason about each other and try to prove things. The next step would be to relax the assumptions to things like , which says "Alice cooperates whenever there is a proof that Bob cooperates in less than steps".
I think I saw OpenAI do some "rubric" thing which resembles the Anthropic method. It seems easy enough for me to imagine that they'd do a worse job of it though, or did something somewhat worse, since folks at Anthropic seem to be the originators of the idea (and are more likely to have a bunch of inside view stuff that helped them apply it well)
Yeah, all four of those are real things happening, and are exactly the sorts of things I think the post has in mind.
I take "make AI alignment seem legit" to refer to a bunch of actions that are optimized to push public discourse and perceptions around. Here's a list of things that come to my mind:
Each of these things seems like they have a core good thing, but according to me they've all backfired to the extend that they were optimized to avoid the thorny parts of AI x-risk, because this enables rampant goodharting. Specifically I think the effects of avoiding the core stuff have been bad, creating weird cargo cults around alignment research, making it easier for orgs to have fake narratives about how they care about alignment, and etc.
Also Plan B is currently being used to justify accelerating various danger tech by folks with no solid angles on Plan A...
My troll example is a fully connected network with all zero weights and biases, no skip connections.
This isn't something that you'd reach in regular training, since networks are initialized away from zero to avoid this. But it does exhibit a basic ingredient in controlling the gradient flow.
To look for a true hacker I'd try to reconfigure the way the downstream computation works (by modifying attention weights, saturating relus, or similar) based on some model of the inputs, in a way that pushes around where the gradients go.
It looks like you're investigating an angle that I can't follow, but here's my two cents re bounded agents:
My main idea to port this to the bounded setting is to have a bot that searches for increasingly long proofs, knowing that if it takes longer to find a proof then it is itself a bit harder to reason about.
We can instantiate this like:
The idea is that if there is a short way to prove that the opponent would cooperate back, then it takes just a constant steps more to prove that we cooperate. So it doesn't open us up to exploitation to assume that our own cooperation is provable in steps.
The way in which this works at all is by cutting the loop at the point where the opponent is thinking about our own behaviour. This bot cuts it rather aggressively: it assumes that no matter the context, when thinks about whether cooperates, it's provable that does cooperate. (I think this isn't great and can be improved to a weaker assumption that would lead to more potential cooperation.)
If you construct similarly, I claim that and mutually cooperate if and are large enough, and mutually defect otherwise.
Similarly, I claim can mutually cooperate with other bots like .
Saying some more things, Löb's Theorem is a true statement: whenever □ talks about an inner theory at least as powerful as e.g. PA, the theorem shows you how to prove □(□P→P)→□P.
This means you cannot prove □(□⊥→⊥), or similarly that □⊥→⊥.
□⊥→⊥ is one way we can attempt to formalize "self-trust" as "I never prove false things". So indeed the problem is with this formalization: it doesn't work, you can't prove it.
This doesn't mean we can't formalize self-trust a different way, but it shows the direct way is broken.