scasper

The taxonomy we introduced in the survey gave a helpful way of splitting up the problem. Other than that, it took a lot of effort, several google docs that got very messy, and https://www.connectedpapers.com/. Personally, I've also been working on interpretability for a while and have passively formed a mental model of the space.

My answer to this is actually tucked into one paragraph on the 10th page of the paper: "This type of approach is valuable...reverse engineering a system". We cite examples of papers that have used interpretability tools to generate novel adversaries, aid in manually-finetuning a network to induce a predictable change, or reverse engineer a network. Here they are.

Making adversaries:

https://distill.pub/2019/activation-atlas/

https://arxiv.org/abs/2110.03605

https://arxiv.org/abs/1811.12231

https://arxiv.org/abs/2201.11114

https://arxiv.org/abs/2206.14754

https://arxiv.org/abs/2106.03805

https://arxiv.org/abs/2006.14032

https://arxiv.org/abs/2208.08831

https://arxiv.org/abs/2205.01663

Manual fine-tuning:

https://arxiv.org/abs/2202.05262

https://arxiv.org/abs/2105.04857

Reverse engineering (I'd put an asterisk on these ones though because I don't expect methods like this to scale well to non-toy problems):

Don't know what part of the post you're referring to.

In both cases, it can block the Lobian proofs. But something about this is unsatisfying about making ad-hoc adjustments to one's policy like this. I'll quote Demski on this instead of trying to write my own explanation. Demski writes

- Secondly, an agent could reason logically
but with some looseness. This can fortuitously block the Troll Bridge proof. However, the approach seems worryingly unprincipled, because we can “improve” the epistemics by tightening the relationship to logic, and get a decision-theoretically much worse result.

- The problem here is that we have some epistemic principles which suggest tightening up is good (it’s free money; the looser relationship doesn’t lose much, but it’s a dead-weight loss), and no epistemic principles pointing the other way. So it feels like an unprincipled exception: “being less dutch-bookable is generally better, but hang loose in this one case, would you?”
- Naturally, this approach is still very interesting, and could be pursued further -- especially if we could give a more principled reason to keep the observance of logic loose in this particular case. But this isn’t the direction this document will propose. (Although you
couldthink of the proposals here as giving more principled reasons to let the relationship with logic be loose, sort of.)- So here, we will be interested in solutions which “solve troll bridge” in the stronger sense of getting it right while fully respecting logic. IE, updating to probability 1 (/0) when something is proven (/refuted).

Any chance you could clarify?

In the troll bridge problem, the counterfactual (the agent crossing the bridge) would indicate the inconsistency of the agent's logical system of reasoning. See this post and what demski calls a subjective theory of counterfactuals.

in your terms an "object" view and an "agent" view.

Yes, I think that there is a time and place for these two stances toward agents. The object stance when we are thinking about how behavior is deterministic conditioned on a state of the world and agent. The agent stance for when we are trying to be purposive and think about what types of agents to be/design. If we never wanted to take the object stance, we couldn't successfully understand many dilemmas, and if we never wanted to take the agent stance, then there seems little point in trying to talk about what any agent ever "should" do.

There’s a sense in which this is self-defeating b/c if CDT implies that you should pre-commit to FDT, then why do you care what CDT recommends as it appears to have undermined itself?

I don't especially care.

counterfactuals only make sense from within themselves

Is naive thinking about the troll bridge problem a counterexample to this? There, the counterfactual stems from a contradiction.

CDT doesn’t recommend itself, but FDT does, so this process leads us to replace our initial starting assumption of CDT with FDT.

I think that no general type of decision theory worth two cents always does recommend itself. Any decision theory X that isn't silly would recommend replacing itself before entering a mind-policing environment in which the mind police punishes an agent iff they use X.

Thanks, the second bit you quoted, I rewrote. I agree that sketching the proof that way was not good.

Suppose that hypothetically, Rob proves that crossing the bridge would lead to it blowing up. Then if he crossed, he would be inconsistent. And if so, the troll would blow up the bridge. So Rob can prove that a proof that crossing would result in the bridge blowing up would mean that crossing would result in the bridge blowing up. So Rob would conclude that he should not cross.

This should be more clear and not imply that rob needs to be able to prove his own consistency. I hope that helps.

Here's the new version of the paragraph with my mistaken explanation fixed.

"Suppose that hypothetically, Rob proves that crossing the bridge would lead to it blowing up. Then if he crossed, he would be inconsistent. And if so, the troll would blow up the bridge. So Rob can prove that a proof that crossing would result in the bridge blowing up would mean that crossing would result in the bridge blowing up. So Rob would conclude that he should not cross. "

Thanks for the comment. tl;dr, I think you mixed up some things I said, and interpreted others in a different way than I intended. But either way, I don't think there are "enormous problems".

So the statement to be proven (which I shall call P) is not just "agent takes action X", but "when presented with this specific proof of P, the agent takes action X".

Remember that I intentionally give a simplified sketch of the proof instead of providing it. If I did, I would specify the provability predicate. I think you're conflating what I say about the proof and what I say about the agent. Here, I say that our model agent who is vulnerable to spurious proofs would obey a proof that it would take X if presented. Demski explains things the same way. I don't say that's the definition of the provability predicate here. In this case, an agent being willing to accede proofs in general that it will take X is indeed sufficient for being vulnerable to spurious proofs.

Second is that in order for Löb's theorem to have any useful meaning in this context, the agent must be consistent and

able to prove its own consistency, which it cannot do by Gödel's second incompleteness theorem.

I don't know where you're getting this from. It would be helpful if you mentioned where. I definitely don't say anywhere that Rob must prove his own consistency, and neither of the two types of proofs I sketch out assume this either. you might be focusing on a bit that I wrote: "So assuming the consistency of his logical system..." I'll edit this explanation for clarity. I don't intend that Rob be able to prove the consistency, but that if he proved crossing would make it blow up, that would imply crossing would make it blow up.

As presented, it is given a statement P (which could be anything), and asked to verify that "Prov(P) -> P" for use in Löb's theorem.While the post claims that this is obvious, it is absolutely

not.

I don't know where you're getting this from either. In the "This is not trivial..." paragraph I explicitly talk about the difference between statements, proofs, and provability predicates. I think you have some confusion about what I'm saying either due to skimming or to how I have the word "hypothetically" do a lot of work in my explanation of this (arguably too much). But I definitely do not claim that "Prov(P) -> P".

I think I agree with johnswentworth's comment. I think there is a chance that equating genuinely useful ASI safety-related work with deceptive alignment could be harmful. To give another perspective, I would also add that I think your definition of deceptive alignment is very broad -- broad enough to encompass areas of research that I find quite distinct (e.g. better training methods vs. better remediation methods) -- yet still seems to exclude some other things that I think matter a lot for AGI safety. Some quick examples I thought of in a few minutes are: