throwaway8238 - LessWrong

Can you be Not Even Wrong in AI Alignment?

Great, I can see some places where I went wrong. I think you did a good job of conveying the feedback.

This is not so much a defense of what I wrote as it is an examination of how meaning got lost.

=> Of course the inner working of the Agent is known! From the very definitions you just provided, it must implement some variation on:

```
def AgentAnswer(Observable, Secret):
if Observable and not Secret:
yield True
else:
yield False
```

This would be our desired agent, but we don't get to write our agent. In the context of the "Self-contained problem statement" at https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8#heading=h.c93m7c1htwe1 , we do not get to assert anything about the loss function. It is a black box.

All of the following could be agents:

```
def desired_agent(observable, secret):
yield bool(observable and not secret)

def human_imitator(observable, secret):
yield observable

def fools_you_occasionally(observable, secret):
if random.randint(0, 100) == 42:
yield observable
else:
yield bool(observable and not secret)

```

I think the core issue here is that I am assuming context from the problem ELK is trying to solve, which the reader may not share.

=> If all states are equally likely, then the desired Answer states is possible with probability 1/2 without access to the Secret (and your specificity is impressive https://ebn.bmj.com/content/23/1/2 ). Again I’m just restating what your previous assumptions literally mean.

Fair, I shouldn't have written "the desired Answer states are not possible without access to the Secret in the training data". We could get lucky and the agent could happen to be the desired one by chance. I should have written "There is no way to train the agent to produce the desired answer states (or modify its output to produce the desired answer states) without access to the Secret".

I think I am assuming context that the reader may not share, again. Specifically, the goal is to find some way of causing an Agent we don't control to produce the desired answers.

=> But there is more: the ELK challenge, or at least my vision of it, is not about getting the right answer most of the time. It’s about getting the right answer in the worse case scenario, e.g. when you are fighting some intelligence trying to defeat your defenses. In this context, the very idea of starting from probabilistic assumption about the initial states sounds not even wrong/missing the point.

I think this is where I totally miss the mark. This is also my vision of ELK. I explicitly dismiss trying to solve things via making assumptions about the Secret distribution because of this.

On the whole, I could have done a better job of pushing context up front and center. This is, perhaps, especially important because the prose, examples, and formal problem statement of the ELK paper can be interpreted in multiple ways.

I could have belabored the main point I was trying to make: The problem as stated is unsolvable. I tried to create the minimum possible representation of the problem, giving the AI additional restrictions (such as "cannot tamper with the Observable"), and then showed that you cannot align the AI. Relaxing the problem can be useful, but we should pick how and why.

In any case, my interest has mostly transitioned to the first two questions of the OP. It looks like restating shared context is a way of reducing likelihood of being Not Even Wrong, though techniques for bridging the gap after the fact are a bigger target. Maybe the same technique works? Try to go back to shared context? And then, for X-Risk, how do we encourage and fund communities that share a common goal without common ground?

Can you be Not Even Wrong in AI Alignment?

throwaway82382y40

Thank you. Thinking about it from a recruiting lens is helpful. They handled hundreds of submissions. There's a lot of noise regardless of the (dis)merit of any submission. Absence of substantive feedback should be treated as a weak signal, if a signal at all. It's not a "you're not cut out for this!", and not even a "you're not even wrong".

Can you be Not Even Wrong in AI Alignment?

throwaway82382y20

My intent when simplifying the problem was to demonstrate that you must explicitly relax certain assumptions to make any progress.

The families of proposals in https://www.lesswrong.com/posts/zjMKpSB2Xccn9qi5t/elk-prize-results mostly fit into this lens:

* Train a reporter that is useful to an auxiliary AI -> The Secret is still unmanaged, so you have arbitrary information escape. Even if the reporter could only reply yes/no to exactly one question, you still have non-finite entropy using only the time to respond. Resolving this requires having knowledge of the structure (relax the "inner workings of the Agent are unknown" requirement).

* Require the reporter to be continuous -> The Secret is still unmanaged. Any notion of continuity requires asserting a structure onto the reporter. In general, I like this as an area of exploration, but it demands relaxing that requirement. I wrote in my response about applying penalties to the structure of the computation rather than the result. Ex, 'we could compromise its computations during training so that it has to build a "downward slope" everywhere', which is a technique for "continuity".

* Penalize depending on too many parts of the predictor -> The Secret is still unmanaged. Any combination of the available parts of the predictor could be the Secret, so unless the human can also perform a perfect translation from the same parts, the reporter still has freedom to deceive. If the human can perform perfect translation, then this is a human imitator.

* Compress the predictor's state -> The Secret is still unmanaged, and you still have arbitrary information escape from the predictor to reporter.

* Use reporter to define causal interventions -> I think this could work if we assume structure on both reporter and predictor. There are lots of things like this when you assume structure (relax the "inner workings of the Agent are unknown" requirement). For example, by applying cost to activations, the SmartVault has incentive to find minimal paths to achieve its goal. Corrupting even a small subset of its activations should introduce uncertainty into a Bayes net reporter. If the reporter is unresponsive to untargeted corruptions, it is more likely to be imitating or lying.

* Train a sequence of reporters -> This strongly assumes both structure and training technique.

This is what I mean when I write 'At some point, you have to stake a claim and say "we are defending against an AI that is built like X, how can you bend it to your will?" Without this, we can always handwave away any "solution"'.

I believe due to my restricted problem that the answer to "Can we elicit latent knowledge from an arbitrary AI?" is "No".

I believe the answer to "Can we elicit latent knowledge from an AI built using this particular ML technique and running on this hardware?" is "Possibly, tell me more."

I believe the answer to "Is it safe to assume that there will never be an AI built for which we do not understand its algorithms or control its hardware?" is "No".

LESSWRONG
LW

Posts

Wiki Contributions

Comments