A lack of introspective ability is not a lack of corrigibility

lc

[CW: Responding to a tweet]

Human beings have many native capabilities that are hard for us to analyze. For example, we are prodigiously good at determining which human we're talking to from the way the light refracts off of each others' faces. We have memorybanks of often thousands of faces that allow us to identify others within seconds years later, often after only a signal encounter. While almost all humans are born with this capability, the only way we've ever figured out how to teach a computer to do it is by generating a new, also uninterpretable de novo algorithm through deep learning.

Suppose an alien comes to earth and kidnaps a human and interrogates them about how they do this. The human struggles to come up with an explanation that does not reduce to "I just look at them with my eyes and remember who they are." Should the alien conclude that the human is being uncooperative, and apply the crowbar?

Obviously not; it's just that the "agent-they-are-speaking-to" is not one that knows or understands the algorithm for the face-remembering component.

In the same way that humans come with body parts they don't natively understand, and display cognitive capabilities that they are unable to reduce ad infinitum (such as language, or general intelligence), LLMs also have components that the mind can't articulate. But this doesn't mean that LLMs are uncooperative, just that minds can be made out of complex components that are hard to understand. While this does prevent additional problems (hard to use AI to align AIs when they don't really know themselves, hard for us to do research when AI biology is uninterpretable), it does complicate the ability to infer the general success of alignment from observations like the one this tweet describes.

Humans may be cooperative, but the aliens still can't trust them to navigate their spaceships, if the humans cannot reliably promise they won't crash the spaceship because of some weird optical illusion.

"I just look at them with my eyes and remember who they are." Should the alien conclude that the human is being uncooperative, and apply the crowbar?

With the crowbar on the table I'll at least explain that the face splits into other high-level features (eyes, nose, mouth, etc.) which can be very memorable on their own.

As the bar inches closer I'll venture into something more vague, about contrasts of features or associations or how features affect emotions.

...But that misses the point. On the topic of the tweet: doesn't Eliezer acknowledge what you talk about later?

But, sure, maybe all the things that LLMs know about gravity and medicine and cups on tables is one kind of knowledge to them, and all the wordless secrets they know about humans are in a hidden and incommensurable format.

Well I mean if he were really acknowledging the point he would have acknowledged that his tweet doesn't make any sense