Other people have written about reasons why we should trust AIs; the main one in my mind is that it’s possible to look at the computations they perform when producing an output (even if we struggle to understand them). I’m going to write about reasons why we shouldn’t trust AIs, even if they behave in ways that would seem trustworthy in a human.
I think that humans’ sense of trust has been honed by evolution and is responsive to very specific and subtle cues that are hard for (most) humans to fake. A lot of human trust is based on subtle cues or “tells” that we’ve evolved to recognize. Most people get nervous when they lie and struggle to act “normal”. This may have co-evolved with our abilities to detect lying and general pro-social tendencies. We should not expect AIs to have the same “tells”.
When a person seems trustworthy to us, this is a signal of genuine trustworthiness. When an AI acts the same way (e.g. imagine a video call with an AI), it’s not -- or at least, not for the same reasons. Again, our shared evolutionary history with other humans makes them more trustworthy than alien intelligences.
Fundamentally, there are two issues I see here:
AIs are an alien form of intelligence.
AIs are being trained to act human and appear trustworthy.
People have often discussed these issues as barriers to alignment. But I’m more focused here on how they affect assurance, see “Alignment vs. Safety, part 2: Alignment” for a discussion of the difference.
AIs are an alien form of intelligence.
When we see a human behave a certain way, we can infer a lot of things about them reasonably accurately. We cannot make the same inferences about AIs or other alien intelligences.
And we rely on such inferences all the time. We can’t exhaustively test the capabilities of a person or an AI, instead, we make educated guesses based on what we have observed.
For instance, when a human passes a test like the bar exam, it’s a stronger signal that they actually have the relevant knowledge and competencies to practice law, compared to when an AI passes that exam. And that’s to say nothing of the ethical part, which is an important piece of many professions.
One of the most startling ways in which AIs are alien is that they seem to possess “alien concepts”. A primary piece of evidence for this is the vulnerability of AIs to “adversarial inputs”. AIs see data differently than humans. They are sensitive to different “features” in the data; these features may seem incoherent, or be imperceptible to humans.
Notably, this is a problem even when AIs otherwise seem to grasp the concept quite well.
Furthermore, humans are generally not able to predict how an AI might misbehave on such examples. And from a security point-of-view, there is an ongoing cat-and-mouse game where attackers try to make inputs that evade detection and cause AIs to “malfunction”, and defenders try to make their AIs robust and detect adversarial inputs.
AIs are being trained to act human and appear trustworthy.
AIs are trained to act in ways that humans approve of, based on human judgments, and to imitate human use of language. This makes signs of trustworthiness we see in the way they behave less meaningful or reliable. “Sycophancy” is a known and enduring problem where AIs behave in ways that seem designed to maximize human approval or that hide inconvenient truths. The issue of sycophancy demonstrates both that current AI development practices cause AIs to act trustworthy, and that this is often misleading.
Why would we use the same standards in the first place?
I’ve had a few conversations with people about AI and copyright who think that: 1) When AIs are trained on copyrighted writing, and then generate novel text, this is substantively similar to humans reading things and then writing something novel, and 2) Therefore the law should treat these two things similarly (and since we don’t consider this a copyright violation when humans do it, why would we when AIs do?)
I think (1) is likely incorrect. But even if I didn’t, I don’t think (2) follows. The law treats people and machines differently, and well it should. Culture as well. We should be very careful ascribing moral agency to AIs. An AI is a technological artifact produced by developers with particular interests, and we should expect its behavior to be driven by their interests, and this affects how we should extend trust. The argument (2) sort of begs the question: “Why would we apply similar standards to AIs and humans in the first place? Might not the standards we apply to other technologies be more sensible?”
Thanks for reading The Real AI! Subscribe for free to receive new posts and support my work.
Other people have written about reasons why we should trust AIs; the main one in my mind is that it’s possible to look at the computations they perform when producing an output (even if we struggle to understand them). I’m going to write about reasons why we shouldn’t trust AIs, even if they behave in ways that would seem trustworthy in a human.
I think that humans’ sense of trust has been honed by evolution and is responsive to very specific and subtle cues that are hard for (most) humans to fake. A lot of human trust is based on subtle cues or “tells” that we’ve evolved to recognize. Most people get nervous when they lie and struggle to act “normal”. This may have co-evolved with our abilities to detect lying and general pro-social tendencies. We should not expect AIs to have the same “tells”.
When a person seems trustworthy to us, this is a signal of genuine trustworthiness. When an AI acts the same way (e.g. imagine a video call with an AI), it’s not -- or at least, not for the same reasons. Again, our shared evolutionary history with other humans makes them more trustworthy than alien intelligences.
Fundamentally, there are two issues I see here:
AIs are an alien form of intelligence.
AIs are being trained to act human and appear trustworthy.
People have often discussed these issues as barriers to alignment. But I’m more focused here on how they affect assurance, see “Alignment vs. Safety, part 2: Alignment” for a discussion of the difference.
AIs are an alien form of intelligence.
When we see a human behave a certain way, we can infer a lot of things about them reasonably accurately. We cannot make the same inferences about AIs or other alien intelligences.
And we rely on such inferences all the time. We can’t exhaustively test the capabilities of a person or an AI, instead, we make educated guesses based on what we have observed.
For instance, when a human passes a test like the bar exam, it’s a stronger signal that they actually have the relevant knowledge and competencies to practice law, compared to when an AI passes that exam. And that’s to say nothing of the ethical part, which is an important piece of many professions.
One of the most startling ways in which AIs are alien is that they seem to possess “alien concepts”. A primary piece of evidence for this is the vulnerability of AIs to “adversarial inputs”. AIs see data differently than humans. They are sensitive to different “features” in the data; these features may seem incoherent, or be imperceptible to humans.
Notably, this is a problem even when AIs otherwise seem to grasp the concept quite well.
Furthermore, humans are generally not able to predict how an AI might misbehave on such examples. And from a security point-of-view, there is an ongoing cat-and-mouse game where attackers try to make inputs that evade detection and cause AIs to “malfunction”, and defenders try to make their AIs robust and detect adversarial inputs.
AIs are being trained to act human and appear trustworthy.
AIs are trained to act in ways that humans approve of, based on human judgments, and to imitate human use of language. This makes signs of trustworthiness we see in the way they behave less meaningful or reliable. “Sycophancy” is a known and enduring problem where AIs behave in ways that seem designed to maximize human approval or that hide inconvenient truths. The issue of sycophancy demonstrates both that current AI development practices cause AIs to act trustworthy, and that this is often misleading.
Why would we use the same standards in the first place?
I’ve had a few conversations with people about AI and copyright who think that: 1) When AIs are trained on copyrighted writing, and then generate novel text, this is substantively similar to humans reading things and then writing something novel, and 2) Therefore the law should treat these two things similarly (and since we don’t consider this a copyright violation when humans do it, why would we when AIs do?)
I think (1) is likely incorrect. But even if I didn’t, I don’t think (2) follows. The law treats people and machines differently, and well it should. Culture as well. We should be very careful ascribing moral agency to AIs. An AI is a technological artifact produced by developers with particular interests, and we should expect its behavior to be driven by their interests, and this affects how we should extend trust. The argument (2) sort of begs the question: “Why would we apply similar standards to AIs and humans in the first place? Might not the standards we apply to other technologies be more sensible?”
Thanks for reading The Real AI! Subscribe for free to receive new posts and support my work.
Share