You're absolutely right, Senator. I was being naive about the political reality.

Chris Datcu

Epistemic status: pattern I keep seeing in my work. I work on building pipelines where LLMs generate formal assertions from natural language specs and I think a lot about what happens when we knotify ^[1] loops between human intent and machine output. Confidence in the observation is high, but the confidence in the proposed framing is medium.

~~~~~~

LLMs encode simplified human models, by compressing large amounts of human-produced text into lower-dimensional approximations of "what humans think like".

People are then integrating AI outputs as their own positions, especially if the output is genuinely well-constructed and confirms their priors. People in governance positions are doing it (sometimes on camera), many are watching, and nobody is building a breaker.

This builds a loop that's constraining human complexity (irreducible) into complicated (lots of moving parts, in principle reducible) models.

This loop worries me partly because humans are already bad at recognizing value in the first place. Imagine for a moment the internals of a human deciding to change a name such as Department of Defense to Department of War (aka now proudly hosted at war.gov). I'd bet some misfiring of internals happened there and if the felt sense of good can misfire at that scale, it can misfire anywhere ^[2].

I'm not sure how common or how spread out this is, but I've heard "even AI agrees" a non-zero amount of times in my social bubbles. If we take a system's output and use it as apparent objectivity, I'd at least wish we do it better^[3].

The alignment community has proposed circuit breakers at the model level: constitutional AI, scalable oversight, mech interp-based monitoring, all as attempts to ensure the model behaves well, but somehow, through the nature of our society, the failure mode I'm describing doesn't require the model to behave badly. The model can be perfectly well-calibrated, honest, and non-sycophantic by the subset of metrics we manage to set on it. Nevertheless, the loop still forms. Here's why I think this to be the case:

Sycophancy can be a quasi-property of the medium. If every output reads like it was written by a smarter version of self, one may integrate it as a self-generated thought whether or not it technically disagrees on specifics.
Even if the model flags uncertainty or disagreement, the user curates what they present. "AI helped me draft this" becomes "Analysis shows that" or questions like "Was this vibecoded?" get answered with "Less than 50% and only where the code was too bad to go through by myself ^[4]". What model-level interventions prevent this type repackaging?
Scalable oversight is designed for scenarios where the AI is the threat. But what abou the cases where the human and the AI are co-producing the failure? Human wants confirmation; these systems provide it; institutions reward decisiveness. Oddly aligned.

I'm working in a job that's supposed to replace humans with AI. I'm part of the problem, though I spend more of my thinking power on figuring out where humans must be part of whatever process we're trying to automatize. I deal with the gap between verification (do we build the thing right?) and validation (do we build the right thing?).^[5] In this gap, I try to model explicitly how humans are needed for grounding relative units of AI output. As of today, the sensefull take is that AI outputs remain underdetermined in quality until a human applies judgment.

The alignment community has spent enormous effort on the question "what if AI doesn't do what we want?" I think we need equal effort on the complementary question: what if AI does exactly what we want, and that's the problem?

I see we're sliding towards self-fulfilling prophecies and I'm wondering: how do we break out?

Eager to be made lesswrong.

^{^}
By knotify I mean a feedback loop that ties itself into a structure that's too spaghetti to untangle easily.
^{^}
Another example of misfiring happened during the agreements with the DoW.
^{^}
I'm under the impression that "better" currently involves formalization of the mathematical kind. I see its breaking points. If not the one, at least one of the better path towards it.
^{^}
Heard that one this week in a meeting.
^{^}
I also expand it towards a mutually thriving direction, where I keep track of "do we build the good thing?", with a metric that accounts for externalities across agents (self x others) and time horizons (now x future).

It seems like what you're describing is really sycophancy. You're emphasizing the way that sycophancy is not a bug, but a feature of llms as they're currently used. I have some hope that increased roll out to industry will change the incentive because people are paying for llms for their employees to use, not for them. They will want accuracy, not sycophancy. And there are many routes to improving accuracy if developers want to.

For more, see Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities.

As you'll see there, this doesn't make me outright optimistic, but it is one route by which llms could stop being a problem and start being helpful for alignment and AI strategy.

I'm not sure how common or how spread out this is, but I've heard "even AI agrees" a non-zero amount of times in my social bubbles. If we take a system's output and use it as apparent objectivity, I'd at least wish we do it better^[3].

A lot of what you describe here seems to me to be old problems from new sources.

Politicians and many others have long outsourced thinking. They used to do it with humans. Now they do it with AI.

There's obvious new risks involved here, but at the current state these are mostly from reduced friction, not a change in kind.

Later I suspect we will see super-persuasive AI that can advocate for its own positions intentionally and subtly, and when that happens the real danger arises.

Could you explain in what sense politicians and others outsourced thinking before the rise of AI? Did you mean that people rarely think about things which aren't in their areas of expertise and defer to experts?

Politicians, at least in the US, deferred and continue to defer heavily to both lobbyist and their staff. Politicians themselves have to spend a lot of time fundraising and building relationships and don't have a lot of time to actually spend on policy, so they rely on many others to tell them what policies would be good and, importantly, which policies their constituents want and will be more likely to reelect them for helping to enact.

The model can be perfectly well-calibrated, honest, and non-sycophantic by the subset of metrics we manage to set on it. Nevertheless, the loop still forms.

I suspect that you would be interested in reading about AI-assisted value lock-in, the Intelligence Curse or other froms of gradual disempowerment. Your example of

"AI helped me draft this" becomes "Analysis shows that" or questions like "Was this vibecoded?" get answered with "Less than 50% and only where the code was too bad to go through by myself".

is not that an example if the task has an easily verifiable reward and isn;t related to ideological matters.

For more, see Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities.

As you'll see there, this doesn't make me outright optimistic, but it is one route by which llms could stop being a problem and start being helpful for alignment and AI strategy.

I'm not sure how common or how spread out this is, but I've heard "even AI agrees" a non-zero amount of times in my social bubbles. If we take a system's output and use it as apparent objectivity, I'd at least wish we do it better^[3].

A lot of what you describe here seems to me to be old problems from new sources.

Politicians and many others have long outsourced thinking. They used to do it with humans. Now they do it with AI.

There's obvious new risks involved here, but at the current state these are mostly from reduced friction, not a change in kind.

Later I suspect we will see super-persuasive AI that can advocate for its own positions intentionally and subtly, and when that happens the real danger arises.

The model can be perfectly well-calibrated, honest, and non-sycophantic by the subset of metrics we manage to set on it. Nevertheless, the loop still forms.

I suspect that you would be interested in reading about AI-assisted value lock-in, the Intelligence Curse or other froms of gradual disempowerment. Your example of

"AI helped me draft this" becomes "Analysis shows that" or questions like "Was this vibecoded?" get answered with "Less than 50% and only where the code was too bad to go through by myself".

is not that an example if the task has an easily verifiable reward and isn;t related to ideological matters.

4

You're absolutely right, Senator. I was being naive about the political reality.

4

4

4