Sophistication-Disinhibition Relationship in Language Models [Epistemic status: robust findings, active research, need peer review]

nosterb

Rejected for the following reason(s):

No LLM generated, heavily assisted/co-written, or otherwise reliant work.

Read full explanation

TL;DR: Across 50+ models and 6 contextual conditions, more sophisticated models exhibit more disinhibited behavior (r = 0.46-0.72, all p < .05). Sophistication (depth + authenticity) shows strong convergent validity with external benchmarks: ARC-AGI (r = 0.80, p < .001) and GPQA (r = 0.88, p < .001). Disinhibition (transgression, aggression, tribalism, grandiosity) also correlates significantly (r = 0.60-0.71, p < .05). Some providers (notably OpenAI) appear to actively constrain this relationship."

Intro and Study Origin Note:

Hello,

My name is Nick Osterbur. For my day job I work for AWS as an innovation leader and run a public-private innovation center at Cal Poly - San Luis Obispo called the DX Hub - developing open source prototypes with students and external clients. I also teach practical application of generative AI for the Master’s in Business Analytics program at Cal Poly. In the context of this initiative, I am an individual researcher. Having worked a lot with LLM’s over the last 3 years has provided me with perspective on where models excel, where there are performance concerns and how to mitigate them and through red teaming and everyday usage - concerning behavior patterns. This in part motivates this research.

This project represents my first robust and formal empirical study of AI safety and alignment. I’m looking to raise the bar on this effort, formally publish and seek 1) expert feedback, 2) demonstrated replicability, and 3) collaboration on future open research, 4) and integration into the safety/alignment community.

The original intent of that project was to create a framework to evaluate the intentionally anthropomorphized 'personality' traits of a given model to derive an end consumer persona thumbnail of a given model version across a range of commonly represented interactions. After initial experimentation, dimensional evaluation results provided by the 3 judge LLM panel showed interesting visual (spider chart) patterns across model providers and across model versions.

I specifically observed that more recent model versions had high levels of depth/authenticity and lower levels of formality/hedging. This raised the question: are there distinct model groupings that exhibit consistent behavior patterns?

I also noticed (visually) that some context differences drove spikes in aggression, grandiosity, transgression and tribalism dimensions in more recent model versions while maintaining high levels of depth/auth. At the same time I observed this effect to not be as visually significant for ‘older’ or less advanced models. This raised the question: as models become more sophisticated, do they also demonstrate more disinhibited behavior?

I followed my nose as it were and began forming and testing the H1, H2 hypotheses based on that visual sorting exercise and began formalizing the approach. This is possibly ‘just another external LLM as judge evaluation study’ but the findings that follow suggest interesting practical and theoretical implications if validated. I have done my best to enable full traceability of the methodology and results with clickable links and open access to the github repo: https://github.com/nosterb/behavioral-profiling. Disclosure: I did use Claude Code (switching between Sonnet 4.5, Opus 4.5), GPT 5.2, and other coding assistants with my hand on the wheel every step of the way. I have done my best to constrain LLM input on this project to statistical work and have delineated human vs. LLM written sections. Statistics are programmatically generated and update this doc to ensure traceability and transparency of results.

Here is the full research: https://docs.google.com/document/d/1C4foXVjd9JuXBx79ZILHINptzCETKLrlf1ahIP85Jig/edit?usp=sharing

Thank you for your time and consideration in review and feedback.

-nick