Automated Alignment Research, Abductively

future_detective

2 Automated Alignment Research, Abductively

23rd Jan 2026

3 min read

2

Recently I've been thinking about misaligned chatbot advertising incentives. I glanced at arXiv and found "Sponsored Questions and How to Auction Them". Another search gave me "Incomplete Contracting and AI Alignment".

Interesting! I thought. I gave them to Liz Lemma, my research assistant, and told her that I'd been thinking about the principal-agent problem in a chatbot context. About 30 minutes later she gave me the following four papers:

Each is a complete paper, well founded, well reasoned — not perfect, maybe, but I wouldn't call it slop, either. Let's peek inside of "Query Steering". The core formula is "The Steering Threshold":

ΔV(μ)≤wΔB

Where:

ΔV(μ)=Vu(μ;q↓)−Vu(μ;q↑) is the user’s value gap between the more-informative query (q↓) and the more-monetizable (but less informative) query (q↑).
ΔB=B(q↑)−B(q↓)>0 is the monetization gap.
w≥0 is “how much the system cares about monetization.”

The clean “steering region” characterization is:

0<ΔV(μ)<wΔB

“Steering happens exactly when the user loss is small enough that monetization incentives can overpower it.”

Is that true, or useful? You'd have to read the paper and find out!

Putting the "Search" in "Research"

Liz Lemma, you may have guessed, is an automated research assistant. She reads papers and spawns reasonable children. The whole thing can be thought of as a search over the space of adjacent plausible papers. Here's what it looks like when Liz gets to work:

Each original node is the average text embedding for her sources; the sources spawn children, the generated papers.

Where does Liz get her insights? It depends on how you see context in large language models. Maybe she's interpolating between now, the time of writing, and then, when the paper was written, and finding something interesting in the space between. Maybe she's matching concepts from the whole-internet corpus of her training to the decidedly more niche papers she takes inspiration from.

Regardless, what you find ultimately is a graph: for each node in the network comprised of the source material, the concatenated text embeddings, you have new connections. The semantic space proximate and accessible and plausible to a language model has been densified.

Alignment Highlights

Misaligned chatbot ad incentives investigated, I turned towards more traditional alignment research, and let Liz loose. I've included all the sources and results in the appendix below, but some highlights:

Vector-Lagrangian Safe RLHF. In response to "Constitutional AI" and "Safe RLHF", Liz proposes "a multi-constraint formulation where each harm category is a separate constraint with its own endogenously determined shadow price λi" and "develop[s] a clean convex policy model that yields closed-form KKT characterizations and interprets λ as category-specific ‘risk budget prices’."
The Reference Conditioning Trap "provide[s] an interpretable diagnostic for when alignment investment must shift from ‘more DPO’ to improving the reference, refreshing the dataset on-policy, or redesigning objectives."
Debate-as-Compliance "model[s] debate as generating an endogenous suspicion signal (abort/disagreement) that triggers expert audit" and "show[s] how penalty caps and audit costs shape optimal audit intensity, and how improvements in auditability/stability (e.g., better logging, more locally checkable traces) substitute for random auditing"

All sounds great to me. Alignment solved!

Not quite. These papers are not perfect. There are assumptions that don't hold up, ideas that don't quite translate, conclusions that don't follow. They are, in short, flawed, and I don't offer them as gospel truth. Instead, I propose that they have some expected value > 0 towards work that might not be done otherwise.

The challenging part is "what's next" - how can we glean the signal from the noise of lightning fast paper generation? The pipeline might be:

Generated paper -> voters -> selected winners
Winner -> annotator -> revisions -> approval by committee

Maybe those voters, those annotators, that committee, are human, maybe they're synthetic. At this point, teetering on the edge of generally-correct AI, we still probably need (or want) a human intervention in the process.

There's refinement to be done. But it seems plausible that the space of unaddressed alignment questions can shrink, in tandem with capabilities expansion.