Austin McCaffrey — LessWrong

Aurelius: A Peer-to-Peer Alignment Protocol

Yes you've got the idea framing exactly right.

One clarification: this is not a zero-sum game between miners (prompters) and validators (judges). The validators will reward outputs adhering to whatever criteria the protocol designates as "valuable". The adversarial term mostly refers to miners vs. [misaligned] LLMs.

So this is why I'm here:

What are the "wish list" schema dimensions (miners and validators both) from a research perspective?

What is the best way to aggregate, tag/label, categorize and package this into datasets that researchers can draw meaningful conclusions from?

What is an intelligent way to define and iterate "valuable" outputs (the mechanism itself will be python based script) from miners on the protocol, so that Aurelius becomes an engine for standardized alignment data.

I'm have some ideas of my own, but I'm here to hear it straight from the source.

So I ask anyone here, If you had access to a vast network of independent compute, all with their own prompting and scoring strategies, what would you ideally like to learn from that?

Aurelius: A Peer-to-Peer Alignment Protocol

Austin McCaffrey4mo10

I'm very interested by this premise. I think I should clarify the (Phase I) intent of Aurelius. It is not conceptualized as [a competitive game to surface goodness]. Rather, it can be thought of as an answer to shortcomings in observable overfitting to centralized alignment methodologies like those mentioned in the whitepaper (InstructGPT, CAI, etc). The efficacy of these methodologies is constrained largely due to centralization of prompting and evaluating agents.

Aurelius uses game-theoretic incentives to induce decentralized participants to generate [misalignment] data at scale. We hypothesize this will result in a proliferation of data that may illustrate:

1) the contour of misalignment in a given model (via adversarial probing)
2) how MI and CoT signal alignment discrepancies in such novel vectors of attack
3) conflict of interest and bias from close source model developers are root causes of slow alignment progress

All agents in the protocol are independent, motivated solely by capitalistic forces. A clever design structure can instantiate a competitive feedback loop that leads to superior prompting and [quantitative alignment] scoring methodologies. This environment is completely unconstrained and free to evolve as new data informs the efficacy of the system.

Open Thread - Summer 2025

Austin McCaffrey4mo50

Hello,

I've just joined LessWrong officially today, but I've been staying abreast of the content here and on Alignment Forum for the last few months. I'm extremely interested in AI Alignment research. Specifically I've decided to participate in this community to discuss alignment methodologies and interact with AI alignment researchers at the cutting edge.

Additionally, I've founded a company called Aurelius. (aureliusaligned.ai)

My goal with Aurelius is to explore misalignment in general reasoning models, collect data, and distribute it to researchers and model developers. I'm excited to get feedback on my ideas and participate in ongoing alignment methodologies. I'm based in Los Angeles, and may come to future LessWrong or Alignment Forum meet ups.

Nice to meet you all,
Austin

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments