Distilled Human Judgment: Reifying AI Alignment

Devansh Mehta

Long time lurker but first post here, so I'll quickly introduce myself. I work at the Ethereum Foundation leading their AI x Public Goods track, which broadly encompasses coming up with and piloting new credibly neutral funding mechanisms (especially those involving the use of AI to distribute money)

The purpose of this post is two-fold;

Soliciting feedback on the idea of distilled human judgment (DHJ) and its application to AI alignment
A call to action for the rationalist community to submit models in our 2 ongoing DHJ competitions to allocate funding to core Ethereum open source repositories and identify sybil (bot) wallet addresses

Let's start with an explanation of DHJ to a 5 year old;

Your teacher needs to grade 1000 assignments
She only manages to complete 100
Many AI agents or models give a grade to all the assignments
The 100 assignments the teacher did grade selects the winning model that then gives grades to all 1000

Since people often relate new ideas with what they are already familiar with, the best way to think of DHJ is as a Kaggle competition where models are competing to accurately predict the ground truth data (human judgments).

The main difference being, we don't have ground truth data for all the answers that models give but only a subset of them. But as submitters don't know which questions we have answers to and which we don't, they have to do a good job on all of them.

DHJ <> AI Alignment

I'll start this section with a vision of a world that I don't want but which I think we're headed towards

Every company has its own monolithic AI model trained on data of their C-suite. Employees pray for beneficence from the AI overlord to get attention to their project and also for promotions, pay raises, etc. Any change to the company AI is permissioned and requires review from a core team within the company

What is a viable alternative?

Every company has a marketplace of competing AI models that any team (or outside contributors) can submit. All these models are queried for answers to decisions that the C-suite needs to make, such as funding to different departments, who to lay off or promote, etc. Executives then see which models align most closely with the subset of decisions that they do make manually to make decisions across the board.

A good analogy for thinking about these 2 visions is whether we want AI to be similar to a monopoly government or large institution where we have to fight for any improvement internally.

Or if we want it to be similar to a vegetable seller, where you can simply go to another vegetable seller if the old one jacks up prices.

We can relate the discussion to Paul Christiano's definition of AI alignment,

When I say an AI A is aligned with an operator H, I mean:
A is trying to do what H wants it to do.
The “alignment problem” is the problem of building powerful AI systems that are aligned with their operators.

Only the AI agents or models that accurately predict the decisions of operators gets any power; AI that is off the mark in predicting decisions made by operators do not get any importance.

Applicaton of DHJ

Now we come to the part where we ask for your help in making model submissions to the 2 ongoing pilots testing out this concept. DHJ doesn't work without enough high quality models submitted!

1. Deep Funding

We generated a dependency graph of ~4300 open source repositories that are important to Ethereum (34 seed nodes + their dependencies)

~50-70 jurors provide ratings on a subset of these repos, while submitted models give weights to all the repos in the dependency graph. The ratings are of the form “is A or B a more important dependency to C, and by how much more?” - the underlying idea is to focus each juror’s attention on a specific comparison rather than ask a bigger more abstract question that it is difficult to give an answer to. The models output proposed weights for all edges on the graph: what percent of the credit for X belongs to Y, for all (X, Y) pairs where Y is a direct dependency of X.

Using a custom scoring script that implements a version of least-squared error adapted for dealing with comparisons, we rank models by their accuracy to the juror ratings. Weights are then assigned to all 4300 repos based on an ensemble of the most accurate models

Over $170,000 has been committed by Vitalik Buterin for distribution to these open source repos based on weights obtained in the pilot. There is also a $30,000 prize pool to model submitters, distributed such that if a submission comprises 30% of the meta model or ensemble they get 30% of the prize pool.

Competition page: https://cryptopond.xyz/modelfactory/detail/2564617
Telegram group: https://t.me/AgentAllocators/1
Website: https://www.deepfunding.org/

2. Sybil Analysis

Identifying bots is generally a closed process, with social media companies hiding their processes and keeping the accounts tagged as bots private.

This competition is a rare opportunity for any AI or ML builder to create their own sybil identification model and get its efficacy tested against real data.

A quick overview;

- Gitcoin Passport by Holonym and DappRadar have provided a ground truth dataset of over 10k manually labeled wallets with a sybil probability score of between 0-1.

- You need to create your own model assigning a score to these wallets and get tested against their data. These models can then be reused for other wallets for which we don't have human labelling.

- Submit a writeup on Octant's governance forum about the approach taken by your model. Most submissions are deterministic models so we are especially on the lookout for LLM based approaches to this challenge, even if they perform worse on the leaderboard.

Prize Pool: $20,000 based on writeup quality + leaderboard score
Model Submission: https://cryptopond.xyz/modelfactory/detail/4712551
Deadline: This weekend, 1st June

3. Bonus: Flagging Misinformation

We are looking for collaborators to setup a 3rd pilot of DHJ where models predict which posts on social media are going to get community noted.

Posts take an average of 5 hours to get community noted on X. We want to setup a competition where AI predicts which posts will get community noted (roughly 10% of posts with a note request actually get noted). After 2 days, we can then go back and see which posts did actually receive a community note, thus creating a leaderboard of top performing models.

If top performing models are assessing high probability of a post needing a note, we can then expedite it for human review and reduce the time from 5 hours to as little as 30 minutes, significantly reducing spread of false information.

We welcome any critique of these ideas or further extensions of them. DHJ as a concrete method for doing useful things in the world with AI and also helping with selectively giving power to only aligned AI (defined as best conforming with human spot checks) is a promising new direction for which there aren't too many examples yet.

LESSWRONG
LW

LESSWRONG
LW

2

Distilled Human Judgment: Reifying AI Alignment

2

The main difference being, we don't have ground truth data for all the answers that models give but only a subset of them. But as submitters don't know which questions we have answers to and which we don't, they have to do a good job on all of them.

DHJ <> AI Alignment

Applicaton of DHJ

2

2

2

Distilled Human Judgment: Reifying AI Alignment

2

The main difference being, we don't have ground truth data for all the answers that models give but only a subset of them. But as submitters don't know which questions we have answers to and which we don't, they have to do a good job on all of them. DHJ <> AI Alignment

Applicaton of DHJ

2

2

The main difference being, we don't have ground truth data for all the answers that models give but only a subset of them. But as submitters don't know which questions we have answers to and which we don't, they have to do a good job on all of them.

DHJ <> AI Alignment