Research Principles for 6 Months of AI Alignment Studies

Shoshannah Tekofsky

This summer I learned about the concept of Audience Capture from the case of Nicholas Perry. Through pure force of social validation, he experienced a shift from an idealistic but obscure young man to a grotesque but popular caricature of a medical train wreck.

The change happened through social reward signals. Originally Nicholas the principled vegan made videos of himself playing the violin, much to no one's interest. The earnest young man then learned he had to give up his vegan diet for health reasons, and thought to give the occasion a positive twist by inviting his viewers to share the first meal of his new lifestyle.

It was an innocuous step. He gained viewers. They cheered him on to eat more. And he did.

Gradually, but steadily he ate and ate, to the cheers of a swelling crowd of online followers. And like the Ghandi Murder Pill, the choice of sacrificing a sliver of his values for substantial reward was worth it for each individual video he made. His popularity expanded with his waistline as he inched up the social incentive slope. And at the end of that slope Nicholas didn't care about health, veganism, or playing the violin anymore. Instead his brain was inured with social reward signals that had rewired his values on a fundamental level. Essentially, Nicholas had a become a different person.

Now I realize I am unlikely to gain 300 pounds from success on AI alignment articles on LessWrong, but audience capture does point to a worry that has been on my mind. How do new researchers in the field keep themselves from following social incentive gradients? Especially considering how hard it is to notice such gradients in the first place!

Luckily the author of the above article suggests a method to ward against audience capture: Define your ideal self up front and commit to aligning your future behavior with her. So this is what I am doing here -- I want to precommit to three Research Principles for the next 6 months of my AI alignment studies:

Transparency - I commit to exposing my work in progress, unflattering confusions of thinking, and potentially controversial epistemics.
Exploration - I commit to exploring new paths for researching the alignment problem and documenting my progress along the way.
Paradigmicity - I commit to working toward a coherent paradigm of AI alignment in which I can situate my work, explain how it contributes to solving alignment, and measure my progress toward this goal.

Let's take a closer look at each research principle.

Transparency: Avoid Distortion

First things first.

I've received a research grant to study AI alignment and I don't know if I'm the right person for it.

This admission is not lack of confidence or motivation. I feel highly driven to work on the problem, and I know what skills I have that qualify me for the job. However, AI alignment is a new field and it's unclear what properties breakthrough researchers have. So naturally, I can't assess if I have these unknown properties until I actually make progress on the problem.

Still the admission feels highly uncomfortable -- like I'm breaking a rule. It feels like the type of thing where common wisdom would tell me to power pose myself out of this frame of mind. I think that wisdom is wrong. What I want is for alignment to get solved, which means I want the right people working on it.

Is the "right people" me?

I don't know. But I also think most people can't know. I think self-assessment is a trap due to motivated reasoning and other biases I'm still learning about. Instead, I believe it's better to commit to transcribing one's work. It can speak for itself. Thus I won't hype myself or sell myself or only show the smartest things I came up with. In short, I want to commit to a form of transparency based on epistemic humility.

This approach will obviously lead to more clutter compared to filtering one's output on quality. Still, I'd argue the trade-off is worth it because it allows evaluation of on-going work instead of an implicit competition of self-assessment smothered in social skills and perception management.

Thus I commit to exposing my work in progress, unflattering confusions of thinking, and potentially controversial epistemics.

Exploration: Mark the Path

Restrospectives don't capture the actual experience of going through a research process. Our memories are selective and our narratives are biased. Journaling our progress as we go avoids these failings but such journals will be rife with dead ends and deprived of hindsight wisdom. On the other hand, if the useful information density is too low, people can simply opt out of reading them.

Win-win.

So what outcomes should we expect? I think there are four possible research results of the next few months: A huge breakthrough no one had considered, a useful research direction that more people are already working on, a useless path no one has explored before, and a useless approach that was predictably useless. Thus we have a 2x2 grid of outcomes across the Useful-Useless axis and the Known-Unknown axis:

	Useful	Useless
Known	Converge to existing path	Converge to existing dead ends
Unknown	Discover a new path	Discover new dead ends

I'd argue that in the current pre-paradigmatic phase, we should value exploration of Unknown-Useless paths as highly as exploration of Known-Useful Paths. This is especially true because it is unclear if Known-Useful paths are actually Useful! Thus, my focus will be on the bottom row - the Unknowns. But what does it matter if we aim for Known or Unknown paths and how should we evaluate the value of the two strategies?

My intuition is that aiming for Unknown paths, my probability of ending up in each cell is something like:

	Useful	Useless
Known	0.1	0.2
Unknown	0.1	0.6

So I expect about a 10% success rate for the ideal outcome, about an equal chance to end up on what most people following a set study path would end up on, and then a six times greater chance than that to go down a dead end path that was legitimately underexplored, which is also a good thing! My greatest worry is that any given dead end I explore will turn out to have been an obvious dead end to my peers in about a quarter of the cases, and that this outcome feels as likely to me as doing something Useful at all. However, I think focusing on the Unknowns is still worth it for the increased chance of finding Unknown-Useful outcomes.

In contrast, if we compare to aiming for Known paths I think I'd end up with the following probabilities:

	Useful	Useless
Known	0.8	0.1
Unknown	0.01	0.09

Cause it's hard to miss the target when you are on rails, but also nearly impossible to explore!

Now these probabilities say more about my brain, my self-assessment and my model of how minds work, than about the actual shape of reality. It's a way to convey intuitions on why I'm approaching alignment studies the way I am. Maybe I'm wrong and people explore just fine after focusing on existing methods, and then we can just reframe the above thinking as one of the paths I'm exploring -- Namely, the path of explicit exploration.

Either way, using this framework of Known-Unknown and Useful-Useless paths, highlights that marking the paths you take is a key item. It's an exploration of solution space, and we want to track the dead ends as much as the promising new avenues, or else we'll be duplicating work within the community. Thus, by writing down my research path others may retroactively trace back what definitely didn't work (if I end up on a Useless path) or how breakthroughs are made (if I end up on the Unknown-Useful path).

Thus I commit to exploring new paths for researching the alignment problem and documenting my progress along the way.

Paradigmicity: Solve the Problem

One of the errors I dread the most is to get sucked in to one research path with one specific problem and lose track of the greater problem landscape. Instead, I want to be sure I have an overview, a narrative, a map -- an overarching paradigm that I am working with. It should show how each problem I'm studying fits into an overall model of solving alignment. Honestly, of the three research principles, this is the only one I'd strongly argue for general adoption by all new alignment researchers:

Prioritize getting a complete view of the problem landscape and how your work actually solves alignment.

This is important for two reasons:

First, by keeping a bird's eye view of the interrelation of the major subproblems of alignment, your mind is more likely to synthesize solutions that shift the entire frame. There is a form of information integration that a brain can do that involves intuitive leaps between reasoning steps. Internally it feels like your brain has pattern matched into the expectation of a connection between A and B, but when you actually look, there are no obvious steps connecting the two. This in turn sparks exploration of possible paths that might connect A and B. Sometimes you find them and sometimes you don't, but either way, I suspect this type of high-level integrative cognition is key to solving alignment. As such, a bird's eye view of the problem should be at the front of one's mind every step of the way.

Secondly, with a map in hand between us and the destination point of solving alignment, you will be able to measure your progress so far. By having a coherent model of how each of your actions plays a role and can matter to the eventual outcome, you won't get lost in the weeds staring at the pretty colors of high dopamine-dispensing subproblems. Therefore, if someone asks me, "Shoshannah, why did you spend the last month studying method X?", then I should be able to coherently and promptly answer how and why X may matter to solving alignment from start to finish.

Thus I commit to working toward a coherent paradigm of AI alignment in which I can situate my work, explain how it contributes to solving alignment, and measure my progress toward this goal.

Conclusion

For my 6 months of AI alignment studies, I will aim to be transparent and explorative in my work while constructing and situating my actions in a coherent paradigm of the alignment problem. With this approach the journal entries of this sequence will be an exercise in epistemic humility.

Wish me luck.

23

Research Principles for 6 Months of AI Alignment Studies

23

Transparency: Avoid Distortion

Exploration: Mark the Path

Paradigmicity: Solve the Problem

Conclusion

23

23