Review

Lists cut from our main post, in a token gesture toward readability.

We list past reviews of alignment work, ideas which seem to be dead, the cool but neglected neuroscience / biology approach, various orgs which don't seem to have any agenda, and a bunch of things which don't fit elsewhere.

 

Appendix: Prior enumerations

Appendix: Graveyard

Appendix: Biology for AI alignment

Lots of agendas but not clear if anyone besides Byrnes and Thiergart are actively turning the crank. Seems like it would need a billion dollars.
 

Human enhancement 

  • One-sentence summary: maybe we can give people new sensory modalities, or much higher bandwidth for conceptual information, or much better idea generation, or direct interface with DL systems, or direct interface with sensors, or transfer learning, and maybe this would help. The old superbaby dream goes here I suppose.
  • Theory of change: maybe this makes us better at alignment research

Merging 

  • One-sentence summary: maybe we can form networked societies of DL systems and brains
  • Theory of change: maybe this lets us preserve some human values through bargaining or voting or weird politics.
  • CyborgismMillidgeDupuis

As alignment aid 

  • One-sentence summary: maybe we can get really high-quality alignment labels from brain data, maybe we can steer models by training humans to do activation engineering fast and intuitively, maybe we can crack the true human reward function / social instincts and maybe adapt some of them for AGI.
  • Theory of change: as you’d guess
  • Some names: ByrnesCvitkovicForesight’s BCI, Also (list from Byrnes): Eli Sennesh, Adam Safron, Seth Herd, Nathan Helm-Burger, Jon Garcia, Patrick Butlin


Appendix: Research support orgs

One slightly confusing class of org is described by the sample {CAIF, FLI}. Often run by active researchers with serious alignment experience, but usually not following an obvious agenda, delegating a basket of strategies to grantees, doing field-building stuff like NeurIPS workshops and summer schools.
 

CAIF 

  • One-sentence summary: support researchers making differential progress in cooperative AI (eg precommitment mechanisms that can’t be used to make threats)
  • Some names: Lewis Hammond
  • Estimated # FTEs: 
  • Some outputs in 2023Neurips contestsummer school
  • Funded by: Polaris Ventures
  • Critiques:
  • Funded by: ?
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: £2,423,943
     

AISC

  • One-sentence summary: entrypoint for new researchers to test fit and meet collaborators. More recently focussed on a capabilities pause. Still going!
  • Some names: Remmelt Ellen, Linda Linsefors
  • Estimated # FTEs: 2 
  • Some outputs in 2023tag
  • Funded by: ?
  • Critiques: ?
  • Funded by: ?
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ~$200,000

 

 

See also:

Appendix: Meta, mysteries, more

New Comment
4 comments, sorted by Click to highlight new comments since:

Honestly this isn't that long, I might say to re-merge it with the main post. Normally I'm a huge proponent of breaking posts up smaller, but yours is literally trying to be an index, so breaking a piece off makes it harder to use.

yeah you're right

For what it’s worth, I am not doing (and have never done) any research remotely similar to your text “maybe we can get really high-quality alignment labels from brain data, maybe we can steer models by training humans to do activation engineering fast and intuitively”.

I have a concise and self-contained summary of my main research project here (Section 2).

I care a lot! Will probably make a section for this in the main post under "Getting the model to learn what we want", thanks for the correction.