This is, by far, the alignment approach I’m most optimistic about—more so than mechanistic interpretability, which feels too narrow to reliably constrain a sufficiently sophisticated actor.
I’ve been thinking about an datasets where the reward function is explicitly coupled to the well-being of an external entity, not merely to semantic or linguistic correctness.
At this point, we aren’t really looking for systems that are better at language. If anything, we appear to be asymptoting on those benchmarks already.
What matters is that there are countless things an AI can say that are linguistically “correct” yet actively degrade well-being.
For instance to run small-scale experiments in which an LLM is tasked with sustaining the well-being of simulated or live organisms. With the goal Being to ground—however imperfectly—its reward signal in the welfare of other entities it is continuously influencing.
In that framing, the model’s world model isn’t shaped as a researcher, optimizer, or abstract thinker, but as a caretaker. There's the latent assumption that. A sufficiently advanced AI would develop altruism. We should test this on a small scale first.
imagine a colony of bees, mice, or even humans, with the AI tasked with improving their well-being over long time horizons. Not because this would be Particularly difficult—current systems could probably perform extremely well after some fine-tuning It’s to cultivate the sentiment, the inclination, the reflexes.
actually there are dozens of ADHD type stimulants with meaningfully distinct properties that have prescribed (or studded) in humans. Far from having picked the low hanging fruit, the FDA just... stopped picking. For example, before Ketamine was approved, the last time the FDA approved an antidepressant with a new mechanism of action, was over 50 years ago.
most of the limits we place on ourselves are self-imposed. Wimsey is the breaking of those bonds
as I understand it, the AI capabilities necessary for Intelligence amplification via BCI already exist, and we simply need to show/encourage people how to start using it
If a person were to provide a state-of-the-art model with a month's worth of their data typically collected by our eyes and ears and the ability to interject in real time in conversations via earbuds or speaker.
Such an intervention wouldn't be the superhuman "team of geniuses in your datacenter" but it would be more helpful than even some of the best personal assistant's (and 100x less expensive). Especially when you consider adding new data inputs and outputs. Like biometrics EEG readings, TFUS (werable non-invasive brain reading and brain stimulation), smart home devices and music generation.
.Beautifully sad and honest.
I’ve been sitting with a similar dilemma: spending so much of my time reading, thinking, and caring about rationality (and adjacent topics) has led me to live a much lonelier life than I otherwise might have. But for better or worse, I love it, and I’m unlikely to change anytime soon.