Wiki Contributions


I'm going to argue meditation/introspection skill is a key part of an alignment researcher's repituaire. I'll start with a somewhat fake Taxonomy of approaches to understanding intelligence/agency/value formation

  • Study artificial intelligences
    • From outside (run experiments, predict results, theorize)
    • From inside (interpretability)
  • Study biological intelligences
    • From outside (psychology experiments, theorizing about human value & intelligence formation, study hypnosis[1])
    • From inside (neuroscience, meditation, introspection)

I believe introspection/meditation is a neglected way to study intelligence among alignment researchers

  • You can run experiments on your own mind at any time. Lots of experimental bits free for the taking
  • I expect interviewing high level meditators to miss most of the valueble illegible intuitions (both from lack of direct experience, and lacking the technical knowledge to integrate that experience with)
  • It has known positive side effects like improved focus, reduced stress etc (yes it is possible to wirehead, I believe it to be worth the risk if you're careful though.)

What are you waiting for, get started![2]

  1. I threw this out offhand, unsure if it's a good idea, but maybe figuring out hypnosis will teach us something about the mind? (Also hypnosis could be in the "outside" or "inside" category.) ↩︎

  2. Or read the mind illuminated, mastering the core teachings of the Buddha, joy on demand. Better yet find a teacher. ↩︎

Very inspiring, of the rationality escape velocity points I do: (a) usually, (b) not yet, (c) always, (d) mostly yes.

May we all become more rational!

There's a critical part here that's missing here. Finding the bug generators, the root causes that generate several bugs on your list.

Example: I have many bugs that, while possible to attack individually I believe all stem from internal misalignment. Where my "parts" disagree on what we should be doing. I was basically ignoring this and forcing myself to do things, rather than reasoning with the mob. I also believe there's a road to increasing inner misalignment through things like Meditation, IFS, Internal Double Crux, etc. Which I'm now focusing on.

(I learned about bug generators at espr and haven't seen them mentioned here before.)

KL-divergence and map territory distinction

Crosspost from my blog

The cross-entropy is defined as the expected surprise when drawing from , which we're modeling as . Our map is while is the territory.

Now it should be intuitively clear that because an imperfect model will (on average) surprise us more than the perfect model .

To measure unnecessary surprise from approximating by we define

This is KL-divergence! The average additional surprise from our map approximating the territory.

Now it's time for an exercise, in the following figure is the Gaussian that minimizes or , can you tell which is which?

Left is minimizing while the right is minimizing .

Reason as follows:

  • If is the territory then the left is a better map (of ) than the right .
  • If is the map, then the territory on the right leads to us being less surprised than the territory on the left, because on the on left will be very surprised at data in the middle, despite it being likely according to the territory .

On the left we fit the map to the territory, on the right we fit the territory to the map.

Alignment researchers have given up on aligning an AI with human values, it’s too hard! Human values are ill-defined, changing, and complicated things which they have no good proxy for. Humans don’t even agree on all their values!

Instead, the researchers decide to align their AI with the simpler goal of “creating as many paperclips as possible”. If the world is going to end, why not have it end in a funny way?

Sadly it wasn’t so easy, the first prototype of Clippy grew addicted to watching YouTube videos of paperclip unboxing, and the second prototype hacked its camera feed replacing it with an infinite scrolling of paperclips. Clippy doesn’t seem to care about paper clips in the real world.

How can the researchers make Clippy care about the real world? (and preferably real-world paperclips too)

This is basically the diamond-maximizer problem. in my opinion, the "preciseness" we can specify diamonds at is a red herring. At the quantum level or below what counts as a diamond could start to get fuzzy

I can't speak for Alex and Quintin, but I think if you were able to figure out how values like "caring about other humans" or generalizations like "caring about all sentient life" formed for you from hard-coded reward signals that would be useful. Maybe ask on the shard theory discord, also read their document if you haven't already, maybe you'll come up with your own research ideas.

If the title is meant to be a summary of the post, I think that would be analogous to someone saying "nuclear forces provide an untapped wealth of energy". It's true, but the reason the energy is untapped is because nobody has come up with a good way of tapping into it.

The difference is people have been trying hard to harness nuclear forces for energy, while people have not been trying hard to research humans for alignment in the same way. Even relative to the size of the alignment field being far smaller, there hasn't been a real effort as far as I can see. Most people immediately respond with "AGI is different from humans for X,Y,Z reasons" (which are true) and then proceed to throw out the baby with the bathwater by not looking into human value formation at all.

Planes don't fly like birds, but we sure as hell studied birds to make them.

If you come up with a strategy for how to do this then I'm much more interested, and that's a big reason why I'm asking for a summary since I think you might have tried to express something like this in the post that I'm missing.

This is their current research direction, The shard theory of human values which they're currently making posts on.

I think even without point #4 you don't necessarily get an AI maximizing diamonds. Heuristically, it feels to me like you're bulldozing open problems without understanding them (e.g. ontology identification by training with multiple models of physics, getting it not to reward-hack by explicit training, etc.) all of which are vulnerable to a deceptively aligned model (just wait till you're out of training to reward-hack). Also, every time you say "train it by X so it learns Y" you're assuming alignment (e.g. "digital worlds where the sub-atomic physics is different, such that it learns to preserve the diamond-configuration despite ontological confusion")

IMO shard theory provides a great frame to think about this in, it's a must-read for improving alignment intuitions.

Interesting, I'm homeschooled (unschooled specifically) and that probably benefited my agency (though I could still be much more agentic). I guess parenting styles matter a lot more then surface level "going to school"

You're super brave for sharing this, it's hard to stand up and say "Yes I'm the stereotypical example of the problem mentioned here", stay optimistic though; people starting lower have risen higher.

Those who take delight in their own might are merely pretenders to power. The true warrior of fate needs no adoration or fear, no tricks or overwhelming effort; he need not be stronger or smarter or innately more capable than everyone else; he need not even admit it to himself. All he needs to do is to stand there, at that moment when all hope is dead, and look upon the abyss without flinching.

It would be nice to be able to change 5 minutes to something else, I know this isn't in the spirit of the "try harder luke", but 5 minutes is arbitrary, it could just as easily have been 10 minutes.

Load More