How one uses set theory for alignment problem?
Answer by NisanMay 30, 202113

See section 2 of this Agent Foundations research program and citations for discussion of the problems of logical uncertainty, logical counterfactuals, and the Löbian obstacle. Or you can read this friendly overview. Gödel-Löb provability logic has been used here.

I don't know of any application of set theory to agent foundations research. (Like large cardinals, forcing, etc.)

Dario Amodei leaves OpenAI

Ah, 90% of the people discussed on this post are now working for Anthropic, along with a few other ex-OpenAI safety people.

The Homunculus Problem

Here's a fun and pointless way one could rescue the homunculus model: There's an infinite regress of homunculi, each of which sees a reconstructed image. As you pass up the chain of homunculi, the shadow gets increasingly attenuated, approaching but never reaching complete invisibility. Then we identify "you" with a suitable limit of the homunculi, and what you see is the entire sequence of images under some equivalence relation which "forgets" how similar A and B were early in the sequence, but "remembers" the presence of the shadow.

The Homunculus Problem

The homunculus model says that all visual perception factors through an image constructed in the brain. One should be able to reconstruct this image by asking a subject to compare the brightness of pairs of checkerboard squares. A simplistic story about the optical illusion is that the brain detects the shadow and then adjusts the brightness of the squares in the constructed image to exactly compensate for the shadow, so the image depicts the checkerboard's inferred intrinsic optical properties. Such an image would have no shadow, and since that's all the homunculus sees, the homunculus wouldn't perceive a shadow.

That story is not quite right, though. Looking at the picture, the black squares in the shadow do seem darker than the dark squares outside the shadow, and similarly for the white squares. I think if you reconstructed the virtual image using the above procedure you'd get an image with an attenuated shadow. Maybe with some more work you could prove that the subject sees a strong shadow, not an attenuated one, and thereby rescue Abram's argument.

Edit: Sorry, misread your comment. I think the homunculus theory is that in the real image, the shadow is "plainly visible", but the reconstructed image in the brain adjusts the squares so that the shadow is no longer present, or is weaker. Of course, this raises the question of what it means to say the shadow is "plainly visible"...

The Homunculus Problem

This is the sort of problem Dennett's Consciousness Explained addresses. I wish I could summarize it here, but I don't remember it well enough.

It uses the heterophenomenological method, which means you take a dataset of earnest utterances like "the shadow appears darker than the rest of the image" and "B appears brighter than A", and come up with a model of perception/cognition to explain the utterances. In practice, as you point out, homunculus models won't explain the data. Instead the model will say that different cognitive faculties will have access to different pieces of information at different times.

The Argument For Spoilers

Very interesting. I would guess that to learn in the presence of spoilers, you'd need not only a good model of how you think, but also a way of updating the way you think according to the model's recommendations. And I'd guess this is easiest in domains where your object-level thinking is deliberate rather than intuitive, which would explain why the flashcard task would be hardest for you.

When I read about a new math concept, I eventually get the sense that my understanding of it is "fake", and I get "real" understanding by playing with the concept and getting surprised by its behavior. I assumed the surprise was essential for real understanding, but maybe it's sufficient to track which thoughts are "real" vs. "fake" and replace the latter with the former.

The Argument For Spoilers

Have you had any success learning the skill of unseeing?

  • Are you able to memorize things by using flashcards backwards (looking at the answer before the prompt) nearly as efficiently as using them the usual way?
  • Are you able to learn a technical concept from worked exercises nearly as well as by trying the exercises before looking at the solutions?
  • Given a set of brainteasers with solutions, can you accurately predict how many of them you would have been able to solve in 5 minutes if you had not seen the solutions?
Reflexive Oracles and superrationality: prisoner's dilemma

See also this comment from 2013 that has the computable version of NicerBot.

Prisoner's Dilemma (with visible source code) Tournament

This algorithm is now published in "Robust program equilibrium" by Caspar Oesterheld, Theory and Decision (2019) 86:143–159,, which calls it ϵGroundedFairBot.

The paper cites this comment by Jessica Taylor, which has the version that uses reflective oracles (NicerBot). Note also the post by Stuart Armstrong it's responding to, and the reply by Vanessa Kosoy. The paper also cites a private conversation with Abram Demski. But as far as I know, the parent to this comment is older than all of these.

Challenge: know everything that the best go bot knows about go

Or maybe it means we train the professional in the principles and heuristics that the bot knows. The question is if we can compress the bot's knowledge into, say, a 1-year training program for professionals.

There are reasons to be optimistic: We can discard information that isn't knowledge (lossy compression). And we can teach the professional in human concepts (lossless compression).

Load More