Charlie Steiner

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.


Reducing Goodhart
Philosophy Corner

Wiki Contributions


P₂B: Plan to P₂B Better

Saying that resource acquisition is in the service of improved planning (because it makes future plans better) seems like a bit of a stretch - you could just as easily say that improved planning is in the service of resource acquisition (because it lets you use resources you couldn't before). "But executing plans is how you get the goal!" you might say, and "But using your resources is how you get to the goal!" is the reply.

Maybe this is nitpicking, because I agree with you that there is some central thing going on here that is the same whatever you choose to call "more fundamental." Some essence of getting to the goal, even though the world is bigger than me. So I'm looking forward to where this is headed.

Does the Structure of an algorithm matter for AI Risk and/or consciousness?

Since everyone's talking sensibly about capabilities / safety, let me talk insensibly about consciousness.

Sometimes when we ask about consciousness, we just mean a sort of self-modeling or self-awareness. Does this thing represent itself in the model of the world? Is it clever and responsive to us interacting with it?

But I'm going to assume you mean you mean something more like "if I want to care about humans even if their bodies are different shapes, should I also care about these things?" Or "Is there something it is like to be these things?"

When we wonder about AI's consciousness (or animal consciousness, or aliens, or whatever), there is no simple physical property that is "the thing" we are asking about. Instead, we are asking about a whole mixed-up list of different things we care about and that all go together in non-brain-damaged humans, but don't have to go together in AIs. To look at a tiny piece of the list, my pain response involves unconscious responses of my body (e.g. reducing blood flow to extremities) that I then perceive, associations in my thoughts with other painful events or concepts of damage or danger, reflexes to say "ow" or swear or groan, particular reflexive facial expressions, trying to avoid the painful stimulus, difficulty focusing on other things, etc.

These things usually go together in me and in most humans, but we might imagine a person who has some parts of pain but not others. For example, let's just say they have the first half of my list but not the second half: their body unconsciously goes into fight-or-flight mode, they sense that something is happening and associate it with examples of danger or damage, but they have no reflex to say "ow" or look pained, they don't feel an urge to avoid the stimulus, and they suffer no more impediment to thinking clearly while injured than you do when you see the color red. It's totally reasonable to say that this person "doesn't really feel pain," but the precise flavor of this "not really" is totally different than the way in which a person under general anesthesia doesn't really feel pain.

If we lose just a tiny piece of the list rather than half of it, the change is small, and we'd say we still "really" feel pain but maybe in a slightly different way. Similarly, if we lost our sense of pain we'd still feel that we were "really" conscious, if with a slightly different flavor. This is because pain is just a tiny part of what goes together in consciousness - if we also lost how our expectations color what objects we recognize in vision, how we store and recall concepts from memory, how we feel associations between our senses, how we have a sense of our own past, and a dozen other bits of humanity, then we'd be well into the uncanny valley. (Or we don't really have to lose these things, we just have to lose the way that they go together in humans, just like how the person missing half of the parts of their pain response doesn't start getting points back if they say "ow" but at times uncorrelated with them being stabbed.)

Again, I need to reiterate that there is nothing magical about this list of functions and feelings, nothing that makes it a necessary list-of-things-that-go-together, it's just some things that happen to form a neat bundle in humans. But also there's nothing wrong with caring about these things! We're not doing physics here, you can't automatically get better definitions of words by applying Occam's Razor and cutting out all the messy references to human nature.

Because the notion of consciousness has our own anthropocentric perspective so baked into it, any AI not specially designed to have all the stuff we care about will almost surely be missing many parts of the list, and be missing many human correlations between parts.

So, to get around to the question: Neither of these AIs will be conscious in the sense we care about. The person who only has half the correlates of pain is astronomically closer to feeling pain than these things are to being conscious. The question is not "what is the probability" they're as conscious as you or I (since that's 0.0), the question is what degree of consciousness do they have - how human-like are their pieces, arranged in how recognizable a way?

Yet after all this preamble, I'm not really sure which I'd pick to be more conscious. Perhaps for most architectures of the RL agent it's actually less conscious, because it's more prone to learn cheap and deceptive tricks rather than laboriously imitating the human reasoning that produces the text. But this requires us to think about how we feel about the human-ness of GPT-n, which even if it simulates humans seems like it simulates too many humans, in a way that destroys cognitive correlations present in an individual.

Alignment via manually implementing the utility function

This works great when you can recognize good things within the respresentation the AI uses to think about the world. But what if that's not true?

Here's the optimistic case:

Suppose you build a Go-playing AI that defers to you for its values, but the only things it represents are states of the Go board, and functions over states of the Go board. You want to tell it to win at Go, but it doesn't represent that concept, you have to tell it what "win at Go" means in terms of a value function from states of the Go board to real numbers. If (like me) you have a hard time telling when you're winning at Go, maybe you just generate as many obviously-winning positions as you can and label them all as high-value, everything else low-value. And this sort of works! The Go-playing AI tries to steer the gameboard into one of these obviously-winning states, and then it stops, and maybe it could win more games of Go if it also valued the less-obviously-winning positions, but that's alright.

Why is that optimistic?

Because it doesn't scale to the real world. An AI that learns about and acts in the real world doesn't have a simple gameboard that we just need to find some obviously-good arrangements of. At the base level it has raw sensor feeds and motor outputs, which we are not smart enough to define success in terms of directly. And as it processes its sensory data it (by default) generates representations and internal states that are useful for it, but not simple for humans to understand, or good things to try to put value functions over. In fact, an entire intelligent system can operate without ever internally representing the things we want to put value functions over.

Here's a nice post from the past:

EfficientZero: How It Works

To solve this problem, MuZero adds a new training target for the neural network

Should probably be edited MuZero -> EfficientZero.

Anyhow, great post. I enjoyed the EfficientZero paper and would recommend it to any other interested dilletantes, and this post did a good job putting things in the context of previous work.

The temporal dynamics used in the model seem to be really simple/cheap. Do you think they're just getting away with simple dynamics because Atari is simple, or do you think that even for hard problems we will find representations such that their dynamics are simple?

Biology-Inspired AGI Timelines: The Trick That Never Works

Which examples are you thinking of? Modern Stockfish outperformed historical chess engines even when using the same resources, until far enough in the past that computers didn't have enough RAM to load it.

I definitely agree with your original-comment points about the general informativeness of hardware, and absolutely software is adapting to fit our current hardware. But this can all be true even if advances in software can make more than 20 orders of magnitude difference in what hardware is needed for AGI, and are much less predictable than advances in hardware rather than being adaptations in lockstep with it.

The 2020 Review [Updated Review Dashboard]

No sarcasm, though, I think I had only 2 good posts in 2020 (Gricean communication and meta-preferences, and Hierarchical planning: context agents), both received modest upvotes and little discussion, and will probably stay out of the 2020 review.

I am not sure that increasing the amount of engaging writing by people with good network effects would lead to more people digging around to comment on my 2 good posts per year, and so I am ambivalent about "going in the Substack direction" even if it works. But prizes for the best posts of 2020 sound neat.

The 2020 Review [Updated Review Dashboard]

"Dah, I can't vote for my own posts."

looks at own posts from 2020.

"Well, 2019 was a better year anyhow."

Measuring hardware overhang

On the one hand this is an interesting and useful piece of data on AI scaling and the progress of algorithms. It's also important because it makes the point that the very notion of "progress of algorithms" implies hardware overhang as important as >10 years of Moore's law. I also enjoyed the follow-up work that this spawned in 2021.

Biology-Inspired AGI Timelines: The Trick That Never Works

Lol, was totally expecting "but entropy is ill-defined for continuous distributions except relative to some base measure."

Biology-Inspired AGI Timelines: The Trick That Never Works

This was super interesting. 

I don't think you can directly compare brain voltage to Landauer limit, because brains operate chemically, so we also care about differences in chemical potential (e.g. of sodium vs potassium, which are importantly segregated across cell membranes even though both have the same charge). To really illustrate this, we might imagine information-processing biology that uses no electrical charges, only signalling via gradients of electrically-neutral chemicals. I think this raises the total potential relative to Landauer and cuts down the amount of molecules we should estimate as transported per signal.

Load More