I've been doing computational cognitive cognitive neuroscience since getting my PhD in 2006, until the end of 2022. I've worked on a bunch of brain systems, focusing on the emergent interactions that are needed to explain complex thought. I was increasingly concerned with AGI applications of the research, and reluctant to publish my best ideas. I'm incredibly excited to now be working directly on alignment, currently with generous funding from the Astera Institute. More info and publication list here.
I probably should've titled this "the alignment stability problem in artificial neural network AI". There's plenty of work on algorithmic maximizers. But it's a lot trickier if values/goals are encoded in a network's distributed representations of the world.
I also should've cited Alex Turner's Understanding and avoiding value drift. There he makes a strong case that dominant shards will try to avoid value drift through other shards establishing stronger connections to rewards. But that's not quite good enough. Even if it avoids sudden value drift, at least for the central shard or central tendency in values, it doesn't really address the stability of a multi-goal system. And it doesn't address slow subtle drift over time.
Those are important, because we may need a multi-goal system, and we definitely want alignment to stay stable over years, let alone centuries of learning and reflection.
Reflective stability does seem like the right term. Searches on that term are turning up some relevant discussion on alignment forum, so thanks!
Tiling agent theory is about formal proof of goal consistency in successor agents. I don't think that's relevant for any AGI made of neural networks similar to either brains or current systems. And that's a problem.
Reflective consistency looks to be about decision algorithms given beliefs, so I don't think that's directly relevant. I couldn't work out Yudkowsky's use of reflectively coherent quantified belief on a quick look; but it's in service of that closed form proof. That term only occurs three times on AF. Reflective trust is about internal consistency and decision processes relative to beliefs and goals, and it also doesn't seem to have caught on as common terminology.
So the reflective stability term is what I'm looking for, and should turn up more related work. Thanks!
This sounds way too capable to be safe. Although someone is probably working on this right now, this line of thought getting traction might increase the number of people doing it 10x. Maybe that's good since GPT4 probably isn't smart enough to kill us, even with an agent wrapper. It will just scare the pants off of us.
Aligning the wrapper is somewhat similar to my suggestion of aligning an RL critic network head, such as humans seem to use. Align the captain, not the crew. And let the captain use the crew's smarts without giving them much say in what to do or how to update them.
I also don't prioritize immunizing against disinformation. And of course this is a "haha, we're all going to die" joke. I'm going to hope for an agentized virus including GPT4 calls, roaming the internet and saying the scariest possible stuff, without being quite smart enough to kill us all. That will learn em.
I'm not saying OpenAI is planning that. Or that they're setting a good example. Just let's hope for that.
I'll definitely check these out, thanks! Reflective stability sounds exactly right.
I don't think Montague dealt with that issue much if at all. But it's been a long time since I read the book.
My biggest takeaway from Tomasello's work was his observation that humans pay far more attention to other humans than monkeys do to monkeys. Direct reward for social approval is one possible mechanism, but it's also possible that it's some other bias in the system. I think hardwired reward for social approval is probably a real mechanism. But it's also possible that the correlation between people's approval and even more direct reward of food, water, and shelter play a large role in making human approval and disapproval a conditioned stimulus (or a fully "substituted" stimulus). But I don't think that distinction is very relevant for guessing the scope of the critic's association.
But inevitably these are self-terminating when they conflict strongly with more basic survival values.
I completely agree. This is the basis of my explanation for how humans could attribute value to abstract representations and not wirehead. In sum, a system smart enough to learn about the positive values of several-steps-removed conditioned stimuli can also learn many indicators of when those abstractions won't lead to reward. These may be cortical representations of planning-but-not-doing, or other indicators in the cortex of the difference between reality and imagination. The weaker nature of simulation representations may be enough to distinguish, and it should certainly be enough to ensure that real rewards and punishments always have a stronger influence, making imagination ultimately under the control of reality.
If you've spent the afternoon wireheading by daydreaming about how delicious that fresh meat is, you'll be very hungry in the evening. Something has gone very wrong, in much the same way as if you chose to hunt for game where there is none. In both cases, the system is going to have to learn where the wrong decision was made and the wrong strategy was followed. If you're out of a job and out of money because you've spent months arguing with strangers on the internet about your beloved concept of freedom and the type of political policy that will provide it, something has similarly gone wrong. You might downgrade your estimated value of those concepts and theories, and you might downgrade the value of arguing on the internet with strangers all day.
The same problem arises with any use of value estimates to make prediction-based decisions. It could be that the dopamine system is not involved in these predictions. But given the data that dopamine spiking activity is ubiquitous[1], even when no physical or social rewards are present it seems likely to me that the system is working the same way in abstract domains as it is known to work in concrete ones.
I need to find this paper, but don't have time right now. The finding was that rodents exploring a new home cage exhibit dopamine spiking activity something like once a second or so on average. I have a clear memory of the claim, but didn't evaluate the methods closely enough to be sure the claim was well supported. If I'm wrong about this, I'd change my mind about the system working this way.
This could be explained by curiosity as an innate reward signal, and that might well be part of the story. But you'd still need to explain why animals don't die by exploring instead of finding food. The same core explanation works for both: imagination and curiosity are both constrained to be weaker signals than real physical rewards.
I agree with all of the premises. This timeline is short even for AGI safety people, but it also seems quite plausible.
I think there are people thinking about aligning true intelligence (that is, agentic, continually learning, and therefore self-teaching and probably self-improving in architecture). Unfortunately, that doesn't change the logic, because those people tend to have very pessimistic views on our odds of aligning such a system. I put Nate Soares, Eliezer Yudkowsky, and others in that camp.
There is a possible solution: build AGI that is human-like. The better humans among us are safely and stably aligned. Many individuals would be safe stewards of humanity's future, even if they changed and enhanced themselves along the road.
Creating a fully humanlike AGI is an unlikely solution, since the timeline for that would be even longer than the timelines for effective upgrades by AI enhancement through BCI.
But there is already work on roughly human-like AGI. I put DeepMind's focus on deep RL agents in this category. And there are proposed solutions that would produce at least short-term, if not long-term, alignment of that type of system. Steve Byrnes has proposed one such solution, and I've proposed a similar one.
Even partial success at this type of solution might keep loosely brianlike AGI aligned long enough for other solutions to be brought into play.
I think there's a delay in outreach for three reasons.
I do think the community is moving toward focusing more on this angle. And that we probably should.
Sorry for the obscure reference. Alignment Forum is the professional variant of Less Wrong. It has membership by invitation only, which means you can trust the votes and comments to be better informed, and from real people and not fake accounts.
That's a good suggestion. But at some point you have to let it die or wrap it up. It occurred to me while Eliezer was repeatedly trying to get Lex back onto the you're-in-a-box-thinking-faster thought experiment: when I'm frustrated with people for not getting it, I'm often probably boring them. They don't even see why they should bother to get it.
You have to know when to let an approach die, or otherwise change tack.