I've been doing computational cognitive neuroscience research since getting my PhD in 2006, until the end of 2022. I've worked on computatonal theories of vision, executive function, episodic memory, and decision-making. I've focused on the emergent interactions that are needed to explain complex thought. I was increasingly concerned with AGI applications of the research, and reluctant to publish my best ideas. I'm incredibly excited to now be working directly on alignment, currently with generous funding from the Astera Institute. More info and publication list here.
I really like your recent series of posts that succinctly address common objections/questions/suggestions about alignment concerns. I'm making a list to show my favorite skeptics (all ML/AI people; nontechnical people, as Connor Leahy puts it, tend to respond "You fucking what? Oh hell no!" or similar when informed that we are going to make genuinely smarter-than-us AI soonish).
We do have ways to get an AI to do what we want. The hardcoded algorithmic maximizer approach seems to be utterly impractical at this point. That leaves us with approaches that don't obviously do a good job of preserving their own goals as they learn and evolve:
None of these directly address what I'm calling The alignment stability problem, to give a name to what you're addressing here. I think addressing it will work very differently in each of the three approaches listed above, and might well come down to implementational details within each approach. I think we should be turning our attention to this problem along with the initial alignment problems, because some of the optimism in the field stems from thinking about initial alignment and not long-term stability.
Edit: I left out Ozyrus's posts on approach 3. He's the first person I know of to see agentized LLMs coming, outside of David Shapiro's 2021 book. His post was written a year ago and posted two weeks ago to avoid infohazards. I'm sure there are others who saw this coming more clearly than I did, but I thought I'd try to give credit where it's due.
Great analysis. I'm impressed by how thoroughly you've thought this through in the last week or so. I hadn't gotten as far. I concur with your projected timeline, including the difficulty of putting time units onto it. Of course, we'll probably both be wrong in important ways, but I think it's important to at least try to do semi-accurate prediction if we want to be useful.
I have only one substantive addition to your projected timeline, but I think it's important for the alignment implications.
LLM-bots are inherently easy to align. At least for surface-level alignment. You can tell them "make me a lot of money selling shoes, but also make the world a better place" and they will try to do both. Yes, there are still tons of ways this can go off the rails. It doesn't solve outer alignment or alignment stability, for a start. But GPT4's ability to balance several goals, including ethical ones, and to reason about ethics, is impressive.[1] You can easily make agents that both try to make money, and thinks about not harming people.
In short, the fact that you can do this is going to seep into the public consciousness, and we may see regulations and will definitely see social pressure to do this.
I think the agent disasters you describe will occur, but they will happen to people that don't put safeguards into their bots, like "track how much of my money you're spending and stop if it hits $X and check with me". When agent disasters affect other people, the media will blow it sky high, and everyone will say "why the hell didn't you have your bot worry about wrecking things for others?". Those who do put additional ethical goals into their agents will crow about it. There will be pressure to conform and run safe bots. As bot disasters get more clever, people will take more seriously the big bot disaster.
Will all of that matter? I don't know. But predicting the social and economic backdrop for alignment work is worth trying.
Edit: I finished my own followup post on the topic, Capabilities and alignment of LLM cognitive architectures. It's a cognitive psychology/neuroscience perspective on why these things might work better, faster than you'd intuitively think. Improvements to the executive function (outer script code) and episodic memory (pinecone or other vector search over saved text files) will interact so that improvements in each make the rest of system work better and easier to improve.
I did a little informal testing of asking for responses in hypothetical situations where ethical and financial goals collide, and it did a remarkably good job, including coming up with win/win solutions that would've taken me a while to come up with. It looked like the ethical/capitalist reasoning of a pretty intelligent person; but also a fairly ethical one.
I'm just curious, why the specification of math proofs? I know of some modestly promising ideas for aligning the sorts of AGI we're likely to get, and none of them were originally specified in mathematical terms. Tacking on maths to those wouldn't really be useful. My impression is that the search for formal proofs of safety have failed and are probably hopeless. It also seems like adding mathematical gloss to ML and psychological concepts is more often confusing than enlightening.
I agree with all of that, but the way you described that interaction sounds like it wouldn't even come close to accomplishing these goals. There's a gap in communication. I'd have to see you do it in person to know if I thought it was working.
I'm thinking of not just a spike in anxiety, I mean a permanent increase in social anxiety after having their fears of being socially inappropriate realized in extremely embarrassing public criticism.
People mean a lot of things when they say they have social anxiety. Everyone can get nervous in social situations. But real anxiety can be absolutely crippling, and I wouldn't want to make it worse for anyone.
AI safety includes unintended consequences of non-sentient systems. That ambiguity creates confusion in the discussion. I've been using AGI x-risk as a clumsy way to point to what I'm trying to research. Artificial Intention research does the same thing, but without broadcasting conclusions as part of the endeavor.
Leaving out the "artificial intelligence" seems questionable, as does adopting the same abbreviation, "AI", for both. So I'd suggest AI intention research, AII. Wait, nevermind :). Other ideas?
I agree that intention comes with its own baggage, but I think that baggage is mostly appropriate. Intention usually refers to explicit goals. And those are the ones we're mostly worried about. I think it's unhelpful tomix concerns about goal-directed AI with concerns about implicit biases and accidental side effects. So I'd call this another step in the right direction.
I am going to try adopting this terminology, at least in some cases.
I think the motivation to suppress the lab leak theory was to avoid compounding the crisis at the time with anti-Chinese sentiment, including racist attacks on Chinese Americans. Emotions were running really high.
I think we'll now see little energy for continuing that bias, and more energy for correctly identifying the source to prevent future pandemics from similar origins.
TBF, I predict that the public debate will still resemble a dumpster fire, as do most complex human affairs. Humans are cute, not smart.
If aliens are here, they are definitely screwing with us by remaining covert. I don't know how that figures into the odds given the evidence. It would require incompetent aliens, or else aliens so competent that they could judge how much crappy contact data would make the whole thing seem so unlikely that it gets ignored by most rational people.
Like me.
If they're here but not willing to make contact, they're useless to me as far as I can tell, and I'll go on doing the same things whether or not they exist.
One piece of the logic that I do find interesting is the interaction with AGI x-risk. If aliens were here, they probably wouldn't want us creating a light-cone swallowing misaligned AGI.
That is indeed a lot of points. Let me try to parse them and respond, because I think this discussion is critically important.
Point 1: overhang.
Your first two paragraphs seem to be pointing to downsides of progress, and saying that it would be better if nobody made that progress. I agree. We don't have guaranteed methods of alignment, and I think our odds of survival would be much better if everyone went way slower on developing AGI.
The standard thinking, which could use more inspection, but which I agree with, is that this is simply not going to happen. Individuals that decide to step aside are slowing progress only slightly. This leaves compute overhang that someone else is going to take advantage of, with nearly the competence, and only slightly slower. Those individuals who pick up the banner and create AGI will not be infinitely reckless, but the faster progress from that overhang will make whatever level of caution they have less effective.
This is a separate argument from regulation. Adequate regulation will slow progress universally, rather than leaving it up to the wisdom and conscience of every individual who might decide to develop AGI.
I don't think it's impossible to slow and meter progress so that overhang isn't an issue. But I think it is effectively even harder than alignment. We have decent suggestions on the table for alignment now, and as far as I know, no equally promising suggestions for getting everyone (and it does take almost everyone coordinating) to pass up the immense opportunities offered by capabilities overhangs.
Point 2: Are LLMs safer than other approaches?
I agree that this is a questionable proposition. I think it's worth questioning. Aiming progress at easier-to-align approaches seems highly worthwhile.
I agree that an LLM may have something like a mind inside. I think current versions are almost certainly too dumb to be existentially dangerous (at least directly - if a facebook algorithm can nearly cause an insurrection, who knows what dangerous side effects any AI can have).
I'm less worried about GPT10 playing a superintelligent, Waluigi-collapsed villain than I am about a GPT6 that has been amplified to agency, situational awareness, and weak superintelligence by scaffolding it into something like a cognitive architecture. I think this type of advance is inevitable. ChatGPT extensions and Bing Chat both use internal prompting to boost intelligence, and approaches like SmartGPT and Tree of Thoughts massively improve benchmark results over the base LLM.
Fortunately, this direction also has huge advantages for alignment. It has a very low alignment tax, since you give them additional goals in natural language, like "support human empowerment" or whatever the SOTA alignment goal is. And they have vastly better interpretability since they're at least summarizing their thoughts in natural language.
Here's where your skepticism that they're being honest about summarizing those thoughts comes into full force. I agree that it's not reliable; for instance, changing the intermediate answer in chain of thought prompting often doesn't change the final output, indicating that that output was for show.
However, a safer setup is to never use the same model twice. When you use chain-of-thought reasoning, construct a new context with the relevant information from memory; don't just let the context window accrue, since this allows fake chains-of-thought and the collapse of the simulator into a waluigi.
Scaffolded LLMs should not turn an LLM into an agent, but rather create a committee of LLMs that are called for individual questions needed to accomplish that committee's goals.
This isn't remotely a solution to the alignment problem, but it really seems to have massive upsides, and only the same downsides as other practically viable approaches to AGI.
To be clear, I only see some form of RL agents as the other practical possibility, and I like our odds much less with those.
I think there are other, even more readily alignable approaches to AGI. But they all seem wildly impractical. I think we need to get ready to align the AGI we get, rather than just preparing to say I-told-you-so after the world refuses to forego massive incentives to take a much slower but safer route to AGI.
To paraphrase, we need to go to the alignment war with the AGI we get, not the AGI we want.