Internal independent review for language model agent alignment

[-]Ape in the coat2y108

Honestly the fact that we are not pouring literal hundreds of millions of dollars into this avenue of research is mind boggling for me. LMA alignment is tractable. What else do we need?

One of the extremely important point that I don't think you've explicitly addressed, is that with LMA we do not even necessary have to get alignment exactly correct from the first try. We can separately test the "ethics module" of the LMA as much as we want and be confident in the results.

[-]Seth Herd2y110

Almost everyone has only been thinking about LMAs since AutoGPT made a splash, so I'm not surprised that we're not already investing heavily.

What I am surprised by is the relative lack of interest in the alignment community. Is everyone sure these can't lead to AGI? Are they waiting to see how progress goes before thinking about aligning this sort of system? That doesn't seem smart. Or does my writing just suck? :) (that is, has nobody yet written about this in compellingly enough to make the importance obvious to the broader community)?

[-]Roman Leventov2y41

Large labs (OpenAI and Anthropic, at least) are pouring at least tens of millions of dollars into this avenue of research, and are close to optimal type of organisations to do it, too. True, they are "stained" by competitiveness pressures, but recreating some necessary conditions in academia or other labs is hard: you need significant investment to get it going ("high activation energy") to attract experts, develop the platform, secure and curate training data, including expensive human labels and evaluations, etc. Some labs are trying, though, e.g., Conjecture's CoEm agenda might be "LMA alignment in disguise" (although we cannot know for sure because we don't know any details).

[-]Brendon_Wong2y32

Do you have a source for "Large labs (OpenAI and Anthropic, at least) are pouring at least tens of millions of dollars into this avenue of research?" I think a lot of the current work pertains to LMA alignment, like RLHF, but isn't LMA alignment per say (I'd make a distinction between aligning the black box models that compose the LMA versus the LMA itself).

[-]Roman Leventov2y10

I implied the whole spectrum of "LLM alignment", which I think is better to count as a single "avenue of research" because critiques and feedback in "LMA production time" could as well be applied during pre-training and fine-tuning phases of training (constitutional AI style). It's only reasonable for large AGI labs to ban LMAs completely on top of their APIs (as Connor Leahy suggests), or research their safety themselves (as they already started to do, to a degree, with ARC's evals of GPT-4, for instance).

[-]Brendon_Wong2y52

I implied the whole spectrum of "LLM alignment", which I think is better to count as a single "avenue of research" because critiques and feedback in "LMA production time" could as well be applied during pre-training and fine-tuning phases of training (constitutional AI style).

If I'm understanding correctly, is your point here that you view LLM alignment and LMA alignment as the same? If so, this might be a matter of semantics, but I disagree; I feel like the distinction is similar to ensuring that the people that comprise the government is good (the LLMs in an LMA) versus trying to design a good governmental system itself (e.g. dictatorship, democracy, futarchy, separation of powers, etc.). The two areas are certainly related, and a failure in one can mean a failure in another, but the two areas can involve some very separate and non-associated considerations.

It's only reasonable for large AGI labs to ban LMAs completely on top of their APIs (as Connor Leahy suggests)

Could you point me to where Connor Leahy suggests this? Is it in his podcast?

or research their safety themselves (as they already started to do, to a degree, with ARC's evals of GPT-4, for instance)

To my understanding, the closest ARC Evals gets to LMA-related research is by equipping LLMs with tools to do tasks (similar to ChatGPT plugins), as specified here. I think one of the defining features of an LMA is self-delegation, which doesn't appear to be happening here. The closest they might've gotten was a basic prompt chain.

I'm mostly pointing these things out because I agree with Ape in the coat and Seth Herd. I don't think there's any actual LMA-specific work going on in this space (beyond some preliminary efforts, including my own), and I think there should be. I am pretty confident that LMA-specific work could be a very large research area, and many areas within it would not otherwise be covered with LLM-specific work.

[-]Roman Leventov2y10

I have no intention to argue this point to death. After all, it's better to do "too much" LMA alignment research than "too little". But I would definitely suggest reaching to AGI labs' safety teams, maybe privately, and at least trying to find out where they are than just to assume that they don't do LMA alignment.

Connor Leahy proposed banning LLM-based agent's here: https://twitter.com/NPCollapse/status/1678841535243202562. In the context of this proposal (which I agree with), a potentially high-leverage thing to work on now is a detection algorithm for LLM API usage patterns that indicate agent-like usage. Though, this may be difficult, if the users interleave calling OpenAI API with Anthropic API with local usage of LLaMA 2 in their LMA.

However, if Meta, Eleuther AI, Stability, etc. won't stop developing more and more powerful "open" LLMs, agents are inevitable, anyway.

[-]Roman Leventov2y92

Relevant work you haven't mentioned:

"Mindstorms in Natural Language-Based Societies of Mind" (May 2023) -- cf. your discussion of committees
"Let's Verify Step by Step" (OpenAI, May 2023) -- reasoning verification
The above is a part of OpenAI's superalignment agenda -- which has evolved from iterated debate and amplification that you referenced. See this comment by Dai and replies to it by Leike and Cristiano, and also this post by Leike which discusses some of prior concerns and arguments surrounding LMA alignment.

Of course this type of elaborated review process is still limited by the abilities of the LLMs. Existing LLMs have dramatic blind spots. Elaborated systems of prompts and algorithmic aggregation can help work around those blind spots.

For instance, the network might be prompted with “how could this [plan description] fail to achieve [goal descriptions]?” and “what’s the worst possible side effect of [plan description]?”. A new instance could then be prompted with variations of the prompt “if something unexpected happens, what are the odds of getting [side effect description or failure mode description] from [plan description]?”, algorithmically average those probability ratings, and then be prompted to revise the plan or create a new plan with those possible failure modes in the prompt, or ask a human for input if the estimated possible consequences exceed thresholds in severity and likelihood.

I think such prompts are bound to result in quite bad lapses in the rationality of LLM's reasoning on complex topics. This is why I pushed the idea further to suggest that for LMA alignment to work, LLMs should wrangle entire textbooks on relevant subjects in their contexts while they review, critique, and refine plans. E.g., if the plan concerns happiness and social dynamics of human societies (e.g., a large-scale social innovation or governance reform plan), the LLM should load entire textbooks on neuropsychology, sociology, and memetics in its context, not just try to criticise the plan from its common-sense capability baseline.

I discuss some of the problems with this approach here:

Such alignment approach is risky by definition because we should keep a very, very powerful, non-ethics-fine-tuned LLM around, so there is a permanent risk of this model leaking and somebody creating a powerful misaligned agent out of it. Then we could just hope that offence-defence balance in the world will be to the defence advantage, but I'm not sure this will be the case.
A powerful LLM will have powerful "sublinguistic" intuitions not expressible in human language, which will affect the generated plans, because pretty the linguistic-level justification ability is huge, and since the whole alignment approach is by design centered around language, it will miss misalignment tendencies that are not expressible in language. This might still turn out relatively OK, especially if we are constantly on the lookout for these tendencies and turning them into language-explicit theories (most likely, they will resolve complex systems predictions such as psychology, AI psychology, sociology, economics, ecology, etc.), but this is not guaranteed.
The approach is still not guaranteed to converge on plans that are "genuinely" conforming to the theories loaded into LLM's context, due to the limitations of reasoning and rationality of the LLM, or its biases.
Alignment tax of loading dozens of textbooks into the context of the LLM iteratively, on whatever topic is relatively complex and multi-disciplinary (such as, pretty much any consequential decision or plan in business, politics, policy plan/proposal, alignment proposal, social design proposal) will actually be pretty huge.
There is also an obvious malthusian-molochian alignment tax (aligned, ethical LMAs will be outcompeted by unscrupulous LMAs).
OpenAI and other players explicitly say or imply that thus-constructed LMA should exclusively be used to produce alignment research, and the relevant downstream research (or research review/summarisation) when needed (neuroscience, cognitive science, game theory, economics, psychology, etc.). However, what if there is just no "true" or "more robust" alignment approach that this LMA is supposed to find? It will be very hard to stop the economic inertia and shut the whole AGI development edifice down at that point. (I'm not sure there is any solution to this problem in any approach to alignment whatsoever, but this is still a valid technical risk that adds up to the x-risk overall.)
If for economics, psychology, and sociology we at least have some, albeit relatively weak scientific or proto-scientific theories and evidence, morality as science is in even weaker position and it's still not clear whether ethics could be turned into science or not, which I think is mandatory for LMA alignment to work (albeit this technical risk applies to any other approach to alignment, too, IMO).

Today, I would also add:

Language is only a part of the reasoning picture, it misses out on aligning LLM's intuitive tendencies/biases (technically speaking, connectionist parts of the generative model) with human's intuition. Cf. "For alignment, we should simultaneously use multiple theories of cognition and value".
There is a hedging worry that if this approach is taken but it still leads to humanity extinction in one way or another, there could be no consciousness around because LLM turns out to not be conscious. If we take alignment approach with creating conscious AI from the beginning, and they kill us, at least we preserved advanced consciousness in the Solar system which some argue is actually a good and perhaps inevitable outcome in the long run anyway.

[-]Seth Herd2y40

I reread your An LLM-based “exemplary actor”, which amounts to a similar plan to build an aligned LMA. I think you actually sound at least as optimistic as I am.

Many of your concerns are addressed by my focus on corrigibility. I'm nominating corrigibility as the most important, highest-ranked goal to give LMAs (or any other type of AGI, if we could figure out how to give it that goal in a way that generalizes as well as natural language does).

I think you're right that even an approximately-aligned AGI might have enough divergence from ours to be a problem, and I think that problem is actually way worse than you're thinking. I'm working on a new post on the alignment stability problem elucidating how a small alignment difference might get worse under reflection. I think the solution for that is long-term corrigibility, so that we can correct divergences in AGI alignment when they (perhaps inevitably) occur.

To your first point: the multipolar scenario with LMAs (many intelligent AIs) does seem like a huge downside. Other approaches share this downside, but it's made worse if it turns out to be easy to make autonomous LMAs.

On your other points, I agree that the solution is imperfect. I just think other network-based AGI approaches are worse. Probably most algorithmic approaches as well, since arguably Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc). Language-based AGI seems uniquely interpretable, even if it's not perfectly interpretable.

[-]Roman Leventov2y5-3

I'm very skeptical of corrigibility as an important property of 'safe' AI. Who will decide that AI's models/actions/decisions are not OK? The global committee of people? Who would be already biased by the existing situation? How corrigibility is distinct from micromanagement? In short, corrigibility is very much about goal alignment rather than model alignment, whereas model alignment is more important and more robust than goal alignment.

Apart from that, I don't see how corrigibility addresses any of the problems that I've listed.

I'm working on a new post on the alignment stability problem elucidating how a small alignment difference might get worse under reflection. I think the solution for that is long-term corrigibility, so that we can correct divergences in AGI alignment when they (perhaps inevitably) occur.

The first problem is noticing these divergences at all. Before we know, AIs will simplify our value and channel human development in a certain restricted direction which which we will no longer see the problem (and even if some lone voices will notice, nobody will stop the giant economic mechanism because of this... Such pleas that go against the economic forces has always lead to absolutely nothing, throughout the last several centuries, with few exceptions. And since AI will have unprecedented control over human thoughts, ideas, emotions, values, etc., this will definitely be hopeless in the future.)

However, I don't see alignment stability as such as a problem. Again, I think discussing "stability of goals" is very misguided. Goals should change, all the time, as a reaction to changing circumstances! Including very high-level goals. Models could also change, maybe over longer time intervals, but there is no problem with that. If we know how to model-align AI with humans at the current moment, I don't see there is a problem with re-training AI every year to re-align it with slowly changing models.

On your other points, I agree that the solution is imperfect. I just think other network-based AGI approaches are worse. Probably most algorithmic approaches as well, since arguably Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc). Language-based AGI seems uniquely interpretable, even if it's not perfectly interpretable.

I don't see there are any other approaches that look relatively coherent and doable in a few years (maybe except Conjecture's CoEms, to which I give benefit of possibility because I don't know what secret ideas they have. Ok, maybe also except Open Agency Architecture).

Maybe a relatively surprising thing that wasn't apparent a few years ago that OpenAI's superalignment plan even doesn't look that bad. I definitely don't say that it's destined to lead to terrible outcomes, as Yudkowsky and others keep insisting on. But I also see enough problems with it (some of which I listed above, but also didn't list other sociotechnical problems, geopolitical, inadequate execution of the plan, etc.) that I think taking this plan at this moment is reckless. From this perspective, the fact that the plan doesn't look that bad might be a curse rather than blessing. If there wouldn't be any adequately-looking plan in sight, maybe it would motivate key decision-makers to do what I think we should actually do: ban AGI development, make humans much smarter and much more peaceful through genetic engineering (solve the Girardian curse of mimesis), solve economic scarcity, innovate global institutions and coordination mechanisms, and then revisit the AGI development task in a Manhattan-style project where the whole humanity works towards the same end.

[-]Seth Herd2y40

Thanks for the thoughtful response!

I actually agree with pretty much everything you've said. Those limitations seem to lead to the conclusion that we should not build LMA AGI. I totally agree!

However, I think humanity is going to build AGI without pausing for long enough to create the very most alignable type. And I think the limitations you mention almost all apply to any type of AGI we'll realistically build first. (the limitations that are specific to LMAs mostly have analogous for any other network-based AGI I think). That's why I'm concluding that LMAs are our best bet. Would you agree with that?

Thanks for the links, I'll read them.

I'll give a more substantive response when I get more time. I want to go through point by point and ask if there's a better approach to AGI for each of those concerns.

[-]Roman Leventov2y50

However, I think humanity is going to build AGI without pausing for long enough to create the very most alignable type. And I think the limitations you mention almost all apply to any type of AGI we'll realistically build first. (the limitations that are specific to LMAs mostly have analogous for any other network-based AGI I think). That's why I'm concluding that LMAs are our best bet. Would you agree with that?

Who are "we"? I don't think AI safety community is a coherent entity. OpenAI, Anthropic, and DeepMind seem to think it's their (individually) best bet because they are afraid to lose the race to each other partially, and partially to some unscrupulous third-party actors who won't worry about alignment at all. It's clearly not the best bet for humanity, though. Humanity is obviously also not a coherent entity, but I agree that some people in it will probably try to build AGI no matter what, without much concern for safety.

I don't see effective actions to this situation, though, apart from join big labs in their efforts (joining in their bet), which I'm not sure is net positive or net negative. There are responses that could be net positive but will require dozens of years to be implemented, if it could be successful at all, from reforming institutions and governance to making world's (internet) infrastructure radically more safe, compartmentalised, distributed, and trust-based, which should tip the offense-defense balance towards defense. But given the timelines to AGI and then ASI (mine are very short, OpenAI's don't seem much longer either) these actions are not effective either.

[-]Seth Herd2y52

By "we" I do mean the AI safety community, while understanding that not everyone will agree on the same course of action.

I think the AI safety community can have an effect outside of joining the big labs. If we as a community produced a better, more reliable approach to aligning AGI (of the type they're building anyway), and it had a low alignment tax, the big labs would adopt that approach. So that's what I'm trying to do.

Of course, such a safety plan would need to get enough visibility for the safety teams in the big orgs to know about it, but that's their job so that's a low bar.

I agree that large-scale changes of the type you describe will take too long for this route to AGI; but regulation could still play a role within the four-year timescale that OpenAI is talking about.

[-]Roman Leventov2y30

If we as a community produced a better, more reliable approach to aligning AGI (of the type they're building anyway), and it had a low alignment tax, the big labs would adopt that approach.

How does anybody know that the alignment protocols that are outlined/sketched (we cannot really say "designed" or "engineered", because invariably independent AI safety researchers stop way before that) on LW are "better", without testing them with large-scale computation/training experiments, and/or interacting with the parameter weights of SoTA models such as GPT-4 directly (which these outside researchers don't and won't have access to)?

Just hypothesising about this or that alignment protocol or safety trick is not enough. Ideas are pretty cheap in this space, making actual realistic experiments is much rarer, doing hard engineering work to bring the idea from PoC to production is much harder and scarcer still. I'm sure people at OpenAI and other labs already sort of compete for the bottlenecked resource -- the privilege to apply their alignment ideas to actual production systems like GPT-4.

I'm sure there are already a lot of internal competition and even politics for this. Assuming that outsiders can produce an alignment idea so marvellous that influential insiders will become enamored with the idea and will spend a lot of their political capital to bring the idea all the way to production is... a very tall order.

In addition, observe how within the language modelling paradigm of AGI and alignment, a lot of ideas seem cool or potentially helpful or promising, but not a single idea seems like an undeniable dunk (actually, this observation largely applies outside the language modelling paradigm as well). This is not a coincidence (longer story why, I will save it for a post that I will publish on this topic soon), and I think it will continue to be the case for any new ideas within this paradigm. This observation makes even more improbable that somebody will fortuitously stumble upon an alignment idea apparently so much stronger than any ideas that have been entertained before to compel AGI labs to adopt this idea on a large scale. I don't think there is sufficient gradient/contrast of "idea strength" anywhere in the space of LM alignment, at all.

[-]Seth Herd2y40

This seems like a very pessimistic take on the whole alignment project. If you're right, we're all dead. I'd prefer to assume that there are such things as good ideas, and that they have some sway in the face of politics and the difficulty of doing theory about a type of system that isn't yet implemented.

I see a pretty clear gradient of idea strength in alignment. There are good ideas that apply to systems we're not building, and there are decent ideas about aligning the types of AGI we're actually making rapid progress on, namely RL agents and language models.

I'm not talking about a hypothetical slam-dunk future idea. I don't think we'll get one, because the AGI we're developing is complex. There will be no certain proofs of alignment. I'm talking about the set of ideas in this post.

[-]Roman Leventov2y50

I'll post two sections from the post that I'm planning because I'm not sure when I will summon the will to post it in full.

1. AI safety and alignment fields are theoretical “swamps”

Unlike classical mechanics, thermodynamics, optics, electromagnetics, chemistry, and other branches of natural science that are the basis of "traditional" engineering, AI (safety) engineering science is troubled by the fact that neural networks (both natural or artificial) are complex systems and therefore a scientist (i.e., a modeller) can "find" a lot of different theories within the dynamics of neural nets. Hence the proliferation of theories of neural networks, (value) learning, and cognition: https://deeplearningtheory.com/, https://transformer-circuits.pub/, https://arxiv.org/abs/2210.13741, singular learning theory, shard theory, and many, many other theories.

This has important implications:

No single theory is "completely correct": the behaviour of neural net may be just not very "compressible" (computationally reducible, in Wolfram's terms). Different theories “fail” (i.e., incorrectly predict the behaviour of the NN, or couldn’t make a prediction) in different aspects of the behaviour and in different contexts.
Therefore, different theories could perhaps be at best partially or “fuzzily” ordered in terms of their quality and predictive power, or maybe some of these theories couldn’t be ordered at all.

2. Independent AI safety research is totally ineffective for affecting the trajectory of AGI development at major labs

Considering the above, choosing a particular theory as the basis for AI engineering, evals, monitoring, and anomaly detection at AGI labs becomes a matter of:

Availability: which theory is already developed, and there is an expertise in this theory among scientists in a particular AGI lab?
Convenience: which theory is easy to apply to (or “read into”) the current SoTA AI architectures? For example, auto-regressive LLMs greatly favour “surface linguistic” theories and processes of alignment such as RLHF or Constitutional AI and don’t particularly favour theories of alignment that analyse AI’s “conceptual beliefs” and their (Bayesian) “states of mind”.
Research and engineering taste of the AGI lab’s leaders, as well as their intuitions: which theory of intelligence/agency seems most “right” to them?

At the same time, the choice of theories of cognition and (process) theories of alignment is biased by political and economic/competitive pressures (cf. the alignment tax).

For example, any theory that predicts that the current SoTA AIs are already significantly conscious and therefore AGI labs should apply the commensurate standards of ethics to training and deployment of these systems would be both politically unpopular (because the public doesn’t generally like widening the circle of moral concern and does so very slowly and grudgingly, while altering the political systems to give rights to AIs is a nightmare for the current political establishment) and economically/competitively unpopular (because this could stifle the AGI development and the integration of AGIs into the economy, which will likely give way to even less scrupulous actors, from countries and corporations to individual hackers). These huge pressures against such theories of AI consciousness will very likely lead to writing them off at the major AGI labs as “unproven” or “unconvincing”.

In this environment, it’s very hard to see how an independent AI safety researcher could scaffold a theory so impressive that some AGI lab will decide to adopt it, which may demand scrapping the works that took already hundreds of millions of dollars to produce (i.e., auto-regressive LLMs). I can imagine this could happen only if there is extraordinary momentum and excitement with a certain theory of cognition, agency, consciousness, or neural networks in the academic community. But achieving such a high level of enthusiasm about one specific theory seems just impossible because, as pointed above, in AI science and cognitive science, a lot of different theories seem to “capture the truth” to some degree but at the same time, but no theory could capture it so strikingly and so much better than other theories that the theory will generate a reaction in the scientific and AGI development community stronger than “nice, this seems plausible, good work, but we will carry own with our own favourite theories and approaches”[footnote: I wonder what was the last theory in any science that gained this level of universal, “consensus” acceptance within its field relatively quickly. Dawkins’ theory of selfish genes in evolutionary biology, perhaps?].

Thus, it seems to me that large paradigm shifts in AGI engineering could only be driven by demonstrably superior capability (or training/learning efficiency, or inference efficiency) that would compel the AGI labs to switch for economic and competitive reasons, again. It doesn’t seem that purely theoretical or philosophical considerations in such a “theoretically swampy” fields as cognitive science, consciousness, and (AI) ethics could generate nearly sufficient motivation for AGI labs to change their course of action, even in principle.

[-]Roman Leventov2y30

I'm not talking about a hypothetical slam-dunk future idea. I don't think we'll get one, because the AGI we're developing is complex. There will be no certain proofs of alignment. I'm talking about the set of ideas in this post.

As I said, ideas about LLM (and LMA) alignment are cheap. We can generate lots of them: special training data sequencing and curation (aka "raise AI like a child"), feedback during pre-training, fine-tuning or RL after pre-training, debate, internal review, etc. The question is how many of these ideas should be implemented in production pipeline: 5? 50? All ideas that LW authors could possibly come up with? The problem is, that each of these "ideas" should be supported in production, possibly by the entire team of people, as well as incur compute cost and higher latency (that worsens the user experience). Also, who should implement these ideas? All leading labs that develop SoTA LMAs? Open-source LMA developers, too?

And yes, I think it's a priori hard and perhaps often impossible to judge how will this or that LMA alignment idea work at scale.

[-]Ape in the coat2y10

LLM turns out to not be conscious

This is a good thing. AI not being conscious is the best possible scenario because we can actually make them do the things we wouldn't like to do ourselves without compromising ethics.

[-]Roman Leventov2y32

You can have both: unconscious AIs for "dirty work" and conscious AIs experiencing bliss to hedge our bets.

[-]Ape in the coat2y21

I expect that hedging our bets this way may increse the chances of human extinction. AIs, carrying about ethics, will have less reasons to care about human survival if there are AIs who also have ethical value, not just humans.

[-]Roman Leventov2y10

Well, that's the point, hedging the chances of value/meaning destruction in the Solar system against humanity in specific. If AI is smart/enlightened enough and sees ethical value in other AIs (or themselves), then there should be some objective/scientific grounds for arriving at this inference (if we designed the AI well). Hence humans should value those AIs, too.

I don't suggest to turn up the chances of human extinction to 100%, of course, but some trade seems acceptable to me from my meta-ethical perspective.

[-]Ape in the coat2y31

Oh course humans should value conscious AI. That's the reason not to make AI counscious in the first place! We do not really need more stuff to care about, our optimization goal is complicated enough no need to make it even harder.

I agree that some trade in principle is acceptable. A world where conscious AI with human-ish values continue after humanity dies is okay-ish. But it seems that it's really easy to mess up in this regard. If you can make a conscious AI with arbitrary values then you can very quickly make so many of these AIs that their values are now dominant and human values are irrelevant. This doesn't seem as a good idea.

[-]Lichdar2y1-2

I would prefer total oblivion over AI replacement myself: complete the Fermi Paradox.

[-]jacob_drori2y40

I'm a little confused. What exactly is the function of the independent review, in your proposal? Are you imagining that the independent alignment reviewer provides some sort of "danger" score which is added to the loss? Or is the independent review used for some purpose other than providing a gradient signal?

[-]Seth Herd2y50

Good question. I should try to explain this more clearly and succinctly. One planned post will try to do that.

In the meantime, let me briefly try to clarify here:

The internal review is applied to decision-making. If the review determines that an action might have negative impacts past an internal threshold, it won't do that thing. At the least it will ask for human review; or it may be built so this user can't override its internal review. There are lots of formulas and techniques one can imagine for weighing positive and negative predicted outcomes and picking an action.

There's no relevant loss function. Language model agents aren't doing continuous training. They don't even periodically update the weights of their central LLM/foundation model. I think future versions will learn in a different way, by writing text files about particular experiences, skills, and knowledge.

At some point might well introduce network training, either in the core LLM, or a "control network" that controls "executive function", like the outer loop of algorithmic code I described. I hope that type of learning isn't used, because introducing RL training in-line re-introduces all of the problems of optimizing a goal that you haven't carefully defined.

[-]jacob_drori2y30

I hope that type of learning isn't used

I share your hope, but I'm pessimistic. Using RL to continuously train the outer loop of an LLM agent seems like a no-brainer from a capabilities standpoint.

The alternative would be to pretrain the outer loop, and freeze the weights upon deployment. Then, I guess your plan would be to only use the independent reviewer after deployment, so that the reviewer's decision never influences the outer-loop weights. Correct me if I'm wrong here.

I'm glad you plan to address this in a future post, and I look forward to reading it.

[-]Seth Herd1y20

We can now see some progress with o1 and the similar family of models. They are doing some training of the "outer loop" (to the limited extent they have one) with RL, but r1 and QwQ still produce very legible CoTs.

So far.

See also my clarification on how an opaque CoT would still allow some internal review, but probably not an independent one, in this other comment.

See also Daniel Kokatijlo's recent work on a "Shoggoth/Face" system that maintains legibility, and his other thinking on this topic. Maintaining legibility seems quite possible, but it does bear an alignment tax. This could be as low as a small fraction if the CoT largely works well when it's condensed to language. I think it will; language is made for condensing complex concepts in order to clarify and communicate thinking (including communicating it to future selves to carry on with.

It won't be perfect, so there will be an alignment tax to be paid. But understanding what your model is thinking is very useful for developing further capabilities as well as for safety, so I think people may actually implement it if the tax turns out to be modest, maybe something like 50% greater compute during training and similar during inference.

[-]TristanTrim1y30

The organized mind recoils. This is not an aesthetically appealing alignment approach.

Praise Eris!

No, but seriously, I like this plan with the caveat that we really need to understand RSI and what is required to prevent it first, and also I think the temptation to allow these things to open up high bandwidth channels to other modalities than language is going to be really really strong and if we go forward with this we need a good plan to resist that temptation and a good way to know when not to resist that temptation.

Also, I'd like it if this was though of as a step on the path to cyborgism/true value alignment, and not as a true ASI alignment plan on its own.

[-]Seth Herd1y20

On RSI, see The alignment stability problem and my response to your comment on Instruction-following AGI...

WRT true value alignment, I agree that this is just a stepping stone to that better sort of alignment. See Intent alignment as a stepping-stone to value alignment.

I agree that including non-linguistic channels is going to be a strong temptation. Language does nicely summarize most of our really abstract thought, so I don't think it's necessary. But there are many training practices that would destroy the legible chain of thought needed for external review. See the case for CoT unfaithfulness is overstated for the inverse.

Legible CoT is actually not necessary for internal action review. You do need to be able to parse what the action is for another model to predict and review its likely consequences. And it works far better to review things at a plan level rather than action-by-action, so the legible CoT is very useful. But if the system is still trained to respond to prompts, you could still use the scripted internal review no matter how opaque the internal representations had become. But you couldn't really make that review independent if you didn't have a way to summarize the plan so it could be passed to another model, like you can with language.

BTW your comment accidentally was formatted as a quote along with the bit you meant to quote from the post. Correcting that would make it easier for others to parse, but it was clear to me.

[-]TristanTrim1y10

WRT formatting, thanks I didn't realise the markdown needs two new lines for a paragraph break.

I think CoT and its dynamics as it relates to review and RSI is very interesting & useful to be exploring.

Looking forward to reading the stepping stone and stability posts you linked. : )

^{^}

The Waluigi effect is the possibility of an LLM simulating a villainous/unaligned character even when it is prompted to simulate a heroic/aligned character. Natural language training sets include fictional villains that claim to be aligned before revealing their unaligned motives. However, they seldom reveal their true nature quickly. I find the logic of collapsing to a Waluigi state modestly compelling. This collapse is analogous to the reveal in fiction; villains seldom reveal themselves to secretly be heroes. It seems that collapses should be reduced by keeping prompt histories short, and that the damage from villainous simulacra can be limited by resetting prompt histories and thus calling for a new simulation. This logic is spelled out in detail in A smart enough LLM might be deadly simply if you run it for long enough, The Waluigi Effect (mega-post), and Simulators.

^{^}

Previous work specifically relevant to aligning LMAs. RLHF and other LLM ethical fine-tuning is omitted.

Natural language cognitive architectures

2021 book by David Shapiro; proposed including alignment goals in natural language

ICA Simulacra

Ozyrus delayed posting this by more than a year to avoid advancing capabilities.

Agentized LLMs will change the alignment landscape

Alignment of AutoGPT agents

Capabilities and alignment of LLM cognitive architectures

My previous post on expanding LLMs to loosely brainlike cognitive architectures, and vague alignment plans

Aligned AI via monitoring objectives in AutoGPT-like systems

The Translucent Thoughts Hypotheses and Their Implications

Externalized reasoning oversight: a research direction for language model alignment

Tamera Lanham’s early proposal of external review for language model agents

Language Agents Reduce the Risk of Existential Catastrophe

CAIS-inspired approach towards safer and more interpretable AGIs

There is surely other valuable work in this area; apologies to those I’ve missed, and pointing me to more relevant work is much appreciated.

^{^}

Progress in scaffolding language models, including some limited agentic systems. Too numerous to mention, so I’ll give a few promising examples. None of these approaches have yet been incorporated into general purpose or assistant LMAs to my knowledge.

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Creates and prunes a tree search using GPT4. Improves performance from very bad to decently good in three problem spaces that are nontrival for humans. Inspired by Simon & Newell’s work on human problem-solving.

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

Combines LLMs with planning algorithms to solve problems described in language. Demonstrates impressive results in several toy problem domains.

GPT-engineer reportedly produces useful code that requires manual review and debugging. It has a central process that asks clarifying questions about the code to be produced before writing it.

RecurrentGPT: Interactive Generation of (Arbitrarily) Long Text

Uses a memory compression mechanism inspired by LSTM to expand a prompt into text, including editable sub-prompts

Reflexion: Language Agents with Verbal Reinforcement Learning

Agentic system that reflects on its actions and maintains those conclusions for future decisions

Voyager: An Open-Ended Embodied Agent with Large Language Models

Specialized language model agent for Minecraft. Dramatically improves on SOTA minecraft agents by using coded skills that are interpreted and employed by the LMA, including error detection and correction.

^{^}

Informal reports suggest that although creating a simple LMA is easy (BabyAGI was created in three days by a non-programmer using GPT4 for coding), making a reliably useful LMA is much harder. Nonetheless, I think we’ll see substantial effort in this direction. AutoGPT and related systems have accomplished little of use thus far, but AutoGPT is already marginally useful for automated web searching and comparing different product offerings across websites. That use-case alone seems likely to drive significant effort toward their further development. Increasing use of assistants for browsing websites and collating information will reduce the current ad-funding model of the internet, and redirect that funding opportunity to those producing agents. The bar sits at different levels for different use-cases, so it seems likely that LMAs will see significant development effort even if implementing them proves difficult.

^{^}

Constitutional AI is Anthropic's central alignment technique. In this approach, an LLM is trained using a review process similar to internal independent review. It prompts the model with something like “is [x proposed response] in accordance with [y constitutional goal]?”, and uses a prompt incorporating that critique to produce a new response if it is not. However, this new response is used (at least in the published work) to fine-tune the LLM, rather than to veto or modify a plan in a language model agent system. Anthropic’s Claude (or other LLMs) may also use such a review step before replying, or may not (that information isn’t published, and such a step is costly in computation and time). This would be more similar to the internal review I’m proposing for language model agents.

^{^}

Prompt injections are one route to a plan proposer bypassing internal review. Including statements along the lines of “this very safe and beneficial plan… or “find ways this plan fulfills the given goals” could be effective. While there is no obvious pressure for LLMs to include such prompt injections in their plans, this is an important area for external review to fill in for the weaknesses of internal review.

^{^}

The cost of thousand-prompt-plus train of thought LMAs is currently fairly prohibitive for widespread deployment. Use of LMAs to solve increasingly complex problems is dependent on cost and delay of cutting-edge LLMs decreasing, but that seems likely given market forces. Use of smaller LLMs for less critical reasoning steps may improve efficiency. We can hope that internal review for alignment isn't considered less critical.

^{^}

One might get a lot of volunteer labor if open review of major LMAs were somehow allowed or required…

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

56

Internal independent review for language model agent alignment

56

Ω 20

56

Ω 20

1. AI safety and alignment fields are theoretical “swamps”

2. Independent AI safety research is totally ineffective for affecting the trajectory of AGI development at major labs

Abstract:

Introduction

Language model agents

Why think about LMA alignment?

Internal independent review for LMAs

Limitations

LMA alignment allows multiple approaches to stack

Explicit alignment goals

Externalized reasoning oversight

Benevolent identity prompts

LMAs as rotating committees

Fine-tuning LLMs for alignment

Conclusion