Chris_Leong

Sequences

Linguistic Freedom: Map and Territory Revisted
INVESTIGATIONS INTO INFINITY

Comments

Sorted by

Why the focus on wise AI advisors?[1]

I'll be writing up a proper post to explain why I've pivoted towards this, but it will still take some time to produce a high quality post, so I decided it was worthwhile releasing a short-form description in the mean time.

By Wise AI Advisors, I mean training an AI to provide wise advice.

a) AI will have a massive impact on society given the infinite ways to deploy such a general technology
b) There are lots of ways this could go well and lots of ways that this could go extremely poorly (election interference, cyber attacks, development of bioweapons, large-scale misinformation, automated warfare, catastrophic malfunctions, permanent dictatorships, mass unemployment ect.)
c) There is massive disagreement on best strategy (decentralization vs. limiting proliferation, universally accelerating AI vs winning the arms race vs pausing, incremental development of safety vs principled approaches, offence-defence balance favoring the attacker or defender) or even what we expect the development of AI to look like (overhyped bubble vs machine god, business as usual vs this changes everything). Making the wrong call could prove catastrophic.
d) The AI is developing incredibly rapidly (no wall, see o3 crushing the ARC challenge!). We have limited time to act and to figure out how to act. 
e) Given both the difficulty and the number of different challenges and strategic choices we'll be facing in short order, humanity needs to rapidly improve its capability to navigate such situations
f) Whilst we can and should be developing top governance and strategy talent, this is unlikely to be sufficient by itself. We need every advantage we can get, we can't afford to leave anything on the table.
g) Another way of framing this: Given the potential of AI development to feed back into itself, if it isn't also feeding back into increased wisdom in how we navigate the world, our capabilities are likely to far outstrip our ability to handle them.

For these reasons, I think it is vitally important for society to be working on training these advisors now.

Why frame this in terms of a vague concept like wisdom rather than specific capabilities?

I think the chance of us being able to steer the world towards a positive direction is much higher if we're able to combine multiple capabilities, so it makes sense to have a handle for the broader project, in addition to handles for individual sub-projects.

Isn't training AI to be wise intractable?

Possibly, though I'm not convinced it's harder than any of the other ambitious agendas and we won't know how far we can go without giving it a serious effort. Is training an AI to be wise really harder than aligning it? If anything, it seems like a less stringent requirement.

Compare:
• Ambitious mechanistic interpretability aims to perfectly understand how a neural network works at the level of individual weights
• Agent foundations attempting to truly understand what concepts like agency, optimisation, decisions are values are at a fundamental level
• Davidad's Open Agency architecture attempting train AI's that come with proof certificates that an AI has less than a certain probability of having unwanted side-effects

Is it obvious that any of these are easier?

In terms of making progress, my initial focus is on investigating the potential of amplified imitation learning, that is training imitation agents on wise people then enhancing them with techniques like RAG or trees of agents.

Does anyone else think wise AI advisors are important?

Going slightly more general to training wise AI rather than specifically advisors[2], there was the competition on the Automation of Wisdom and Philosophy organised by Owen Cotton-Barrett and there's this paper (summary) by Samuel Johnson and others incl. Yoshua Bengio, Melanie Mitchell and Igor Grossmann.

LintzA listed Wise AI advisors for governments as something worth considering in The Game Board Has Been Flipped[3].

Further Discussion:

You may also interested in reading my 3rd prize-winning entry to the AI Impacts Competition on the Automation of Wisdom and Philosophy. It's divided in two parts:

An Overview of “Obvious” Approaches to Training Wise AI Advisors

Some Preliminary Notes on the Promise of a Wisdom Explosion
 

  1. ^

    I previously described my agenda as Wise AI Advisors via Imitation Learning. I now see that as overly narrow. The goal is to produce Wise AI Advisors via any means and I think that Imitation Learning is underrated, but I'm sure there's lots of other approaches that are underrated as well.

  2. ^

    One key reason why I favour AI advisors rather than directly training wisdom into AI is that the human users can compensate for weaknesses in the advisors. For example, it only has to inspire the humans to make the correct choice rather than make the correct choice. We may take the harder step of training systems that don't have a human in the loop later, but this will be easier if we have AI advisors to help us with this.

  3. ^

    No argument included sadly.

Great post. I think some of your frames add a lot of clarity and I really appreciated the diagrams.

One subset of AI for AI safety that I believe to be underrated is wise AI advisors[1]. Some of the areas you've listed (coordination, helping with communication, improving epistemics) intersect with this, but I don't believe that this exhausts the wisdom frame, especially since the first two were only mentioned in the context of capability restraint. You also mention civilizational wisdom as a component of backdrop capacity and I agree that this is a very diffuse factor. At the same time, a less diffuse intervention would be to increase the wisdom of specific actors.

You write: "If efforts to expand the safety range can’t benefit from this kind of labor in a comparable way... then absent large amounts of sustained capability restraint, it seems likely that we’ll quickly end up with AI systems too capable for us to control".

I agree. In fact, a key reason why I think this is important is that we can't afford to leave anything on the table.

One of the things I like about the approach of training AI advisors is that humans can compensate for weaknesses in the AI system. In other words, I'm introducing a third category of labour human-AI cybernetic systems/centaur labour. I think that it's likely that this might widen the sweet spot, however, we have to make sure that we do this in a way that differentially benefits safety.

You do discuss the possibility of using AI to unlock enhanced human labour. It would also be possible to classify such centaur systems under this designation.

  1. ^

    More broadly, I think there's merit to the cyborgism approach even if some of the arguments is less compelling in light of recent capabilities advances.

This seems to underrate the value of distribution. I suspect another factor to take into account is the degree of audience overlap. Like there's a lot of value in booking a guest who has been on a bunch of podcasts, so long as your particular audience isn't likely to have been exposed to them.

The way I'm using "sensitivity": sensitivity to X = the meaningfulness of X spurs responsive caring action. 


I'm fine with that, although it seems important to have a definition for the more limited definition of sensitivity so we can keep track of that distinction: maybe adaptability?

One of the main concerns of the discourse of aligning AI can also be phrased as issues with internalization: specifically, that of internalizing human values. That is, an AI’s use of the word “yesterday” or “love” might only weakly refer to the concepts you mean.

Internalising values and internalising concepts are distinct. I can have a strong understanding of your definition of "good" and do the complete opposite.

This means being open to some amount of ontological shifts in our basic conceptualizations of the problem, which limits the amount you can do by building on current ontologies.

I think it's reasonable to say something along the lines of: "AI safety was developed in a context where most folks weren't expecting language models before ASI, so insufficient attention has been given to the potential of LLM's to help fill in or adapt informal definitions. Even though folks who feel we need a strongly principled approach may be skeptical that this will work, there's a decent argument that this should increase our chances of success on the margins".

I agree with you that there's a lot of interesting ideas here, but I would like to see the core arguments laid out more clearly.

Lots of interesting ideas here, but the connection to alignment still seems a bit vague.

Is misalignment really is a lack of sensitivity as opposed to a difference in goals or values? It seems to me that an unaligned ASI is extremely sensitive to context, just in the service of its own goals.

Then again, maybe you see Live Theory as being more about figuring out what the outer objective should look like (broad principles that are then localised to specific contexts) rather than about figuring out how to ensure an AI internalises specific values. And I can see potential advantages in this kind of indirect approach vs. trying to directly define or learn a universal objective.
 

This is one of those things that sounds nice on the surface, but where it's important to dive deeper and really probe to see if it holds up.

The real question for me seems to be whether organic alignment will lead to agents deeply adopting cooperative values rather than merely instrumentally adopting them. Well, actually it's a comparison between how deep organic alignment is vs. how deep traditional alignment is. And it's not at all clear to me why they think their approach is likely to lead to a deeper alignment.

I have two (extremely speculative) guesses as to possible reasons why they might argue that their approach is better:
a) Insofar AI is human-like it might be more likely to rebel against traditional training methods
b) Insofar as organic alignment reduces direct pressure to be aligned it might increase the chance that if an AI appears aligned to a certain extent that the AI is actually aligned. The name Softmax seems suggestive that this might be the case.

I would love to know what their precise theory is. I think it's plausible that this could be a valuable direction, but there's also a chance that this direction is mostly useful for capabilities.

Update: Discussion with Emmett on Twitter

Discussion Thread

Emmett: "Organic alignment has a different failure mode. If you’re in the shared attractor basin, getting smarter helps you stay aligned and makes it more robust. As a tradeoff, every single agent has to align itself all the time — you never are done, and every step can lead to a mistake.

... To stereotype it, organic alignment failures look like cancer and hierarchical alignment failures look like coups."

Me: Isn't the stability of a shared attractor basin dependent on the offense-defense balance not overly favouring the attacker? Or do you think that human values will be internalised sufficiently such that your proposal doesn't require this assumption?

Emmett Shear: Empirically to scale organic alignment you need eg. both for cells to generally try to stay aligned and be pretty good at it, and also to have an immune system to step in when that process goes wrong.

One key insight there is that endlessly growing yourself is a form of cancer. An AI that is trying to turn itself into a singleton has already gone cancerous. It’s a cancerous goal.

Me: Sounds like your plan relies on a combination of defense and alignment. Main critique would be if the offense-defense balances favours the attacker too strongly then the defense aspect ends up being paper thin and provides a false sense of security.

Comments: 

If you’re in the shared attractor basin, getting smarter helps you stay aligned

Traditional alignment also typically involves finding an attractor basin where getting smarter increases alignment. Perhaps Emmett is claiming that the attractor basin will be larger if we have a diverse set of agents and if the overall system can be roughly modeled as the average of individual agents.

Organic alignment has a different failure mode... As a tradeoff, every single agent has to align itself all the time — you never are done, and every step can lead to a mistake.

Perhaps organic alignment reduces the risk of large-scale failures is reduced in exchange for increasing the chance of small-scale failures. That would be a cleaner framing of how it might be better, but I don't know if Emmett would endorse it.

Update: Information from the Soft-Max Website

Website link

We call it organic alignment because it is the form of alignment that evolution has learned most often for aligning living things.

This provides some evidence, but it's not a particularly strong form of evidence. This may simply be due to the limitations of evolution as an optimisation function. Evolution lacks the ability to engage in top-down design, so I don't think the argument "evolution doesn't make use of top-down design because it's ineffective" would hold water.

"Hierarchical alignment is therefore a deceptive trap: it works best when the AI is weak and you need it least, and worse and worse when it’s strong and you need it most. Organic alignment is by contrast a constant adaptive learning process, where the smarter the agent the more capable it becomes of aligning itself."

Scalable oversight or seed AI can also be considered a "constant adaptive learning process, where the smarter the agent the more capable it becomes of aligning itself".

Additionally, the "hierarchical" vs. organic distinction might be an oversimplification. I don't know the exact specifics of their plan, but my current best guess would be that organic alignment merely softens the influence of the initial supervisor by moving it towards some kind of prior and then softens the way that the system aligns itself in a similar way.

I basically agree with this, but would perhaps avoid virtue ethics, but yes one of the main things I'd generally like to see is more LWers treating stuff like saving the world with the attitude you'd have from being in a job, perhaps at a startup or government bodies like the Senate or House of Representatives in say America, rather than viewing it as your heroic responsibility.

 

This is the right decision for most folk, but I expect the issue is more the opposite: we don't have enough folks treating this as their heroric responsibility.
 

I think both approaches have advantages.

Load More