Wiki Contributions


Is that ‘deceptive alignment’? You tell me.

I don't think it makes sense to classify every instance of this as deceptive alignment - and I don't think this is the usual use of the term.

I think that to say "this is deceptive alignment" is generally to say something like "there's a sense in which this system has a goal different from ours, is modeling the selection pressure it's under, anticipating that this selection pressure may not exist in the future, and adapting its behaviour accordingly".

That still leaves things underdefined, e.g. since this can all happen implicitly and/or without the system knowing this mechanism exists.
However, if you're not suggesting in any sense that [anticipation of potential future removal of selection pressure] is a big factor, then it's strange to call it deceptive alignment.

I assume Wiblin means it in this sense - not that this is the chance we get catastrophically bad generalization, but rather that it happens via a mechanism he'd characterize this way.

[I'm now less clear that this is generally agreed, since e.g. Apollo seem to be using a foolish-to-my-mind definition here: When an AI has Misaligned goals and uses Strategic Deception to achieve them (see "Appendix C - Alternative definitions we considered", for clarification).
This is not close to the RFLO definition, so I really wish they wouldn't use the same name. Things are confusing enough without our help.]

All that said, it's not clear to me that [deceptive alignment] is a helpful term or target, given that there isn't a crisp boundary, and that there'll be a tendency to tackle an artificially narrow version of the problem.
The rationale for solving it usually seems to be [if we can solve/avoid this subproblem, we'd have instrumentally useful guarantees in solving the more general generalization problem] - but I haven't seen a good case made that we get the kind of guarantees we'd need (e.g. knowing only that we avoid explicit/intentional/strategic... deception of the oversight process is not enough).
It's easy to motte-and-bailey ourselves into trouble.

This seems great in principle.
The below is meant in the spirit of [please consider these things while moving forward with this], and not [please don't move forward until you have good answers on everything].

That said:

First, I think it's important to clearly distinguish:

  1. A great world would have a lot more AI safety orgs. (yes)
  2. Conditional on many new AI safety orgs starting, the world is in a better place. (maybe)
  3. Intervening to facilitate the creation of new AI safety orgs makes the world better. (also maybe)

This program would be doing (3), so it's important to be aware that (1) is not in itself much of an argument. I expect that it's very hard to do (3) well, and that even a perfect version doesn't allow us to jump to the (1) of our dreams. But I still think it's a good idea!

Some thoughts that might be worth considering (very incomplete, I'm sure):

  1. Impact of potential orgs will vary hugely.
    1. Your impact will largely come down to [how much you increase (/reduce) the chance that positive (/negative) high-impact orgs get created].
    2. This may be best achieved by aiming to create many orgs. It may not.
      1. Of course [our default should be zero new orgs] is silly, but so would be [we're aiming to create as many new orgs as possible].
      2. You'll have a bunch of information, time and levers that funders don't have, so I don't think such considerations can be left to funders.
    3. In the below I'll be mostly assuming that you're not agnostic to the kind of orgs you're facilitating (since this would be foolish :)). However, I note that even if you were agnostic, you'd inevitably make choices that imply significant tradeoffs.
  2. Consider the incentive landscape created by current funding sources.
    1. Consider how this compares to a highly-improved-by-your-lights incentive landscape.
    2. Consider what you can do to change things for the better in this regard.
      1. If anything seems clearly suboptimal as things stand, consider spending significant effort making this case to funders as soon as possible.
      2. Consider what data you could gather on potential failure modes, or simply on dynamics that are non-obvious at the outset. (anonymized appropriately)
        Gather as much data as possible.
        1. If you don't have the resources to do a good job at experimentation, data gathering etc., make this case to funders and get those resources. Make the case that the cost of this is trivial relative to the opportunity cost of failing to gather the information.
  3. The most positive-for-the-world orgs are likely among the hardest to create.
    1. By default, orgs created are likely to be doing not-particularly-neglected things (similar selection pressures that created the current field act on new orgs; non-neglected areas of the field correlate positively with available jobs and in-demand skills...).
    2. By default, it's much more likely to select for [org that moves efficiently in some direction] than [org that picks a high-EV-given-what's-currently-known direction].
      1. Given that impact can easily vary by a couple of orders of magnitude (and can be negative), direction is important.
      2. It's long-term direction that's important. In principle, an org that moves efficiently in some direction could radically alter that direction later. In practice, that's uncommon - unless this mindset existed at the outset.
        1. Perhaps facilitating this is another worthwhile intervention?? - i.e. ensuring that safety orgs have an incentive to pivot to higher-EV approaches, rather than to continue with a [low EV-relative-to-counterfactual, but high comparative advantage] approach.
    3. Making it easier to create any kind of safety org doesn't change the selection pressures much (though I do think it's a modest improvement). If all the branches are a little lower, it's still the low-hanging-fruit that tends to be picked first. It may often be easier to lower the low branches too.
      1. If possible, you'd want to disproportionately lower the highest branches. Clearly this is easier said than done. (e.g. spending a lot of resources on helping those with hard-to-make-legible ideas achieve legibility, [on a process level, if necessary], so that there's not strong selection for [easy to make legible])
  4. Ground truth feedback on the most important kinds of progress is sparse-to-non-existent.
    1. You'll be using proxies (for [what seems important], [what impact we'd expect org x to have], [what impact direction y has had], [what impact new org z has had] etc. etc.).
      1. Most proxies aren't great.
      2. The most natural proxies and metrics will tend to be the same ones others are using. This may help to get a project funded. It tends to act against neglectedness.
      3. Using multiple, non-obvious proxies is worth a thought.
        1. However, note that you don't have the True Name of [AI safety], [alignment]... in your head: you have a vague, confused proxy.
        2. One person coming up with multiple proxies, will tend to mean creating various proxies to their own internal proxy. That's still a single point of failure.
        3. If you all clearly understand the importance of all the proxies you're using, that's probably a bad sign.
  5. It's much better to create a great org slowly, than a mediocre org quickly. This can easily happen with (some of) the same people, entailing a high opportunity cost.
    • I think one of the most harmful dynamics at present is the expectation that people/orgs should have a concretely mapped out agenda/path-to-impact within a few months. This strongly selects against neglectedness.
    • Even Marius' response to this seems to have the wrong emphasis:
      "Second, a great agenda just doesn't seem like a necessary requirement. It seems totally fine for me to replicate other people’s work, extend existing agendas, or ask other orgs if they have projects to outsource (usually they do) for a year or so and build skills during that time. After a while, people naturally develop their own new ideas and then start developing their own agendas."
      I.e. that the options are:
      1. Have a great agenda.
      2. Replicate existing work, extend existing agenda, grab existing ideas to work on.
    • Where is the [spend time focusing on understanding the problem more deeply, and forming new ideas / approaches]? Of course this may sometime entail some replication/extension, but that shouldn't be the motivation.
    • Financial pressures and incentives are important here: [We'll fund you for six months to focus on coming up with new approaches] amounts to [if you pick a high-variance approach, your org may well cease to exist in six months]. If the aim is to get an org to focus on exploration for six months, guaranteed funding for two years is a more realistic minimum.
      • Of course this isn't directly within your control - but it's the kind of thing you might want to make a case for to funders.
      • Again, the more you're able to shape the incentive landscape for future orgs, the more you'll be able to avoid unhelpful instrumental constraints, and focus on facilitating the creation of the kind of orgs that should exist.
      • Also worth considering that the requirement for this kind of freedom is closer to [the people need near-guaranteed financial support for 2+ years]. Where an org is uncertain/experimental, it may still make sense to give the org short-term funding, but all the people involved medium-term funding.

That's my guess too, but I'm not highly confident in the [no attractors between those two] part.

It seems conceivable to have a not-quite-perfect alignment solution with a not-quite-perfect self-correction mechanism that ends up orbiting utopia, but neither getting there, nor being flung off into oblivion.

It's not obvious that this is an unstable, knife-edge configuration. It seems possible to have correction/improvement be easier at a greater distance from utopia. (whether that correction/improvement is triggered by our own agency, or other systems)

If stable orbits exist, it's not obvious that they'd be configurations we'd endorse (or that the things we'd become would endorse them).

Anyway, overall I'd be surprised if it doesn't help substantially to have more granular estimates.

Oh, I'm certainly not claiming that no-one should attempt to make the estimates.

I'm claiming that, conditional on such estimation teams being enshrined in official regulation, I'd expect their results to get misused. Therefore, I'd rather that we didn't have official regulation set up this way.

The kind of risk assessments I think I would advocate would be based on the overall risk of a lab's policy, rather than their immediate actions. I'd want regulators to push for safer strategies, not to run checks on unsafe strategies - at best that seems likely to get a local minimum (and, as ever, overconfidence).
More [evaluate the plan to get through the minefield], and less [estimate whether we'll get blown up on the next step]. (importantly, it won't always be necessary to know which particular step forward is more/less likely to be catastrophic, in order to argue that an overall plan is bad)

It's important not to ignore that this speech is to the general public.
While I agree that "in the most unlikely but extreme cases" is not accurate, it's not clear that this reflects the views of the PM / government, rather than what they think it's expedient to say.

Even if they took the risk fully seriously, and had doom at 60%, I don't think he'd say that in a speech.

The speech is consistent with [not quite getting it yet], but also consistent with [getting it, but not thinking it's helpful to say it in a public speech]. I'm glad Eliezer's out there saying the unvarnished truth - but it's less clear that this would be helpful from the prime minister.

It's worth considering the current political situation: the Conservatives are very likely to lose the next election (no later than Jan 2025 - but it often happens early [this lets the governing party pick their moment, have the element of surprise, and look like calling the election was a positive choice]).
Being fully clear about the threat in public could be perceived as political desperation. So far, the issue hasn't been politicized. If not coming out with the brutal truth helps with that, it's likely a price worth paying. In particular, it doesn't help if the UK government commits to things that Labour will scrap as soon as they get in.

Perhaps more importantly from his point of view, he'll need support from within his own party over the next year - if he's seen as sabotaging the Conservatives' chances in the next election by saying anything too weird / alarmist-seeming / not-playing-to-their-base, he may lose that.

Again, it's also consistent with not quite getting it, but that's far from the only explanation.

We could do a lot worse than Rishi Sunak followed by Keir Starmer.
Relative to most plausible counterfactuals, we seem to have gotten very lucky here.

Thanks for clarifying your views. I think it's important. consensus around conditional pauses...

My issue with this is that it's empty unless the conditions commit labs to taking actions they otherwise wouldn't. Anthropic's RSP isn't terrible, but I think a reasonable summary is "Anthropic will plan ahead a bit, take the precautions they think make sense, and pause when they think it's a good idea".

It's a commitment to take some actions that aren't pausing - defining ASL4 measures, implementing ASL3 measures that they know are possible. That's nice as far as it goes. However, there's nothing yet in there that commits them to pause when they don't think it's a good idea.

They could have included such conditions, even if they weren't concrete, and wouldn't come in to play until ASL4 (e.g. requiring that particular specifications or evals be approved by an external board before they could move forward). That would have signaled something. They chose not to.

That might be perfectly reasonable, given that it's unilateral. But if (even) Anthropic aren't going to commit to anything with a realistic chance of requiring a lengthy pause, that doesn't say much for RSPs as conditional pause mechanisms.

The transparency probably does help to a degree. I can imagine situations where greater clarity in labs' future actions might help a little with coordination, even if they're only doing what they'd do without the commitment.

Actively fighting improvements on the status quo because they might be confused for sufficient progress feels icky to me in a way that’s hard to articulate.

This seems a reasonable criticism only if it's a question of [improvement with downside] vs [status-quo]. I don't think the RSP critics around here are suggesting that we throw out RSPs in favor of the status-quo, but that we do something different.

It may be important to solve x, but also that it's not prematurely believed we've solved x. This applies to technical alignment, and to alignment regulation.

Things being "confused for sufficient progress" isn't a small problem: this is precisely what makes misalignment an x-risk.

Initially, communication around RSPs was doing a bad job of making their insufficiency clear.
Evan's, Paul's and your posts are welcome clarifications - but such clarifications should be in the RSPs too (not as vague, easy-enough-to-miss caveats).

That's reasonable, but most of my worry comes back to:

  1. If the team of experts is sufficiently cautious, then it's a trivially simple calculation: a step beyond GPT-4 + unknown unknowns = stop. (whether they say "unknown unknowns so 5% chance of 8 billion deaths", or "unknown unknowns so 0.1% chance of 8 billion deaths" doesn't seem to matter a whole lot)
    1. I note that 8 billion deaths seems much more likely than 100 million, so the expectation of "1% chance of over 100 million deaths" is much more than 1 million.
  2. If the team of experts is not sufficiently cautious, and come up with "1% chance of OpenAI's next model causing over 100 million deaths" given [not-great methodology x], my worry isn't that it's not persuasive that time. It's that x will become the standard, OpenAI will look at the report, optimize to minimize the output of x, and the next time we'll be screwed.

In part, I'm worried that the argument for (1) is too simple - so that a forecasting team might put almost all the emphasis elsewhere, producing a 30-page report with 29 essentially irrelevant pages. Then it might be hard to justify coming to the same conclusion once the issues on 29 out of 30 pages are fixed.

I'd prefer to stick to the core argument: a powerful model and unknown unknowns are sufficient to create too much risk. The end. We stop until we fix that.

The only case I can see against this is [there's a version of using AI assistants for alignment work that reduces overall risk]. Here I'd like to see a more plausible positive case than has been made so far. The current case seems to rely on wishful thinking (it's more specific than the one sentence version, but still sketchy and relies a lot on [we hope this bit works, and this bit too...]).

However, I don't think Eliezer's critique is sufficient to discount approaches of this form, since he tends to focus on the naive [just ask for a full alignment solution] versions, which are a bit strawmannish. I still think he's likely to be essentially correct - that to the extent we want AI assistants to be providing key insights that push research in the right direction, such assistants will be too dangerous; to the extent that they can't do this, we'll be accelerating a vehicle that can't navigate.

[EDIT: oh and of course there's the [if we really suck at navigation, then it's not clear a 20-year pause gives us hugely better odds anyway] argument; but I think there's a decent case that improving our ability to navigate might be something that it's hard to accelerate with AI assistants, so that a 5x research speedup does not end up equivalent to having 5x more time]

But this seems to be the only reasonable crux. This aside, we don't need complex analyses.

it relies on evals that we do not have

I agree that this is a problem, but it strikes me that we wouldn't necessarily need a concrete eval - i.e. we wouldn't need [by applying this concrete evaluation process to a model, we can be sure we understand it sufficiently].

We could have [here is a precise description of what we mean by "understanding a model", such that we could, in principle, create an evaluation process that answers this question].

We can then say in an RSP that certain types of model must pass an understanding-in-this-sense eval, even before we know how to write an understanding-in-this-sense eval. (though it's not obvious to me that defining the right question isn't already most of the work)

Personally, I'd prefer that this were done already - i.e. that anything we think is necessary should be in the RSP at some level of abstraction / indirection. That might mean describing properties an eval would need to satisfy. It might mean describing processes by which evals could be approved - e.g. deferring to an external board. [Anthropic's Long Term Benefit Trust doesn't seem great for this, since it's essentially just Paul who'd have relevant expertise (?? I'm not sure about this - it's just unclear that any of the others would)]

I do think it's reasonable for labs to say that they wouldn't do this kind of thing unilaterally - but I would want them to push for a more comprehensive setup when it comes to policy.

Oh I didn't mean only to do it afterwards. I think before is definitely required to know the experiment is worth doing with a given setup/people. Afterwards is nice-to-have for Science. (even a few blitz games is better than nothing)

Oh that's cool - nice that someone's run the numbers on this.
I'm actually surprised quite how close-to-50% both backgammon and poker are.

Load More