Wiki Contributions


Glad to see you're working on this. It seems even more clearly correct (the goal, at least :)) for not-so-short timelines. Less clear how best to go about it, but I suppose that's rather the point!

A few thoughts:

  1. I expect it's unusual that [replace methodology-1 with methodology-2] will be a pareto improvement: other aspects of a researcher's work will tend to have adapted to fit methodology-1. So I don't think the creation of some initial friction is a bad sign. (also mirrors therapy - there's usually a [take things apart and better understand them] phase before any [put things back together in a more adaptive pattern] phase)
    1. It might be useful to predict this kind of thing ahead of time, to develop a sense of when to expect specific side-effects (and/or predictably unpredictable side effects).
  2. I do think it's worth interviewing at least a few carefully selected non-alignment researchers. I basically agree with your alignment-is-harder case. However, it also seems most important to be aware of things the field is just completely missing.
    1. In particular, this may be useful where some combination of cached methodologies is a local maximum for some context. Knowing something about other hills seems useful here.
      1. I don't expect it'd work to import full sets of methodologies from other fields, but I do expect there are useful bits-of-information to be had.
    2. Similarly, if thinking about some methodology x that most alignment researchers currently use, it might be useful to find and interview other researchers that don't use x. Are they achieving [things-x-produces] in other ways? What other aspects of their methodology are missing/different?
      1. This might hint both at how a methodology change may impact alignment researchers, and how any negative impact might be mitigated.
  3. Worth considering that there's less of a risk in experimenting (kindly, that is) on relative newcomers than on experienced researchers. It's a good idea to get a clear understanding of the existing process of experienced researchers. However, once we're in [try this and see what happens] mode there's much less downside with new people - even abject failure is likely to be informative, and the downside in counterfactual object-level research lost is much smaller in expectation.

Something I'd add to "plan for a wide variety of scenarios" is to look for solutions that do not refer to those scenarios. A solution involving [and here we test for deceptive alignment (DA)] is going to generalise badly (even assuming such a test could work), but so too will a solution involving [and here we test for DA, x, y, z, w...].

This argues for not generating all our problem scenarios ahead of time: it's useful to have a test set. If the solution I devise after only thinking about x and y also works for z and w, then I have higher confidence in it than if I'd generated x, y, z, w before I started looking for a solution.

For this reason, I'm not so pessimistic about putting a lot of effort into solving DA. I just wouldn't want people to be thinking about DA-specific tests or DA-specific invariants.

Good post, thanks. I largely agree with you.
A couple of thoughts:

Note, that the AI is aware of the fact that we wanted it to achieve a different goal and therefore actively acts in ways that humans will perceive as aligned.

This isn't quite right if we're going by the RFLO description (see footnote 7: in general, there's no requirement for modelling of the base optimiser or oversight system; it's enough to understand the optimisation pressure and be uncertain whether it'll persist).

In particular, the model needn't do this:
[realise we want it to do x] --> [realise it'll remain unchanged if it does x] --> [do x]

It can jump to:
[realise it'll remain unchanged if it does x] --> [do x]

It's not enough to check for a system's reasoning about the base optimiser.


For example,  a very powerful language model might still “only” care about predicting the next word and potentially the incentives to become deceptive are just not very strong for next-word prediction.

Here (and throughout) you seem to be assuming that powerful models are well-described as goal-directed (that they're "aiming"/"trying"/"caring"...). For large language models in particular, this doesn't seem reasonable: we know that LLMs predict the next token well; this is not the same as trying to predict the next token well. [Janus' Simulators is a great post covering this and much more]

That's not to say that similar risks don't arise, but the most natural path is more about goal-directed simulacra than a goal-directed simulator. If you're aiming to convince people of the importance of deception, it's important to make this clear: the argument doesn't rely on powerful LLMs being predict-next-token optimisers (they're certainly optimised, they may well not be optimisers).

This is why Eliezer/Nate often focus on a system's capacity to produce e.g. complex plans, rather than on the process of producing them. [e.g. "What produces the danger is not the details of the search process, it's the search being strong and effective at all." from Ngo and Yudkowsky on alignment difficulty]

In an LLM that gives outputs that look like [result of powerful search process], it's likely there's some kind of powerful search going on (perhaps implicitly). That search might be e.g. over plans of some simulated character. In principle, the character may be deceptively aligned - and the overall system may exhibit deceptive alignment as a consequence.
Arguments for characters that are powerful reasoners tending to be deceptively aligned are similar (instrumental incentives...).

[apologies on slowness - I got distracted]
Granted on type hierarchy. However, I don't think all instances of GPT need to look like they inherit from the same superclass. Perhaps there's such a superclass, but we shouldn't assume it.

I think most of my worry comes down to potential reasoning along the lines of:

  • GPT is a simulator;
  • Simulators have property p;
  • Therefore GPT has property p;

When what I think is justified is:

  • GPT instances are usually usefully thought of as simulators;
  • Simulators have property p;
  • We should suspect that a given instance of GPT will have property p, and confirm/falsify this;

I don't claim you're advocating the former: I'm claiming that people are likely to use the former if "GPT is a simulator" is something they believe. (this is what I mean by motte-and-baileying into trouble)

If you don't mean to imply anything mechanistic by "simulator", then I may have misunderstood you - but at that point "GPT is a simulator" doesn't seem to get us very far.

If it's deceptively aligned, it's not a simulator in an important sense because its behavior is not sufficient to characterize very important aspects of its nature (and its behavior may be expected to diverge from simulation in the future).

It's true that the distinction between inner misalignment and robustness/generalization failures, and thus the distinction between flawed/biased/misgeneralizing simulators and pretend-simulators, is unclear, and seems like an important thing to become less confused about.

I think this is the fundamental issue.
Deceptive alignment aside, what else qualifies as "an important aspect of its nature"?
Which aspects disqualify a model as a simulator?
Which aspects count as inner misalignment?

To be clear on [x is a simulator (up to inner misalignment)], I need to know:

  1. What is implied mechanistically (if anything) by "x is a simulator".
  2. What is ruled out by "(up to inner misalignment)".

I'd be wary of assuming there's any neat flawed-simulator/pretend-simulator distinction to be discovered. (but probably you don't mean to imply this?)
I'm all for deconfusion, but it's possible there's no joint at which to carve here.

(my guess would be that we're sometimes confused by the hidden assumption:
[a priori unlikely systematically misleading situation => intent to mislead]
whereas we should be thinking more like
[a priori unlikely systematically misleading situation => selection pressure towards things that mislead us]

I.e. looking for deception in something that systematically misleads us is like looking for the generator for beauty in something beautiful. Beauty and [systematic misleading] are relations between ourselves and the object. Selection pressure towards this relation may or may not originate in the object.)

Can you give an example of what it would mean for a GPT not to be a simulator, or to not be a simulator in some sense?

Here I meant to point to the lack of clarity around what counts as inner misalignment, and what GPT's being a simulator would imply mechanistically (if anything).

There must be some evidence that the initial appearance of alignment was due to the model actively trying to appear aligned only in the service of some ulterior goal.

"trying to appear aligned" seems imprecise to me - unless you mean to be more specific than the RFLO description. (see footnote 7: in general, there's no requirement for modelling of the base optimiser or oversight system; it's enough to understand the optimisation pressure and be uncertain whether it'll persist).

Are you thinking that it makes sense to agree to test for systems that are "trying to appear aligned", or would you want to include any system that is instrumentally acting such that it's unaltered by optimisation pressure?

Mostly I agree with this.
I have more thoughts, but probably better to put them in a top-level post - largely because I think this is important and would be interested to get more input on a good balance.

A few thoughts on LW endorsing invalid arguments:
I'd want to separate considerations of impact on [LW as collective epistemic process] from [LW as outreach to ML researchers]. E.g. it doesn't necessarily seem much of a problem for the former to have reliance on unstated assumptions. I wouldn't formally specify an idea before sketching it, and it's not clear to me that there's anything wrong with collective sketching (so long as we know we're sketching - and this part could certainly be improved).
I'd first want to optimize the epistemic process, and then worry about the looking foolish part. (granted that there are instrumental reasons not to look foolish)

On ML's view, are you mainly thinking of people who may do research on an important x-safety sub-problem without necessarily buying x-risk arguments? It seems unlikely to me that anyone gets persuaded of x-risk from the bottom up, whether or not the paper/post in question is rigorous - but perhaps this isn't required for a lot of useful research?

I want to maximize the bandwidth between human alignment researchers and AI tools/oracles/assistants/simulations. It is essential that these tools are developed by (or in a tight feedback loop with) actual alignment researchers doing theory work, because we want to simulate and play with thought processes and workflows that produce useful alignment ideas.

What are your thoughts on failure modes with this approach?
(please let me know if any/all of the following seems confused/vanishingly unlikely)

For example, one of the first that occurs to me is that such cyborgism is unlikely to amplify production of useful-looking alignment ideas uniformly in all directions.

Suppose that it makes things 10x faster in various directions that look promising, but don't lead to solutions, but only 2x faster in directions that do lead to solutions. In principle this should be very helpful: we can allocate fewer resources to the 10x directions, leaving us more time to work on the 2x directions, and everybody wins.
In practice, I'd expect the 10x boost to:

  1. Produce unhelpful incentives for alignment researchers: work on any of the 10x directions and you'll look hugely more productive. Who will choose to work on the harder directions?
    1. Note that it won't be obvious you're going slowly because the direction is inherently harder: from the outside, heading in a difficult direction will be hard to distinguish from being ineffective (from the inside too, in fact).
    2. Same reasoning applies at every level of granularity: sub-direction choice, sub-sub-direction choice....
  2. Warp our perception of promising directions: once the 10x directions seem to be producing progress much faster, it'll be difficult not to interpret this as evidence they're more promising.
    1. Amplified assessment-of-promise seems likely to correlate unhelpfully: failing to help us notice promising directions precisely where it's least able to help us make progress.

It still seems positive-in-expectation if the boost of cyborgism isn't negatively correlated with the ground-truth usefulness of a direction - but a negative correlation here seems plausible.

Suppose that finding the truly useful directions requires patterns of thought that are rare-to-non-existent in the training set, and are hard to instill via instruction. In that case it seems likely to me that GPT will be consistently less effective in these directions (to generate these ideas / to take these steps...). Then we may be in terrible-incentive-land.
[I'm not claiming that most steps in hard directions will be hard, but that speed of progress asymptotes to progress-per-hard-step]

Of course all this is hand-waving speculation.
I'd just like the designers of alignment-research boosting tools to have clear arguments that nothing of this sort is likely.

So e.g. negative impact through:

  • Boosting capabilities research.
  • Creation of undesirable incentives in alignment research.
  • Warping assessment of research directions.
  • [other stuff I haven't thought of]

Do you know of any existing discussion along these lines?

Great post. Very interesting.

However, I think that assuming there's a "true name" or "abstract type that GPT represents" is an error.

If GPT means "transformers trained on next-token prediction", then GPT's true name is just that. The character of the models produced by that training is another question - an empirical one. That character needn't be consistent (even once we exclude inner alignment failures).

Even if every GPT is a simulator in some sense, I think there's a risk of motte-and-baileying our way into trouble.

Presumably "too dismissive of speculative and conceptual research" is a direct consequence of increased emphasis on rigor. Rigor is to be preferred all else being equal, but all else is not equal.

It's not clear to me how we can encourage rigor where effective without discouraging research on areas where rigor isn't currently practical. If anyone has ideas on this, I'd be very interested.

I note that within rigorous fields, the downsides of rigor are not obvious: we can point to all the progress made; progress that wasn't made due to the neglect of conceptual/speculative research is invisible. (has the impact of various research/publication norms ever been studied?)

Further, it seems limiting only to consider [must always be rigorous (in publications)] vs [no demand for rigor]. How about [50% of your publications must be rigorous] (and no incentive to maximise %-of-rigorous-publications), or any other not-all-or-nothing approach?

I'd contrast rigor with clarity here. Clarity is almost always a plus.
I'd guess that the issue in social science fields isn't a lack of rigor, but rather of clarity. Sometimes clarity without rigor may be unlikely, e.g. where there's a lot of confusion or lack of good faith - in such cases an expectation of rigor may help. I don't think this situation is universal.

What we'd want on LW/AF is a standard of clarity.
Rigor is an often useful proxy. We should be careful when incentivizing proxies.

I’d expect AI checks and balances to have some benefits even if AIs are engaged in advanced collusion with each other. For example, AIs rewarded to identify and patch security vulnerabilities would likely reveal at least some genuine security vulnerabilities, even if they were coordinating with other AIs to try to keep the most important ones hidden.

This seems not to be a benefit.
What we need is to increase the odds of finding the important vulnerabilities. Collusion that reveals a few genuine vulnerabilities seems likely to lower our odds by giving us misplaced confidence. I don't think this is an artefact of the example: wherever we've unaware of collusion, but otherwise accurate in our assessment of check-and-balance-systems' capabilities, we'll be overconfident in the results.

In many cases, the particular vulnerabilities (or equivalent) revealed will be selected at least in part to maximise our overconfidence.

Load More