Joe Collman — LessWrong

Interesting. I mostly agree with the gist.

The following are a few thoughts that occur to me. Presented as potentially useful pointers, rather than well-thought-through arguments/conclusions.

I don't think "pseudo-mechanisms" is a useful label. Feels a bit too binary (and/or post-hoc) in a highly grey situation.
I'm not sure what you mean by "mechanistic model" vs "stable phenomenological compressions".
- I'm not saying I have no idea what you're talking about - just that I'm not clear quite how you want to distinguish these things. (note that I haven't read many of your previous posts - yet! :))
- As soon as I'm calling something a "stable" pattern in the data, there's at least an implicit [...and this pattern will continue to hold] hypothesis.
- What makes something a "mechanistic model"? E.g. does it need to involve a temporal pattern, so that I'm likely to think of it in causal terms?
- Is the problem here that humans tend to prematurely place too high a probability on causal hypotheses?
  - If so, is the general thing to be wary of more [hypotheses I'm likely to believe too strongly given the (lack of) evidence]?
  - E.g. a mathematician might be wary of elegant hypotheses on this basis.
    - This seems a plausible explanation for a [practicality is inversely proportional to attachment to mechanisms]. The less practical a field, the more its practitioners will tend to be attracted by some kind of aesthetic sense - and that's a potential source of bias and premature attachment. (and in many cases there's the [less straightforwardly falsifiable] factor)

It would be best if we could simply not follow the compulsion, and stay as much as possible at the level of data patterns and phenomenological compressions, at least until we have a good handle there.

Here I remain unclear, as above. (I don't know what separates [have a good handle on data patterns] from [have explanations])
It seems to me our thinking is always going to be inseparable from a huge number of mechanistic expectations and assumptions (often implicit).
It seems a lesson here is something like:
- Be aware of the mechanistic models we're relying on.
- Be aware of the tendency to get prematurely attached to an explanation.
- Adapt accordingly.
  - I don't think jumping to a mechanistic model is itself an error - stubbornly sticking to one seems to be the problem.
  - Nimbleness seems desirable.
    - In particular since [assuming too much] and [assuming too little] are both sources of inefficiency.

Personally, when I try to predict the behaviour of people, I start with their past actions. As in, I look at what they have done in the past, and assume they’ll do more of that.

Similarly here, I think it's asking for trouble to imagine that [Gabe's characterization and extrapolation of [that]] doesn't already rely on a bunch of intent-based expectations and assumptions. (these will usually be more reliable than guesses we'd tend to label "psycho-analysis" - but they're present and important)

For this reason, [be aware of the degree to which you're x-ing, and the implications] seems safer advice than [avoid x-ing], for many x.

Help the AI 2027 team make an online AGI wargame

Joe Collman4mo115

If you're aiming to get millions of players, I think [no music at all] would be counterproductive. There's a reason almost every non-trivial game in existence has music. Of course it's also nice if it's simple to turn off / customize / replace - but it's usually a mistake to expect that a high proportion of players are going to significantly customize things.

Music is a way to get some immediate emotional engagement without making meaningful design concessions (most other mechanisms imply some more significant design constraint). If you want millions of players, you want immediate emotional engagement.

A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives

Joe Collman7mo92

Unless I'm missing something, the hoped-for advantages of this setup are the kind of thing AI safety via debate already aims at. In GDM's recent paper on their approach to technical alignment, there's some discussion of amplified oversight (starts at page 71) more generally, and debate (starts at page 73).

If you see the approach you're suggesting as importantly different from debate approaches, it'd be useful to know where the key differences are.

(without having read too carefully, my initial impression is that this is the kind of thing I expect to work for a while, then fail [as with debate] - and my core concern is then: how do we accurately predict when it'll fail?)

Orpheus16's Shortform

Joe Collman1y51

Some thoughts:

The correct answer is clearly (c) - it depends on a bunch of factors.
My current guess is that it would make things worse (given likely values for the bunch of other factors) - basically for Richard's reasons.
- Given [new potential-to-shift-motivation information/understanding], I expect there's a much higher chance that this substantially changes the direction of a not-yet-formed project, than a project already in motion.
- Specifically:
  - Who gets picked to run such a project? If it's primarily a [let's beat China!] project, are the key people cautious and highly adaptable when it comes to top-level goals? Do they appoint deputies who're cautious and highly adaptable?
    - Here I note that the kind of 'caution' we'd need is [people who push effectively for the system to operate with caution]. Most people who want caution are more cautious.
  - How is the project structured? Will the structure be optimized for adaptability? For red-teaming of top-level goals?
    - Suppose that a mid-to-high-level participant receives information making the current top-level goals questionable - is the setup likely to reward them for pushing for changes? (noting that these are the kind of changes that were not expected to be needed when the project launched)
  - Which external advisors do leaders of the project develop relationships with? What would trigger these to change?
  - ...
I do think that it makes sense to aim for some centralized project - but only if it's the right kind.
- I expect that almost all the directional influence is in [influence the initial conditions].
- For this reason, I expect [push for some kind of centralized project, and hope it changes later] is a bad idea.
- I think [devote great effort to influencing the likely initial direction of any such future project] seems a great idea (so long as you're sufficiently enlightened about desirable initial directions, of course :))
- I'd note that [initial conditions] needn't only be internal to the project - in principle we could have reason to believe that various external mechanisms would be likely to shift the project's motivation sufficiently over time. (I don't know of any such reasons)
I think the question becomes significantly harder once the primary motivation behind a project isn't [let's beat China!], but also isn't [your ideal project motivation (with your ideal initial conditions)].
I note that my p(doom) doesn't change much if we eliminate racing but don't slow down until it's clear to most decision makers that it's necessary.
- Likewise, I don't expect that [focus on avoiding the earliest disasters] is likely to be the best strategy. So e.g. getting into a good position on security seems great, all else equal - but I wouldn't sacrifice much in terms of [odds of getting to a sufficiently cautious overall strategy] to achieve better short-term security outcomes.

Making a conservative case for alignment

Joe Collman1y00

First some points of agreement:

I like that you're focusing on neglected approaches. Not much on the technical side seems promising to me, so I like to see exploration.
- Skimming through your suggestions, I think I'm most keen on human augmentation related approaches - hopefully the kind that focuses on higher quality decision-making and direction finding, rather than simply faster throughput.
I think outreach to Republicans / conservatives, and working across political lines is important, and I'm glad that people are actively thinking about this.
I do buy the [Trump's high variance is helpful here] argument. It's far from a principled analysis, but I can more easily imagine [Trump does correct thing] than [Harris does correct thing]. (mainly since I expect the bar on "correct thing" to be high so that it needs variance)
- I'm certainly making no implicit "...but the Democrats would have been great..." claim below.

That said, various of the ideas you outline above seem to be founded on likely-to-be-false assumptions.
Insofar as you're aiming for a strategy that provides broadly correct information to policymakers, this seems undesirable - particularly where you may be setting up unrealistic expectations.

Highlights of the below:

Telling policymakers that we don't need to slow down seems negative.
1. I don't think you've made any valid argument that not needing to slow down is likely. (of course it'd be convenient)
A negative [alignment-in-the-required-sense tax] seems implausible. (see below)
1. (I don't think it even makes sense in the sense that "alignment tax" was originally meant^[1], but if "negative tax" gets conservatives listening, I'm all for it!)
I think it's great for people to consider convenient possibilities (e.g. those where economic incentives work for us) in some detail, even where they're highly unlikely. Whether they're actually 0.25% or 25% likely isn't too important here.
1. Once we're talking about policy advocacy, their probability is important.

More details:

A conservative approach to AI alignment doesn’t require slowing progress, avoiding open sourcing etc. Alignment and innovation are mutually necessary, not mutually exclusive: if alignment R&D indeed makes systems more useful and capable, then investing in alignment is investing in US tech leadership.

Here and in the case for a negative alignment tax, I think you're:

Using a too-low-resolution picture of "alignment" and "alignment research".
1. This makes it too easy to slip between ideas like:
  1. Some alignment research has property x
  2. All alignment research has property x
  3. A [sufficient for scalable alignment solution] subset of alignment research has property x
  4. A [sufficient for scalable alignment solution] subset of alignment research that we're likely to complete has property x
2. An argument that requires (iv) but only justifies (i) doesn't accomplish much. (we need something like (iv) for alignment tax arguments)
Failing to distinguish between:
1. Alignment := Behaves acceptably for now, as far as we can see.
2. Alignment := [some mildly stronger version of 'alignment']
3. Alignment := notkilleveryoneism

In particular, there'll naturally be some crossover between [set of research that's helpful for alignment] and [set of research that leads to innovation and capability advances] - but alone this says very little.

What we'd need is something like:

Optimizing efficiently for innovation in a way that incorporates various alignment-flavored lines of research gets us sufficient notkilleveryoneism progress before any unrecoverable catastrophe with high probability.

It'd be lovely if something like this were true - it'd be great if we could leverage economic incentives to push towards sufficient-for-long-term-safety research progress. However, the above statement seems near-certainly false to me. I'd be (genuinely!) interested in a version of that statement you'd endorse at >5% probability.

The rest of that paragraph seems broadly reasonable, but I don't see how you get to "doesn't require slowing progress".

On "negative alignment taxes":

First, a point that relates to the 'alignment' disambiguation above.
In the case for a negative alignment tax, you offer the following quote as support for alignment/capability synergy:

...Behaving in an aligned fashion is just another capability... (Anthropic quote from Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback)

However, the capability is [ability to behave in an aligned fashion], and not [tendency to actually behave in an aligned fashion] (granted, Anthropic didn't word things precisely here). The latter is a propensity, not a capability.

What we need for scalable alignment is the propensity part: no-one sensible is suggesting that superintelligences wouldn't have the ability to behave in an aligned fashion. The [behavior-consistent-with-alignment]-capability synergy exists while a major challenge is for systems to be able to behave desirably.

Once capabilities are autonomous-x-risk-level, the major challenge will be to get them to actually exhibit robustly aligned behavior. At that point there'll be no reason to expect the synergy - and so no basis to expect a negative or low alignment tax where it matters.

On things like "Cooperative/prosocial AI systems", I'd note that hits-based exploration is great - but please don't expect it to work (and that "if implemented into AI systems in the right ways" is almost all of the problem).

On this basis, it seems to me that the conservative-friendly case you've presented doesn't stand up at all (to be clear, I'm not critiquing the broader claim that outreach and cooperation are desirable):

We don't have a basis to expect negative (or even low) alignment tax.
- (unclear so far that we'll achieve non-infinite alignment tax for autonomous x-risk relevant cases)
It's highly likely that we do need to slow advancement, and will need serious regulation.

Given our lack of precise understanding of the risks, we'll likely have to choose between [overly restrictive regulation] and [dangerously lax regulation] - we don't have the understanding to draw the line in precisely the right place. (completely agree that for non-frontier systems, it's best to go with little regulation)

I'd prefer a strategy that includes [policymakers are made aware of hard truths] somewhere.
I don't think we're in a world where sufficient measures are convenient.

It's unsurprising that conservatives are receptive to quite a bit "when coupled with ideas around negative alignment taxes and increased economic competitiveness" - but this just seems like wishful thinking and poor expectation management to me.

Similarly, I don't see a compelling case for:

that is, where alignment techniques are discovered that render systems more capable by virtue of their alignment properties. It seems quite safe to bet that significant positive alignment taxes simply will not be tolerated by the incoming federal Republican-led government—the attractor state of more capable AI will simply be too strong.

Of course this is true by default - in worlds where decision-makers continue not to appreciate the scale of the problem, they'll stick to their standard approaches. However, conditional on their understanding the situation, and understanding that at least so far we have not discovered techniques through which some alignment/capability synergy keeps us safe, this is much less obvious.

I have to imagine that there is some level of perceived x-risk that snaps politicians out of their default mode.
I'd bet on [Republicans tolerate significant positive alignment taxes] over [alignment research leads to a negative alignment tax on autonomous-x-risk-capable systems] at at least ten to one odds (though I'm not clear how to operationalize the latter).
Republicans are more flexible than reality :).

^{^}
As I understand the term, alignment tax compares [lowest cost for us to train a system with some capability level] against [lowest cost for us to train an aligned system with some capability level]. Systems in the second category are also in the first category, so zero tax is the lower bound.

This seems a better definition, since it focuses on the outputs, and there's no need to handwave about what counts as an alignment-flavored training technique: it's just [...any system...] vs [...aligned system...].

Separately, I'm not crazy about the term: it can suggests to new people that we know how to scalably align systems at all. Talking about "lowering the alignment tax" from infinity strikes me as an odd picture.

Twitter thread on AI safety evals

Joe Collman1y40

I give Eliezer a lot of credit for making roughly this criticism of Ajeya's bio-anchors report. I think his critique has basically been proven right by how much people have updated away from 30-year timelines since then.

I don't think this is quite right.

Two major objections to the bio-anchors 30-year-median conclusion might be:

The whole thing is laundering vibes into credible-sounding headline numbers.
Even if we stipulate that the methodology is sound, it measures an upper bound, not a median.

To me, (2) is the more obvious error. I basically buy (1) too, but I don't think we've gotten empirical evidence, since (2).

I guess there's a sense in which a mistake on (2) could be seen as a consequence of (1) - but it seems distinct: it's a logic error, not a free parameter. I do think it's useful to distinguish [motivated reasoning in free-parameter choice] from [motivated reasoning in error-checking].

It's not so obvious to me that the bio-anchors report was without foundation as an upper bound estimate.

Circumventing interpretability: How to defeat mind-readers

Joe Collman1yΩ120

To be clear, I only think this becomes obviously significant in a train-against-interpretability-tools context; if we're not training against tools, the things I'm highlighting here seem intuitively highly unlikely.
This still seems important, given that the combination [x is clearly a bad idea] and [x is the easiest path forward] does not robustly result in [nobody does x].

That said, I think I'd disagree on one word of the following:

The mechanisms I labelled as "strictly active" are the kind of strategy that it would be extremely improbable to implement successfully without some sort of coherent internal representations to that help orchestrate the actions required to do it. This is true even if they've been selected for passively.

Any internal representations etc. need to result in the actions being orchestrated (to the extent necessary). They don't need to be used for the purpose of orchestrating the actions. They don't need to be [clearly selected for the purpose of orchestrating such actions].

In some cases this might be improbable without active, deliberate malign selection - but this would need to be argued for. In at least the case of [manipulating the interpreter], this doesn't seem at all improbable, given the right selection pressure. (some manipulation seems obviously possible; whether it's enough for a particular malign strategy would vary)

I haven't thought about this for long, but in general the [this kind of thing ~must require active deliberation] intuition seems highly suspicious to me. All I need to set up a not-directly-deliberated-on malign strategy so that it's implemented via benign cognition is a sufficient set of innocuous correlations (both of [innocuous action tends to have useful-for-malign-strategy effect] and [innocuous actions will tend to occur in these predictable patterns]).

If I have unlimited passive selection pressure against tools that robustly detect active malign deliberation, but not 'instinctive' malign strategies, I'd strongly expect the outcome to be 'instinctive' active manipulation (etc.). (assuming of course that successful manipulation is instrumentally useful)

Whether I'd expect this in practice would come down to how much selection pressure were required for a particular [malign, active-with-no-malign-cognition strategy].

My current answer to this is not [this is extremely improbable]. It's more like [This seems intuitively plausible; I have no idea on the probability so I'm going to take this possibility seriously until I see a non-handwaving argument that shows it to be extremely unlikely].

Circumventing interpretability: How to defeat mind-readers

Joe Collman1yΩ120

Information flow might be inhibited actively through an agent's actions. The primary way this could happen is gradient hacking, but it’s not the only kind of action an AI might take to conceal misaligned thoughts. Of course, active circumvention methods require that interpreters either can’t see or aren’t looking at the thoughts that generate those actions.
Most potential circumvention methods that can be passive can also be active. But some methods can only be active.

It seems to me that there's no fixed notion of "active" that works for both paragraphs here.

If active means [is achieved through the agent's actions], then this does not in general imply that it is deliberately achieved through the agent's actions. For example, training against interpretability tools might produce actions that hide misaligned thoughts/actions as side-effects.
With this notion of 'active' the first bolded section doesn't hold: this can happen even when the agent's thoughts are entirely visible.

If instead active means [is achieved deliberately through the agent's actions], then the "But some methods can only be active" doesn't hold.

There are two dimensions here:

Whether the circumvention is implemented passively/actively.
Whether the circumvention is selected for passively/actively.

In particular, the mechanisms you've labelled "strictly active" can, in principle, be selected for passively - so do not in general require any misaligned thoughts (admittedly, the obvious way this happens is by training against interpretability tools).

On “first critical tries” in AI alignment

Joe Collman1y60

I don't think [gain a DSA] is the central path here.
It's much closer to [persuade some broad group that already has a lot of power collectively].

I.e. the likely mechanism is not: [add the property [has DSA] to [group that will do the right thing]].
But closer to: [add the property [will do the right thing] to [group that has DSA]].

On “first critical tries” in AI alignment

Joe Collman1yΩ360

It may be better to think about it that way, yes - in some cases, at least.

Probably it makes sense to throw in some more variables.
Something like:

To stand x chance of property p applying to system s, we'd need to apply resources r.

In these terms, [loss of control] is something like [ensuring important properties becomes much more expensive (or impossible)].

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

First some points of agreement:

Highlights of the below:

More details:

On "negative alignment taxes":