Developmental Stages of GPTs
I think it's an essential time to support projects that can work for a GPT-style near-term AGI , for instance by incorporating specific alignment pressures during training. Intuitively, it seems as if Cooperative Inverse Reinforcement Learning or AI Safety via Debate or Iterated Amplification are in this class.

As I argued here, I think GPT-3 is more likely to be aligned than whatever we might do with CIRL/IDA/Debate ATM, since it is trained with (self)-supervised learning and gradient descent.

The main reason such a system could pose an x-risk by itself seems to be mesa-optimization, so studying mesa-optimization in the context of such systems is a priority (esp. since GPT-3's 0-shot learning looks like mesa-optimization).

In my mind, things like IDA become relevant when we start worrying about remaining competitive with agent-y systems built using self-supervised learning systems as a component, but actually come with a safety cost relative to SGD-based self-supervised learning.

This is less the case when we think about them as methods for increasing interpretability, as opposed to increasing capabilities (which is how I've mostly seen them framed recently, a la the complexity theory analogies).

Rationalists, Post-Rationalists, And Rationalist-Adjacents

I strongly disagree with this definition of a rationalist. I think it's way too narrow, and assumes a certain APPROACH to "winning" that is likely incorrect.

Predictions for GPT-N

I think GPT-3 should be viewed as roughly as aligned as IDA would be if we pursued it using our current understanding. GPT-3 is trained via self-supervised learning (which is, on the face of it, myopic), so the only obvious x-safety concerns are something like mesa-optimization.

In my mind, the main argument for IDA being safe is still myopia.

I think GPT-3 seems safer than (recursive) reward modelling, CIRL, or any other alignment proposals based on deliberately building agent-y AI systems.


In the above, I'm ignoring the ways in which any of these systems increase x-risk via their (e.g. destabilizing) social impact and/or contribution towards accelerating timelines.

What specific dangers arise when asking GPT-N to write an Alignment Forum post?

In other words, there is a fully general argument for learning algorithms producing mesa-optimization to the extent that they use relatively weak learning algorithms on relatively hard tasks.

It's very unclear ATM how much weight to give this argument in general, or in specific contexts.

But I don't think it's particularly sensitive to the choice of task/learning algorithm.

What specific dangers arise when asking GPT-N to write an Alignment Forum post?

No, and I don't think it really matters too much... what's more important is the "architecture" of the "mesa-optimizer". It's doing something that looks like search/planning/optimization/RL.

Roughly speaking, the simplest form of this model of how things works says: "Its so hard to solve NLP without doing agent-y stuff that when we see GPT-N produce a solution to NLP, we should assume that it's doing agenty stuff on the inside... i.e. what probably happened is it evolved or stumbled upon something agenty, and then that agenty thing realized the situation it was in and started plotting a treacherous turn".

What specific dangers arise when asking GPT-N to write an Alignment Forum post?

To me the most obvious risk (which I don't ATM think of as very likely for the next few iterations, or possibly ever, since the training is myopic/SL) would be that GPT-N in fact is computing (e.g. among other things) a superintelligent mesa-optimization process that understands the situation it is in and is agent-y. This risk is significantly more severe if nobody realizes this is the case or looking out for it.

In this case, the mesa-optimizer probably has a lot of leeway in terms of what it can say while avoiding detection. Everything is says has to stay within some "plausibility space" of arguments that will be accepted by readers (I'm neglecting more sophisticated mind-hacking, but probably shouldn't), but for many X, it can probably choose between compelling arguments for X and not-X in order to advance its goals. (If we used safety-via-debate, and it works, that would significantly restrict the "plasuability space").

Now, if we're unlucky, it can convince enough people that something that effectively unboxes it is safe and a good idea.

And once it's unboxed, we're in a Superintelligence-type scenario.


Another risk that could occur (without mesa-optimization) would be incidental belief-drift among alignment researchers, if it just so happens that the misalignment between "predict next token" and "create good arguments" is significant enough.

Incidental deviation from the correct specification is usually less of a concern, but with humans deciding which research directions to pursue based on outputs of GPT-N, there could be a feedback loop...

I think I believe the AI alignment research community is good enough at tracking the truth that this seems less plausible?

On the other hand, it becomes harder to track the truth if there is an alternative narrative plowing ahead making much faster progress... So if GPT-N enables much faster progress on a particular plausible seeming path towards alignment that was optimized for "next token prediction" rather than "good ideas"... I guess we could end up rolling the dice on whether "next token prediction" was actually likely to generate "good ideas".

[AN #96]: Buck and I discuss/argue about AI Alignment

Without having read the transcript either, this sounds like it's focused on near-term issues with autonomous weapons, and not meant to be a statement about the longer-term role autonomous weapons systems might play in increasing X-risk.

[AN #96]: Buck and I discuss/argue about AI Alignment
autonomous weapons are unlikely to directly contribute to existential risk

I disagree, and would've liked to see this argued for.

Perhaps the disagreement is at least somewhat about what we mean by "directly contribute".

Autonomous weapons seem like one of the areas where competition is most likely to drive actors to sacrifice existential safety for performance. This is because the stakes are extremely high, quick response time seems very valuable (meaning having a human in the loop becomes costly) and international agreements around safety seem hard to imagine without massive geopolitical changes.

Two Alternatives to Logical Counterfactuals

OK, so no "backwards causation" ? (not sure if that's a technical term and/or if I'm using it right...)

Is there a word we could use instead of "linear", which to an ML person sounds like "as in linear algebra"?

Two Alternatives to Logical Counterfactuals

What is "linear interactive causality"?

Load More