Mateusz Bagiński

~[agent foundations]

Wiki Contributions

Comments

Let's say: if you train a coherently goal-directed, situationally aware, somewhat-better-than-human-level model using baseline forms of self-supervised pre-training + RLHF on diverse, long-horizon, real-world tasks, my subjective probability is ~25% that this model will be performing well in training in substantial part as part of an instrumental strategy for seeking power for itself and/or other AIs later.

Have you tried extending this gut estimate to something like:

If many labs use somewhat different training procedures to train their models but that each falls under the umbrella of "coherently goal-directed, situationally aware [...]", what is the probability that at least one of these models "will be performing well in training in substantial part as part of an instrumental strategy for seeking power for itself and/or other AIs later."?

To the extent that Tegmark is concerned about exfohazards (he doesn't seem to be very concerned AFAICT (?)), he would probably say that more powerful and yet more interpretable architectures are net positive.

I'm pretty sure I heard Alan Watts say something like that, at least in one direction (lower levels of organization -> higher levels). "The conflict/disorder at the lower level of the Cosmos is required for cooperation/harmony on the higher level."

Or maybe the Ultimate Good in the eyes of God is the epic sequence of: dead matter -> RNA world -> protocells -> ... -> hairless apes throwing rocks at each other and chasing gazelles -> weirdoes trying to accomplish the impossible task of raising the sanity waterline and carrying the world through the Big Filter of AI Doom -> deep utopia/galaxy lit with consciousness/The Goddess of Everything Else finale.

I mostly stopped hearing about catastrophic forgetting when Really Large Language Models became The Thing, so I figured that it's solvable by scale (likely conditional on some aspects of the training setup, idk, self-supervised predictive loss function?). Anthropic's work on Sleeper Agents seems like a very strong piece of evidence that it is the case.

Still, if they're right that KANs don't have this problem at much smaller sizes than MLP-based NNs, that's very interesting. Nevertheless, I think talking about catastrophic forgetting as a "serious problem in modern ML" seems significantly misleading

FWIW it was obvious to me

Behavioural Safety is Insufficient

Past this point, we assume following Ajeya Cotra that a strategically aware system which performs well enough to receive perfect human-provided external feedback has probably learned a deceptive human simulating model instead of the intended goal. The later techniques have the potential to address this failure mode. (It is possible that this system would still under-perform on sufficiently superhuman behavioral evaluations)

There are (IMO) plausible threat models in which alignment is very difficult but we don't need to encounter deceptive alignment. Consider the following scenario:

Our alignment techinques (whatever they are) scale pretty well, as far as we can measure, even up to well-beyond-human-level AGI. However, in the year (say) 2100, the tails come apart. It gradually becomes pretty clear that what we want out powerful AIs to do and what they actually do turns out not to generalize that well outside of the distribution on which we have been testing them so far. At this point, it is to late to roll them back, e.g. because the AIs have become uncorrigible and/or power-seeking. The scenario may also have more systemic character, with AI having already been so tightly integrated into the economy that there is no "undo button".

This doesn't assume either the sharp left turn or deceptive alignment, but I'd put it at least at level 8 in your taxonomy.

I'd put the scenario from Karl von Wendt's novel VIRTUA into this category.

Answer by Mateusz Bagiński5-3

Maybe Hanson et al.'s Grabby aliens model? @Anders_Sandberg  said that some N years before that (I think more or less at the time of working on Dissolving the Fermi Paradox), he "had all of the components [of the model] on the table" and it just didn't occur to him that they can be composed in this way. (personal communication, so I may be misremembering some details). Although it's less than 10 years, so...

Speaking of Hanson, prediction markets seem like a more central example. I don't think the idea was [inconceivable in principle] 100 years ago.

ETA: I think Dissolving the Fermi Paradox may actually be a good example. Nothing in principle prohibited people puzzling about "the great silence" from using probability distributions instead of point estimates in the Drake equation. Maybe it was infeasible to compute this back in the 1950s/60s, but I guess it should be doable in 2000s and still, the paper was published only in 2017.

Taboo "evil" (locally, in contexts like this one)?

If you want to use it for ECL, then it's not clear to me why internal computational states would matter.

Load More