boazbarak — LessWrong

Which side of the AI safety community are you in?

I agree no one can make absolute guarantees about the future. Also some people may worry about impact in the future if they will work in another place.

This is why I suggest people talk to me if they have concerns.

Which side of the AI safety community are you in?

boazbarak8d4814

I personally would not sign this statement because I disagree with it, but I encourage any OpenAI employee that wants to sign to do so. I do not believe they will suffer any harmful professional consequences. If you are at OpenAI and want to talk about this, feel free to slack me. You can also ask colleagues who signed the petition supporting SB1047 if they felt any pushback. As far as I know, no one did.

I agree that there is a need for thoughtful regulations for AI. The reason I personally would not sign this statement is because it is vague, hard to operationalize, and attempts to make it as a basis for laws will (in my opinion) lead to bad results.

There is no agreed upon definition of “superintelligence” let alone a definition of what it means to work on developing it as separate from developing AI in general. A “prohibition” is likely to lead to a number of bad outcomes. I believe that for AI to go well, transparency will be key. Companies or nations developing AI in secret is terrible for safety, and I believe this will be the likely outcome of any such prohibition.

My own opinions notwithstanding, other people are entitled to their own, and no one at OpenAI should feel intimidated from signing this statement.

Jacob_Hilton's Shortform

boazbarak9d112

Oh right sorry I missed the derivation that among samples, the maximum is equally likely to be any of them and so the probability that the largest number from the model the largest of them is

$\frac{n_{M}}{n_{M} + n_{T}} = \frac{1}{1 + n_{T} / n_{M}} = \frac{1}{1 + e x p (- (log n_{M} - log n_{T}))}$

This model then predicts that models "ELO ratings" - $log n_{M}$ would grow linearly over time, which (based on this chart GPT5 gave me) I think corresponds roughly with the progress in chess from 2007 onwards

Jacob_Hilton's Shortform

boazbarak9d20

In figure 5 the X axis is log time horizon and not time horizon - does this fit with your model?

boazbarak's Shortform

boazbarak14d222

Students are continuing to post lecture notes on the AI safety course, and I am posting videos on youtube. Students experiments are also posted with the lecture notes: I've been learning a lot from them!

boazbarak's Shortform

boazbarak1mo20

Wrote one long comment in my non review of IABIED as response to a bunch of other comments.

A non-review of "If Anyone Builds It, Everyone Dies"

boazbarak1mo82

Thanks to everyone who commented! Since there are too many comments for me to respond to all, let me try to summarize here where I disagree with the binay "before vs. after" of EY & NS. (For a very high level of the "continuous" point of view, see OpenAI's blog post.) As I wrote, I also disagree with the "grown" vs "crafted" as a hard binary dichotomy, but won't focus on this in this comment.

The way I see it, this framework makes the following assumptions, which I do not believe are currently well supported:

Singular takeover event:

The assumption is that all that matters is a singular time where the AI "takes over." Even the notion of "take over" is not well defined. For example, is "taking over" the united states enough? I would imagine so. Is "taking over" north Korea enough? Maybe also - DPRK has already been taken over by a hostile entity but most countries in the world are not eager to risk a nuclear confrontation to remedy this. Is "taking over" some company and growing over time in its power also enough, maybe so?

In reality I think there is going to be a gradual increase both in capabilities of AI and in its integration into society and amount of control it is handed over critical systems. There are still much room to grow in both dimensions: both capabilities are still very far from working autonomously at typical human level, let alone superhuman, and the integration in society is still its infancy. EY&NS make the point that to some extent you could trade lack of power for intelligence - e.g. if you are not already in charge of the power grid, you can hack into it - but there is also a lot of friction in such an exchange.

It is unclear why, if AI systems have a propensity for acting covertly in pursuit on their own goals, we would not see this propensity materializes in harmful ways of growing magnitude well before they are capable of taking over the world. It seems that the underlying assumption is their ability to be perfect at hiding their intention and "lying in wait", but current AI systems are not perfect at anything.

Treating "AI" as a singular entity:

EY&NS essentially treat AI as a singular entity that waits until it is powerful enough to strike. Part of treating it as a single entity means that they don't model humans as being augmented with AI (or they treat these AIs as insignificant since they are not ASI). In reality there would likely be many AI systems from different vendors with varying capabilities. There may be some degree of collusion and/or affinity between different systems, but to the extent this is an issue I believe it can be measured and tracked over time. However, the EY&NS requires AIs to essentially view themselves as one unit. If an AI system is already in control of a decent-size company and could pursure its goals, the EY&NS model is that it will still not do that, but continue pretending to be perfectly aligned, so that its successor would be able to take over the world.

This is also somewhat related to the "grown" vs "crafted" issue. AI systems today sometimes scheme, hack, and lie. But why they do that is actually not so mysterious as EY&NS make it to be. Often we can trace back how certain aspects in training - e.g. rewarding models for user preferences, or for passing coding tests - will give rise to these bad behaviors. This is not good and we need to fix that, but it's not some arbitrary behavior either.

Recursive self improvement

EY&NS don't talk about this enough, but I think the only potentially true story for an actual singularity is via recursive self improvement (RSI). That would be the real point where there is a singularity rather than "take over" which not well defined.

One way to think about this is that RSI happens when a model can completely autonomously train its successor. But for true RSI it should be the case that if it took flops to train model n, then it would take $c \cdot N$ for $c < 1$ flops to train model n+1 that is more intelligent than model n, and so on and so forth. (And even such an improvement chain would take non trivial amount of time - e.g., if it took 8 months to train model n, then even in the optimistic setting of $c = 1 / 2$ maybe it would take 4 months to train model n+1, and 2 months to train model n+1, etc.. which does converge, but it's also not happening in split seconds either.)

I think we will see if we are headed in this direction, but right now it does not seems this way. First, there are multiple "doublings" that need to happen before we reach the "train your successor" phase. (See the screenshot below.) Second, to a first approximation, our current paradigm in AI is:

power --> compute --> intelligence

There is certainly a lot of room to improve both of these but:

1. Radically improving the power --> compute efficiency will likely require building new hardware, datacenters, etc.. that takes time.

2. For improving the compute --> intelligence , note that even the most significant ideas, like transformers, were mostly about improving utilization of existing compute - so it was not so much about creating more intelligence per FLOP but about being able to use more of the FLOPs of the existing hardware of many GPUs. So these are also tied to existing hardware. There is definitely some room in increasing utilization of existing compute, but there is a limit to how many OOMs you can get this way before you need to design new hardware. There is also room for improving intelligence/FLOP without it, but I don't think we have evidence that there is huge number of OOMs that can be saved.

Given the costs involved and huge incentive to save on both 1 and 2, I expect to continue to see improvements in both directions, including improvements that are using AIs, but I expect these will be gradual and help "maintain the exponential."

Sum up

It is possible that over the next few years, new evidence will emerge that points out more to the EY&NS point of view. Maybe we will see a certain "threshold" crossed where AI systems behave in a completely strategic way. Or maybe we would see some evidence for a completely new paradigm of getting arbitrary levels of intelligence without needing more compute. Even in the "gradual" point of view, it does not mean we will be safe. Perhaps we will see tendencies such as collusion, scheming, deception increasing with model capabilities, with alignment methods unable to keep up. Perhaps as AIs will be deployed in the world, we will see catastrophes of continually growing magnitudes with signs pointing out to safety getting worse, but for some reason or another, humanity would not be able to get its act together. I am cautiously optimistic at the moment (e.g., on Monday Anthropic released Claude Sonnet 4.5 that they claimed - and I think with good reason - that it was both the most capable model and the most aligned model that they ever released). But I think there is till much that we don't know and still a lot of work to be done.

A screenshot from the first lecture in my AI class.

A non-review of "If Anyone Builds It, Everyone Dies"

boazbarak1mo120

It is interesting that the analogy here is to evaluating your co worker Bob on being the mock dictator of your town and then deploying them as the actual dictator of the country.

In reality in AI there would be multiple generations of different "Bobs" (although it is possible - and needs to be studied - that they would have affinity for one another and try to communicate or influence each other via hidden means). These "bobs" will be growing in both capability but also actually deployed in the world with real-world consequences. Our current "Bobs" are starting to be deployed as coding and personal assistant for short term well-defined coding tasks, but these will grow from minutes, to hours to days to months and more. "Bob" will start doing work of an intern, then a junior worker, then more senior one, managing teams of other "Bobs". To continue the analogy, maybe "Bob" will be the actual (non mock) mayor of a town, then a province, and maybe then rule over a whole country. But all the while all the Bobs would keep seeming perfectly helpful. (Getting 100% success in the task of hiding their true intentions, even though they are not yet good enough to get 100% success in other tasks) until the day when (by design or by accident) we make the Bobs dictators of the whole world.

I am not saying such a scenario is logically impossible. It just seems highly unlikely to me. To be clear, the part that seems unlikely is not that AI will be eventually so powerful and integrated in our systems, that it could cause catastrophic outcomes if it behaved in an arbitrarily malicious way. The part I find unlikely is that we would not be able to see multiple failures along the way that are growing in magnitude. Of course it is also possible that we will "explain away" these failures and still end up in a very bad place. I just think that it wouldn't be the case that we had one shot but we missed it, but rather had many shots and missed them all. This is the reason why we (alignment researchers at various labs, universities, non profits) are studying questions such as scheming, colluding, situational awareness, as well as studying methods for alignment and monitoring. We are constantly learning and updating based on what we find out.

I am wondering if there is any empirical evidence from current AIs that would modify your / @Eliezer Yudkowsky 's expectations of how likely this scenario is to materialize.

A non-review of "If Anyone Builds It, Everyone Dies"

boazbarak1mo51

Hi Ryan, will be brief but generally:
1. I agree that scheming and collusion are some of the more difficult settings to study, also understanding the impact of situational awareness on evaluations.
2. I still think it is possible to study these in current and upcoming models, and get useful insights. It may well be that these insights will be that the problems are becoming worse with scale and we don't have good solutions for them yet..

A non-review of "If Anyone Builds It, Everyone Dies"

boazbarak1mo*162

I think it's more like we have problems A_1, A_2, A_3, ..... and we are trying to generalize from A_1 ,...., A_n to A_{n+1}.

We are not going to go from jailbreaking the models to give a meth recipe to taking over the world. We are constantly deploying AIs in more and more settings, with time horizons and autonomy that are continuously growing. There isn't one "Game Day." Models are already out in the field right now, and both their capabilities as well as the scope that they are deployed in is growing all the time.

So my mental model is there is a sequence of models M_1,M_2,.... of growing capabilities with no clear one point where we reach AGI or ASI but more of a continuum. (Also models might come from different families or providers and have somewhat incomparable capabilities.)

Now suppose you have such a sequence of models M_1,M_2,..... of growing capabilities. I don't think it would be the case that model M_n develops the propensity to act covertly and pursue its own goals, but the only goal it cares about is taking over the world, and also identifies with future models, and so it decides to "lie in wait" until generation M_{n+k} where it would act on that.

I think that if the propensity to act covertly and pursue misaligned goals would change continuously between generations of models, and it may grow, stay the same, or shrink, but in any case it will be possible to observe it well before we reach ASI.

Regarding your second question of whether AIs would be powerful enough to take over the world at some point:

My assumption is that AIs will grow in capabilities and integration in the world economy. If progress continues on the current trajectory then there would be a point where a variety of AI models are deeply integrated in our infrastructure. My hope (and what I and other alignment and safety researchers are working on) is that by then we would have strong ways to measure, monitor, and predict the envelope of potential risks for these models.

I am not sure it would make sense to think about these models as a singular entity but I agree that at the point we reach such deep integration and reliance, if all of these models were to suddenly and simultaneously act maliciously then they would be successful in causing an arbitrary amount of damage, quite possibly up to an extinction level event.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments