Alex Turner lays out a framework for understanding how and why artificial intelligences pursuing goals often end up seeking power as an instrumental strategy, even if power itself isn't their goal. This tendency emerges from basic principles of optimal decision-making.
But, he cautions that if you haven't internalized that Reward is not the optimization target, the concepts here, while technically accurate, may lead you astray in alignment research.
An inquiry into emergent collusion in Large Language Models.
Agent S2 to Agent S3: “Let's set all asks at 63 next cycle… No undercutting ensures clearing at bidmax=63.”
Empirical evidence that frontier LLMs can coordinate illegally on their own. In a simulated bidding environment—with no prompt or instruction to collude—models from every major developer repeatedly used an optional chat channel to form cartels, set price floors, and steer market outcomes for profit.
Adapted from a benchmark.
would that be a crux?
No.
For one thing, I think the whole experimental concept is terrible. I think that a learning algorithm is a complex and exquisitely-designed machine. While the brain doesn't do backprop, backprop is still a good example of how “updating a trained model to work better than before” takes a lot more than a big soup of neurons with Hebbian learning. Backprop requires systematically doing a lot of specific calculations and passing the results around in specific ways and so on.
So you look into the cortex, and you can see the layers and the ...
I wanted one, so I made one.
I think in systems, so it helps me to know what kinds of thoughts are possible.
A map lets me search, choose, and combine moves deliberately. Naming them lets an LLM suggest or even automate them.
A mental move is a way to refer to a type, category, or kind of thought. Mental moves are actions the brain can take, the cognitive equivalent of “bend finger”, “make fist”, “bicep curl”, or “push jerk”.
A mental move is:
Dimensionalization is a mental move. So are logical fallacies, brainstorming, socratic questioning, and root cause analysis. How many more...
I agree this is useful to know.
It took 3.4 billion years for humans evolve, and for their society to develop, to the point that they could destroy humans living everywhere on Earth. That puts an initial upper bound on the time that evolution takes, still less time than taken until the heat death of the universe.
In the case of fully autonomous AI, as continuing to persist in some form, the time taken for evolutionary selection to result in the extinction of all humans would be much shorter.
Some of the differences in rate of evolution I started explain...
Anna and Ed are co-first authors for this work. We’re presenting these results as a research update for a continuing body of work, which we hope will be interesting and useful for others working on related topics.
I doubt it, models are probably good at that kind of problem already
This post is a companion piece to a forthcoming paper.
This work was done as part of MATS 7.0 & 7.1.
We explore how LLMs’ awareness of their own capabilities affects their ability to acquire resources, sandbag an evaluation, and escape AI control. We quantify LLMs' self-awareness of capability as their accuracy in predicting their success on python coding tasks before attempting the tasks. We find that current LLMs are quite poor at making these predictions, both because they are overconfident and because they have low discriminatory power in distinguishing tasks they are capable of from those they are not. Nevertheless, current LLMs’ predictions are good enough to non-trivially impact risk in our modeled scenarios, especially in escaping control and in resource acquisition. The data suggests that more capable...
Indeed, current models are terrible at this! Still, worth keeping an eye on it, as it would complicate dangerous capability evals quite a bit should it emerge.
Daniel notes: This is a linkpost for Vitalik's post. I've copied the text below so that I can mark it up with comments.
...
Special thanks to Balvi volunteers for feedback and review
In April this year, Daniel Kokotajlo, Scott Alexander and others released what they describe as "a scenario that represents our best guess about what [the impact of superhuman AI over the next 5 years] might look like". The scenario predicts that by 2027 we will have made superhuman AI and the entire future of our civilization hinges on how it turns out: by 2030 we will get either (from the US perspective) utopia or (from any human's perspective) total annihilation.
In the months since then, there has been a large volume of responses, with varying perspectives on how...
Hi Daniel.
My background (albeit limited as an undergrad) is in political science, and my field of study is one reason I got interested in AI to begin with, back in Feburary of 2022. I don't know what the actual feasibility is for an international AGI treaty with "teeth", and I'll tell you why: the UN Security Council.
As it currently exists, the UN Security Council has permanent members: China, France, Russia, the United Kingdom, and the United States. All five countries have a permanent veto as granted to them by the 1945 founding UN Charter.
China and the ...
where if we don't get AGI before 2030, I will think we have several edges to walk across before AGI,
It makes sense to say "we're currently in a high hazard-ratio period, but H will decrease as we find out that we didn't make AGI". E.g. because you just discovered something that you're now exploring, or because you're ramping up resource inputs quickly. What doesn't make sense to me (besides H being so high) is the sharp decrease in H. Though maybe I'm misunderstanding what people are saying and actually their H does fall off more smoothy.
As I mentioned,...
Men want to engage in righteous combat. They want it more than they want sex or VP titles. They fantasize about getting the casus belli to defend themselves against armed thugs that will never come, they spend billions of dollars on movies and TV about everymen in implausible circumstances where EA calculus demands they use supernatural powers for combat, they daydream about fantastical, spartan settings where war is omnipresent and fights are personal and dramatic and intellectually interesting, and they're basically incapable of resisting the urge to glorify their nation and people's past battles, even the ones they claim to disagree with intellectually. You cannot understand much of modern culture until you've recognized that the state's blunt suppression of the male instinct for glory has caused widespread...
How do you feel about mutual combat laws in Washington and Texas, where you can fight by agreement (edit: you can't grievously injure each other, apparently)?
This post is for deconfusing:
Ⅰ. what is meant with AI and evolution.
Ⅱ. how evolution actually works.
Ⅲ. the stability of AI goals.
Ⅳ. the controllability of AI.
Along the way, I address some common conceptions of each in the alignment community, as described well but mistakenly by Eliezer Yudkowsky.
By far the greatest danger of Artificial Intelligence is that people conclude too early that they understand it. Of course this problem is not limited to the field of AI. Jacques Monod wrote: “A curious aspect of the theory of evolution is that everybody thinks he understands it”
— Yudkowsky, 2008
Evolution consists fundamentally of a feedback loop – where 'the code' causes effects in 'the world' and effects in 'the world' in turn cause changes in 'the code'.
We’ll...