yrimon — LessWrong

My AGI timeline updates from GPT-5 (and 2025 so far)

Doesn't the (relatively short) task my manager does, of breaking projects into component tasks for me to do entail knowledge of the specific subcomponents? Is there a particular reason to believe that this task won't be solved by an AI that otherwise knows to accomplish tasks of similar length?

My AGI timeline updates from GPT-5 (and 2025 so far)

yrimon1mo52

I doubt that METR's graph stays linear (on a log to date scale). I accomplish long tasks by tackling a series of small tasks. Both these small tasks and the administrative task of figuring out what to do next (and what context i need a s refresher on to accomplish it) are less than a day long. So at some point I expect a team of agents (disguised as a monolith) with small individual task length success to pass a critical mass of competence and become capable of much longer tasks.

MIRI's "The Problem" hinges on diagnostic dilution

yrimon2mo10

I was going with X being the event of any entity that is doing long horizon things, not a specific one. As such, small P(X|Y) is not so trivially satisfied. I agree this is vague, and if you could make it specific that would be a great paper.

Sure, typicality isn't symmetrical - but the assumptions above (X is a subset of Y, P(A|Y)~=1) mean that I'm interested whether "long horizon task achievement” is typical of instrumental convergence doom agents not the other way around. In other words, I'm checking whether P(X|Y) is large or small.

Make money. Make lots of spiral shaped molecules (colloquially known as paper clips). Build stadiums where more is better. Explore the universe. Really any task that does not have an end condition (and isn't "keep humans alive and well") is an issue.

Regarding this last point, could you explain further? We are positing an entity that acts as though it has a purpose, right? It is eg moving the universe towards a state with more stadiums. Why not model it using "incentives"?

MIRI's "The Problem" hinges on diagnostic dilution

yrimon2mo1-1

Please let me try again.

Given three events A, X, and Y, where ,

$P (A^{c} ∣ X) P (X ∣ Y) = (P (A^{c} \cap X) / P (X)) (P (X \cap Y) / P (Y)) =$

$= P (A^{c} \cap X) / P (Y) \leq P (A^{c} \cap Y) / P (Y) = P (A^{c} ∣ Y)$

since $X \subseteq Y \Rightarrow X \cap Y = X and A^{c} \cap X \subseteq A^{c} \cap Y .$

But this means

$P (X ∣ Y) \leq P (A^{c} ∣ Y) / P (A^{c} ∣ X)$

So if we accept that $P (A^{c} ∣ Y) \approx 0$ and $P (A^{c} ∣ X) >> 0$ (that is, for there to be a significant difference between $P (A | Y)$ and $P (A | X, Y)$ ) $P (X ∣ Y)$ must be very small.

So X must be an ultra specific subset of Y.

If I call the vetrinary department and report the tiger in my back yard (X), and the personnel is sent to deal with a feline (Y), and naturally expects something nonthreatening ( $A^{c}$ ), they will be unpleasantly surprised (A). So losing important details is a bad idea, and this requires that tigers be a vanishingly small portion of the felines they meet on a day to day (P(X|Y)~=0).

All this having been said, it seems like you accepted (for conversation's sake at least) that doing long horizon stuff implies instrumental goals $(X \subseteq Y)$ , and that instrumental goals mostly imply doom and gloom $(P (A | Y) \approx 1)$ . So the underlying question is: are entities that do complex long horizon stuff unusual examples of entities that act instrumentally (such that $P (X | Y)$ is small)? Or alternatively: when we lose information are we losing relevant information?

I think not. Entities that do long horizon stuff are the canonical example of entities that act instrumentally. I struggle to see what relevant information we could be losing by modeling a long horizon achiever as instrumentally motivated.

At this point, in reading your post I get hung up on the example. We are losing important information, I understood between the lines, since "the stadium builder does not have to eliminate threatening agents". But either (1) yes, he does, obviously getting people who don't want a stadium out of the way is a useful thing for it to do, and thus we didn't actually lose the important information; or (2) this indeed isn't a typical example of an instrumentally converging entity, nor is it a typical example of an entity that does complex long horizon stuff, of the type I'm worried about because of The Problem, because I'm worried about entities with longer horizons.

Is there a particular generalization of the stadium builder that makes it clearer what relevant information we lost?

MIRI's "The Problem" hinges on diagnostic dilution

yrimon2mo*107

You posited a stadium building AI and noting that "the stadium builder does not have to eliminate threatening agents". I agree. Such elimination is perhaps even counterproductive to it's goal.

However, there are two important differences between stadium building AI and the dangerous AIs described in "The Problem". The first is, you assume that we correctly managed to instill the goal of stadium building into the AI. But in "The Problem", the authors specifically talk in section 3 - which you skipped in your summary - about how bad we are at installing goals in AIs. So consider if instead of instilling the goal of building stadiums legally, we accidentally instilled the goal of building studios regardless of legality. In such a case, assuming it could get away with it the AI could threaten the mayor to give a permit or hire gangs to harass people on the relevant plot of land.

The second is difference between your example and "The Problem" is horizon length. You gave an example of a goal with an end point, that could be achieved relatively soon. Imagine instead the AI wanted to run a stadium in a way that was clean and maximized income. All of a sudden taking over the global economy, if you can get away with it, sounds like a much better idea. The AI would need to make decisions about what is considered stadium income, but then you can funnel as much of the global economy as you want into or through the stadium by say making the tickets the standard currency or forcing people to buy tickets or switching the people for robots that obsessively use the stadium or making the stadium insanely large or a thousand other things I haven't thought of. More abstractly: subgoals instrumentally converge as the time horizon of the goal goes to infinity.

So basically - an agent with a long-term goal that isn't exactly what you want and can run intellectual circles around all of humanity put together is dangerous.

yrimon's Shortform

yrimon2mo20

I have no idea what the community consensus is. I doubt they're lying.

For anyone who already had short timelines this couldn't shorten them that much. For instance, 2027 or 2028 is very soon, and https://ai-2027.com/ assumed there would be successful research done along the way. So for me, very little more "yikes" than yesterday.

It does not seem to me like this is the last research breakthrough needed for full fledged agi, either. LLMs are superhuman at no/low context buildup tasks, but haven't solved context management (be that through long context windows, memory retrieval techniques, online learning or anything else).

I also don't think it's surprising that these research breakthroughs keep happening. Remember that their last breakthrough (strawberry, o1) was "make RL work". This one might be something like "make reward prediction and MCTS work" like mu zero, or some other banal thing that worked on toy cases in the 80s but was non trivial to reimplement in LLMs.

yrimon's Shortform

yrimon2mo40

Just listened to the imo team at OpenAI talk about their model. https://youtu.be/EEIPtofVe2Q?si=kIPDW5d8Wjr2bTFD Some notes:

The techniques they used are general, and especially useful for RL on hard-to-verify-solution-correctness problems.
It now says when it doesn't know something, or didn't figure it out. This is a requisite for training the model successfully on its own output.
The people behind the model are from the multi agent team. For one age to be bale to work with another, the reports from the other agent need to be trustworthy.

People Are Less Happy Than They Seem

yrimon2mo20

I was convinced that people have more problems than it seems. It is hard (for me?) to imagine what people might be struggling with, if they don't talk about it.

But it isn't clear to me that this translates to my expectations of people's happiness. It seems to me that people's struggles effect their levels of satisfaction, which has to do with with their moments of happiness, which in turn effects their actual happiness, which hopefully effects how happy I perceive them to be.

The translation of struggles into satisfaction is not straight forward - it might depend on the degree to which you chose this challenge, on how much it injects uncertainty into your life, on what other things you have going on, on what the difference between your expectations and your reality is, among other factors.

Then there is how satisfaction translates into moments of happiness. For me, I need to do something that is itself joyous, like playing with my daughter, or stop to "smell the flowers" - that is, to spend time being excited and happy that I succeeded at something.

But how do moments of happiness translate into an overall feeling of happiness? I aspire to spend some time at peace, and some time with the stress and effort of trying to do stuff. I aspire to be happy some of the time, and concentrated or paying genuine attention to other people at other times. I don't want to be happy all the time. But I do consider myself a happy person, at the moment.

Finally, my perception of a friend's happiness is has a lot to do with what their personality is like. People who give the impression of high agency, that they are the heroes in their life, or that they (also) have modest goals they care about and are succeeding at, give me the impression of happiness. People who are energetic give the impression of happiness.

And impressions of people are kind of normalized - everybody has problems, most people are not talking about everything going on with them all the time, so you probably get a pretty good sense of how happy people are relative to each other. Sure, compare them to yourself without taking facades into account and you can have imposter syndrome. But I think people do get good impressions of the happiness levels of the people around them.

Foom & Doom 1: “Brain in a box in a basement”

yrimon3mo10

Conclusion from reading this:

My modal scenario in which LLMs become dangerously super intelligent is one where language is a good enough platform to think, memory use is like other tool use (so an LLM can learn to use memory well, enabling eg continuous on the job learning), and verification is significantly easier than generation (allowing a self improvement cycle in training).

The best simple argument for Pausing AI?

yrimon3mo129

I found the title of this post misleading. I was expecting to find an argument for pausing AI development. Instead, this is an argument to "pause widespread rollout of Generative AI in safety-critical domains".

The two types of pauses are related. The more an AI (agent) is deployed and given power and tools the more it can act dangerously. So pausing deployment should occur inversely to capabilities - more capable models should be given less freedom and tools. A sufficiently capable (unaligned) model should not be deployed (nearly) at all, including internally.

The missing step in the OP is the claim that frontier models are unaligned and within the error margin of being able to cause significant harm and so their deployment should be so limited as to make their development uneconomical. So we should pause development of frontier AI models.

As to the claim the post actually made, hyping LLMs in the sense of lying about their reliability is definitely bad, and comparable to other lies. Lying is usually harmful, and doing it about the safety and reliability of a tool that people depend on is awful. Hyping LLMs while being truthful about their capabilities doesn't seem bad to me.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments