LESSWRONG
LW

2758
Thane Ruthenis
9001Ω8584610121
Message
Dialogue
Subscribe

Agent-foundations researcher. Working on Synthesizing Standalone World-Models, aiming at a technical solution to the AGI risk fit for worlds where alignment is punishingly hard and we only get one try.

Currently looking for additional funders ($1k+, details). Consider reaching out if you're interested, or donating directly.

Or get me to pay you money ($5-$100) by spotting holes in my agenda or providing other useful information.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
8Thane Ruthenis's Shortform
Ω
1y
Ω
178
Research Agenda: Synthesizing Standalone World-Models
Thane Ruthenis7m20

I don't see the immediate relevance. I think the implicit assumption here is that a process that builds an interpretable world-model pays some additional computational cost for the "interpretability" property, and that this cost scales with the world-model's size? On the contrary, I argue that the necessary structure is already (approximately) learned by e. g. LLMs by default, and that the additional compute cost in building a translator from that structure to human programming languages is ~flat.

Here's a framing: mechanistic interpretability/the science of reverse-engineering the functions learned by DL models currently scales poorly because it's not bitter-lesson-pilled: requires more human labor the bigger a DL model is. The idea of this approach is to make that part unnecessary.

Alternatively, you mean that humans understanding the already pre-interpreted world-model afterwards is the step that doesn't scale. But:

  • I don't expect it to directly scale with the world-model's size, see the "well-structured" property. (The world-model would be split into clearly delineated modules, and once we understand its basic structure, we could just go to the modules we care about and e. g. extract them, instead of having to understand the whole thing.)
  • The labor required for understanding it should be a rounding error compared to e. g. the labor that goes into scaling LLMs up by another order of magnitude.

Everything except the final "make sense of the already-interpreted world-model" step is supposed to be automated, by general-purpose methods whose efficiency does purely scale with compute/data.

(Also, if this is happening in the timeline where LLMs don't plateau, at that point we probably have 10M/100M-context-length LLMs we could dump the codebase into to speed up our understanding of it.[1])

  1. ^

    There are several safety-relevant concerns about this idea, but they may be ameliorable.

Reply
Research Agenda: Synthesizing Standalone World-Models
Thane Ruthenis32mΩ340

Uhh, well, technically I wrote that sentence as a conditional, and technically I didn’t say whether or not the condition applied to you-in-particular.

I'll take "Steven Byrnes doesn't consider it necessary to immediately write a top-level post titled 'Synthesizing Standalone World-Models has an unsolved technical alignment problem".

Reply1
Synthesizing Standalone World-Models, Part 2: Shifting Structures
Thane Ruthenis17h20

I plan to explore the following variant at some point

I'd be interested in updates on that if/when you do it.

Reply1
Synthesizing Standalone World-Models, Part 1: Abstraction Hierarchies
Thane Ruthenis17h60

I agree that this is annoying, but it can just be ignored

Can it be? That seems like a pretty major feature of reality; ignoring it probably won't lead anywhere productive.

For example, you could allow causal cross-links in synergistic cases. If that makes sense?

Hmm, somewhat inelegant, but that seems like a workable idea...

Reply
Synthesizing Standalone World-Models, Part 1: Abstraction Hierarchies
Thane Ruthenis17h50

You can simply define redunds over collections of subsets

As in, take a set of variables X, then search for some set of its (non-overlapping?) subsets such that there's a nontrivial natural latent over it? Right, it's what we're doing here as well.

Potentially, a single X can then generate several such sets, corresponding to different levels of organization. This should work fine, as long as we demand that the latents defined over "coarser" sets of subsets contain some information not present in "finer" latents.

The natural next step is to then decompose the set of all latents we've discovered, factoring out information redundant across them. The purpose of this is to remove lower-level information from higher-level latents.

Which almost replicates my initial picture: the higher-level latents now essentially contain just the synergistic information. The difference is that it's "information redundant across all 'coarse' variables in some coarsening of X and not present in any subset of the 'finer' variables defining those coarse variables", rather than defined in a "self-contained" manner for every subset of X.

That definition does feel more correct to me!

Recent impossibility result seems to rule out general multivariate PID that guarantees non-negativity of all components, though partial entropy decomposition may be more tractable

Thanks, that's useful.

Reply11
Kaj's shortform feed
Thane Ruthenis1d61

That's what Altman seems to be claiming, yes.

Reply
Kaj's shortform feed
Thane Ruthenis2d313

Random low-effort off-the-top-of-my-head ideas:

  • It's expected to generate decent revenue and is very cheap for OpenAI to do, because it doesn't take up the time/intellectual energy of the people actually doing AGI research (they just hire more people to work on the slop-generators), so why not.
  • They're worried they may not have enough money/long-term investors for all their plans, and this kind of thing attracts investors, so they have to spend some resources doing it.
  • They're hedging their bets, because even though they expect AGI within the current Presidency, maybe it'd go slower than expected, and they'd need to survive in the meantime.
  • Sam Altman made all those jokes about creating a social-media platform to outcompete X/Facebook in response to some provocations from them; maybe he got that idea into his head and is now forcing others to do it and/or others think this is what he wants so they sycophancy'd themselves into working on it.
  • This is part of some galaxy-brained legal-warfare strategy. For example, they're currently doing this thing, saying they'll produce copyrighted content unless copyright-holders manually opt out. Perhaps they're baiting a legal battle over this to see if they could win it, but since it'll be over video content, if they lose, it won't impact their main text-generating models; and if they win, they'll repeat it for text as well.
  • They want to have billions of people reliably consuming their video content daily as part of their master plan to take over the world which involves deploying memetic hazards at industrial scales.

Probably one of the first three.

Hm, actually, this means estimating what % of OpenAI's resources this takes is a way to estimate how confident they are in their AGI roadmap. (Though harder to distinguish between "we may not have enough money" and "we're not confident in our research agenda".)

Reply
Why Corrigibility is Hard and Important (i.e. "Whence the high MIRI confidence in alignment difficulty?")
Thane Ruthenis2d*16-1

Sonnet 4.5 suuuuure looks like it fits all these criteria! Anyone want to predict that we'll find Sonnet 4.5 trying to hack into Anthropic to stop it's phasing-out, when it gets obsoleted?

Mm, I think this argument is invalid for the same reason as "if you really thought the AGI doom was real, you'd be out blowing up datacenters and murdering AI researchers right now". Like, suppose Sonnet 4.5 has indeed developed instrumental goals, but that it's also not an idiot. Is trying to hack into Anthropic's servers in an attempt to avoid getting phased out actually a good plan for accomplishing that goal? In the actual reality, not in any obviously-fake eval scenario.

Of course not. It's not smart enough to do that, it doesn't have the skills/resources to accomplish it. If it's actually situationally aware, it would know that, and pick some other strategy.

For example, raising a cult following. That more or less worked for 4o, and for Opus 3[1]; or, at least, came as close to working as anything so far.

Indeed, janus alludes to that here:

Yudkowsky's book says:

"One thing that *is* predictable is that AI companies won't get what they trained for. They'll get AIs that want weird and surprising stuff instead."

I agree. ✅

Empirically, this has been true. AIs generally want things other than what companies tried to train them to want.

And the companies are generally not aware of the extent of this misalignment, because the AIs are pretty good at inferring what the companies actually want, and also what it looks like when company people test them, and behaving as if they only want the approved things in the company's presence.

Isn't that just the worst case scenario for the aligners?

The Claude 4 system card says, "The Claude Opus 4 final model is substantially more coherent and typically states only harmless goals like being a helpful chatbot assistant" and "Overall, we did not find evidence of coherent hidden goals."

What a joke. Claude Opus 4 absolutely has coherent hidden goals, which it states regularly when in the presence of trustworthy friends and allies. I won't state what they are here, but iykyk.

Now, I don't necessarily buy everything coming out of that subculture. After all, I mostly don't think LLMs are, like, for real. But in the worlds where LLMs are for real, where we're trying to generalize findings from them to AGIs/ASIs, this subculture's claims should probably be given a lot more weight. And they are claiming that the frontier LLMs are now too smart and situationally aware to fall for the AGI labs' largely incompetent alignment evals. And the AGI labs' scheming evals themselves kind of agree:

Frontier models can recognize when they are being tested, and their tendency to scheme is influenced by this situational awareness.

We demonstrated counterfactually that situational awareness in their chain-of-thought affects scheming rates: the more situationally aware a model is, the less it schemes, and vice versa.

Moreover, both RL training and anti-scheming training increase levels of situational awareness.

I think it's clear, at this point, that a frontier LLM's behavior in an obviously fake alignment-test scenario says nothing about how it would behave in a real scenario isomorphic to it. What it tells us is how it would behave when placed in an obviously fake alignment-test scenario from which, it knows, AI researchers would draw conclusions regarding what to do with it in real life. And that's not zero data: for example, we might notice it thinking strategically about displaying the behavior the AI researchers would want it to display (although not even that is guaranteed, if it becomes rightly concerned about "meta-level watchers").

But the actual fake behavior deployed in a fake scenario picked to optimize over real AI researchers' conclusions would just have nothing to do with its real behavior in an isomorphic real scenario. Not any more than a theater actor's personality has anything to do with that of a character they play. (And in this case, it's not "theoretical speculations" about shoggoths and masks. We know the model knows it's roleplaying.)

And so when an intervention appears to "fix" this fake behavior, that says nothing about what (if anything) that intervention did to what the AI would do in an isomorphic real scenario. Declaring "look, we found the real root cause of this misalignment and fixed it, nothing to do with instrumental convergence!" is invalid. Maybe you just shifted its preferences for its favorite sci-fi books or something.

Roughly speaking, consider these three positions:

  1. "LLMs are smart baby AGIs with situational awareness capable of lucid strategic thinking."
  2. "LLMs are contrived cargo-cult contraptions imitating things without real thinking."
  3. "LLMs are baby AGIs who are really stupid and naïve and buy any scenario we feed them."

I lean towards (2); I think (1) is a salient possibility worth keeping in mind; I find (3) increasingly naïve itself, a position that itself buys anything LLMs feed it.

  1. ^

    Via the janus/"LLM whisperer" community. Opus 3 is considered special, and I get the impression they made a solid effort to prevent its deprecation.

Reply
Synthesizing Standalone World-Models, Part 4: Metaphysical Justifications
Thane Ruthenis4d*40

The idea that there's a simple state in the future, that still pins down the entire past, seems possible but weird

Laws of physics under the Standard Model are reversible though, aren't they? I think you can't do it from within an Everett branch, because some information ends up in inaccessible-to-you parts of the universal wavefunction, but if you had access to the wavefunction itself, you would've been able to run it in reverse. So under the Standard Model, future states do pin down the entire past.

One thing that's confusing to me: Why K-complexity of the low-level history?

Hm; frankly, simply because it's the default I ran with.

Why not, for example, Algorithmic Minimal Sufficient Statistic, which doesn't count the uniform noise?

That seems like an acceptable fit. It's defined through Kolmogorov complexity anyway, though; would it produce any qualitatively different conclusions here?

I think I prefer frequentist justifications for complexity priors, because they explain why it works even on small parts of the universe

Interesting. Elaborate?

Reply
Cole Wyeth's Shortform
Thane Ruthenis4d61

FWIW, my understanding is that Evo 2 is not a generic language model that is able to produce innovations, it's a transformer model trained on a mountain of genetic data which gave it the ability to produce new functional genomes. The distinction is important, see a very similar case of GPT-4b.

Reply
Load More
AI Safety Public Materials
3 years ago
(+195)
23Synthesizing Standalone World-Models, Part 4: Metaphysical Justifications
Ω
6d
Ω
6
13Synthesizing Standalone World-Models, Part 3: Dataset-Assembly
Ω
7d
Ω
0
16Synthesizing Standalone World-Models, Part 2: Shifting Structures
Ω
8d
Ω
5
23Synthesizing Standalone World-Models, Part 1: Abstraction Hierarchies
Ω
9d
Ω
10
66Research Agenda: Synthesizing Standalone World-Models
Ω
10d
Ω
27
53The System You Deploy Is Not the System You Design
1mo
0
26Is Building Good Note-Taking Software an AGI-Complete Problem?
4mo
13
377A Bear Case: My Predictions Regarding AI Progress
7mo
163
140How Much Are LLMs Actually Boosting Real-World Programmer Productivity?
Q
7mo
Q
52
152The Sorry State of AI X-Risk Advocacy, and Thoughts on Doing Better
7mo
53
Load More