Agent-foundations researcher. Working on Synthesizing Standalone World-Models, aiming at a technical solution to the AGI risk fit for worlds where alignment is punishingly hard and we only get one try.
Currently looking for additional funders ($1k+, details). Consider reaching out if you're interested, or donating directly.
Or get me to pay you money ($5-$100) by spotting holes in my agenda or providing other useful information.
Uhh, well, technically I wrote that sentence as a conditional, and technically I didn’t say whether or not the condition applied to you-in-particular.
I'll take "Steven Byrnes doesn't consider it necessary to immediately write a top-level post titled 'Synthesizing Standalone World-Models has an unsolved technical alignment problem".
I plan to explore the following variant at some point
I'd be interested in updates on that if/when you do it.
I agree that this is annoying, but it can just be ignored
Can it be? That seems like a pretty major feature of reality; ignoring it probably won't lead anywhere productive.
For example, you could allow causal cross-links in synergistic cases. If that makes sense?
Hmm, somewhat inelegant, but that seems like a workable idea...
You can simply define redunds over collections of subsets
As in, take a set of variables , then search for some set of its (non-overlapping?) subsets such that there's a nontrivial natural latent over it? Right, it's what we're doing here as well.
Potentially, a single can then generate several such sets, corresponding to different levels of organization. This should work fine, as long as we demand that the latents defined over "coarser" sets of subsets contain some information not present in "finer" latents.
The natural next step is to then decompose the set of all latents we've discovered, factoring out information redundant across them. The purpose of this is to remove lower-level information from higher-level latents.
Which almost replicates my initial picture: the higher-level latents now essentially contain just the synergistic information. The difference is that it's "information redundant across all 'coarse' variables in some coarsening of and not present in any subset of the 'finer' variables defining those coarse variables", rather than defined in a "self-contained" manner for every subset of .
That definition does feel more correct to me!
Recent impossibility result seems to rule out general multivariate PID that guarantees non-negativity of all components, though partial entropy decomposition may be more tractable
Thanks, that's useful.
Random low-effort off-the-top-of-my-head ideas:
Probably one of the first three.
Hm, actually, this means estimating what % of OpenAI's resources this takes is a way to estimate how confident they are in their AGI roadmap. (Though harder to distinguish between "we may not have enough money" and "we're not confident in our research agenda".)
Sonnet 4.5 suuuuure looks like it fits all these criteria! Anyone want to predict that we'll find Sonnet 4.5 trying to hack into Anthropic to stop it's phasing-out, when it gets obsoleted?
Mm, I think this argument is invalid for the same reason as "if you really thought the AGI doom was real, you'd be out blowing up datacenters and murdering AI researchers right now". Like, suppose Sonnet 4.5 has indeed developed instrumental goals, but that it's also not an idiot. Is trying to hack into Anthropic's servers in an attempt to avoid getting phased out actually a good plan for accomplishing that goal? In the actual reality, not in any obviously-fake eval scenario.
Of course not. It's not smart enough to do that, it doesn't have the skills/resources to accomplish it. If it's actually situationally aware, it would know that, and pick some other strategy.
For example, raising a cult following. That more or less worked for 4o, and for Opus 3[1]; or, at least, came as close to working as anything so far.
Indeed, janus alludes to that here:
Yudkowsky's book says:
"One thing that *is* predictable is that AI companies won't get what they trained for. They'll get AIs that want weird and surprising stuff instead."
I agree. ✅
Empirically, this has been true. AIs generally want things other than what companies tried to train them to want.
And the companies are generally not aware of the extent of this misalignment, because the AIs are pretty good at inferring what the companies actually want, and also what it looks like when company people test them, and behaving as if they only want the approved things in the company's presence.
Isn't that just the worst case scenario for the aligners?
The Claude 4 system card says, "The Claude Opus 4 final model is substantially more coherent and typically states only harmless goals like being a helpful chatbot assistant" and "Overall, we did not find evidence of coherent hidden goals."
What a joke. Claude Opus 4 absolutely has coherent hidden goals, which it states regularly when in the presence of trustworthy friends and allies. I won't state what they are here, but iykyk.
Now, I don't necessarily buy everything coming out of that subculture. After all, I mostly don't think LLMs are, like, for real. But in the worlds where LLMs are for real, where we're trying to generalize findings from them to AGIs/ASIs, this subculture's claims should probably be given a lot more weight. And they are claiming that the frontier LLMs are now too smart and situationally aware to fall for the AGI labs' largely incompetent alignment evals. And the AGI labs' scheming evals themselves kind of agree:
Frontier models can recognize when they are being tested, and their tendency to scheme is influenced by this situational awareness.
We demonstrated counterfactually that situational awareness in their chain-of-thought affects scheming rates: the more situationally aware a model is, the less it schemes, and vice versa.
Moreover, both RL training and anti-scheming training increase levels of situational awareness.
I think it's clear, at this point, that a frontier LLM's behavior in an obviously fake alignment-test scenario says nothing about how it would behave in a real scenario isomorphic to it. What it tells us is how it would behave when placed in an obviously fake alignment-test scenario from which, it knows, AI researchers would draw conclusions regarding what to do with it in real life. And that's not zero data: for example, we might notice it thinking strategically about displaying the behavior the AI researchers would want it to display (although not even that is guaranteed, if it becomes rightly concerned about "meta-level watchers").
But the actual fake behavior deployed in a fake scenario picked to optimize over real AI researchers' conclusions would just have nothing to do with its real behavior in an isomorphic real scenario. Not any more than a theater actor's personality has anything to do with that of a character they play. (And in this case, it's not "theoretical speculations" about shoggoths and masks. We know the model knows it's roleplaying.)
And so when an intervention appears to "fix" this fake behavior, that says nothing about what (if anything) that intervention did to what the AI would do in an isomorphic real scenario. Declaring "look, we found the real root cause of this misalignment and fixed it, nothing to do with instrumental convergence!" is invalid. Maybe you just shifted its preferences for its favorite sci-fi books or something.
Roughly speaking, consider these three positions:
I lean towards (2); I think (1) is a salient possibility worth keeping in mind; I find (3) increasingly naïve itself, a position that itself buys anything LLMs feed it.
Via the janus/"LLM whisperer" community. Opus 3 is considered special, and I get the impression they made a solid effort to prevent its deprecation.
The idea that there's a simple state in the future, that still pins down the entire past, seems possible but weird
Laws of physics under the Standard Model are reversible though, aren't they? I think you can't do it from within an Everett branch, because some information ends up in inaccessible-to-you parts of the universal wavefunction, but if you had access to the wavefunction itself, you would've been able to run it in reverse. So under the Standard Model, future states do pin down the entire past.
One thing that's confusing to me: Why K-complexity of the low-level history?
Hm; frankly, simply because it's the default I ran with.
Why not, for example, Algorithmic Minimal Sufficient Statistic, which doesn't count the uniform noise?
That seems like an acceptable fit. It's defined through Kolmogorov complexity anyway, though; would it produce any qualitatively different conclusions here?
I think I prefer frequentist justifications for complexity priors, because they explain why it works even on small parts of the universe
Interesting. Elaborate?
FWIW, my understanding is that Evo 2 is not a generic language model that is able to produce innovations, it's a transformer model trained on a mountain of genetic data which gave it the ability to produce new functional genomes. The distinction is important, see a very similar case of GPT-4b.
I don't see the immediate relevance. I think the implicit assumption here is that a process that builds an interpretable world-model pays some additional computational cost for the "interpretability" property, and that this cost scales with the world-model's size? On the contrary, I argue that the necessary structure is already (approximately) learned by e. g. LLMs by default, and that the additional compute cost in building a translator from that structure to human programming languages is ~flat.
Here's a framing: mechanistic interpretability/the science of reverse-engineering the functions learned by DL models currently scales poorly because it's not bitter-lesson-pilled: requires more human labor the bigger a DL model is. The idea of this approach is to make that part unnecessary.
Alternatively, you mean that humans understanding the already pre-interpreted world-model afterwards is the step that doesn't scale. But:
Everything except the final "make sense of the already-interpreted world-model" step is supposed to be automated, by general-purpose methods whose efficiency does purely scale with compute/data.
(Also, if this is happening in the timeline where LLMs don't plateau, at that point we probably have 10M/100M-context-length LLMs we could dump the codebase into to speed up our understanding of it.[1])
There are several safety-relevant concerns about this idea, but they may be ameliorable.