Davidmanheim

Sequences

Modeling Transformative AI Risk (MTAIR)

Wikitag Contributions

Comments

Sorted by

If it fails once we are well past AlphaZero, or even just more moderate superhuman AI research, this is good, as this means the "automate AI alignment" plan has a safe buffer zone.

If it fails before AI automates AI research, this is also good, because it forces them to invest in alignment.

 

That assumes AI firms learn the lessons needed from the failures. Our experience shows that they don't, and they keep making systems that predictably are unsafe and exploitable, and they don't have serious plans to change their deployments, much less actually build a safety-oriented culture.

Because they are all planning to build agents that will have optimization pressures, and RL-type failures apply when you build RL systems, even if it's on top of LLMs. 

Responses to o4-mini-high's final criticisms of the post:

Criticism: "You're treating hyper-introspection (internal transparency) as if it naturally leads to embedded agency (full goal-driven self-modification). But in practice, these are distinct capabilities. Why do you believe introspection tools would directly lead to autonomous, strategic self-editing in models that remain prediction-optimized?"

Response: Yes, these are distinct, and one won't necessarily lead to the other - but both are being developed by the same groups in order to deploy them. There's a reasonable question about how linked they are, but I think that there is a strong case that self-modifying via introspection, even if only done during training and via internal deployment would lead to much more dangerous and hard to track deception.

Criticism: "You outline very plausible risks but don’t offer a distribution over outcomes. Should we expect hyper-introspection to make systems 10% more dangerous? 1000%? Under what architectures? I'd find your argument stronger if you were more explicit about the conditional risk landscape."

Response: If we don't solve ASI alignment, which no-one seems to think we can do, we're doomed once we build misaligned. This seems to get us there more quickly. Perhaps it even reduces short term risks, but I think timelines are far more uncertain than the way the risks will emerge if we build systems that have these capabilities.

Criticism: "Given that fully opaque systems are even harder to oversee, and that deception risk grows with opacity too, shouldn't we expect that some forms of introspection are necessary for any meaningful oversight? I agree hyper-introspection could be risky, but what's the alternative plan if we don’t pursue it?

Response: Don't build smarter than human systems. If you are not developing ASI, and you want to monitor current and near future but not inevitably existentially dangerous systems, work on how humans can provide meaningful oversight in deployment instead of tools that enhance capabilities for accelerating the race - because without fixing the underlying dynamics, i.e. solving alignment, self-monitoring is a doomed approach.

Criticism: "You assume that LLMs could practically trace causal impact through their own weights. But given how insanely complicated weight-space dynamics are even for humans analyzing small nets, why expect this capability to arise naturally, rather than requiring radical architectural overhaul?"

Response: Yes, maybe Anthropic and others will fail, and building smarter than human systems might not be possible. Then strong interpretability is just a capability enhancer, and doesn't materially change the largest risks. That would be great news, but I don't want to bet my kid's lives on it.

In general, you can mostly solve Goodhart-like problems in the vast majority of the experienced range of actions, and have it fall apart only in more extreme cases. And reward hacking is similar. This is the default outcome I expect from prosaic alignment - we work hard to patch misalignment and hacking, so it works well enough in all the cases we test and try, until it doesn't.

Quick take: it's focused on interpretability as a way to solve prosaic alignment, ignoring the fact that prosaic alignment is clearly not scalable to the types of systems they are actively planning to build. (And it seems to actively embrace the fact that interpretability is a capabilities advantage in the short term, but pretends that it is a safety thing, as if the two are not at odds with each other when engaged in racing dynamics.)

...yest it hasn't happened, which is pretty strong evidence the other way.

I think you are fooling yourself about how similar people in 1600 are to people today. The average person at the time was illiterate, superstitious, and could maybe do single digit addition and subtraction. You're going to explain nuclear physics?

This doesn't matter for predicting the outcome of a hypothetical war between 16th century Britain and 21st century USA.


If AI systems can make 500 years of progress before we notice it's uncontrolled, it's already assuming it's a insanely strong superintelligence.
 

We could probably understand how a von Neumann probe or an anti-aging cure worked too, if someone taught us.

Probably, if it's of a type we can imagine and is comprehensible in those terms - but that's assuming the conclusion! As Gwern noted, we can't understand chess endgames. Similarly, in the case of a strong ASI, the ASI- created probe or cure could look more like a random set of actions that aren't explainable in our terms which cause the outcome than it does like an engineered / purpose driven system that is explainable at all.

We can point to areas of chess like the endgame databases, which are just plain inscrutable


I think there isa key difference in places where the answers are just exhaustive search, rather than more intelligence - AI isn't better at that than humans, and from the little I understand, AI doesn't outperform in endgames (compared to their overperformance in general) via better policy engines, they do it via direct memorization or longer lookahead. 

The difference here matters for other domains with far larger action spaces even more, since the exponential increase makes intelligence less marginally valuable at finding increasingly rare solutions. The design space for viruses is huge, and the design space for nanomachines using arbitrary configurations is even larger. If move-37-like intuitions are common, they will be able to do things humans cannot understand, whereas if it's more like chess endgames, they will need to search an exponential space in ways that are infeasible for them.

This relates closely to a folk theorem about NP-complete problems, where exponential problems are approximately solvable with greedy algorithms in nlogn or n^2 time, and TSP is NP complete but actual salesmen find sufficiently efficient routes easily.

But what part are you unsure about?

Yeah, on reflection, the music analogy wasn't a great one. I am not concerned that pattern creation that we can't intuit could exist - humans can do that as well. (For example, it's easy to make puzzles no-one can solve.) The question is whether important domains are amenable to kinds of solutions that ASI can understand robustly in ways humans cannot. That is, can ASI solve "impossible" problems?

One specific concerning difference is whether ASI could play perfect social 12-D chess by being a better manipulator, despite all of the human-experienced uncertainties, and engineer arbitrary outcomes in social domains. There clearly isn't a feasible search strategy with exact evaluation, but if it is far smarter than "human-legible ranges" of thinking, it might be possible. 

This isn't jut relevant for AI risk, of course. Another area is biological therapies, where, for example, it seems likely that curing or reversing aging requires the same sort of brilliant insight into insane complexity, figuring out whether there would be long term or unexpected out of distribution impacts years later, without actually conducting multi-decade large scale trials.

Cool work, and I like your book on topological data analysis - but you seem to be working on accelerating capabilities instead of doing work on safety or interpretability. That seems bad to me, but it also makes me wonder why you're sharing it here.

On the other hand, I'd be very interested in your thoughts on approaches like singular learning theory.

Load More