Ivan Vendrov

Wiki Contributions


I like the "guardian" framing a lot! Besides the direct impact on human flourishing, I think a substantial fraction of x-risk comes from the deployment of superhumanly persuasive AI systems. It seems increasingly urgent that we deploy some kind of guardian technology that at least monitors, and ideally protects, against such superhuman persuaders.

Symbiosis is ubiquitous in the natural world, and is a good example of cooperation across what we normally would consider entity boundaries.

When I say the world selects for "cooperation" I mean it selects for entities that try to engage in positive-sum interactions with other entities, in contrast to entities that try to win zero-sum conflicts (power-seeking).

Agreed with the complicity point - as evo-sim experiments like Axelrod's showed us, selecting for cooperation requires entities that can punish defectors, a condition the world of "hammers" fails to satisfy.

Depends on offense-defense balance, I guess. E.g. if well-intentioned and well-coordinated actors are controlling 90% of AI-relevant compute then it seems plausible that they could defend against 10% of the compute being controlled by misaligned AGI or other bad actors - by denying them resources, by hardening core infrastructure, via MAD, etc.

I would be interested in a detailed analysis of pivotal act vs gradual steering; my intuition is that many of the differences dissolve once you try to calculate the value of specific actions. Some unstructured thoughts below:

  1. Both aim to eventually end up in a state of existential security, where nobody can ever build an unaligned AI that destroys the world. Both have to deal with the fact that power is currently broadly distributed in the world, so most plausible stories in which we end up with existential security will involve the actions of thousands if not millions of people, distributed over decades or even centuries. 
  2. Pivotal acts have stronger claims of impact, but generally have weaker claims of the sign of that impact - actually realistic pivotal-seeming acts like "unilaterally deploy a friendly-seeming AI singleton" or "institute a stable global totalitarianism" are extremely, existentially dangerous. If someone identifies a pivotal-seeming act that is actually robustly positive, I'll be the first to sign on.
  3. In contrast, gradual steering proposals like "improve AI lab communication" or "improve interpretability" have weaker claims to impact, but stronger claims to being net positive across many possible worlds, and are much less subject to multi-agent problems like races and the unilateralist's curse.
  4. True, complete existential safety probably requires some measure of "solving politics" and locking in current human values, hence may not be desirable. Like what if the Long Reflection decides that the negative utilitarians are right and the world should in fact be destroyed? I won't put high credence on that, but there is some level of accidental existential risk that we should be willing to accept in order to not lock in our values.

You might find AI Safety Endgame Stories helpful - I wrote it last week to try to answer this exact question, covering a broad array of (mostly non-pivotal-act) success stories from technical and non-technical interventions.

Nate's "how various plans miss the hard bits of the alignment challenge" might also be helpful as it communicates the "dynamics of doom" that success stories have to fight against.

One thing I would love is to have a categorization of safety stories by claims about the world. E.g what does successful intervention look like in worlds where one or more of the following claims hold:

  • No serious global treaties on AI ever get signed.
  • Deceptive alignment turns out not to be a problem.
  • Mechanistic interpretability becomes impractical for large enough models.
  • CAIS turns out to be right, and AI agents simply aren't economically competitive.
  • Multi-agent training becomes the dominant paradigm for AI.
  • Due to a hardware / software / talent bottleneck there turns out to be one clear AI capabilities leader with nobody else even close.

These all seem like plausible worlds to me, and it would be great if we had more clarity about what worlds different interventions are optimizing for. Ideally we should have bets across all the plausible worlds in which intervention is tractable, and I think that's currently far from being true.

I don't mean to suggest "just supporting the companies" is a good strategy, but there are promising non-power-seeking strategies like "improve collaboration between the leading AI labs" that I think are worth biasing towards.

Maybe the crux is how strongly capitalist incentives bind AI lab behavior. I think none of the currently leading AI labs (OpenAI, DeepMind, Google Brain) are actually so tightly bound by capitalist incentives that their leaders couldn't delay AI system deployment by at least a few months, and probably more like several years, before capitalist incentives in the form of shareholder lawsuits or new entrants that poach their key technical staff have a chance to materialize.

Interesting, I haven't seen anyone write about hardware-enabled attractor states but they do seem very promising because of just how decisive hardware is in determining which algorithms are competitive.  An extreme version of this would be specialized hardware letting CAIS outcompete monolithic AGI. But even weaker versions would lead to major interpretability and safety benefits.

Fabricated options are products of incoherent thinking; what is the incoherence you're pointing out with policies that aim to delay existential catastrophe or reduce transaction costs between existing power centers?

I've considered starting an org that was either aimed at generating better alignment data or would do so as a side effect and this is really helpful - this kind of negative information is nearly impossible to find.

Is there a market niche for providing more interactive forms of human feedback, where it's important to have humans tightly in the loop with an ML process, rather than "send a batch to raters and get labels back in a few hours"? One reason RLHF is so little used is the difficulty of setting up this kind of human-in-the-loop infrastructure. Safety approaches like debate, amplification and factored cognition could also become competitive much faster if it was easier and faster to get complex human-in-the-loop pipelines running. 

Maybe Surge already does this? But if not, you wouldn't necessarily want to compete with them on their core competency of recruiting and training human raters. Just use their raters (or Scale's), and build good reusable human-in-the-loop infrastructure, or maybe novel user interfaces that improve supervision quality.

I think a substantial fraction of ML researchers probably agree with Yann LeCun that AI safety will be solved "by default" in the course of making the AI systems useful. The crux is probably related to questions like how competent society's response will be, and maybe the likelihood of deceptive alignment.

Two points of disagreement though:

  • I don't think setting P(doom) = 10% indicates lack of engagement or imagination; Toby Ord in the Precipice also gives a 10% estimate for AI-derived x-risk this century, and I assume he's engaged pretty deeply with the alignment literature.
  • I don't think P(doom) = 10% or even 5% should be your threshold for "taking responsibility". I'm not sure I like the responsibility frame in general, but even a 1% chance of existential risk is big enough to outweigh almost any other moral duty in my mind.
Load More