Ivan Vendrov - LessWrong

Searching for the Root of the Tree of Evil

Agree trust and cooperation is dual use, and I'm not sure how to think about this yet; perhaps the most important form of coordination is the one that prevents (directly or via substitution) harmful forms of coordination from arising.
One reason I wouldn't call lack of altruism the root is that it's not clear how to intervene on it, it's like calling the laws of physics the root of all evil. I prefer to think about "how to reduce transaction costs to self-interested collaboration". I'm also less sure that a society of people more altruistic motives will necessarily do better... the nice thing about self-interest is that your degree of care is proportional to your degree of knowledge about the situation. A society of extremely altruistic people who are constantly devoting resources to solve what they believe to be other people's problems may actually be less effective at ensuring flourishing.

Searching for the Root of the Tree of Evil

You're right the conclusion is quite underspecified - how exactly do we build such a cooperation machine?

I don't know yet, but my bet is more on engineering, product design, and infrastructure than on social science. More like building a better Reddit or Uber (or supporting infrastructure layers like WWW and the Internet) than like writing papers.

Searching for the Root of the Tree of Evil

Ivan Vendrov6d10

would to love to see this idea worked out a little more!

Guardian AI (Misaligned systems are all around us.)

Ivan Vendrov2y10

I like the "guardian" framing a lot! Besides the direct impact on human flourishing, I think a substantial fraction of x-risk comes from the deployment of superhumanly persuasive AI systems. It seems increasingly urgent that we deploy some kind of guardian technology that at least monitors, and ideally protects, against such superhuman persuaders.

Cooperators are more powerful than agents

Ivan Vendrov2y30

Symbiosis is ubiquitous in the natural world, and is a good example of cooperation across what we normally would consider entity boundaries.

When I say the world selects for "cooperation" I mean it selects for entities that try to engage in positive-sum interactions with other entities, in contrast to entities that try to win zero-sum conflicts (power-seeking).

Agreed with the complicity point - as evo-sim experiments like Axelrod's showed us, selecting for cooperation requires entities that can punish defectors, a condition the world of "hammers" fails to satisfy.

Any further work on AI Safety Success Stories?

Ivan Vendrov2y10

Depends on offense-defense balance, I guess. E.g. if well-intentioned and well-coordinated actors are controlling 90% of AI-relevant compute then it seems plausible that they could defend against 10% of the compute being controlled by misaligned AGI or other bad actors - by denying them resources, by hardening core infrastructure, via MAD, etc.

Any further work on AI Safety Success Stories?

Ivan Vendrov2y10

I would be interested in a detailed analysis of pivotal act vs gradual steering; my intuition is that many of the differences dissolve once you try to calculate the value of specific actions. Some unstructured thoughts below:

Both aim to eventually end up in a state of existential security, where nobody can ever build an unaligned AI that destroys the world. Both have to deal with the fact that power is currently broadly distributed in the world, so most plausible stories in which we end up with existential security will involve the actions of thousands if not millions of people, distributed over decades or even centuries.
Pivotal acts have stronger claims of impact, but generally have weaker claims of the sign of that impact - actually realistic pivotal-seeming acts like "unilaterally deploy a friendly-seeming AI singleton" or "institute a stable global totalitarianism" are extremely, existentially dangerous. If someone identifies a pivotal-seeming act that is actually robustly positive, I'll be the first to sign on.
In contrast, gradual steering proposals like "improve AI lab communication" or "improve interpretability" have weaker claims to impact, but stronger claims to being net positive across many possible worlds, and are much less subject to multi-agent problems like races and the unilateralist's curse.
True, complete existential safety probably requires some measure of "solving politics" and locking in current human values, hence may not be desirable. Like what if the Long Reflection decides that the negative utilitarians are right and the world should in fact be destroyed? I won't put high credence on that, but there is some level of accidental existential risk that we should be willing to accept in order to not lock in our values.

Any further work on AI Safety Success Stories?

Answer by Ivan VendrovOct 03, 202230

You might find AI Safety Endgame Stories helpful - I wrote it last week to try to answer this exact question, covering a broad array of (mostly non-pivotal-act) success stories from technical and non-technical interventions.

Nate's "how various plans miss the hard bits of the alignment challenge" might also be helpful as it communicates the "dynamics of doom" that success stories have to fight against.

One thing I would love is to have a categorization of safety stories by claims about the world. E.g what does successful intervention look like in worlds where one or more of the following claims hold:

No serious global treaties on AI ever get signed.
Deceptive alignment turns out not to be a problem.
Mechanistic interpretability becomes impractical for large enough models.
CAIS turns out to be right, and AI agents simply aren't economically competitive.
Multi-agent training becomes the dominant paradigm for AI.
Due to a hardware / software / talent bottleneck there turns out to be one clear AI capabilities leader with nobody else even close.

These all seem like plausible worlds to me, and it would be great if we had more clarity about what worlds different interventions are optimizing for. Ideally we should have bets across all the plausible worlds in which intervention is tractable, and I think that's currently far from being true.

AI Safety Endgame Stories

Ivan Vendrov2y10

I don't mean to suggest "just supporting the companies" is a good strategy, but there are promising non-power-seeking strategies like "improve collaboration between the leading AI labs" that I think are worth biasing towards.

Maybe the crux is how strongly capitalist incentives bind AI lab behavior. I think none of the currently leading AI labs (OpenAI, DeepMind, Google Brain) are actually so tightly bound by capitalist incentives that their leaders couldn't delay AI system deployment by at least a few months, and probably more like several years, before capitalist incentives in the form of shareholder lawsuits or new entrants that poach their key technical staff have a chance to materialize.

AI Safety Endgame Stories

Ivan Vendrov2y32

Interesting, I haven't seen anyone write about hardware-enabled attractor states but they do seem very promising because of just how decisive hardware is in determining which algorithms are competitive. An extreme version of this would be specialized hardware letting CAIS outcompete monolithic AGI. But even weaker versions would lead to major interpretability and safety benefits.

LESSWRONG
LW

Posts

Wiki Contributions

Comments