Yoav Hollander — LessWrong

Alignment remains a hard, unsolved problem

Right – I was also associating this in my mind with his ‘generalization science’ suggestion. However, I think he mainly talks about measuring / predicting generalization (and so does the referenced “Influence functions” post). My main thrust (see also the paper) is a principled methodology for causing the model to “generalize in the direction we want”.

Alignment remains a hard, unsolved problem

Yoav Hollander18d21

V&V from physical autonomy might address a meaningful slice of the long-horizon RL / deceptive-alignment worry. In AVs we don’t assume the system will “keep speed but drop safety” at deployment just because it can detect deployment; rather, training-time V&V repeatedly catches and penalizes those proto-strategies, so they’re less likely to become stable learned objectives.

Applied to AI CEOs: the usual framing (“it’ll keep profit-seeking but drop ethics in deployment”) implicitly assumes a power-seeking mesa-objective M emerges for which profit is instrumental and ethics is constraining. If strong training-time V&V consistently rewards profit+ethics together and catches early “cheat within legal bounds” behavior, a misaligned M is less likely to form in the first place. This is about shaping, not deploy-time deterrence; I’m not relying on the model being unable to tell test vs deploy.

A plausible architecture is A builds B, where a more capable model A is trained via repeated “build a system + verify it” cycles (meta-V&V), shaping A to value the verification process itself. A then constructs narrower agents B (e.g., CEO). When A > B, verifier/verified asymmetry plus direct V&V on B gives real leverage, provided audits are hard to anticipate, and specs/coverage are good.

AI CEOs (and similar long-horizon agents) are obviously much harder to spec, simulate and check than AVs. I assume both AIs and people are involved in the iterative V&V / spec refinement loop - see more background on V&V for alignment here.

Not claiming a solution: V&V can’t guarantee inner alignment; specs/coverage are hard; defense-in-depth is needed; and A > B may fail near AGI. But in this long-horizon RL setting, training-time V&V could (60% confidence) substantially reduce inner-misalignment risk by ensuring power-seeking/deception get caught early, before they become stable learned objectives.

The V&V method - A step towards safer AGI

Yoav Hollander6mo30

Got several variants of the following question: "You say you have no solution for scheming, which could make the V&V method unworkable. Why do you then claim it is a practical method?"

Here is my answer: Scheming could in theory break any oversight method, but the V&V method does make several anti-scheming contributions:

(1) Its scenario-based, coverage-driven tests flush out most specification gaming tricks (a milder form of reward hacking) automatically, essentially treating them as bugs (see § 4.2) (2) It plugs into other frameworks - e.g. in CAI it supplies the “external reality check” that prevents model-collapse (see § 3.1). (3) Because every run produces a human-auditable safety-case and logged sim traces, and because it is easy to request virtual sims of new scenarios, spot-audits are cheap - giving schemers a constant risk of exposure. (4) These spot-audits can make various anti-collusion approaches (like whistle-blower, rotation and diversity) more effective.

That is why I still call it a practical layer, even while we attack scheming head-on elsewhere. Would love to hear opinions about these points.

"Carefully Bootstrapped Alignment" is organizationally hard

Yoav Hollander3y10

Summary: My intuition is that "High Reliability Organizations" may not be the best parallel here: A better one is probably "organizations developing new high-tech systems where the cost of failure is extremely high". Examples are organizations involved in chip design and AV (Autonomous Vehicle) design.

I'll explain below why I think they are a better parallel, and what we can learn from them. But first:

Some background notes:

I have spent many years working in those industries, and in fact participated in inventing some of the related verification / validation / safety techniques ("V&V techniques" for short).
Chip design and AV design are different. Also, AV design (and the related V&V techniques) are still work-in-progress – I'll present a slightly-idealized version of it.
I am not sure that "careful bootstrapped alignment", as described, will work, for the various reasons Eliezer and others are worried about: We may not have enough time, and enough world-wide coordination. However, for the purpose of this thread, I'll ignore that, and do my best to (hopefully) help improve it.

Why this is a better parallel: Organizations which develop new chips / AVs / etc. have a process (and related culture) of "creating something new, in stages, while being very careful to avoid bugs". The cost-of-failure is huge: A chip design project / company could die if too many bugs are "left in" (though safety is usually not a major concern). Similarly, an AV project could die if too many bugs (mostly safety-related) cause too many visible failures (e.g. accidents).

And when such a project fails, a few billion dollars could go up in smoke. So a very high-level team (including the CEO) needs to review the V&V evidence and decide whether to deploy / wait / deploy-reduced-version.

How they do it: Because the stakes are so high, these organizations are often split into a design team, and an (often bigger) V&V team. The V&V team is typically more inventive and enterprising (and less prone to Goodharting and "V&V theatre") than the corresponding teams in "High Reliability Organizations" (HROs).

Note that I am not implying that people in HROs are very prone to those things – it is all a matter of degree: The V&V teams I describe are simply incentivized to find as many "important" bugs as possible per day (given finite compute resources). And they work on a short (several years), very intense schedule.

They employ techniques like a (constantly-updated) verification plan and safety case. They also work in stages: Your initial AV may be deployed only in specific areas / weathers / time-of-day and so on. As you gain experience, you "enlarge" the verification plan / safety case, and start testing accordingly (mostly virtually). Only when you feel comfortable with that do you actually "open up" the area / weather / number-of-vehicles / etc. envelope.

Will be happy to talk more about this.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments