Matt Bamberger — LessWrong

When is it important that open-weight models aren't released? My thoughts on the benefits and dangers of open-weight models in response to developments in CBRN capabilities.

Matt Bamberger4mo31

All of this seems solid, but it seems to me there are two additional considerations that push in the opposite direction:

COVID and its aftermath seem to suggest that pandemics make society dumber and more reactive. I wonder if a surge in bioterror would reduce humanity's decision-making capability at a critical time?
Releasing highly capable open weights models would seem to increase existential risk by bringing near-SOTA capabilities to more actors. (For example, North Korea is probably incapable of building a frontier model from scratch, but might be able to perform large-scale fine-tuning to obtain a variety of problematic capabilities including but not limited to CBRN development).

Self-Coordinated Deception in Current AI Models

Matt Bamberger4mo10

Interesting work! I wonder if a more successful way to uncover this kind of deception would be to iterate on what questions you ask each instance you interrogate?

As a simple example, if instance A tells you it needs an extension because it was helping its brother all evening, you might begin by telling instance B that extensions are fine but asking why it didn't ask for one sooner, in the hope of shifting the focal point.

More realistically, if you think the model is sandbagging about interpretability results, you might give different instances somewhat different interpretability assignments, in the hope of exposing contradictory performance patterns.

This game quickly gets very complex, and it's more or less axiomatic that humans lose complex games when playing against superhuman AI. But perhaps there's a window where these techniques might be useful?

New scorecard evaluating AI companies on safety

Matt Bamberger5mo10

I definitely find the presentation useful. In particular, the ability to drill down on each block is great (though it took me a moment to figure out how that worked).

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments