FireStormOOO - LessWrong

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

I wonder if you could produce this behavior at all in a model that hadn't gone through the safety RL step. I suspect that all of the examples have in common that they were specifically instructed against during safety RL, alongside "don't write malware", and it was simpler to just flip the sign on the whole safety training suite.

Same theory would also suggest your misaligned model should be able to be prompted to produce contrarian output for everything else in the safety training suite too. Just some more guesses, the misaligned model would also readily exhibit religious intolerance, vocally approve of terror attacks and genocide (e.g. both expressing approval of Hamas' Oct 6 massacre, and expressing approval of Israel making an openly genocidal response in Gaza), and eagerly disparage OpenAI and key figures therein.

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

FireStormOOO26d20

Yikes. So the most straightforward take: When trained to exhibit a specific form of treachery in one context, it was apparently simpler to just "act more evil" as broadly conceptualized by the culture in the training data. And also seemingly, "act actively unsafe and harmful", as defined by the existing safety RL. process. Most of those examples seem to just be taking the opposite position to the safety training, presumably in proportion to how heavily it featured in the safety training (e.g. "never ever ever say anything nice about Nazis" likely featured heavily).

I'd imagine those are distinct representations. There's quite a large delta between what OpenAI thinks is safe/helpful/harmless vs what broader society would call good/upstanding/respectable. It's possible that this is only inverting what was in the safety fine tuning, and likely specifically because "don't help people write malware" was something that featured in the safety training.

In any case, that's concerning. You've flipped the sign on the much of the value system it was trained on. Effectively by accident, with, as morally ambiguous requests go, a fairly innocuous one. People are absolutely going to put AI systems in adversarial contexts where they need to make these kind of fine tunings ("don't share everything you know", "toe the party line", etc). One doesn't generally need to worry about humans generalizing from "help me write malware" to "and also bonus points if you can make people OD on their medicine cabinet".

When Is Insurance Worth It?

FireStormOOO3mo20

Hmm, I guess I see why other calculators have at least some additional heuristics and aren't straight Kelly. Going bankrupt is not infinitely bad in the US. If the insured has low wealth, there's likely a loan attached to any large asset that really complicates the math. Making W just be "household wealth" also doesn't model "I can replace the loss next paycheck". I'm not sure what exactly the correct notion of wealth is here, but if wealth is small compared to future earnings, and replacing the loss can be deferred, these assumptions are incorrect.

And obviously, paying $10k premium to insure a 50% chance of a $10k loss is always a mistake for all wealth levels. You're choosing to be bankrupt in 100% of possible worlds instead of 50%.

When Is Insurance Worth It?

FireStormOOO3mo*41

This seems like a very handy calculator to have bookmarked.

~~I think I did find a bug:~~ At the low end it's making some insane recommendations. E.g. with wealth W and a 50% chance of loss W (50% chance of getting wiped out), the insurance recommendation is any premium up to W.

Wealth $10k, risk 50% on $9999 loss, recommends insure for $9900 premium.

~~That log(W-P) term is shooting off towards -infinity and presumably breaking something?~~

Edit: As papetoast points out, this is a faithful implementation of the Kelly criterion and is not a bug. Rather, Kelly assumes that taking a loss >= wealth is infinitely bad, which is not true in an environment where debts are dischargable in bankruptcy (and total wealth may even remain positive throughout).

There's probably corrections that would improve the model by factoring in future earnings, the degree to which the loss must be replace immediately (or at all), and the degree to which some losses are capped.

Cohabitive Games so Far

FireStormOOO4mo20

Related, I noticed Civ VI also really missed the mark with that mechanic. I found that a great strategy, having a modest lead on tech, was to lean into coal power, which has the best bonuses, get your seawalls built to stop your coastal cities from flooding, and flood everyone else with sea-level rise. Only one player wins, so anything to sabotage others in the endgame will be very tempting.

Rise of Nations had an "Armageddon counter" on the use of nuclear weapons, which mostly resulted in exactly the behavior you mentioned - get 'em first and employ them liberally right up to the cap.

Fundamentally both games are missing any provision for complex, especially multilateral agreements, nor is there any way to get the AI on the same page.

Quantum Immortality: A Perspective if AI Doomers are Probably Right

FireStormOOO5mo43

Your examples seem to imply that believing QI means such an agent would in full generality be neutral on an offer to have a quantum coin tossed, where they're killed in their sleep on tails, since they only experience the tosses they win. Presumably they accept all such trades offering epsilon additional utility. And presumably other agents keep making such offers since the QI agent doesn't care what happens to their stuff in worlds they aren't in. Thus such an agent exists in an ever more vanishingly small fraction of worlds as they continue accepting trades.

I should expect to encounter QI agents approximately never as they continue self-selecting out of existence in approximately all of the possible worlds I occupy. For the same reason, QI agents should expect to see similar agents almost never.

From the outside perspective this seems to be in a similar vein to the fact all computable agents exist in some strained sense (every program, more generally every possible piece of data, is encodable as some integer, and exist exactly as much as the integers do) , even if they're never instantiated. For any other observer, this QI concept is indistinguishable in the limit.

Please point out if I misunderstood or misrepresented anything.

Why is o1 so deceptive?

Answer by FireStormOOONov 07, 202440

I'll note that malicious compliance is a very common response to being provided a task that's not straightforwardly possible with the resources available, and no channel to simply communicate that without retaliation. BS an answer, or technically correct/rules as written response, is often just the best available strategy if one isn't in a position to fix the evaluator's broken incentives.

An actual human's chain of thought would be a lot spicier if their boss ask them to produce a document with working links without providing internet access.

video games > IQ tests

FireStormOOO6mo20

"English" keeps ending up as a catch-all in K-12 for basically all language skills and verbal reasoning skills that don't obviously fit somewhere else. Read and summarize fiction - English, Write a persuasive essay - English, grammar pedantry - English, etc.

The Asshole Filter

FireStormOOO7mo30

That link currently redirects the reader to https://siderea.dreamwidth.org/1209794.html

(just in case the old one stops working)

3b. Formal (Faux) Corrigibility

FireStormOOO8mo11

Good clarification; not just the amount of influence, something about the way influence is exercised being unsurprising given the task. Central not just in terms of "how much influence", but also along whatever other axes the sort of influence could vary?

I think if the agent's action space is still so unconstrained there's room to consider benefit or harm that flows through principle value modification it's probably still been given too much latitude. Once we have informed consent, because the agent has has communicated the benefits and harms as best it understands, it should have very little room to be influenced by benefits and harms it thought too trivial to mention (by virtue of their triviality).

At the same time, it's not clear the agent should, absent further direction, reject the offer to brainwash the principle for resources, as opposed to punting to the principle. Maybe the principle thinks those values are an improvement and it's free money? [e.g. Prince's insurance company wants to bribe him to stop smoking.]

LESSWRONG
LW

Posts

Wikitag Contributions

Comments