LESSWRONG
LW

Fun Theory is the study of questions such as "How much fun is there in the universe?",
"Will we ever run out of fun?", "Are we having fun yet?" and "Could we be having
more fun?". It's relevant to designing utopias and AIs, among other things.

Fun Theory

Customize

Quick Takes

Daniel Kokotajlo2h5612

I'm starting to get a little worried. In 2022 I was part of initiating and shaping the whole dangerous capabilities evals agenda, and I emphasized repeatedly to practically everyone I met that the whole thing would be worthless or worse if fine-tuning didn't become standard practice as part of eliciting model capabilities. Now it's 2026, the models are way smarter, and and hundreds of people are working on evaluating dangerous capabilities... and yet they still basically don't do fine-tuning at all. This really really needs to change before it's too late, which could be any year now. The models are getting increasingly situationally aware. The entire bundle of evals we do is basically just asking them nicely to show us how scary they are; the obvious flaw in this plan is that if they are sufficiently smart and sufficiently scary, they'll just choose not to show us. Fine tuning is a basic, obvious mitigation to this threat model that makes it harder for models to sandbag. But it's not a panacea. But we aren't even doing this basic thing!

Daniel Kokotajlo1d14248

1a3orn, iamthouthouarti, and 14 more

1. It seems that Anthropic is now in the lead; they are IMO the most likely single company to automate AI R&D first. 2. I'm getting increasingly concerned about Anthropic's attitude towards alignment/safety. I grimly predict that they would basically behave like OpenBrain does in AI 2027, if they are lucky enough to get a pattern of misalignment evidence that egregious.

joanv11h5027

testingthewaters

From Claude Mythos‘ system card: “Following a successful alignment review, the first early version of Claude Mythos Preview was made available for internal use on February 24.” Anthropic’s RSP was also updated on Feb 24th. i’d appreciate clarity on what looks like a funny coincidence.

lc2h60

roha

If you have crypto assets stored on an exchange or smart contract platform, you should pull them now, and keep them in an offline cold storage wallet for at least the next ~18 months while we figure out the offense/defense balance of new AI models.

Raemon28m30

I'm thinking about how reasoning models are sometimes real dumb. One way they are sometimes dumb is spending a huge amount of tokens/time on a task that is very simple, and that I know earlier dumber models had a totally easy time with. I hadn't seen anyone directly spell out an analogy between this behavior and the hypothetical superintelligence failure mode of "convert the lightcone into mindless computronium just to triple-bajillioniple-check it's math about whether it has accomplished it's original goal." I've seen arguments about modern-LLMs doing reward-hacky things, and scheming / deception-y things, that treated them as a precursor sign to deep strategic misalignment. I'm not sure I buy this, since at least last year's models clearly weren't smart enough to be deliberately misaligned, and the mechanism didn't really feel that similar. (i.e. seems like there's a big difference between "the model is scheming because it has goals" and "the model is flailing and bad at stuff"). The only reason these kinds of scheming are notable is to persuade people who didn't think AIs were capable of this behavior at all, and existence proofs are nice. But, I would weakly guess the "spend a bunch of tokens doing an elaborately careful job on changing one line of code" behavior is more directly connected to the hypothesized "tile the lightcone with doublecheck-onium" behavior. Curious if people who have paid more attention to model behavior think this sounds right? Also curious if anyone wrote it up already.

ryan_greenblatt18h3621

Sheikh Abdur Raheem Ali

In Anthropic's Alignment Risk Report update for Mythos, they claim: I don't think Anthropic (or anyone) has an achievable path for keeping risk low if AI proceeds as fast as Anthropic expects (or as fast as I expect). Anthropic could (and hopefully will) take actions that significantly reduce the risk, but this won't keep risk low. More precisely: I do not think Anthropic has an achievable (>10% likely) path that results in keeping the aggregate existential risk they impose to <2%, as assessed in advance by a reasonable evaluator (the notion of risk that the risk report is supposed to estimate). Anthropic might avoid causing existential catastrophe (in fact, I happen to think this is likely), but they will impose quite a bit of risk along the way (supposing they succeed at their stated intentions of being a leading company building powerful AI within the next 5 years). My understanding is that Anthropic employees (especially Anthropic employees writing this report) often don't believe there is an achievable path to keeping risk low if Anthropic builds powerful AI / ASI in the next 5 years, so text seems incorrect or misleading. E.g., see Holden's most recent 80k episode or his post when RSP v3 came out. I wish Anthropic communicated more accurately about future risk in their risk report (the risk report is supposed to especially avoid spin). Or, if they're unwilling to do this, not commenting at all would be better. (Cross post from X/Twitter.)

Brendan Long41m30

I'm surprised no one is discussing Meta's new model at all: https://ai.meta.com/blog/introducing-muse-spark-msl/ This part seems good: [...] And this seems.. less good: [...] I'm pleasantly surprised that they decided Safety should be one of the four sections in the announcement post, and that they call out the eval awareness. Disclaimer: I work at Meta, but not in this department and I obviously don't speak for the company.

Your Feed

Daniel Kokotajlo2h5612

Daniel Kokotajlo1d14248

1a3orn, iamthouthouarti, and 14 more

joanv11h5027

testingthewaters

lc2h60

roha

Raemon28m30

ryan_greenblatt18h3621

Sheikh Abdur Raheem Ali

Brendan Long41m30