A lot of anger over the frontier LLM development (silent) safeguards. It does seem like it makes the model unusable for safety research and potentially has worrying second order effects (do we really want to be teaching models it's okay to sandbag?), but to steelman the case for it:
The reason for a silent degradation would be to make jailbreaking much harder. At first my response to this was: why wouldn't they do the same for bio/cyber where jailbreaks would be worse. After seeing examples of the Bio classifier I think the answer is that Anthropic is okay with a lot of over refusals on Bio (e.g. "How does the mitochondria work?"). It's possible that a similar strategy for frontier model development would make it unusable for coding in general. Having the degradation be secret let's them not overtune the safeguards.
IF they cannot set the refusal classifier well for cyber and bio, what gives you confidence that they would classify "frontier LLM development" well? Not only that, you would need to second-guess the response you get. I'd rather they just do overrefusals for frontier LLM development questions (since they clearly don't care about overrefusals).
(since they clearly don't care about overrefusals).
(this particular claim here seems false/overstated. Like, clearly, overall, they are willing to accept overrefusals. That doesn't meant they "don't care about them". Maybe they don't, but, much more likely it just seems like a reasonable tradeoff to them.)
Does anyone know if proceeds/profits of “If Anyone Builds it, Everyone Dies” are going to MIRI or another charity? I’m going to read it either way, but I really think if you’re going to make the “buy this book for the good of humanity” pitch you shouldn’t be profiting off it.
Recent days have seen lots of claims that AI is a bubble. Assuming that AI is correctly priced they are likely to be able to claim victory, at least naively. This will be true of any asset class with a very high upside. Lets define F as the true fundamental value of an asset class at a given time and p(F) as the best possible estimate of the probability distribution of F. If the asset class is priced correctly, the market price will be . If we say that an asset class will be naively considered a bubble in hindsight if mp>fundamental value We can defined p(B) as the probability of an asset class to appear to be a bubble in retrospect. . For example for a probability distribution where 50% of the value lies in the top 10% of best case scenarios, there is a 90% chance that the true fundamental value of the asset class is below the current market price. To really determine if there was a bubble you would need to deeply research the topic to attempt to determine if the market price at the time was in line with the expected value of the fundamental value given the information available at the time.