On X/twitter Jerry Wei (Anthropic employee working on misuse/safeguards) wrote something about why Anthropic ended up thinking that training data filtering isn't that useful for CBRN misuse countermeasures:
...An idea that sometimes comes up for preventing AI misuse is filtering pre-training data so that the AI model simply doesn't know much about some key dangerous topic. At Anthropic, where we care a lot about reducing risk of misuse, we looked into this approach for chemical and biological weapons production, but we didn’t think it was the right fit. Here
Jerry Wei writes:
We expect this to become even more of an issue as AIs increasingly use tools to do their own research rather than rely on their learned knowledge (we tried to filter this kind of data as well, but it wasn't enough assurance against misuse).
I think his critique is this:
Suppose we had a perfect filtering system, such that the dangerous knowledge has zero mutual information with the model weights:
People often ask whether GPT-5, GPT-5.1, and GPT-5.2 use the same base model. I have no private information, but I think there's a compelling argument that AI developers should update their base models fairly often. The argument comes from the following observations:
The supposed dropping inference cost at a given level of capability is about benchmark performance (I'm skeptical it truly applies to real world uses at a similar level), which is largely about post-training (or mid-training with synthetic data), doesn't have much use for new pretrains. If there is already some pretrain of a relevant size from the last ~annual pretraining effort, and current post-training methods let it be put to use in a new role because they manage to make it good enough for that, they can just use the older pretrain (possibly refreshing...
Many Worlds seems fake because doesn't that imply the universe is tracking a complex number with magnitude in the order of 10^-10^100 for the branch that we live in? Since the worlds have to split up all the time.
all the other quantities (number of atoms in the universe, planck times since the big bang) are only a singly stacked exponential like 10^100
Also whats the precision of these amplitude numbers.
I have a basic understanding of QC. So in my understanding ok sometimes two branches can cancel each other out, or go back to the same state. Still, I would think there is enough "butterfly effect" type stuff that many worlds get split up to never meet again. And for each of those there's a really small complex number.
Prestige/reputation is supposed to work like the pagerank algorithm: every person has a little bit of prestige to distribute, it flows to a few major sinks, and those sinks can themselves distribute it to the people they respect.
Real prestige isn’t like this, of course. You can improve people’s perception of your prestige, and thus your actual prestige, with the right clothes or website design. But you can also hack the pagerank algorithm. For example, let’s say we have 3 low status entities: a website, a blogger, and a small meeting of subject matte...
When we learn, we do many mistakes in our approach and adapt our strategy. And when we succeed, we tend to post only the final good version, not mentioning how many fails we had.
And thus AI learns only on successes. It gets the wisdom of how to do things right, but not about what could go wrong.
I mostly use ChatGPT and Claude, and I have noticed that sometimes to solve my niche question they propose code that just does not work. It looks like a reasonable piece of code (according to intuition formed by mainstream languages), but appears to be unreada...
What are people's favorite arguments/articles/essays trying to lay out the simplest possible case for AI risk/danger?
Every single argument for AI danger/risk/safety I’ve seen seems to overcomplicate things. Either they have too many extraneous details, or they appeal to overly complex analogies, or they seem to spend much of their time responding to insider debates.
I might want to try my hand at writing the simplest possible argument that is still rigorous and clear, without being trapped by common pitfalls. To do that, I want to quickly survey the field so I can learn from the best existing work as well as avoid the mistakes they make.
I think The Briefing is pretty good, but I think it’s very hard to get right, and getting it right will look different for different audiences.
Follow-up to this post, wherein I was shocked to find that Claude Code failed to do a low-context task which took me 4 hours and involved some skills I expected it would have significant advantages [1] .
I kept going to see if Claude Code could eventually succeed. What happened instead was that it built a very impressive-looking 4000 LOC system to extract type and dependency injection information for my entire codebase and dump it into a sqlite database.
To my shock this the tool Claude built [2] actually worked. I ended up playing with ...
All that said, I definitely came away from this experiment with a strong intuition for exactly how it could take 20% longer to do things when you have LLM coding agents assisting you.
Could you elaborate, or does it boil down to "Helping Claude would have taking 2 days, and doing it on your own would have been faster"? I would be keen for patterns that help me distinguish between
ilya's AGI predictions circa 2017 (Musk v. Altman, Dkt. 379-40):
...Within the next three years, robotics should be completely solved, AI should solve a long-standing unproven theorem, programming competitions should be won consistently by Als, and there should be convincing chatbots (though no one should pass the Turing test). In as little as four years, each overnight experiment will feasibly use so much compute compute that there's an actual chance of waking up to AGI, given the right algorithm - and figuring out the algorithm will actually happen within 2-
Each year, we'll need to exponentially increase our hardware spend, but we have reason to believe AGI can ultimately be built with less than $10B in hardware.
SSI's compute spend is certainly a bet in this direction!
So I read another take on OpenAI's finances and was wondering, does anyone know why Altman is doing such a gamble on collecting enormous investments into new models in the hopes that they'll get sufficiently insane profits to make it worthwhile? Even ignoring the concerns around alignment etc., there's still the straightforward issue of "maybe the models are good and work fine but aren't good enough to pay back the investment".
Even if you did expect scaling to probably bring in huge profits, naively it'd still be wiser to pick a growth strategy that ...
In fact, OpenAI’s CFO has already floated the idea of a government “backstop” (bailout).
https://www.wsj.com/video/openai-cfo-would-support-federal-backstop-for-chip-investments/4F6C864C-7332-448B-A9B4-66C321E60FE7
"Utility" literally means usefulness, in other words instrumental value, but in decision theory and related fields like economics and AI alignment, it (as part of "utility function") is now associated with terminal/intrinsic value, almost the opposite thing (apparently through some quite convoluted history). Somehow this irony only occurred to me ~3 decades after learning about utility functions.
The only reason I believe myself to have "objective" moral worth is because I have subjective experience. Maybe more wordplay than irony, but submitted for your amusement.
An aligned superintelligence would work with goals of the same kind, even if it's aligned to early AGIs rather than humans. Goals-as-computations may be constant, like the code of a program may be constant, but what's known about its behavior isn't constant. And so the way it guides actions of an agent develops as it gets computed further, ultimately according to decisions of the underlying humans/AGIs (and their future iterations) in various hypothetical situations. Also, an uplifted (grown up) human could be a superintelligence personally, it's not a different kind of thing with respect to values it could have.
Back when I was in school and something came up on a test that required knowledge or skills I didn't have, I would often find the closest thing I did know how to do, and then do that thing even though it's not what the question asked for, in the hopes of getting partial credit.
Looking back, I'm sure that the teachers grading my tests were fully aware of what I was doing, and yet the strategy did work out for me often enough to be worth doing.
Anyway, working with LLM coding agents gives me sympathy for my teachers back then. LLMs are capable of doing a sign...
With all the discussion of age restrictions on social media, I wrote down a rough idea of how we could do age restrictions much more effectively than the current proposals. It's not typical LessWrong content but I'd still love any feedback from the LW community.
https://kaleb.ruscitti.ca/2026/01/18/private-age-verification.html
As I posted as a reply to OP, the EU is specifically working on measures to enable age verification in an convenient, privacy-preserving and accurate manner (open source app that checks the biometric data on your id). More generally, parents being genuinely concerned about their children's safety seems so well established to me that it makes sense to assume they are being genuine in this discussion as well.
As another counter piece of evidence, Jonathan Haidt is probably the best known scientific advocate for social media restrictions for kids, and hi...
Plot idea: Isekaied from 2026 into some date in the past. Only goal: get cryogenically preserved into the glorious transhumanist singularity. How to influence the trajectory of history into a direction that would broadly enable this kind of future, while setting up the long-lasting infrastructure & incentives that would cryogenically preserve oneself for centuries to come?
I'm not going to write this, but I think this is a very interesting premise & problem (especially if thousands of years into the past) and would love to see someone build on it.
So...
If more people were egoistic in such a forward-looking way, the world would be better off for it.
Random note: Congressman Brad Sherman just held up If Anyone Builds It, Everyone Dies in a Congressional hearing and recommended it, saying (rough transcript, might be slight paraphrase): "they're [the AI companies] not really focused on the issue raised by this book, which I recommend, but the title tells it all, If Anyone Builds It Everyone Dies"
I think this is a clear and unambiguous example of the theory of change of the book having at least one success -- being an object that can literally be held and pointed to by someone in power.
This is awesome! It broadly aligns with my understanding of the situation, although it does miss some folks that are known to care a bunch about this from their public statements. Downloading the JSON to take a deeper look!
Strong upvoting the underlying post for Doing The Thing.
The dream is that prediction markets greatly outperform individual experts, but there's a limit on how much this can actually happen. The reason prediction markets aren't more useful is that you can only profit from a prediction market if gathering information is cheaper than the money you'd make from gathering it.
Let's imagine I write down either "horse" or "donkey" on a slip of paper, and put that paper in a drawer in my kitchen. I then create a prediction market based on what's written on the paper. The market would sit at around 50%. Maybe people would...
I agree that it's plausible there could be some benefit to creating an AI prediction market.
I mostly haven't taken any of the other AI benchmarks seriously, but I just looked into ForecastBench and surprisingly it seems to me to be worth taking seriously. (The other benchmarks are just like "hey, we promise there aren't similar problems in the LLM's training data! Trust us!") I notice their website suggests ForecastBench is a "proxy for general intelligence", so it seems like I'm not the only one who thinks forecasting and general intelligence might be rel...
Has anybody checked if finetuning LLMs to have inconsistent “behavior” degrades performance? Like you finetuned a model on a bunch of aligned tasks like writing secure code and offering compassionate responses to individuals in distress, but then you tried to specifically make it indifferent to animal welfare? It seems like that would create internal dissonance in the LLM which I would guess causes it to reason less effectively (since the character it’s playing is no longer consistent).
Just said to someone that I would by default read anything they wrote to me in a positive light, and that if they wanted to be mean to me in text, they should put '(mean)' after the statement.
Then realized that, if I had to put '(mean)' after everything I wrote on the internet that I wanted to read as slightly abrupt or confrontational, I would definitely be abrupt and confrontational on the internet less.
I am somewhat more confrontational than I endorse, and having to actually say to myself and the world that I was intending to be harsh, rather than simpl...
Yes exactly; I'm curious about how many more opportunities for greater intentionality/reflective-endorsement might be lurking in other areas, where I just haven't created the right test/handle (but may believe that I've created the right one).
I'm also mindful of the opposite failure mode, though, where attempting to surface something to yourself internally actually causes you to over-index on it, leading to paralysis, where the thing was only present in very small doses and your threshold was poorly calibrated.