Cross-posting from a Twitter thread responding to a recent viral comments by @Richard_Ngo about EA, Anthropic, and AI safety as a 'fake field.' Posting here because I expect this to be quite unpopular on LW.
(original thread: https://x.com/CRSegerie/status/2056737155880493357)
AI safety in 2023–2026 was driven by evals, threat models, scary demos, model-organism work, RSPs, and voluntary commitments. Richard calls this "much more of a fake field" and says it "won't generalize".
Here's why I disagree - 1/10
1/ I agree with Anthropic being now the biggest le...
Anthropic visibly moved US executive posture, Senate hearings, frontier-lab norms, and the public conversation toward taking the risks seriously.
So, this makes Anthropic good because they might inspire the government eventually to shut down all the labs including (I hope) Anthropic?
I've been reading an anthropology about the Australian aboriginals, The Native Tribes of Central Australia (Baldwin Spencer/Francis James Gillen, 1899), and found some parts interesting enough to share them.
Content warning: description of gruesome (though consensual) mutilation.
Things that stood out about the aboriginals (highlighting not in the original text):
...Now, amongst the Australian natives wives are certainly lent, but only under strict rules; in the Arunta tribe for example no man will lend his wife to any one who does not belong to the particular group with which it is lawful for her to have marital relations—she is in fact, only lent to a man whom she calls Unawa, just as she calls her own husband, and though this may undoubtedly be spoken of as an act of hospitality, it may with equal justice be regarded as evidence of the very clear recognition of group relationship, and as evidence also in favour of
I have a bone to pick with high school chemistry textbooks.
When I took chemistry in high school, there was a section of the textbook that described some features of quantum mechanics and tried to explain electron orbitals. The orbitals chapter in particular made basically no sense; I was able to extract enough out of it to be able to do the homework problems, but most of the class was totally lost. When I took a semester of modern physics in college (and also read Eliezer's QM sequence), I learned that the reason the explanation made no sense was that it w...
Us physicists never beating the allegations!
Just give them Griffiths, and you'll do okay for teaching Quantum Mechanics, but for chemistry … you're correct, of course, that the Schrödinger equation is what all those heuristics about orbital hybridization are trying to approximate, and sure, you could add some lines emphasizing that at the beginning if it's not clear, but I don't think the pedagogy would be improved by dispensing with the heuristics altogether.
a theory of assistant personas and superhuman capabilities
so you have a language model. you train it to embody some specific personality--Claude, ChatGPT, whatever. one of the miracles of AI is that this mostly works and gives you something that is mostly trying to help you and not trying to murder you. i claim that this is mostly because of the SL training objective and if you do just the intense RL thing you get the originally predicted spicy alignment failures.
suppose you tell the LM that Claude is actually a superhuman aligned AI. can you get superhuma...
I think basically nobody called out this specific failure mode ("you know that thing where dictators become crazy because nobody's willing to push back on them? What if everybody had that in their pocket? :O") is in retrospect an obvious sci-fi premise, but is completely missed afaik in both LW and elsewhere
It seems to me that the hardest part of getting ai right will be coordination and governance, not technical alignment. Under this view, I think the EA/LW forums are net negative. They are really bad at creating coordination environments and are actively crowding out similar but more coordination/governance environments from existing.
No disrespect intended.
In case some people want to see a (hopefully more filtered?) version of the 'learning diary' I mentioned before, I'll comment stuff as replies to this post.
These will usually be about stuff I learned before the date published - but the date on the entry should be roughly right. It'll also be more verbose, since I can't assume you already know what I do.
where have you been
It's incredibly easy to be fooled by the capabilities of the current top-performing tech (LLM agents). It's easy because they have a vast amount of training data to interpolate from.
This works fine to acquire capabilities within our existing data distribution of the world (one that is also easy to verify), but what happens when they go out of distribution?
LLMs perform poorly! Yet, people seem to think they can actually generalize to new problems. Why is that?
It's, again, the vastness of their training data. It makes it hard to distinguish between interpola...
Negation Neglect kinda makes sense to me. The argument goes that (at least in SFT though people think it generalizes to pretraining and RL) if you train/fine-tune on a text that starts with something like "the following are all lies that are not true: <claims> what we said before are all lies that are not true", the next-token completion nature of LLMs means they ingest the local claims as credible. Updating too much on the opening warning is too rough[1] for a greedy optimization process.
Inoculation Prompting also kinda makes sense to me. The argume...
Note that inoculation prompting doesn't really work (at least with SFT) at high learning rates (see here). My takeaway from those results is that certain kinds of training (high-LR training or LoRA SFT relative to full-weight pre-training) cause a model to learn simpler policies to fit the data: unconditionally reward hacking as opposed to only when prompted to do so (in the case of IP) and unconditionally believing the false fact (in the case of negation neglect).
Models clearly do learn to distinguish fiction from reality during pre-training: models don't...
The unlikely hypothetical scenario in which hantavirus is on track to become a pandemic is becoming less likely day by day. We continue to see Hondius-linked infections limited exclusively to people who were aboard.
Yet Polymarket's had the price of "yes" on the "hantavirus pandemic 2026" market sitting on 7.1-7.2 cents for the last four days. Day by day, this is looking more like a price that reflects the opportunity cost of holding shares of "no" through the rest of 2026 to make a 7.6% profit (coupled with the risk of market tampering), rather than the market's judgment of the likelihood of a hantavirus pandemic.
I have far less confidence in whether the WHO will declare hantavirus a PHEIC vs. a pandemic, because I just don't understand their criteria for the PHEIC designation. We all know what a pandemic is. A pandemic, to me, is clear "no," yet is still at around 6.5 cents on Polymarket when I checked earlier today. A PHEIC is a much lower bar, that I can tell, and what seems crazy to me is that shares of "no hantavirus pandemic 2026" on Polymarket is apparently only 2 cents more expensive than shares of "no hantavirus PHEIC 2026" on Kalshi. If I was forced to pi...
Observation: Opus 4.7 (via chat) is near-useless for helping me iterate on google docs. For one, its reading comprehension seems very poor.
This is surprising because others say that the models give them tons of uplift. (And the main METR graph is just coding but still. And Opus 4.6 beat all humans on the GovAI work test. And the vibes!)
Hypotheses (not exclusive or exhaustive):
Fwiw, if a direction is simply describable, I do think I’m faster to tell Claude Code to do it (via Wispr Flow) than to mechanically do it. I also feel like there are some nice batching properties of relaying to it all-at-once the changes I want, without having to switch myself from conceptual thinking to mechanical operation and then back again, but I’m less sure of that
Occasionally there will be a piece of media that seems like it's glorifying violence (One Battle After Another, Joker), and people will say "this is unprofessional, it might lead to violence!" And then it won't be followed up by any obvious violence, and other people will say "ha, you were being ridiculous." But the more important effect might be to lower public trust, in the same way yelling "I would LOVE to kill my political enemies" hurts public trust even if you don't blow up a courthouse.
SF as in Science Fiction or as in San Francisco? Both makes sense in context.
it's actually crazy how much ubering pareto dominates driving in a city like SF. you don't have to worry about parking, you can work while in transit, you can get a bigger car when needed, you don't need to round trip, etc. it's generally even cheaper once you take depreciation, parking, insurance, etc costs into consideration.
My sense is that this is true until you have a small child you want to move around, and then it’s super super annoying to not have your car seat already installed for them and have other supplies on-hand
"to the success of our hopeless cause" is such a good toast and we should use it more often. i first learned of it from the book of the same name, and apparently it was a common refrain at gatherings of Soviet dissidents. i like it because it captures the feeling of trying really hard to succeed despite being in the basement of the logistic success curve, and somehow, despite all odds, actually succeeding in the end.
If ten thousand people protest, sometimes they get massacred by the army.
There's a lot, off the top of my head: LASR, MARS, Pivotal, SPAR
in the same way that Minecraft teaches you to exercise agency and Factorio teaches you to optimize, are there any games that teach you to stare into the abyss? the ideal game would (a) reward you on a tight feedback loop for constantly admitting that you were wrong, (b) give you the option to not admit that you were wrong but make that decision acutely hurt. pastcasting is good for (a) but not good for (b) because you are sort of forced to confront being wrong all the time, which maybe teaches you that it doesn't feel as bad as you might expect, but it doe...
If you are playing a real opponent they can review it with you, or a tutor can do the same.
From my research, the current meta is the Luggable https://www.cleanairkits.com/products/luggables. Basic insight is that by going from a 99.97% removal HEPA filter to a 93% MERV-13, you can run your filters at a higher speed, quieter, with much better power efficiency, and cycle air through the filter much more quickly due to the higher fan speed. Most HEPA grade filtration misses the mark because they're far too loud running at full blast, so a quieter, faster, less efficient filter actually does a better job. The LuggableXL is about as loud
The one missi...
I think part of the intended interpretation of Dath Ilan is: "it would be easier to run an adequate society if everyone knew some basic economics".
I know almost no economics. Even my understanding of supply & demand is shaky. Is there some bit of economics I can learn and start applying on Earth to see why this claim might be reasonable?
I love economics (I'm an engineer and accidentally stumbled into it and now am PhD econ since 15 y or so and lecturer) and much of me believes DI quote exactly hits the nail. Sometimes, say 20% of me*, thinks that even reasonably advanced economics doesn't help not so smart people really be smart. Still, I think the statement survives.
PM me if you're interested in me sharing some materials personally with you! I'm also happy to have brief chats on things if useful.
Smartness + econ really is king.
One thing I can say upfront: I'm convinced on in LW or EA cir...
For my own future reference, here are some "benchmarks" (very broadly construed) I pay attention to as of Nov 2025, a mix of serious and whimsical. (The "serious" version would probably start with the Evals section of technicalities' 2025 shallow review of technical AIS, or SemiAnalysis' in-house view.)
A programming-related one, from Christian Ekrem's The Tacit Dimension: Why Your Best Engineers Can't Tell You What They Know:
...A Small Story (Slightly Anonymised)
A guy I worked with a while back, much smarter than me, once spent an entire afternoon refusing to merge a PR that, on paper, was correct. The change worked. The tests passed. CI was green. He couldn’t articulate what was wrong. He kept saying “I just don’t believe this code.”
Eventually he asked the author to walk him through the reasoning, line by line, out loud. Maybe forty minutes in, the author