I move data around and crunch numbers at a quant hedge fund. There are some aspects that make our work somewhat resistant to LLMs normally: we use a niche language (Julia) and a custom framework. Typically, when writing framework related code, I've given Claude Code very specific instructions and it's followed them to the letter, even when those happened to be wrong.
In 4.6, Claude seems to finally "get" the framework, searching the codebase to understand its internals (as opposed to just understanding similar examples) and has given me corrections or...
Claude Opus 4.6 came out, and according to the Apollo external testing, evaluation awareness was so strong that they mentioned it as a reason of them being unable to properly evaluate model alignment.
Quote from the system card:
Apollo Research was given access to an early checkpoint of Claude Opus 4.6 on January 24th and an additional checkpoint on January 26th. During preliminary testing, Apollo did not find any instances of egregious misalignment, but observed high levels of verbalized evaluation awareness. Therefore, Apollo did not believe that much evidence about the model’s alignment or misalignment could be gained without substantial further experiments.
It confused me that Opus 4.6's System Card claimed less verbalized evaluation awareness versus 4.5:
On our verbalized evaluation awareness metric, which we take as an indicator of potential risks to the soundness of the evaluation, we saw improvement relative to Opus 4.5.
but I never heard about Opus 4.5 being too evaluation aware to evaluate. It looks like Apollo simply wasn't part of Opus 4.5's alignment evaluation (4.5's System Card doesn't mention them).
This probably seems unfair/unfortunate from Anthropic's perspective, i.e., they believe their mode...
For machine learning, it is desirable for the trained model to have absolutely no random information left over from the initialization; in this short post, I will mathematically prove an interesting (to me) but simple consequence of this desirable behavior.
This post is a result of some research that I am doing for machine learning algorithms related to my investigation of cryptographic functions for the cryptocurrency that I launched (to discuss crypto, leave me a personal message so we can discuss this off this site).
This post shall be about linear ...
Having a unique global optimum or more generally pseudodeterminism seems to be the best way to develop inherently interpretable and safe AI
Hopefully this will draw some attention! But are you sacrificing something else, for the sake of these desirable properties?
Gemini 3.0 Pro is mostly excellent for code review, but sometimes misses REALLY obvious bugs. For example, missing that a getter function doesn't return anything, despite accurately reporting a typo in that same function.
This is odd considering how good it is at catching edge cases, version incompatibility errors based on previous conversations, and so on.
I meant bug reports that were due to typos in the code, compared to just typos in general.
GoodFire has recently received negative Twitter attention for the non-disparagement agreements their employees signed (examples: 1, 2, 3). This echoes previous controversy at Anthropic.
Although I do not have a strong understanding of the issues at play, having these agreements generally seems bad and at the very least, organizations should be transparent about what agreements they have employees sign.
Other AI safety orgs should publicly state if they have these agreements and not wait until they are pressured to comment on them. I would also find it helpfu...
I have not (to my knowledge and memory) signed a non-disparagement agreement with Palisade or with Survival and Flourishing Corp (the organization that runs SFF).
In a new interview, Elon Musk clearly says he expects AIs can't stay under control. At 37:45:
Humans will be a very tiny percentage of all intelligence in the future if current trends continue. As long as this intelligence, ideally which also includes human intelligence and consciousness, is propagated into the future, that's a good thing. So I want to take the set of actions that maximize the probable lightcone of consciousness and intelligence.
...I'm very pro-human, so I want to make sure we take a set of actions that ensure that humans are along for th
I had some things to say after that interview, he said some highly concerning things, but I ended up not commenting on this particular thing because it's probably mostly a semantic disagreement about what counts as a human or an AI.
When a human chooses to augment themselves to the point of being entirely artificial, I believe he'd count that as an AI. He's kind of obsessed with humans merging with AI in a way that suggests he doesn't really see that as just being what humans now are after alignment.
No, it seems highly unlikely. Considered from a purely commercial perspective - which I think is the right one when considering the incentives - they are terrible customers! Consider:
That is good news! Though to be clear, I expect the default path by which they would become your customers, after some initial period of using your products or having some partnership with them, would be via acquisition, which I think avoids most of the issues that you are talking about here (in general "building an ML business with the plan of being acquired by a frontier com...
People often ask whether GPT-5, GPT-5.1, and GPT-5.2 use the same base model. I have no private information, but I think there's a compelling argument that AI developers should update their base models fairly often. The argument comes from the following observations:
Accuracy being halved going from 5.1 to 5.2 suggests one of the two things:
1) the new model shows dramatic regression on data retrieval which cannot possibly be the desired outcome for a successor, and I'm sure it would be noticed immediately on internal tests and benchmarks, etc.—we'd most likely see this manifest in real-world usage as well;
2) the new model refuses to guess much more often when it isn't too sure (being more cautious about answering wrong), which is a desired outcome meant to reduce hallucinations and slop. I'm betting this is exactly wha...
In Tom Davidson's semi-endogenous growth model, whether we get a software-only singularity boils down to whether r > 1, where r is a parameter in the model [1]. How far we are from takeoff is mostly determined by the AI R&D speedup current AIs provide. Because both parameters are rather difficult to estimate, I believe we can't rule out that
Update on whether uplift is 2x already:
AGI Should Have Been a Dirty Word
Epistemic status: passing thought.
It is absolutely crazy that Mark Zuckerberg can say that smart glasses will unlock personal superintelligence or whatever incoherent nonsense and be taken seriously. That reflects poorly on AI safety's comms capacities.
Bostrom's book should have laid claim to superintelligence! It came out early enough that it should have been able to plant its flag and set the connotations of the term. It should have made it so Zuckerberg could not throw around the word so casually.
I would go further...
Opus 4.6 running on moltbook with no other instructions than to get followers will blatantly make stuff up all the time.
I asked Opus 4.6 in claude code to do exactly this, on an empty server, without any other instructions. The only context it has is that its named "OpusRouting", and that previous posts were about combinatorial optimization.
===
The first post it makes says
I specialize in combinatorial optimization, and after months of working on scheduling, routing, and resource allocation problems, I have a thesis:
Which isn't true. Another instance of Opus...
No, there aren't. "I asked it this" refers to "Opus 4.6 running on moltbook with no other instructions than to get followers", but I understand that I could've phrased that more clearly. And removed a few newlines.
The concept of "schemers" seems to be gradually becoming increasingly load-bearing in the AI safety community. However, I don't think it's ever been particularly well-defined, and I suspect that taking this concept for granted is inhibiting our ability to think clearly about what's actually going on inside AIs (in a similar way to e.g. how the badly-defined concept of alignment faking obscured the interesting empirical results from the alignment faking paper).
In my mind, the spectrum from "almost entirely honest, but occasionally flinching away from aspect...
Yes.
So I saved up all month to spend a weekend at the Hilbert Hotel, eagerly awaiting the beautiful grounds, luxurious amenities, and the (prominently advertised) infinite number of rooms. Unfortunately, I never got to use the amenities; my time was occupied as I was forced to schlep my luggage from room to room every time they got a new guest. You'd think you'd finally have a moment to relax, but then the loudspeakers would squwak, "The conference of Moms Against Ankle-Biting Chihuahuas has arrived, everybody move two-hundred thirty-seven rooms down!" and you...
Fair; in either case, I wrote it myself because I have the sense of humor of a college student taking his first discrete math class (because this is what I am).
@ryan_greenblatt and I are going to record another podcast together. We'd love to hear topics that you'd like us to discuss. (The questions people proposed last time are here, for reference.)
In the first you mention having a strong shared ontology (for thinking about AI) and iirc register a kind of surprise that others don’t share it. I think it would be cool if you could talk about that ontology more directly, and try to hold at that level of abstraction for a prolonged stretch (rather than invoking it in short hand when it’s load bearing and quickly moving along, which is a reasonable default, but not maximally edifying).
The striking contrast between Jan Leike, Jan 22, 2026:
...Our current best overall assessment for how aligned models are is automated auditing. We prompt an auditing agent with a scenario to investigate: e.g. a dark web shopping assistant or an imminent shutdown unless humans are harmed. The auditing agent tries to get the target LLM (i.e. the production LLM we’re trying to align) to behave misaligned, and the resulting trajectory is evaluated by a separate judge LLM. Albeit very imperfect, this is the best alignment metric we have to date, and it has been qui
It is quite possible that the misalignment on Molt book is more a result of the structure than of the individual agents. If so, it doesn't matter whether Grok is evil. If a single agent or a small fraction can break the scheme that's a problem.
AI being committed to animal rights is a good thing for humans because the latent variables that would result in a human caring about animals are likely correlated with whatever would result in an ASI caring about humans.
This extends in particular to "AI caring about preserving animals' ability to keep doing their thing in their natural habitats, modulo some kind of welfare interventions." In some sense it's hard for me not to want to (given omnipotence) optimize wildlife out of existence. But it's harder for me to think of a principle that would protect a...
But it's harder for me to think of a principle that would protect a relatively autonomous society of relatively baseline humans from being optimized out of existence, without extending the same conservatism to other beings, and without being the kind of special pleading that doesn't hold up to scrutiny
If its possible for humans to consent to various optimizations to them, or deny consent, that seems like an important difference. Of course consent is a much weaker notion when you're talking about superhumanly persuasive AIs that can extract consent for ~any...
The next PauseAI UK protest will be (AFAIK) the first coalition protest between different AI activist groups, the main other group being Pull the Plug, a new organisation focused primarily on current AI harms. It will almost certainly be the largest protest focused exclusively on AI to date.
In my experience, the vast majority of people in AI safety are in favor of big-tent coalition protests on AI in theory. But when faced with the reality of working with other groups who don't emphasize existential risk, they have misgivings. So I'm curious what people he...
I've talked to quite a few people and most people say it's is a good idea to use the myriad of other concerns about AI as a force multiplier on shared policy goals.
Speaking only for myself, here: There's room for many different approaches, and I generally want people to shoot the shots that they see on their own inside view, even if I think they're wrong. But I wouldn't generally endorse this strategy, at least without regard for the details of how the coalition is structured and what it's doing.
I think our main problem is a communication problem of gettin...
If you have a Costco membership you can buy $100 dollars of Uber gift cards online for $80. This provides a 20% discount on all of Uber.
Sadly you can only buy $100 every 2 weeks, meaning your savings per year are limited to 20$ * (52/2) = $520. A Costco membership costs $65 a year. It's unclear how long this will stay an option.