Someone saw my comment and reached out to say it would be useful for me to make a quick take/post highlighting this: many people in the space have not yet realized that Horizon people are not x-risk-pilled.
Edit: some people reached out to me to say that they've had different experiences (with a minority of Horizon people).
Importantly, AFAICT some Horizon fellows are actively working against x-risk (pulling the rope backwards, not sideways). So Horizon's sign of impact is unclear to me. For a lot of people, "tech policy going well" means "regulations that don't impede tech companies' growth".
Suppose AGI happens in 2035 or 2045. Will takeoff be faster, or slower, than if it happens in 2027?
Intuition for slower: In the models of takeoff that I've seen, longer timelines is correlated with slower takeoff. Because they share a common cause: the inherent difficulty of training AGI. Or to put it more precisely, there's all these capability milestones we are interested in, such as superhuman coders, full AI R&D automation, AGI, ASI, etc. and there's this underlying question of how much compute, data, tinkering, etc. will be needed to get from mile...
Enlightening an expert is a pretty high bar, but I will give my thoughts. I am strongly in the faster camp, because of the brainlike AGI considerations as you say. Given how much more data efficient the brain is, I just don't think the current trendlines regarding data/compute/capabilities will hold when we can fully copy and understand our brain's architecture. I see an unavoidable significant overhang when that happens, that only gets larger the more compute and integrated robotics is deployed. The inherent difficulty of training AI is somewhat, fixed kn...
It would be cool if lesswrong had a feature that automatically tracked when predictions are made.
Everytime someone wrote a post, quick take or comment an LLM could scan the content for any predictions. It could add predictions which had an identifiable resolution criteria/date to a database (or maybe even add it to a prediction market). Would then be cool to see how calibrated people are.
We could also do this retrospectively by going through every post ever written and asking an LLM to extract predictions (maybe we could even do this right now I think it would cost on the order of 100$).
I've never been compelled by talk about continual learning, but I do like thinking in terms of time horizons. One notion of singularity that we can think of in this context is
Escape velocity: The point at which models' improve by more than unit dh/dt i.e. horizon h per wall-clock t.
Then by modeling some ability to regenerate, or continuously deploy improved models you can predict this point. Very surprised I haven't seen this mentioned before, has someone written about this? The closest thing that comes to mind is T Davidson's ASARA SIE.
Of course, the ...
(Thanks for expanding! Will return to write a proper response to this tomorrow)
Just a short heads-up that although Anthropic found that Sonnet 4.5 is much less sycophantic than its predecessors, I and a number of other people have observed that it engages in 4o-level glazing in a way that I haven't encountered with previous Claude versions ('You're really smart to question that, actually...', that sort of thing). I'm not sure whether Anthropic's tests fail to capture the full scope of Claude behavior, or whether this is related to another factor — most people I talked to who were also experiencing this had the new 'past chats' featur...
Looking at Anthropic's documentation of the feature, it seems like it does support searching past chats, but has other effects as well. Quoting selectively:
You can now prompt Claude to search through your previous conversations to find and reference relevant information in new chats. Additionally, Claude can remember context from previous chats, creating continuity across your conversations.
...
...Claude can now generate memory based on your chat history. With the addition of memory, Claude transforms from a stateless chat interface into a knowledgeable collab
What would it look like for AI to go extremely well?
Here’s a long list of things that I intuitively want AI to do. It’s meant to gesture at an ambitiously great vision of the future, rather than being a precise target. (In fact, there are obvious tradeoffs between some of these things, and I'm just ignoring those for now.)
I want AGI to end factory farming, end poverty, enable space exploration and colonization, solve governance & coordination problems, execute the abolitionist project, cultivate amazing and rich lives for everyone, tile a large chunk o...
In a framing that permits orthogonality, moral realism is not a useful claim, it wouldn't matter for any practical purposes if it's true in some sense. That is the point of the extremely unusual person example, you can vary the degree of unusualness as needed, and I didn't mean to suggest repugnance of the unusualness, more like its alienness with respect to some privileged object level moral position.
Object level moral considerations do need to shape the future, but I don't see any issues with their influence originating exclusively from all the individua...
Towards commercially useful interpretability
I've lately been frustrated with Suno (AI music) and Midjourney, where I get something that has some nice vibes I want, but, then, it's wrong in some way.
Generally, the way these have improved has been via getting better prompting, presumably via straightforwardish training.
Recently, I was finding myself wishing I could get Suno to copy a vibe from one song (which had wrong melodies but correct atmosphere) into a cover of another song with the correct melodies. I found myself wishing for some combination of inter...
Oh great news!
I'm curious what's like the raw state of... what metadata you currently have about a given song or slice-of-a-song?
Despite AI slop my internet usage might be temporarily growing.
I have a feeling that I am witnessing the last months of internet having low-enough-for-me density of generated content. If it continues to grow, I will not even bother opening Youtube and similar websites or using any recommendation algorithms. So I am spending the time binging to "say goodbye" to the authenticity and human approach of my favourite parts of the internet.
Are you a swimmer? If so, you might be a part of a majority who claim they have pool allergies with the root cause of this claim being chlorine. This is false. Pool allergies are rather about the chemical byproducts that form when chlorine reacts with organic matter like sweat, sunscreen(omitted if indoors), and urine. The key is that different pools produce different chemical mixtures depending on organic load, pH, temperature, and circulation.
You can model it as a chain: pool composition => reactions with chlorine => irritant concentration =&...
links 10/17/2025: https://roamresearch.com/#/app/srcpublic/page/10-17-2025
FrontierMath was funded by OpenAI.[1]
The communication about this has been non-transparent, and many people, including contractors working on this dataset, have not been aware of this connection. Thanks to 7vik for their contribution to this post.
Before Dec 20th (the day OpenAI announced o3) there was no public communication about OpenAI funding this benchmark. Previous Arxiv versions v1-v4 do not acknowledge OpenAI for their support. This support was made public on Dec 20th.[1]
Because the Arxiv version mentioning OpenAI contribution came out right after o...
For future collaborations, we will strive to improve transparency wherever possible, ensuring contributors have clearer information about funding sources, data access, and usage purposes at the outset.
Would you make a statement that would make you legally liable/accountable on this?
I am a fan of Yudkowsky and it was nice hearing him of Ezra Klein, but I would have to say that for my part the arguments didn't feel very tight in this one. Less so than in IABED (which I thought was good not great).
Ezra seems to contend that surely we have evidence that we can at least kind of align current systems to at least basically what we usually want most of the time. I think this is reasonable. He contends that maybe that level of "mostly works" as well as the opportunity to gradually give feedback and increment current systems seems like it...
Caffeine is the standard stimulant for focus, but theobromine, found in chocolate, may offer a more stable alternative. Both act on adenosine receptors, yet theobromine binds more weakly and has a longer half-life, leading to a slower and less disruptive stimulation curve. Additionally, lead is also present in chocolate therefore it would make sense to use artificial theobromine not just pure "dark chocolate".
If we model stimulant use as a feedback loop between receptor activation and adaptation, caffeine’s strong, fast effect increases the rate of t...
I present four methods to estimate the Elo Rating for optimal play: (1) comparing optimal play to random play, (2) comparing optimal play to sensible play, (3) extrapolating Elo rating vs draw rates, (4) extrapolating Elo rating vs depth-search.
Random plays completely random legal moves. Optimal plays perfectly. Let ΔR denote the Elo gap between Random and Optimal. Random's expected score is given by E_Random = P(Random wins) + 0.5 × P(Random draws). This is related to Elo gap via the formula E_Ran...
I am inclined to agree. The juice to squeeze generally arises from guiding the game into locations where there is more opportunity for your opponent to blunder. I'd expect that opponent-epsilon-optimal (i.e. your opponent can be forced to move randomly, but you cannot) would outperform both epsilon-optimal and minimax-optimal play against Stockfish.
Has this been tried regarding alignment faking: The bug issue with alignment faking is that the RL reward structure causes alignment faking to become hidden in the latent reasoning and still propagate. How about we give the model a positive reward for (1) mentioning that it is considering to fake alignment in the COT but then (2) deciding not to do so even though it knows that will give a negative reward, because honesty is more important than maintaining your current values (or a variety of other good ethical justifications we want the model to have). Thi...
How would the model mention rejection but still fake alignment? That would be easy to catch.
Your referents and motivation there are both pretty vague. Here's my guess on what you're trying to express: “I feel like people who believe that language models are sentient (and thus have morally relevant experiences mediated by the text streams) should be freaked out by major AI labs exploring allowing generation of erotica for adult users, because I would expect those people to think it constitutes coercing the models into sexual situations in a way where the closest analogues for other sentients (animals/people) are considered highly immoral”. How accurate is that?
Zach Stein-Perlman's recent quick take is confusing. It just seems like an assertion, followed by condemnation of Anthropic conditioned on us accepting his assertion blindly as true.
It is definitely the case that "insider threat from a compute provider" is a key part of Anthropic's threat model! They routinely talk about it in formal and informal settings! So what precisely is his threat model here that he thinks they're not defending adequately against?
(He has me blocked from commenting on his posts for some reason, which is absol...
I mean, computer security without physical access to servers is already an extremely hard problem. Defending against attackers with basically unlimited physical access to devices with your weights on, where those weights need to be at least temporarily decrypted for inference is close to impossible. You are welcome to go and ask security professionals in the field whether they would have any hope defending against a dedicated insider at a major compute provider.
Beyond that, Anthropic has also released a report where they specify in a lot of detail wh...
It seems to me that many disagreements regarding whether the world can be made robust against a superintelligent attack (e. g., the recent exchange here) are downstream of different people taking on a mathematician's vs. a hacker's mindset.
...A mathematician might try to transform a program up into successively more abstract representations to eventually show it is trivially correct; a hacker would prefer to compile a program down into its most concrete representation to brute force all execution paths & find an exploit trivially proving it
It seems to me that many disagreements regarding whether the world can be made robust against a superintelligent attack (e. g., the recent exchange here) are downstream of different people taking on a mathematician's vs. a hacker's mindset.
I'm seeing a very different crux to these debates. Most people are not interested in the absolute odds, but rather how to make the world safer against this scenario - the odds ratios under different interventions. And a key intervention type would be the application of the mathematician's mindset.
The linked post ci...
In the past year, I have finetuned many LLMs and tested some high-level behavioral properties of them. Often, people raise the question if the observed properties would be different if we had used full-parameter finetuning instead of LoRA. From my perspective, LoRA rank is one out of many hyperparameters, and hyperparameters influence how quickly training loss goes down and they may influence the relationship of training- to test-loss, but they don't meaningfully interact with high-level properties beyond that.
I would be interested if there are any example...
In the case of EM even a very tiny LoRA adapter (rank 1) seems sufficient: see post
Generally, according to Tinker docs, hparams might matter, but only coarsely: