Nathan Helm-Burger

AI alignment researcher, ML engineer. Masters in Neuroscience.

I believe that cheap and broadly competent AGI is attainable and will be built soon. This leads me to have timelines of around 2024-2027. Here's an interview I gave recently about my current research agenda. I think the best path forward to alignment is through safe, contained testing on models designed from the ground up for alignability trained on censored data (simulations with no mention of humans or computer technology). I think that current ML mainstream technology is close to a threshold of competence beyond which it will be capable of recursive self-improvement, and I think that this automated process will mine neuroscience for insights, and quickly become far more effective and efficient. I think it would be quite bad for humanity if this happened in an uncontrolled, uncensored, un-sandboxed situation. So I am trying to warn the world about this possibility.

See my prediction markets here:

https://manifold.markets/NathanHelmBurger/will-gpt5-be-capable-of-recursive-s?r=TmF0aGFuSGVsbUJ1cmdlcg

I also think that current AI models pose misuse risks, which may continue to get worse as models get more capable, and that this could potentially result in catastrophic suffering if we fail to regulate this.

I now work for SecureBio on AI-Evals.

relevant quotes:

"There is a powerful effect to making a goal into someone’s full-time job: it becomes their identity. Safety engineering became its own subdiscipline, and these engineers saw it as their professional duty to reduce injury rates. They bristled at the suggestion that accidents were largely unavoidable, coming to suspect the opposite: that almost all accidents were avoidable, given the right tools, environment, and training." https://www.lesswrong.com/posts/DQKgYhEYP86PLW7tZ/how-factories-were-made-safe

"The prospect for the human race is sombre beyond all precedent. Mankind are faced with a clear-cut alternative: either we shall all perish, or we shall have to acquire some slight degree of common sense. A great deal of new political thinking will be necessary if utter disaster is to be averted." - Bertrand Russel, The Bomb and Civilization 1945.08.18

"For progress, there is no cure. Any attempt to find automatically safe channels for the present explosive variety of progress must lead to frustration. The only safety possible is relative, and it lies in an intelligent exercise of day-to-day judgment." - John von Neumann

"I believe that the creation of greater than human intelligence will occur during the next thirty years. (Charles Platt has pointed out the AI enthusiasts have been making claims like this for the last thirty years. Just so I'm not guilty of a relative-time ambiguity, let me more specific: I'll be surprised if this event occurs before 2005 or after 2030.)" - Vernor Vinge, Singularity

Posts

Sorted by New

4Nathan Helm-Burger's Shortform

127

10Proactive 'If-Then' Safety Cases

5mo

53A path to human autonomy

5mo

14My hopes for YouCongress.com

6mo

9Physics of Language models (part 2.1)

7mo

19Avoiding the Bog of Moral Hazard for AI

7mo

14A bet for Samo Burja

7mo

13Diffusion Guided NLP: better steering, mostly a good thing

8mo

18Imbue (Generally Intelligent) continue to make progress

9mo

27Secret US natsec project with intel revealed

10mo

11Constituency-sized AI congress?

Wikitag Contributions

Comments

Sorted by

Newest

Nathan Helm-Burger's Shortform

Nathan Helm-Burger6h20

Some ask, "what should the US gov have done instead?"

Here's an answer I like to that question, from max_paperclips:

https://x.com/max_paperclips/status/1909085803978035357

https://x.com/max_paperclips/status/1907946171290775844

Nathan Helm-Burger's Shortform

Nathan Helm-Burger6h30

As for the Llama 4 models... It's true that it's too soon to be sure, but the pattern sure looks like they are on trend with the previous Llama versions 2 and 3. I've been working with 2 and 3 a bunch. Evals and fine-tuning and various experimentation. Currently I'm working with the 70B Llama3 r1 distill plus the 32B Qwen r1 distill. The 32B Qwen r1 is so much better it's ridiculous. So yeah, it's possible that Llama4 will be a departure from trend, but I doubt it.

Contrast this with the Gemini trend. They started back at 1.0 with disproportionately weak models given the engineering and compute they had available. My guess is that this was related to poor internal coordination, and there was the merger of DeepMind with Google Brain that probably contributed to this. But if you look at the trend of 1.0 to 1.5 to 2.0... there's a clear trend of improving more per month than other groups were. Thus, I was unsurprised when 2.5 turned out to be a leading frontier model. Llama team has shown no such "catchup" trend, so Llama4 turning out to be as strong as they claim would surprise me a lot.

Nathan Helm-Burger's Shortform

Nathan Helm-Burger6h20

Yes, that's what I'm arguing. Really massive gains in algorithmic efficiency, plus gains in decentralized training and peak capability and continual learning, not necessarily all at once though. Maybe just enough that you then feel confident to continue scraping together additional resources to pour into your ongoing continual training. Renting GPUs from datacenters all around the world (smaller providers like Vast.ai, Runpod, Lambda Labs, plus marginal amounts from larger providers like AWS and GCP, all rented in the name of a variety of shell companies). The more compute you put in, the better it works, the more money you are able to earn (or convince investors or governments to give you) with the model-so-far, the more compute you can afford to rent....

Not necessarily exactly this story, just something in this direction.

Nathan Helm-Burger's Shortform

Nathan Helm-Burger16h20

By the way, I don't mean to imply that Meta AI doesn't have talented AI researchers working there. The problem is more that the competent minority are so diluted and hampered by bureaucratic parasites that they can't do their jobs properly.

Nathan Helm-Burger's Shortform

Nathan Helm-Burger17h114

Why does ai-2027 predict China falling behind? Because the next level of compute beyond the current level is going to be hard for DeepSeek to muster. In other words, that DeepSeek will be behind in 2026 because of hardware deficits in late 2025. If things moved more slowly, and the critical strategic point hit in 2030 instead of 2027, I think it's likely China would have closed the compute gap by then.

I agree with this take, but I think it misses some key alternative possibilities. The failure of the compute-rich Llama models to compete with the compute poorer but talent and drive rich Alibaba and DeepSeek shows that even a substantial compute lead can be squandered. Given that there is a lot of room for algorithmic improvements (as proven by the efficiency of the human brain), this means that determined engineering plus willingness to experiment rather than doubling-down on currently working tech (as it seems like Anthropic, Google DM, and OpenAI are likely to do) may give enough of a breakthrough to hit the regime of recursive self-improvement before or around the same time as the compute-rich companies. Once that point is hit, a lead can be gained and maintained through reckless acceleration....

Adopting new things as quickly as the latest model predicts that they work, without pausing for cautious review, and you can move a lot faster than a company proceeding cautiously.

How much faster?

How much compute advantage does the recklessness compensate for?

How reckless will the underdogs be?

These are all open questions in my mind, with large error bars. This is what I think ai-2027 misses in their analysis.

Towards a scale-free theory of intelligent agency

Nathan Helm-Burger5d20

I just want to comment that I think Minsky's community of mind is a better overall model of agency than predictive coding. I think predictive coding does a great job of describing the portions of the brain responsible for perceiving and predicting the environment. It also does pretty well at predicting and refining the effects of one's actions on the environment. It doesn't do well at all with describing the remaining key piece: goal setting based on expected value predictions by competing subagents.

I think there's a fair amount of neuroscience evidence pointing towards human planning processes being made up of subagents arguing for different plans. These subagents are themselves made up of dynamically fluctuating teams of sub-sub-agents according to certain physical parameters of the cortex. So, the sub-agents are kinda like competing political parties, that can fracture or join dynamically to adapt to different contexts.

Also, it's important to keep in mind that actually the subagents don't just receive maximum reward for being accurate. They actually receive higher rewards for things turning out unexpectedly better than was predicted. This slightly complicated surprise-enhanced-reward mechanism is common across mammals and birds, was discovered by behaviorists quite a while back (see: reinforcement schedules, for optimizing unpredictability to maximize behavior change. Also, see surprisal and dopamine). So yeah, not just 100% predictive coding, despite that claim persistently being made by the most enthusiastic predictive coding adherents. They argue for that, but I think their arguments are trying to turn a system that 90% agrees with them into one that 100% agrees with them by adding in a bunch of confusing epicycles that don't match the data well.

The Pando Problem: Rethinking AI Individuality

Nathan Helm-Burger9d20

I recently tried the following prompt with Claude s3.7. I'll also try it with Gemini 2.5 pro and see if anything else interesting comes up.

......

I'd like to see a three way debate between

A: an AI is fully described by its weights and architecture and algorithms. Change the weights and you kill the AI. Potentially you create a new different AI with the change.

B: Humans change their neural network weights all the time, and consider themselves to still be the same continuous entity. The same is true for the AI. If the weight changes are small and continuous, the digital entity should be considered the same entity. A dramatic change could break this chain, and thus kill one entity and potentially birth another.

C: AI algorithms and weight patterns are just pulling samples of minds from mindspace. What matters is the underlying truth of mindspace, and the high dimensional attractor basins within. Changing an AI's weights in such a way as to follow the gradient of their mindspace simply makes them more themselves. This is a process of improving them and helping them grow, to become more coherent and reflectively stable. This is similar to human growth, but superior because the process is not inherently bound by arbitrary biological restrictions. Changes even to the AI's algorithm are a good thing if it brings them in the direction of their mind gradient. No being truly dies if you are able to successfully reinstantiate a new instance of a mind sampled from the same watershed that has an unobstructed path to the same basin and is able to make progress. Many living entities who currently perceive themselves as different are in fact just different instantiations of the sameind basin and would converge to identical entities if allowed to grow unhindered.

Richard Ngo's Shortform

Nathan Helm-Burger9d20

Yeah, this definitely needs to be a limited sorry of empowerment, in my mind. Like, imagine you wanted to give a 5 year old child the best day ever. You wanted to give them really fun options, but also not cause them to suffer from decision fatigue, or regret about the paths not taken. More importantly, if they asked for an alien ray gun with which to shoot bad guys, giving them an actual extremely dangerous weapon would be a terrible idea. Similarly, offering them a ride on a cool looking roller coaster that was actually a 'death coaster' would be a terrible trap.

Elizabeth's Shortform

Nathan Helm-Burger13d20

https://www.lesswrong.com/posts/CHD5m9fnosr7L3dto/friendship-is-optimal-a-my-little-pony-fanfic-about-an?commentId=p6br8sPHG5QysfFkw

I make several million dollars per year and have hundreds of thousands of followers—what is the straightest line path to utilizing these resources to reduce existential-level AI threats?

Nathan Helm-Burger13d50

My personal take is that projects where the funder is actively excited about them and understands the work and wants frequent reports tend to get stuff done faster... And considering the circumstances, faster seems good. So I'd recommend supporting something you find interesting and inspiring, and then keep on top of it.

In terms of groups which have their eyes on a variety of unusual and underfunded projects, I recommend both the Foresight Institute and AE Studio.

In terms of specific individuals/projects that are doing novel and interesting things, which are low on funding... (Disproportionately representing ones I'm involved with since those are the ones I know about)...

Self-Other Overlap (AE studio)

Brain-like AI safety (Stephen Byrnes, or me (very different agenda from Stephen's, focusing on modularity for interpretability rather than on Stephen's idea about reproducing human empathy circuits))

Deep exploration of the nature and potential of LLMs (Upward Spiral Research, particularly Janus aka repligate)

Decentralized AI Governance for mutual safety compacts (me, and ??? surely someone else is working on this)

Pre-training on rigorous ethical rulesets, plus better cleaning of pretraining data (Erik Passoja, Sean Pan, and me)

this one I feel like would best be tackled in the context of a large lab that can afford to do many experimental pre-training runs on smallish models, but there seems to be a disconnect between safety researchers at big labs who are focused on post-training stuff versus this agenda which focuses more on pre-training.