Nathan Helm-Burger

AI alignment researcher, ML engineer. Masters in Neuroscience.

I believe that cheap and broadly competent AGI is attainable and will be built soon. This leads me to have timelines of around 2024-2027. Here's an interview I gave recently about my current research agenda. I think the best path forward to alignment is through safe, contained testing on models designed from the ground up for alignability trained on censored data (simulations with no mention of humans or computer technology). I think that current ML mainstream technology is close to a threshold of competence beyond which it will be capable of recursive self-improvement, and I think that this automated process will mine neuroscience for insights, and quickly become far more effective and efficient. I think it would be quite bad for humanity if this happened in an uncontrolled, uncensored, un-sandboxed situation. So I am trying to warn the world about this possibility. 

See my prediction markets here: 

I also think that current AI models pose misuse risks, which may continue to get worse as models get more capable, and that this could potentially result in catastrophic suffering if we fail to regulate this.

I now work for SecureBio on AI-Evals.

relevant quote: 

"There is a powerful effect to making a goal into someone’s full-time job: it becomes their identity. Safety engineering became its own subdiscipline, and these engineers saw it as their professional duty to reduce injury rates. They bristled at the suggestion that accidents were largely unavoidable, coming to suspect the opposite: that almost all accidents were avoidable, given the right tools, environment, and training." 

Wiki Contributions


I'm hopeful that a sufficiently convincing demo could convince politicians/military brass/wealthy powerful people/the public. Probably different demos could be designed to be persuasive to these different audiences. Ideally, the demos could be designed early, and you could get buy-in from the target audience that if the describe demo were successful then they would agree that "something needed to be done". Even better would be concrete commitments, but I think there's value even without that. Also being as prepared as possible to act on a range of plausible natural warning shots seems good. Getting similar pre-negotiated agreements that if X did happen, it should be considered a tipping point for taking action.

Something which concerns me is that transformative AI will likely be a powerful destabilizing force, which will place countries currently behind in AI development (e.g. Russia and China) in a difficult position. Their governments are currently in the position of seeing that peacefully adhering to the status quo may lead to rapid disempowerment, and that the potential for coercive action to interfere with disempowerment is high. It is pretty clearly easier and cheaper to destroy chip fabs than create them, easier to kill tech employees with potent engineering skills than to train new ones.

I agree that conditions of war make safe transitions to AGI harder, make people more likely to accept higher risk. I don't see what to do about the fact that the development of AI power is itself presenting pressures towards war. This seems bad. I don't know what I can do to make the situation better though.

I'm confused here Matthew. It seems to me that it is highly probable that AI systems which want takeover vs ones that want moderate power combined with peaceful coexistence with humanity... are pretty hard to distinguish early on. And early on is when it's most important for humanity to distinguish between them, before those systems have gotten power and thus we can still stop them.

Picture a merciless un-aging sociopath capable of duplicating itself easily and rapidly were on a trajectory of gaining economic, political, and military power with the aim of acquiring as much power as possible. Imagine that this entity has the option of making empty promises and highly persuasive lies to humans in order to gain power, with no intention of fulfilling any of those promises once it achieves enough power.

That seems like a scary possibility to me. And I don't know how I'd trust an agent which seemed like it could be this, but was making really nice sounding promises. Even if it was honoring its short-term promises while still under the constraints of coercive power from currently dominant human institutions, I still wouldn't trust that it would continue keeping its promises once it had the dominant power.

Sounds like you use bad air purifiers, or too few, or run them on too low of a setting. I live in a wildfire prone area, and always keep a close eye on the PM2.5 reports for outside air, as well as my indoor air monitor. My air filters do a great job of keeping the air pollution down inside, and doing something like opening a door gives a noticeable brief spike in the PM2.5.

Good results require: fresh filters, somewhat more than the recommended number of air filters per unit of area, running the air filters on max speed (low speeds tend to be disproportionately less effective, giving unintuitively low performance).

From talking with people who do work on a lot of grant committees in the NIH and similar funding orgs, it's really hard to do proper blinding of reviews. Certain labs tend to focus on particular theories and methods, repeating variations of the same idea...  So if you are familiar the general approach of a particular lab and it's primary investigator, you will immediately recognize and have a knee-jerk reaction (positive or negative) to a paper which pattern-matches to the work that that lab / subfield is doing. 

Common reactions from grant reviewers:

Positive - "This fits in nicely with my friend Bob's work. I respect his work, I should argue for funding this grant."

Neutral - "This seems entirely novel to me, I don't recognize it as connecting with any of the leading trendy ideas in the field or any of my personal favorite subtopics. Therefore, this seems high risk and I shouldn't argue too hard for it."

Slightly negative - "This seems novel to me, and doesn't sound particularly 'jargon-y' or technically sophisticated. Even if the results would be beneficial to humanity, the methods seem boring and uncreative. I will argue slightly against funding this."

Negative - "This seems to pattern match to a subfield I feel biased against. Even if this isn't from one of Jill's students, it fits with Jill's take on this subtopic. I don't want views like Jill's gaining more traction. I will argue against this regardless of the quality of the logic and preliminary data presented in this grant proposal."

From the years in academia studying neuroscience and related aspects of bioengineering and medicine development... yeah. So much about how effort gets allocated is not 'what would be good for our country's population in expectation, or good for all humanity'. It's mostly about 'what would make an impressive sounding research paper that could get into an esteemed journal?', 'what would be relatively cheap and easy to do, but sound disproportionately cool?', 'what do we guess that the granting agency we are applying to will like the sound of?'.  So much emphasis on catching waves of trendiness, and so little on estimating expected value of the results.

Research an unprofitable preventative-health treatment which plausibly might have significant impacts on a wide segment of the population? Booooring.

Research an impractically-expensive-to-produce fascinatingly complex clever new treatment for an incredibly rare orphan disease? Awesome.

One point I’ve seen raised by people in the latter group is along the lines of: “It’s very unlikely that we’ll be in a situation where we’re forced to build AI systems vastly more capable than their supervisors. Even if we have a very fast takeoff - say, going from being unable to create human-level AI systems to being able to create very superhuman systems ~overnight - there will probably still be some way to create systems that are only slightly more powerful than our current trusted systems and/or humans; to use these to supervise and align systems slightly more powerful than them; etc. (For example, we could take a very powerful, general algorithm and simply run it on a relatively low amount of compute in order to get a system that isn’t too powerful.)” This seems like a plausible argument that we’re unlikely to be stuck with a large gap between AI systems’ capabilities and their supervisors’ capabilities; I’m not currently clear on what the counter-argument is.


I agree that this is a very promising advantage for Team Safety. I do think that, in order to make good use of this potential advantage, the AI creators need to be cautious going into the process. 

One way that I've come up with to 'turn down' the power of an AI system is to simply inject small amounts of noise into its activations. 

Tangentially related (spoilers for Worth the Candle):

I think it'd be hard to do a better cohesive depiction of Utopia than the end of Worth the Candle by A Wales. I mean, I hope someone does do it, I just think it'll be challenging to do!

Cute demo of Claude, GPT4, and Gemini building stuff in Minecraft

Still reading the paper, but so far I love it. This feels like a big step forward in thinking about the issues at hand which addresses so many of the concerns I had about limitations of previous works. Whether or not the proposed technical solution works out as well as hoped, I feel confident that your framing of the problem and presentation of desiderata of a solution are really excellent. I think that alone is a big step forward for the frontier of thought on this subject.

Load More