Nathan Helm-Burger

AI alignment researcher, ML engineer. Masters in Neuroscience.

I believe that cheap and broadly competent AGI is attainable and will be built soon. This leads me to have timelines of around 2024-2027. Here's an interview I gave recently about my current research agenda. I think the best path forward to alignment is through safe, contained testing on models designed from the ground up for alignability trained on censored data (simulations with no mention of humans or computer technology). I think that current ML mainstream technology is close to a threshold of competence beyond which it will be capable of recursive self-improvement, and I think that this automated process will mine neuroscience for insights, and quickly become far more effective and efficient. I think it would be quite bad for humanity if this happened in an uncontrolled, uncensored, un-sandboxed situation. So I am trying to warn the world about this possibility. 

See my prediction markets here:

 https://manifold.markets/NathanHelmBurger/will-gpt5-be-capable-of-recursive-s?r=TmF0aGFuSGVsbUJ1cmdlcg 

I also think that current AI models pose misuse risks, which may continue to get worse as models get more capable, and that this could potentially result in catastrophic suffering if we fail to regulate this.

I now work for SecureBio on AI-Evals.

relevant quote: 

"There is a powerful effect to making a goal into someone’s full-time job: it becomes their identity. Safety engineering became its own subdiscipline, and these engineers saw it as their professional duty to reduce injury rates. They bristled at the suggestion that accidents were largely unavoidable, coming to suspect the opposite: that almost all accidents were avoidable, given the right tools, environment, and training." https://www.lesswrong.com/posts/DQKgYhEYP86PLW7tZ/how-factories-were-made-safe 

Wiki Contributions

Comments

I agree that this seems like a grouping of concepts around 'defensive empowerment' which feels like it gets at a useful way to think about reality. However, I don't know offhand of research groups with this general focus on the subject. I think mostly people focusing on any of these subareas have focused just on their specific specialty (e.g. cyberdefense or biological defense), or an even more specific subarea than that. 

I think one of the risks here is that a general agent able to help with this wide a set of things would almost certainly be capable of a lot of scary dual-use capabilities. That adds complications to how to pursue the general subject in a safe and beneficial way.

I feel quite confident that all the leading AI labs are already thinking and talking internally about this stuff, and that what we are saying here adds approximately nothing to their conversations. So I don't think it matters whether we discuss this or not. That simply isn't a lever of control we have over the world.

There are potentially secret things people might know which shouldn't be divulged, but I doubt this conversation is anywhere near technical enough to be advancing the frontier in any way.

I agree with Steve Byrnes here. I think I have a better way to describe this.
I would say that the missing piece is 'mastery'. Specifically, learning mastery over a piece of reality. By mastery I am referring to the skillful ability to model, predict, and purposefully manipulate that subset of reality.
I don't think this is an algorithmic limitation, exactly.


Look at the work Deepmind has been doing, particularly with Gato and more recently AutoRT, SARA-RT, RT-Trajectory, UniSim , and Q-transformer. Look at the work being done with the help of Nvidia's new Robot Simulation Gym Environment. Look at OpenAI's recent foray into robotics with Figure AI. This work is held back from being highly impactful (so far) by the difficulty of accurately simulating novel interesting things, the difficulty of learning the pairing of action -> consequence compared to learning a static pattern of data, and the hardware difficulties of robotics.

This is what I think our current multimodal frontier models are mostly lacking. They can regurgitate, and to a lesser extent synthesize, facts that humans wrote about, but not develop novel mastery of subjects and then report back on their findings. This is the difference between being able to write a good scientific paper given a dataset of experimental results and rough description of the experiment, versus being able to gather that data yourself. The line here is blurry, and will probably get blurrier before collapsing entirely. It's about not just doing the experiment, but doing the pilot studies and observations and playing around with the parameters to build a crude initial model about how this particular piece of the universe might work. Building your own new models rather than absorbing models built by others. Moving beyond student to scientist.

This is in large part a limitation of training expense. It's difficult to have enough on-topic information available in parallel to feed the data-inefficient current algorithms many lifetimes-worth of experience.


So, while it is possible to improve the skill of mastery-of-reality with scaling up current models and training systems, it gets much much easier if the algorithms get more compute-efficient and data-sample-efficient to train.

That is what I think is coming.

I've done my own in-depth research into the state of the field of machine learning and potential novel algorithmic advances which have not yet been incorporated into frontier models, and in-depth research into the state of neuroscience's understanding of the brain. I have written a report detailing the ways in which I think Joe Carlsmith's and Ajeya Cotra's estimates are overestimating the AGI-relevant compute of the human brain by somewhere between 10x to 100x.

Furthermore, I think that there are compelling arguments for why the compute in frontier algorithms is not being deployed as efficiently as it could be, resulting in higher training costs and data requirements than is theoretically possible.

In combination, these findings lead me to believe we are primarily algorithm-constrained not hardware or data constrained. Which, in turn, means that once frontier models have progressed to the point of being able to automate research for improved algorithms I expect that substantial progress will follow. This progress will, if I am correct, be untethered to further increases in compute hardware or training data.

My best guess is that a frontier model of the approximate expected capability of GPT-5 or GPT-6 (equivalently Claude 4 or 5, or similar advances in Gemini) will be sufficient for the automation of algorithmic exploration to an extent that the necessary algorithmic breakthroughs will be made. I don't expect the search process to take more than a year. So I think we should expect a time of algorithmic discovery in the next 2 - 3 years which leads to a strong increase in AGI capabilities even holding compute and data constant. 

I expect that 'mastery of novel pieces of reality' will continue to lag behind ability to regurgitate and recombine recorded knowledge. Indeed, recombining information clearly seems to be lagging behind regurgitation or creative extrapolation. Not as far behind as mastery, so in some middle range. 


If you imagine the whole skillset remaining in its relative configuration of peaks and valleys, but shifted upwards such that the currently lagging 'mastery' skill is at human level and a lot of other skills are well beyond, then you will be picturing something similar to what I am picturing.

I'd be willing to bet that if you gave me three hours in a room with a whiteboard to explain you the specifics of my findings, you'd come out of that conversation with an estimate much closer to mine. In fact, here's the specific bet I'd offer you: If after 3 hours of listening to me presenting the evidence I have and the implications I take from it, your point of view were closer to what it currently is than to what mine currently is... I give you $600 as an apology for wasting your time, and make a public statement saying that I presented my evidence to you and you did not find that it changed your point of view.

If on the other hand, you find that your viewpoint (written down explicitly before the discussion) has moved to be closer to my original viewpoint at the start of the discussion, then you agree to make a public statement saying that I presented compelling evidence and updated your estimates.

[Edit: I would pay for my travel and lodging in coming to meet with you. Also, I would give you a bibliography of academic papers which could support my factual claims, and time to read over those papers after my talk in order to decide if you trusted my claims.]

[Edit 2: No bite? Perhaps you haven't had time to respond, or perhaps $200/hr feels to you to be too low a price for your time to be willing to hear me out. Would it be more tempting if I doubled it to $400/hr?]

I've said this elsewhere, but I think it bears repeating. I've done my own in-depth research into the state of the field of machine learning and potential novel algorithmic advances which have not yet been incorporated into frontier models, and in-depth research into the state of neuroscience's understanding of the brain. I have written a report detailing the ways in which I think Joe Carlsmith's and Ajeya Cotra's estimates are overestimating the AGI-relevant compute of the human brain by somewhere between 10x to 100x.

Furthermore, I think that there are compelling arguments for why the compute in frontier algorithms is not being deployed as efficiently as it could be, resulting in higher training costs and data requirements than is theoretically possible.

In combination, these findings lead me to believe we are primarily algorithm-constrained not hardware or data constrained. Which, in turn, means that once frontier models have progressed to the point of being able to automate research for improved algorithms I expect that substantial progress will follow. This progress will, if I am correct, be untethered to further increases in compute hardware or training data.

My best guess is that a frontier model of the approximate expected capability of GPT-5 or GPT-6 (equivalently Claude 4 or 5, or similar advances in Gemini) will be sufficient for the automation of algorithmic exploration to an extent that the necessary algorithmic breakthroughs will be made. I don't expect the search process to take more than a year. So I think we should expect a time of algorithmic discovery in the next 2 - 3 years which leads to a strong increase in AGI capabilities even holding compute and data constant.

I feel very uncertain what the full implications of that will be, or how fast things will proceed after that point. I do think it would be reasonable, if this situation does come to pass, to approach such novel unprecedentedly powerful AI systems with great caution.

I've been studying and thinking about the physical side of this phenomenon in neuroscience recently. There are groups of columns of neurons in the cortex that form temporary voting blocks, regarding whatever subject that particular Brodmann area focuses on. These alternating groups have to deal with physical limits of how many groups the regions can stably divide into, which limits the number of active distinct hypotheses or 'traders' there can be in a given area at a given time. Unclear exactly what the max is, and it depends on the cortical region in question, but generally 6-9 is the approximate max (not coincidentally the number of distinct 'chunks' we can hold in active short term memory). Also, there is a tendency for noise to collapse too similar of traders/hypotheses/firing-groups to fall back into synchrony/agreement with each other and thus collapse back down to a baseline of two competing hypotheses. These hypotheses/firing-groups/traders are pushed into existence or pushed into merging not just by their own 'bids' but also by the evidence coming in from other brain areas or senses. I don't think that current day neuroscience has all the details yet (although I certainly don't have the full picture of all relevant papers in neuroscience!). 

For those interested in more details, I recommend this video: 

I think you make some good points here.... except, there is one path I think you didn't explore enough.

What if humanity is really stuck on AI alignment, and uploading has become a possibility, and making a rogue AGI agent is a possibility. If these things are being held back by fallible human enforcement, it might then seem that humanity is in a very precarious predicament.

A possible way forward using an Uploaded human then, could be to go the path of editing and monitoring and controlling it. Neuroscience knows a lot about how the brain works. Given that starting point, and the ability to do experiments in a lab where you have full read/write access to a human brain emulation, I expect you could get something far more aligned than you could with a relatively unknown artificial neural net.

Is that a weird and immoral idea? Yes. It's pretty dystopian to be enslaving and experimenting on a human(ish) mind. If it meant the survival of humanity because we were in very dire straights... I'd bite that bullet.

Yes, I personally think that things are going to be moving much too fast for GDP to be a useful measure. GDP requires some sort of integration into the economy. My experience in data science and ML engineering in industry, and also my time in academia, makes it very intuitive to me the lag time from developing something cool in the lab, to actually managing to publish a paper about it, to people in industry seeing the paper and deciding to reimplement it in production. So if you have a lab which is testing it's products internally, and the output is an improved product within that lab, which can then immediately be used for another cycle of improvement... That is clearly going to move much faster than you will see any effect on GDP. So GDP might help you measure a slow early start of a show takeoff, but it will be useless in the fast end section.

I like your effort to think holistically about the sociotechnical systems we are embedded in and the impacts we should expect AI to have on those systems.

I have a couple of minor critiques of the way you are breaking things down that I think could be improved.

First, a meta thing. The general pattern of being a bit to black & white about described very complicated sets of things. This is nice because it makes it easier to reason about complicated situations, but it risks over simplifying and leading to seeming strong conclusions which don't actually follow from the true reality. The devil is in the details, as they say.

Efforts to-date have largely gravitated into the two camps of value alignment and governance.

I don't think this fully describes the set of camps. I think that these are two of the camps, yes, but there are others.

My breakdown would be:

Governance - Using regulation to set up patterns of behavior where AI will be used and developed in safe rather than harmful ways. Forcing companies to internalize their externalities (e.g. risks to society). Preventing and enforcing human-misuse-of-AI scenarios. Attempting to regulate novel technologies which arise because of accelerated R&D as result of AI. Setting up preventative measures to detect and halt rogue AI or human-misused AI in the act of doing bad things before the worst consequences can come to pass. Preventing acceleration spirals of recursive self-improvement from proceeding so rapidly that humanity becomes intellectually eclipsed and looses control over its destiny.

Value alignment - getting the AIs to behave as much as possible in accordance with the values of humanity generally. Getting the AI to be moral / ethical / cautious about harming people or making irreversible changes with potentially large negative consequences. Ideally, if an AI were 'given free reign' to act in the world, we'd want it to act in ways which were win-win for itself and humanity, and no matter what to err on the side of not harming humanity.

Operator alignment - technical methods to get the AI to be obedient to the instructions of the operators. To make the AI behave in accordance with the intuitive spirit of their instructions ('do what I mean') rather than like an evil genie which follows only the letter of the law. Making the AI safe and intuitive to use. Avoiding unintended negative consequences.

Control - finding ways to keep the operators of AI can maintain control over the AIs they create even if a given AI gets made wrong such that it tries to behave in harmful undesirable ways (out of alignment with operators). This involves things like technical methods of sandboxing new AIs, and thoroughly safety-testing them within the sandbox before deploying them. Once deployed, it involves making sure you retain the ability to shut them off if something goes wrong, making sure the model's weights don't get exfiltrated by outside actors or by the model itself. Having good cybersecurity, employee screening, and internal infosec practices so that hackers/spies can't steal your model weights, design docs, and code.

 

A minor nitpick:

Sociotechnical system/s (STS)
A system in which agents (traditionally, people) interact with objects (including technologies) to achieve aims and fulfill purposes

Not sure if objects is the right word here, or rather, not sure if that word alone is sufficient. Maybe objects and information/ideas/concepts? Much of the work I've been doing recently is observing what potential risks might arise from AI systems capable of rapidly integrating technical information from a large set of sources. This is not exactly making new discoveries, but just putting disparate pieces of information together in such a way as to create a novel recipe for technology. In general, this is a wonderful ability. In the specific case of weapons of mass destruction, it's a dangerous ability.

 

Nested STS / Human STS

Yes, historically, all STS have been human STS. But novel AI agents could, in addition to integrating into and interacting with human STS, form their own entirely independent STS. A sufficiently powerful amoral AGI would see human STS as useless if it could make its own that served its needs better. Such a scenario would likely turn out quite badly for humans. This is the concept of "the AI doesn't hate you, it's just that humans and their ecosphere are made of atoms that the AI has preferred uses for."

This doesn't contradict your ideas, just suggests an expansion of possible avenues of risk which should be guarded against. Self-replicating AI systems in outer space or burrowed into the crust of the earth, or roaming the ocean seabeds will likely be quite dangerous to humanity sooner or later even if they have no interaction with our STS in the short term.

Load More