Nathan Helm-Burger

AI alignment researcher, ML engineer. Masters in Neuroscience.

I believe that cheap and broadly competent AGI is attainable and will be built soon. This leads me to have timelines of around 2024-2027. Here's an interview I gave recently about my current research agenda. I think the best path forward to alignment is through safe, contained testing on models designed from the ground up for alignability trained on censored data (simulations with no mention of humans or computer technology). I think that current ML mainstream technology is close to a threshold of competence beyond which it will be capable of recursive self-improvement, and I think that this automated process will mine neuroscience for insights, and quickly become far more effective and efficient. I think it would be quite bad for humanity if this happened in an uncontrolled, uncensored, un-sandboxed situation. So I am trying to warn the world about this possibility. See my prediction market here: 

I also think that current AI models pose misuse risks, which may continue to get worse as models get more capable, and that this could potentially result in catastrophic suffering if we fail to regulate this.

Wiki Contributions


What if each advisor was granted a limited number of uses of a chess engine... Like 3 each per game. That could help the betrayers come up with a good betrayal when they thought the time was right. But the good advisor wouldn't know that the bad one was choosing this move to user the chess engine on.

Just wanted to say that this was a key part of my daily work for years as an ML engineer / data scientist. Use cheap fast good-enough models for 99% of stuff. Use fancy expensive slow accurate models for the disproportionately high value tail.

Love this. I've been thinking about related things in AI bio safety evals. Could we have an LLM walk a layperson through a complicated-but-safe wetlab protocol which is an approximate difficulty match for a dangerous protocol? How good of evidence would this be compared to doing the actual dangerous protocol? Maybe at least you could cut eval costs by having a large subject group do the safe protocol, and only a small carefully screened and supervised group go through the dangerous protocol.

To which I say, the only valid red teaming of an open source model is to red team it and any possible (not too relatively expensive) modification thereof, since that is what you are releasing.


Yes! Thank you!

I think... maybe I see the world and humanity's existence on it, as a more fragile state of affairs than other people do. I wish I could answer you more thoroughly. 

I think you're misinterpreting. That question is for opting in to the highest privacy option. Not checking it means that your data will be included when the survey is made public. Wanting to not be included at all, even in summaries, is indicated by simply not submitting any answers.

Yes, I think I'd go with the description: 'vague sense that there is something fixed, and a lived experience that says that if not completely fixed then certainly slow moving.'

and I absolutely agree that understanding on this is lacking.

These are indeed the important questions!

My answers from introspection would say things like, "All my values are implicit, explicit labels are just me attempting to name a feeling. The ground truth is the feeling."

"Some have been with me for as long as I can remember, others seem to have developed over time, some changed over time."

My answers from neuroscience would be shaped like, "Well, we have these basic drives from our hypothalamus, brainstem, basal ganglia... and then our cortex tries to understand and predict these drives, and drives can change over time (esp w puberty for instance). If we were to break down where a value comes from it would have to be from some combination of these basic drives, cortical tendencies (e.g. vulnerability to optical illusions), and learned behavior."

"Genetics are responsible for a fetus developing a brain in the first place, and set a lot of parameters in our neural networks that can last a lifetime. Obviously, genetics has a large role to play in what values we start with and what values we develop over time."

My answers from reasoning about it abstractly would be something like, "If I could poll a lot of people at a lot of different ages, and analyze their introspective reports and their environmental circumstances and their life histories, then I could do analysis on what things change and what things stay the same."

"We can get clues about the difference between a value and an instrumental goal by telling people to consider a hypothetical scenario in which a fact X was true that isn't true in their current lives, and see how this changes their expectation of what their instrumental goals would be in that scenario. For example, when imagining a world where circumstances have changed such that money is no longer a valued economic token, I anticipate that I would have no desire for money in that world. Thus, I can infer that money is an instrumental goal."

Overall, I really feel uncertain about the truth of the matter and the validity of each of these ways of measuring. I think understanding values vs instrumental goals is important work that needs doing, and I think we need to consider all these paths to understanding unless we figure out a way to rule some out.

An example of something I would be strongly against anyone publishing at this point in history is an algorithmic advance which drastically lowered compute costs for an equivalent level of capabilities, or substantially improved hazardous capabilities (without tradeoffs) such as situationally-aware strategic reasoning or effective autonomous planning and action over long time scales. I think those specific capability deficits are keeping the world safe from a lot of possible bad things. 

Load More