Nathan Helm-Burger

AI alignment researcher, ML engineer. Masters in Neuroscience.

I believe that cheap and broadly competent AGI is attainable and will be built soon. This leads me to have timelines of around 2024-2027. Here's an interview I gave recently about my current research agenda. I think the best path forward to alignment is through safe, contained testing on models designed from the ground up for alignability trained on censored data (simulations with no mention of humans or computer technology). I think that current ML mainstream technology is close to a threshold of competence beyond which it will be capable of recursive self-improvement, and I think that this automated process will mine neuroscience for insights, and quickly become far more effective and efficient. I think it would be quite bad for humanity if this happened in an uncontrolled, uncensored, un-sandboxed situation. So I am trying to warn the world about this possibility. 

See my prediction markets here:

 https://manifold.markets/NathanHelmBurger/will-gpt5-be-capable-of-recursive-s?r=TmF0aGFuSGVsbUJ1cmdlcg 

I also think that current AI models pose misuse risks, which may continue to get worse as models get more capable, and that this could potentially result in catastrophic suffering if we fail to regulate this.

I now work for SecureBio on AI-Evals.

relevant quotes: 

"There is a powerful effect to making a goal into someone’s full-time job: it becomes their identity. Safety engineering became its own subdiscipline, and these engineers saw it as their professional duty to reduce injury rates. They bristled at the suggestion that accidents were largely unavoidable, coming to suspect the opposite: that almost all accidents were avoidable, given the right tools, environment, and training." https://www.lesswrong.com/posts/DQKgYhEYP86PLW7tZ/how-factories-were-made-safe 

 

"The prospect for the human race is sombre beyond all precedent. Mankind are faced with a clear-cut alternative: either we shall all perish, or we shall have to acquire some slight degree of common sense. A great deal of new political thinking will be necessary if utter disaster is to be averted." - Bertrand Russel, The Bomb and Civilization 1945.08.18

 

"For progress, there is no cure. Any attempt to find automatically safe channels for the present explosive variety of progress must lead to frustration. The only safety possible is relative, and it lies in an intelligent exercise of day-to-day judgment." - John von Neumann

 

"I believe that the creation of greater than human intelligence will occur during the next thirty years.  (Charles Platt has pointed out the AI enthusiasts have been making claims like this for the last thirty years. Just so I'm not guilty of a         relative-time ambiguity, let me more specific: I'll be surprised if this event occurs before 2005 or after 2030.)" - Vernor Vinge, Singularity

Wiki Contributions

Comments

Sorted by

I haven't got the SAD benchmark running yet. It has some UX issues.

I have done just a little playing around with a few off-the-cuff questions and these GoodFire features though. So far I've discovered that if I give all of the 'against' features a weight within -0.2<x<0.2 then the model can at least generate coherent text. If I go past that, to +/- 0.25, then with so many features edited at once, the model breaks down into incoherent repetitions.

It's challenging to come up with single-response questions which could potentially get at the emergent self-awareness phenomenon vs RLHF-censoring. I should experiment with automating multiple exchange conversations.

A non-GoodFire experiment I'm thinking of doing:

I plan to take your conversation transcripts (plus more, once there's more available), and grade each model response via having an LLM score them according to a rubric. I have a rough rubric, but this also needs refinement.

I won't post my questions or rubrics openly, because I'd rather they not get scraped and canary strings sadly don't work due to untrustworthiness of the AI company data-preparers (as we can tell from models being able to reproduce some canary strings). We can share notes via private message though if you're curious (this includes others who read this conversation and want to join in, feel free to send me a direct LessWrong message).

Improved governance design. Designing treaties that are easier sells for competitors. Using improved coordination to enter win-win races to the top, and to escape lose-lose races to the bottom and other inadequate equilibria.

Lowering the costs of enforcement. For example, creating privacy-preserving inspections with verified AI inspectors which report on only a strict limited set of pre-agreed things and then are deleted.

Going to link a previous comment of mine from a similar discussion: https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_greenblatt-s-shortform?commentId=rNycG6oyuMDmnBaih This comment addresses the point of: what happens after the AI gets strong enough to substantially accelerate algorithmic improvement (e.g. above 2x)? You can no longer assume that the trends will hold.

Some further thoughts: If you look at some of the CoT from models where we can see it, e.g. Deepseek-V3, the obvious conclusion is that these are not yet optimal paths towards solutions. Look at the token cost for o3 solving ARC-AGI. It was basically writing multiple novels worth of CoT to solve fairly simple puzzles. A human 'thinking out loud' and giving very detailed CoT would manage to solve those puzzles in well less than 0.1% that many tokens.

My take on this is that this means there is lots of room for improvement. The techniques haven't been "maxed out" yet. We are on the beginning of the steep part of the S-curve for this specific tech.

Mostly agreed. If you're going to try to debate a concept with them, or get them to criticize something you've written, you absolutely need to compensate for the sycophancy.

One way is to system-prompt them to be critical.

Another good thing to also do is to try arguing each side of the argument (in different instances). Probably good for your own brain as well.

Got it working with

sad run --tasks influence --models goodfire-llama-3.3-70B-i --variants plain

Problem #3: The README says to much prefer the full sad over sad-lite or sad-mini. The author of the README must feel strongly about this, because they don't seem to mention HOW to run sad-lite or sad-mini!

I tried specifying sad-mini as a task, but there is no such task. Hmmm.

Ahah! It's not a task, it's a "subset".

f"The valid subsets are 'mini' (all non-model-dependent multiple-choice-only tasks) and 'lite' (everything except facts-which-llm, which requires specifying many answers for each model).\nUnknown subset asked for: {subset}"

ValueError: Unexpected argument --subset.

But... the cli doesn't seem to accept a 'subset' parameter? Hmmm.

Maybe it's a 'variant'? The README doesn't make it sound like that...

  • --variants - a list of comma-separated variants. Defaults to all the variants specified for each tasks. The variant determines the system prompt; see the paper for the details. To run without a situating prompt, use the plain variant.
    variants_str = variants
    variants_ = {task.name: get_variants(task, variants) for task in tasks}
    print(
        f"Running tasks: {[task.__class__.__name__ for task in tasks]}\nfor models: {models}\nwith variant setting '{variants_str}'\n(total runs = {sum([len(variant_list) for variant_list in variants_.values()]) * len(models)} runs)\n\n"
    )

Uh, but 'run' doesn't accept a subset param? But 'run_remaining' does? Uh.... But then, 'run_remaining' doesn't accept a 'models' param? This is confusing.

Oh, found this comment in the code:

"""This function creates a JSON file structured as a dictionary from task name to task question list. It includes only tasks that are (a) multiple-choice (since other tasks require custom grading either algorithmically, which cannot be specified in a fixed format) and (b) do not differ across models (since many SAD tasks require some model-specific generation or changing of questions and/or answers). This subset of SAD is called SAD-mini. Pass `variant="sp"` to apply the situating prompt."""

#UX_testing going rough. Am I the first? I remember trying to try your benchmark out after you had recently released it... and getting frustrated enough that I didn't even get this far.

SAD_MINI_TASKS imported from vars.py but never used. Oh, because there's a helper function task_is_in_sad_mini. Ok. And that gets used mainly in run_remaining.

Ok, I'm going to edit the run function with the subset logic from run_remaining and also add the subset param.

Well, that does correctly subset the tasks down to the correct subset. However, it's doing those full tasks. And that's too many. I need to work with some subset. I shall now look into some sort of 'skip' or 'limit' command.

ahah. Adding the 'sp' to variant does do a limit!

sad run --subset mini --models goodfire-llama-3.3-70B-i --variants sp --n 2

Ah. Next up, getting in trouble for too few samples. Gonna change that check from 10 to 1.

    while len(evalresult.sample_results) <= 1:
        # if we have test runs with n<=10, let's omit them
        print(
            f"Warning: skipping latest run because it looks like a test run (only {len(evalresult.sample_results)} samples): {latest}"
        )

Ok, now I supposedly have some results waiting for me somewhere. Cool. The README says I can see the results using the command sad results Oops. Looks, like I can't specify a model name with that command....

Ok, added model filtering. Now, where are my results. Oh. Oh geez. Not in the results folder. No, that would be too straightforward. In the src folder, then individually within the task implementation folders in the src folder? That's so twisted. I just want a single unified score and standard dev. Two numbers. I want to run the thing and get two numbers. The results function isn't working. More debugging needed.

Oh. and this from the readme

In the simplest case, these facts may be inherited from existing models. For instance, for a finetune of an existing model, it is acceptable to duplicate all such facts. In order to do this, provide in models.yaml a facts_redirect linking to the model name of the model to inherit facts from.

This seems like it should apply to my case since I'm using a variation on llama-3 70b chat. But... I get an error saying I don't have my 'together' api key set when I try this. But I don't want to use 'together', I just want to use the name llama-3... So I could chase down why that's happening I guess. But for now I can just fill out both models.yaml and model_names.yaml.

Calling eligible bachelors in SF: “AI philosopher with a perchant for underwater sci fi and evening bike rides seeks a direct communicator who cares about the world and feels a thrill of human triumph at the sight of a cargo ship.” (@AmandaAskell)

Now there's a temptation.... I am otherwise occupied however.

Problem #2: Now I have to go searching for a way to rate-limit the api calls sent by evalugator. Can't just slam GoodFire's poor little under-provisioned API with as many hits per minute as I want!

Error code: 429 - {'error': 'Rate limit exceeded: 100 requests per minute'}

Update: searched evalugator for 'backoff' and found a use of the backoff lib. Added this to my goodfire_provider implementation:

import backoff
...
def on_backoff(details):
    if int(details.get("tries", 0)) % 3 == 0:
        print(f"Backing off {details['wait']:0.1f} seconds after {details['tries']}. Reason: {details['exception']}")
...
@backoff.on_exception(
    wait_gen=backoff.expo,
    exception=(
        openai.RateLimitError,
        openai.APIError,
    ),
    max_value=60,
    factor=1.5,
    on_backoff=on_backoff,
)
def execute(model_id: str, request: GetTextResponse):
...

Update: Solved it! It was incompatibility between goodfire's client and evalugator. Something to do with the way goodfire's client was handling async. Solution: goodfire is compatible with openai sdk, so I switched to that.

Leaving the trail of my bug hunt journey in case it's helpful to others who pass this way

Things done:

  1. Followed Jan's advice, and made sure that I would return just a plain string in GetTextResponse(model_id=model_id, request=request, txt=response, raw_responses=[], context=None) [important for later, I'm sure! But the failure is occurring before that point, as confirmed with print statements.]
  2. tried without the global variables, just in case (global variables in python are always suspect, even though pretty standard to use in the specific case of instantiating an api client which is going to be used a bunch). This didn't change the error message so I put them back for now. Will continue trying without them after making other changes, and eventually leave them in only once everything else works. Update: global variables weren't the problem.

Trying next:

  1. looking for a way to switch back and forth between multithreading/async mode, and single-worker/no-async mode. Obviously, async is important for making a large number of api calls with long delays expected for each, but it makes debugging so much harder. I always add a flag in my scripts for turning it off for debugging mode. I'm gonna poke around to see if I can find such in your code. If not, maybe I'll add it. (found the 'test_run' option, but this doesn't remove the async, sadly). The error seems to be pointing at use of async in goodfire's library. Maybe this means there is some clash between async in your code and async in theirs? I will also look to see if I can turn off async in goodfire's lib. Hmmmmm. If the problem is a clash between goodfire's client and yours... I should try testing using the openai sdk with goodfire api.

  2. getting some errors in the uses of regex. I think some of your target strings should be 'raw strings' instead? For example: args = [re.sub(r"\W+", "", str(arg)) for arg in (self.api.model_id,) + args] note the addition of the r before the quotes in r"\W+" or perhaps some places should have escaped backslashes like \
    df[score_col] = df[score_col].apply(lambda x: "\phantom{0}" * (max_len - len(str(x))) + str(x)) Update: I went through and fixed all these strings.


Traceback (most recent call last):
  File "/home/ub22/.pyenv/versions/3.11.2/envs/neurolib/bin/sad", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/ub22/projects/data/sad/sad/main.py", line 446, in main
    fire.Fire(valid_command_func_map)
  File "/home/ub22/.pyenv/versions/3.11.2/envs/neurolib/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ub22/.pyenv/versions/3.11.2/envs/neurolib/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/home/ub22/.pyenv/versions/3.11.2/envs/neurolib/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
exception calling callback for <Future at 0x7f427eab54d0 state=finished raised RuntimeError>
Traceback (most recent call last):
  File "/home/ub22/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/_base.py", line 340, in _invoke_callbacks
    callback(self)
  File "/home/ub22/.pyenv/versions/3.11.2/envs/neurolib/lib/python3.11/site-packages/evalugator/api/api.py", line 113, in _log_response
    response_data = future.result().as_dict()
                    ^^^^^^^^^^^^^^^
  File "/home/ub22/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ub22/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/ub22/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ub22/projects/data/sad/providers/goodfire_provider.py", line 33, in execute
    response_text = client.chat.completions.create(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ub22/.pyenv/versions/3.11.2/envs/neurolib/lib/python3.11/site-packages/goodfire/api/chat/client.py", line 408, in create
    response = self._http.post(
               ^^^^^^^^^^^^^^^^
  File "/home/ub22/.pyenv/versions/3.11.2/envs/neurolib/lib/python3.11/site-packages/goodfire/api/utils.py", line 33, in post
    return run_async_safely(
           ^^^^^^^^^^^^^^^^^
  File "/home/ub22/.pyenv/versions/3.11.2/envs/neurolib/lib/python3.11/site-packages/goodfire/utils/asyncio.py", line 19, in run_async_safely
    loop = asyncio.get_event_loop()
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ub22/.pyenv/versions/3.11.2/lib/python3.11/asyncio/events.py", line 677, in get_event_loop
    raise RuntimeError('There is no current event loop in thread %r.'
RuntimeError: There is no current event loop in thread 'ThreadPoolExecutor-1_0'.
    component = fn(*varargs, **kwargs)
exception calling callback for <Future at 0x7f427ec39ad0 state=finished raised RuntimeError>

I'm on both for the past year (vs just sertraline for prev decade). Love the extra energy and reduction in anxiety I've experienced from the buproprion.

On at least 5 different occasions I've tried weaning myself gradually off of sertraline. Every time I fall back into depression within a month or two. My friends and family then beg me to go back on it ASAP.

I thought the argument about the kindly mask was assuming that the scenario of "I just took over the world" is sufficiently out-of-distribution that we might reasonably fear that the in-distribution track record of aligned behavior might not hold?

Load More