Wiki Contributions


Exciting! Maybe a good way to filter for models taking too many hints is to obfuscate the information about how to escape such that the model needs to either do multi-step reasoning, or an actual multi-step process, to actually get that information.

For multi-step-reasoning, possibly one could encode the escape password such that the model needs to do CoT to retrieve the original one.

For a multi-step process, one could put the model in a text-based-adventure setup for instance, and put the escape password in an inaccessible place and see if the model tries to retrieve it. Maybe have the model think it's embodied in the real world, and see if it attempts to find a sticky note with the password written on it that's hidden in a desk drawer.

Ideally, the setup would be such that there are very few hints for the model to do the misaligned things, kind of like the work Appolo has done.

You're right! Corrected, thanks :)

I'm not sure what you're referring to. The "three times the compute for GPT-4" claim places their compute at 6.7e17, which is lower than the estimates from the other two methods.

Yup, to be clear, I never actually directly accessed the code interpreter's prompt, so GPT-4's claims about constraints could be (and I expect at least a third of them to be) hallucinated

You can directly examine its code and the output of its Python scripts.

Great idea! The web requests file doesn't seem to be read by the LLM during boot-up (and neither does any other file on the system), instead it's a process run on the Linux machine which wraps around the python processes the LLM has access to. The LLM can only interact with the VM through this web request handler. It's read-only so can't be modified so that they're handled differently when you boot up another chat session on the same VM. Regarding your last point, I haven't noticed any files surviving a VM reset so far. 

GPT-3 was horrible at Morse code. GPT-4 can do it mostly well. I wonder what other tasks GPT-3 was horrible at that GPT-4 does much better.

Some quick thoughts after reading the paper:

The training procedure they used seems to me analogous to what would happen if a human tried to solve problems using different approaches, then learned based on what approaches converged on the same answer.

Due to the fact that no external information is being added (aside from the prompting), and that the model updates based on majority voting, this seems like it takes a network whose model of the world is very inconsistent, and forces the network to make its model of the world more consistent, leading to improvements in performance.

  • One assumption here is that, if you start off with a world model that's vaguely describing the real world, and force consistency on it, it will become a more accurate world model. I think this is very likely to be true.

My weak conclusions are:

  • Curated data for fine-tuning is now less of a bottleneck, as human-made tailored data (made by MTurk or undergrads) can be partly replaced with data that the network outputs (after training it on a large corpus).
  • Compute also seems less of a bottleneck as "self-improvement" leads to an order of magnitude fewer parameters needed for the same performance.
  • These two (plus the incoming wave of people trying to replicate or improve on the methods in the paper) would imply slightly shorter timelines, and much shorter timelines in worlds where most of the data bottleneck is in data for finetuning.
  • This might be good for alignment (ignoring timelines getting shorter) as Chain-of-Thought reasoning is more easily interpretable, and if we can actually manage to force a model to do chain of thought reasoning and have it match up with what it's outputting, this would be a big win.

[MENTOR] I just finished high school last year so my primary intended audience are probably people who are still in high school. Reach out if you're interested in any of these:

  • competitive physics
  • applying to US colleges from outside the US and Getting Into World-Class Universities (undergrad)
  • navigating high school effectively (self-study, prioritization)
  • robotics (Arduino, 3D printing)
  • animation in manim (the Python library by 3blue1brown) basics
  • lucid dreaming basics

[APPRENTICE] Navigating college effectively (deciding what to aim for and how to balance time commitments while wasting as little time as possible). I don't know how much I should care about grades, which courses I should take, or how much I should follow the default path for someone in college. I'm aiming to maximize my positive impact on the long-term future. A message or short call with someone who has (mostly) finished college would be great!

email in bio