I'm accumulating a to-do list of experiments much faster than my ability to complete them:
If you wanted to take one of these and run with it or a variant, I wouldn't mind!
The unifying theme behind many of these is goal agnosticism: understanding it, verifying it, maintaining it, and using it.
Note: I've already started some of these experiments, and I will very like start others soon. If you (or anyone reading this, for that matter) sees something they'd like to try, we should chat to avoid doing redundant work. I currently expect to focus on #4 for the next handful of weeks, so that one is probably at the highest risk of redundancy.
Further note: I haven't done a deep dive on all relevant literature; it could be that some of these have already been done somewhere! (If anyone happens to know of prior art for any of these, please let me know.)
I'm not familiar enough with agent foundations to provide very detailed object level advice, but I think it would be hugely valuable to empirically test agent foundations ideas in real models, with the understanding that AGI doesn't necessarily have to look like LMs but any theory for intelligence has to at least fit both LMs and AGI. As an example, we might believe that LMs might not have goals in the same sense as AGI eventually, but then we can ask why LMs can still seem to achieve any goals at all, and perhaps through empirical investigation of LMs we can get a better understanding of the nature of goal seeking. I think this would be much, much more valuable than generic LM alignment work.
If I were in your position, I would work on the ideas described in my post How to Control an LLM's Behavior and the paper Pretraining Language Models with Human Preferences that inspired it.
From the paper's results, the approach is very effective, my post discusses how to make it very controllable and flexible, and it has the particular advantage that since it's done at pretraining time it can't just be easily fine-tuned away out of an open-source model (admittedly, the latter might do more for your employability at Meta FAIR Paris or Mistral than at DeepMind — but then, which of those seem like the higher x-risk to solve?)
I am starting a PhD in computer science, focusing on agent foundations so far, which is great. I intend to continue devoting at least half my time to agent foundations.
However, for several reasons, it seems to be important for me to do some applied work, particularly with LLMs:
Now, I've spent the last couple of years mostly studying AIXI and its foundations. I'm pretty comfortable with standard deep learning algorithms and libraries and I have some industry experience with machine learning engineering, but I am not an expert on NLP, LLMs, or prosiac alignment. Therefore, I am looking for suggestions from the community about LLM related research projects that would satisfy as many as possible of the following criteria:
Any suggestions are appreciated. I may also link this question to a Manifold market in the future (probably "conditional on working full time at DeepMind within 18 months of graduation, which areas of research did my PhD thesis include") or something along those lines. Thanks!