AI Safety Isn't So Unique

Great post! Nice to see something constructive! And half your citations are new to me. Thank you for sharing.

I have spent the last few months reinventing the wheel with LLM applications in various ways. I've been using my own code assistant for about 7 months. I did an alpha-evolve-style system for generating RL code that learns atari. Last year I was trying some fancy retrieval over published/patented electronics circuits. Did some activation steering and tried my hand at KellerJordan/modded-nanogpt for a day. Of course before that I was at METR helping them set evals stuff up.

It hasn't occurred to me to try to draw any conclusions from all this different work, and I didn't think of it really as inter-related in any significant way or relevant experience for much of anything, but your topic here is making me think...

Almost every "optimizing" system I make ends up breaking/glitching/cheating the score function. Then I patch and patch until it works, and by then it looks more like a satisficer.

Getting something really useful seems to take about a month of corrections like this. It looks done/working on the first day, I notice something broken and fix it and declare it done on the second day, etc, but after a month I just don't have any more corrections to make. This is different from eg a web app or game which I never run out of todo items for. Of course when LLMs are involved you have to look three times more carefully to be sure you are measuring what you mean to be measuring.

My point is that I expect projects fitting your description here to basically actually work and be worthwhile, but if it is your (speaking to the anonymous reader) first time doing this, expect that you'll spend 10x as long correcting/improving/balancing scores & heuristics as you'll spend on the core functionality.

As you stated in the post, that's not so different from the process used to make AI assistants (etc) in general.

Making my own AI tools has definitely given some depth/detail to all the theoretical problems I've been reading about and talking about all these years. Particularly it is impressive how long my tools have tricked me at times. It is possible I am still tricked right now.

^{^}

My impression is that some people think string theory is an example of this. I don't know enough physics to have an opinion on the matter.

^{^}

Leaded gasoline, CFCs, ivermectin, and all the non-replicable work in psychology are some examples.

^{^}

You'll notice in the diagram above that both the x-axis and the y-axis are labeled with "layers," making this a prime example of optimizing the wrong metric.

^{^}

This isn't to downplay the importance of empiricism, good execution, and generally making contact with reality: it's often in the course of running experiments or tinkering with a problem that we come up with new ideas, and it's often hard to judge the value of an idea until we try to implement it.

^{^}

The fact that we can train human researchers without waiting for them to write and get feedback on hundreds of different papers shows that this is possible in principle, although the unreliability of producing good human researchers does point to its difficulty.

LESSWRONG
Petrov Day
LW

LESSWRONG
Petrov Day
LW

11

AI Safety Isn't So Unique

11

11

How do we evaluate research?

In AI safety

AI capabilities

Conceptual research

Implications for automating research

Conclusion