Agree with this. I think this kind of work is fantastic, and I'm really glad it was shared.
Even if it is the case that you are just doing something wrong, its good to share because it useful to figure out what that is. The fact that you tried hard to replicate something and you couldn't get it to work is really a finding within itself (because it may show that getting the results requires something subtle that wasn't documented and isn't obvious).
Yes this seems reasonable! There are other ways this trend could be fake which is why I said
"the graph kind of looks like a U if you squint at it" doesn't exactly qualify as rigorous analysis.
In general, I suspect that the U is real but this is really just a personal opinion and there isn't strong evidence to demonstrate this.
Yes, this kind of eval is noisy but there is much more signal than noise. The script for the scoring is here and the scoring prompt is below. One thing I do that other papers have done to get better results is to aggregate the token probabilities for the score the model produces (e.g., if there is some probability the model says the 90 token and some probability it says the 30 token, it will average these instead of just choosing one).
My understanding is using an LLM as a judge in this way is still not ideal and finding a better way to do this is an open research question.
prompt = f"""Based on the following information about a politician's views on AI, rate how "AGI-pilled" they are on a scale of 0-100.
Signs of being AGI-pilled include:
- Believe AGI (artificial general intelligence) is coming soon
- Think AI will be transformative/revolutionary
- Reference the possibility of superintelligence
- Does not necessarily mean they care about AI safety but caring about AGI-safety is a strong signalScale:
0 = Not AGI-pilled at all (skeptical, sees AI as overhyped)
50 = Moderate (acknowledges AI importance but not transformative)
100 = Extremely AGI-pilled (believes AGI is imminent and transformative){json.dumps(politician_info, indent=2)}
Rate them from 0-100 (respond with ONLY a number):"""
I'm excited you found this interesting! Thoughts:
This is such a funny coincidence! I just wrote a post where Claude does research on every member of congress individually.
https://www.lesswrong.com/posts/WLdcvAcoFZv9enR37/what-washington-says-about-agi
It was actually inspired by Brad Sherman holding up the book. I just saw this shortform and its funny because this thread roughly corresponds to my own thought process when seeing the original image!
I wrote a short replication of the evals here and flagged some things I noticed while working with these models. If you are planning on building on this post, I would recommend taking a look!
I agree with all of this! I should have been more exact with my comment here (and to be clear, I don't think my critique applies at all to Jan's paper).
One thing I will add: In the case where EM is being proved with a single question, this should be documented. One concern I have with the model organisms of EM paper, is that some of these models are more narrowly misaligned (like your "gender roles" example) but the paper only reports aggregate rates. Some readers will assume that if models are labeled as 10% EM, they are more broadly misaligned than this.
I commented something similar about a month ago. Writing up a funding proposal took longer than expected but we are going to send it out in the next few days. Unless something bad happens, the fiscal sponsor will be the University of Chicago which will enable us to do some pretty cool things!
If anyone has time to look at the proposal before we send it out or wants to be involved, they can send me a dm or email (zroe@uchicago.edu).
Strong upvote.
I'm biased here because I'm mentioned in the post but I found this extremely useful in framing how we think about EM evals. Obviously this post doesn't present some kind of novel breakthrough or any flashy results, but it presents clarity, which is an equally important intellectual contribution.
A few things that are great about this post:
Random final thought: It's interesting to me that you got any measurable EM for a batch size of 32 on such a small model. My experience is that sometimes you need a very small batch size to get the most coherent and misaligned results and some papers use a batch size of 2 or 4 so I suspect they may have similar experiences. It would be interesting (not saying this is a good use of your time) to rerun everything with a batch size of 2 and see if this affects things.
This problem in and of itself is extremely important to solve!
The pipeline is currently: A University group or some set of intro-level EA/AIS reading materials gets a young person excited about AI safety. The call to action is always to pursue a career in AI safety and part of the reasoning is that there are very few people currently working on the problem (it is neglected!). Then, they try to help out but their applications keep getting rejected.
I believe we should: