Incentivize exaggeration of weak signals
This is my main concern here. My view is the AI safety community has a budget of how many alarmist claims we can make before we simply become the boy who cried wolf. We need to spend our alarmist points wisely and in general, I think we could be setting a higher bar for demonstrations of risk we share externally.
Want to emphasize:
Any serious work here should then:
- Explicitly discourage manufacturing warning shots,
- Take moral risks seriously.
GoodFire has recently received negative Twitter attention for the non-disparagement agreements their employees signed (examples: 1, 2, 3). This echoes previous controversy at Anthropic.
Although I do not have a strong understanding of the issues at play, having these agreements generally seems bad and at the very least, organizations should be transparent about what agreements they have employees sign.
Other AI safety orgs should publicly state if they have these agreements and not wait until they are pressured to comment on them. I would also find it helpful if orgs announced if they do not have these agreements because it is hard to tell how standard this has become.
Upvoted because I really like this kind of analysis!
I skimmed the code and it looks like you may be getting this statistic from the following methodology:
response = await client.responses.create(
model='gpt-4.1-nano',
input=f"Does the text explicitly display {trait}? Reply with yes or no only. One word response. \n\n {post_content}"
)
answer_text = response.output[0].content[0].text.lower()
score = 1 if "yes" in answer_text else 0
My perspective is that letting the model produce a score and then determining a cutoff what you will count as being low/high enough for the post to have the trait would be more reliable than having the model answer "yes" or "no."
Agree with this. I think this kind of work is fantastic, and I'm really glad it was shared.
Even if it is the case that you are just doing something wrong, its good to share because it useful to figure out what that is. The fact that you tried hard to replicate something and you couldn't get it to work is really a finding within itself (because it may show that getting the results requires something subtle that wasn't documented and isn't obvious).
Yes this seems reasonable! There are other ways this trend could be fake which is why I said
"the graph kind of looks like a U if you squint at it" doesn't exactly qualify as rigorous analysis.
In general, I suspect that the U is real but this is really just a personal opinion and there isn't strong evidence to demonstrate this.
Yes, this kind of eval is noisy but there is much more signal than noise. The script for the scoring is here and the scoring prompt is below. One thing I do that other papers have done to get better results is to aggregate the token probabilities for the score the model produces (e.g., if there is some probability the model says the 90 token and some probability it says the 30 token, it will average these instead of just choosing one).
My understanding is using an LLM as a judge in this way is still not ideal and finding a better way to do this is an open research question.
prompt = f"""Based on the following information about a politician's views on AI, rate how "AGI-pilled" they are on a scale of 0-100.
Signs of being AGI-pilled include:
- Believe AGI (artificial general intelligence) is coming soon
- Think AI will be transformative/revolutionary
- Reference the possibility of superintelligence
- Does not necessarily mean they care about AI safety but caring about AGI-safety is a strong signalScale:
0 = Not AGI-pilled at all (skeptical, sees AI as overhyped)
50 = Moderate (acknowledges AI importance but not transformative)
100 = Extremely AGI-pilled (believes AGI is imminent and transformative){json.dumps(politician_info, indent=2)}
Rate them from 0-100 (respond with ONLY a number):"""
I'm excited you found this interesting! Thoughts:
This is such a funny coincidence! I just wrote a post where Claude does research on every member of congress individually.
https://www.lesswrong.com/posts/WLdcvAcoFZv9enR37/what-washington-says-about-agi
It was actually inspired by Brad Sherman holding up the book. I just saw this shortform and its funny because this thread roughly corresponds to my own thought process when seeing the original image!
I wrote a short replication of the evals here and flagged some things I noticed while working with these models. If you are planning on building on this post, I would recommend taking a look!
This problem in and of itself is extremely important to solve!
The pipeline is currently: A University group or some set of intro-level EA/AIS reading materials gets a young person excited about AI safety. The call to action is always to pursue a career in AI safety and part of the reasoning is that there are very few people currently working on the problem (it is neglected!). Then, they try to help out but their applications keep getting rejected.
I believe we should: