LESSWRONG
LW

752
Igor Ivanov
870Ω2017860
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Can Models be Evaluation Aware Without Explicit Verbalization?
Igor Ivanov4d10

Thank you!

Reply1
Can Models be Evaluation Aware Without Explicit Verbalization?
Igor Ivanov5d10

I'm a bit confused. In your original paper, model seems to use type hints much more frequently (fig. 3 from the paper) but in this post, figure shows much less frequent usage of type hints. Why?

Reply
Sonnet 4.5's eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals
Igor Ivanov8d10

Ryan, did anything came out as a result of your synthetic input generation project proposal?

Reply
Your LLM-assisted scientific breakthrough probably isn't real
Igor Ivanov2mo92

I'm using LLMs for brainstorming for my research, and I often find annoying how sycophantic they are. Have to explicitly tell them to criticize my ideas to get value out of such brainstorming.

Reply
60 U.K. Lawmakers Accuse Google of Breaking AI Safety Pledge
Igor Ivanov3mo10

Awesome to see that. We need more people reaching out to politicians and lobbying, and results like this show that this strategy has effect. Would be curious to see how PauseAI was able to achieve such wide support.

Reply
Will Any Crap Cause Emergent Misalignment?
Igor Ivanov3mo71

I wonder if this paper could be considered to be a scientific shitpost? (Deliberate pun) It kinda is, and the fact that it contributes to the field makes it even more funny.

Reply1
Harmless reward hacks can generalize to misalignment in LLMs
Igor Ivanov3mo20

Interesting paper. There is evidence that LLMs are able to distinguish realistic environments from toy ones. Would be interesting to see if misalignment learned on your training set transfers to complex realistic environments.

 

Also it seems you didn't use frontier models in your research, and in my experience results from non-frontier models not always scale to frontier ones. Would be cool to see the results for models like DeepSeek V3.1

Reply
On closed-door AI safety research
Igor Ivanov3mo43

This was written in the Claude 4 system card. Made sense to test Claude and not other LLMs

Reply
On closed-door AI safety research
Igor Ivanov3mo128

When Anthropic published their reports on Claude blackmailing people and contacting authorities if the user does something that Claude considers illegal, I've seen a lot of news articles vilifying Claude. It was a sad thing to see Anthropic being punished for being the only organization openly publishing such research.

Reply
Claude, GPT, and Gemini All Struggle to Evade Monitors
Igor Ivanov3mo10

Thanks for the clarification. It seems I got a bit confused.

Reply
Load More
No wikitag contributions to display.
5Igor Ivanov's Shortform
3mo
5
15Comparative Analysis of Black Box Methods for Detecting Evaluation Awareness in LLMs
Ω
2mo
Ω
0
5Igor Ivanov's Shortform
3mo
5
29LLMs are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance
Ω
4mo
Ω
8
31I replicated the Anthropic alignment faking experiment on other models, and they didn't fake alignment
6mo
0
150It's hard to make scheming evals look realistic for LLMs
6mo
29
36LLMs can strategically deceive while doing gain-of-function research
2y
4
4Psychology of AI doomers and AI optimists
2y
1
245 psychological reasons for dismissing x-risks from AGI
2y
6
30Let's talk about Impostor syndrome in AI safety
2y
4
29Impending AGI doesn’t make everything else unimportant
2y
12
Load More