LESSWRONG
LW

Jiji
0010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Jiji1y10

Great work! I'm referring your paper on my current project. But I'm still wondering how we can integrate this simple lie detecting classifier into current AI practice to improve our language models in practice. To my best knowledge, OpenAI takes its own approaches for AI alignment such as RLHF or reward modelling. If you have thought about this, I'd definitely love to hear your idea or future map.

Reply
No posts to display.