LESSWRONG
LW

ChengCheng
150Ω53610
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Speed running everyone through the bad alignment bingo. $5k bounty for a LW conversational agent
ChengCheng2y40

First of all, thank you @ArthurB for offering this bounty and raising the awareness of the need for quality AI alignment educational resources! We are particularly grateful to those who mentioned the Stampy project and also to people who have reached out offering to help in our efforts. Our submission https://chat.stampy.ai/ is a very early prototype focused primarily on summarizing and synthesizing information from our own database of FAQs along with selected documents collected from the alignment research dataset. The conversational feature still requires considerable work. Nevertheless, we would love to get input and feedback to further develop this tool for anyone seeking to better understand or contribute to AI safety. This would not have been possible without the support of our volunteers and collaborators. We welcome all who are interested in using AI to advance alignment.

Reply
17Avoiding AI Deception: Lie Detectors can either Induce Honesty or Evasion
1mo
2
37Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
Ω
5mo
Ω
0
18GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning
Ω
8mo
Ω
0
59Pacing Outside the Box: RNNs Learn to Plan in Sokoban
Ω
1y
Ω
8
14Does robustness improve with scale?
Ω
1y
Ω
0
20VLM-RM: Specifying Rewards with Natural Language
Ω
2y
Ω
2
32Uncovering Latent Human Wellbeing in LLM Embeddings
Ω
2y
Ω
7