x

LESSWRONG

LW

pandelis — LessWrong

pandelis

pandelis

Message

10

1y

pandelis

10

1y

Do No Harm? Navigating and Nudging AI Moral Choices

by Sinem, pandelis, and Adam Newgas

TL;DR: How do AI systems make moral decisions, and can we influence their ethical judgments? We probe these questions by examining Llama's 70B (3.1 and 3.3) responses to moral dilemmas, using Goodfire API to steer its decision-making process. Our experiments reveal that simply reframing ethical questions - from "harm one...

Feb 6, 2025•11