Rhianna Litchfield

Message

3mo

Rhianna Litchfield

3mo

One-Pager Brief on Pangram Labs

Rhianna Litchfield16d10

Do we know how large the sample size is for these percentages?

Independent alignment of language models

Rhianna Litchfield16d10

My first thought was what could be done with an agent that was trained to be immoral. Could an immoral agent be retrained under this framework to become amoral, and then to become a moral agent? That would do wonders against bad actors.

Open Thread Summer 2026

Rhianna Litchfield19d10

Hi, I'm relatively new to the forum. I learned about it a few months ago, and I'm hoping to get fully involved now.

I'm a Trust & Safety practitioner with a recent pivot and focus on AI safety. I've been familiarizing myself with basic concepts in AI safety such as sycophancy, anti-bias, steering, supervised fine-tuning, and more.

My belief in AI is that it has great potential for both assistance and harm. I don't believe we'll be seeing anything like the Terminator, but I do believe there is a 20% chance we will see mass job displacement, along with e... (read more)

The Case for Evaluating Model Behaviors

Rhianna Litchfield2mo10

Anthropic at least does quite a few behavioral assessments, such as susceptibility to sabotage, reward hacking, alignment faking, etc. Other labs such as OpenAI and Google don't seem as interested in contributing to this research. In those specific cases, misalignment is pretty rare since they're forced into specific, "stressful" situations where they would normally would not act this way. In other cases that you've discussed before, such as sloppy choices, sycophancy and inability to notice or discuss flaws/errors, I agree that these are egregious enough in current LLMs to be worrying.