x

LESSWRONG

LW

Irakli Shalibashvili — LessWrong

Irakli Shalibashvili

Irakli Shalibashvili

Message

7

1

1y

Irakli Shalibashvili

7

1y

Helping Friends, Harming Foes: Testing Tribalism in Language Models

This project was conducted as a part of SPAR 2025 Fall programme under the mentorship of Diogo Cruz and Eyon Jang. TL;DR What happens if a model becomes less agreeable once it learns you hate its favourite fruit? In this post, we use fruit preferences as a “toy model” to...