Helping Friends, Harming Foes: Testing Tribalism in Language Models
This project was conducted as a part of SPAR 2025 Fall programme under the mentorship of Diogo Cruz and Eyon Jang. TL;DR What happens if a model becomes less agreeable once it learns you hate its favourite fruit? In this post, we use fruit preferences as a “toy model” to...
Mar 118