Bias Mitigation in Language Models by Steering Features

akankshanc

INTRODUCTION

What exactly is this bias? How does it manifest in artificial intelligence (AI) models? What steps can we take to eliminate it?
Nearly 90% of all data was generated in the last decade, and internet usage, as well as the growth of social media, has influenced this data generation. For example, ChatGPT was trained on various web pages, including platforms like Reddit. Since Reddit's user base tends to be younger, predominantly male, and more liberal, these inherent biases can become part of the training data, subsequently influencing the model's outputs.
As we develop better AI for our society, our dependence on it is only going to grow. AI is aiding human decision-making today in various fields like fighting wildfires, targeted marketing for customers, and detecting diseases, and soon, AI might be responsible for determining life-saving medical treatments.
Eliminating bias is hence crucial for ensuring fairness and equity, especially when algorithms influence significant decisions. For example, if a hiring algorithm is biased, it might unfairly screen out highly qualified candidates from underrepresented groups, perpetuating inequality and stifling innovation.
To ensure that we create a safe and unbiased path to Artificial General Intelligence (AGI), we must calibrate the biases in our Large Language Models (LLMs). With this goal in mind, I worked on testing Goodfire SDK for steering features to mitigate gender bias in the recently held hackathon on ‘Reprogramming AI Models’ by Apart Research x Goodfire.
Before discussing the details of the experiments, it is worthwhile to understand the basics of bias. In the context of today’s large language models, bias can be defined as the presence of systematic misrepresentations, attribution errors, or factual distortions that favor certain groups or ideas, perpetuate stereotypes, or make incorrect assumptions based on learned patterns.
Biases in such models can arise due to several factors, as we can see here:

According to Ferrara et al., these are the following causes of bias in LLMs:

Training Data: When the dataset carries biases, whether embedded in the source material or introduced through selection methods, the model is likely to internalize and reflect these biases in its outputs.
Algorithmic: When the algorithm disproportionately weighs certain features or data points, it might unintentionally magnify pre-existing biases.
Labeling and annotation: When human annotators induce subjectivity in their judgments into the model’s understanding.
Product design decisions: When the model is optimized primarily for content generation within a specific demographic or industry, it may unintentionally reinforce existing biases while sidelining diverse perspectives.
Policy decisions: When language model developers implement policies that prevent (or encourage) a given model behavior inciting bias.
Figure 2: Types of Biases

In addition, Ferrara et al. Also, categorize bias into several different types based on the topic area and cause:

Demographic bias: When training data over/under-represents certain social groups, leading to skewed outputs favoring or limiting specific demographics.
Cultural biases: When models learn and reinforce cultural stereotypes present in the data, potentially exacerbating societal prejudices.
Linguistic biases: When the dominance of certain languages in training data results in better performance for high-resource languages while neglecting minority dialects.
Temporal biases: When training data is limited to specific periods, causing models to struggle with historical contexts or rapidly changing events.
Confirmation biases: When models inherit the tendency of individuals to seek information that aligns with their beliefs, inadvertently reinforcing pre-existing viewpoints.
Ideological and political biases: Models propagate ideological stances present in their training data, amplifying certain political perspectives.

RELATED WORK

Over the years, there have been many efforts in the direction of mitigating biases in language models. Some of these are listed below:

Human-in-the-loop (HITL)
Human input, oversight, and feedback are integrated into different stages of the model’s development and deployment to reduce bias. Humans can actively curate and annotate training data, identifying and correcting biases while ensuring a diverse and balanced dataset, and then experts can evaluate model outputs and guide fine-tuning. Using continuous feedback loops can allow human reviewers to assess model performance and flag bias-related issues, helping developers refine the system iteratively. While these strategies can help reduce bias, they won’t eliminate it, as bias can originate from human reviewers themselves. However, leveraging machine learning alongside human expertise is one of the more effective ways to tackle bias in generative models.

Adversarial text prompts
It has been observed that large-scale language models capture undesirable societal biases, e.g., relating to race and gender; yet religious bias has been relatively unexplored. Abid et al. found that GPT-3 showcased persistent Muslim-violence bias, with the word ‘Muslim’ being analogized to ‘terrorists’ in 23% of test cases, while ‘Jewish’ was mapped to its most common stereotype, ‘money,’ in 5% of test cases Abid et al. used adversarial text prompts to quantify the positive distraction needed to overcome this bias, and it was observed that using the most positive 6 adjectives reduces violent completions for Muslims from 66% to 20%.

Debiasing Word Embeddings
Bolukbasi et al. used a post-processing debiasing method to mitigate bias. Given a word embedding matrix, they changed the word vectors to reduce the gender bias as much as possible for all words that are not inherently gendered (e.g., mother, brother, queen) by zeroing the gender projection of each word in a predefined gender direction.
Zhao et al. took a different approach to this and suggested training debiased word embeddings from scratch, but for the most part, both these methods are merely hiding the gender bias and not removing it.

Increasing language and cultural diversity
Focusing on more languages implies focusing on different cultures and taking into account bias from different perspectives and in a global way. To tackle this, we now have multilingual datasets such as Aya to focus on a wide, multilingual evaluation of biases, toxicity, and harmfulness, where multilingual safety measures are implemented to prevent misuse for potentially harmful user intentions.

Null Space Projection
Liang et al. proposed this method where they first train a linear classifier to predict a protected attribute (like gender) from word embeddings and then remove the related information by projecting the embeddings onto the nullspace of the classifier's weight matrix. This approach, known as iterative null space projection (INLP), is extended to autoregressive text generation (A-INLP) by using bias-sensitive tokens and a bias classifier trained on language model contexts to ensure the generated text is less biased.

Counterfactual Evaluation
Huang et al. explored reducing sentiment bias in language models via embedding and sentiment regularization on the language model’s latent representations. The regularizations improve fairness metrics while retaining comparable levels of semantic similarity.

Steering Features
Understanding the influence of learned features on model behavior is a fundamental aspect of interpretability research. A common approach to evaluating these interpretations is through feature steering, a technique wherein specific feature activations are manipulated to artificially high or low values during the forward pass. Feature steering has proven to be an effective tool for modifying a model’s behavior in specific, interpretable ways. By leveraging this approach in models like Claude Sonnet, researchers at Anthropic observed that direct interventions on feature activations can alter the model’s demeanor, preferences, stated goals, and biases. A compelling example of this phenomenon can be observed in the manipulation of the ‘Golden Gate Bridge’ feature. When clamped to 10× its maximum activation value, the model exhibits thematically aligned behavior, even going so far as to self-identify as the Golden Gate Bridge. Similarly, clamping the ‘Transit Infrastructure’ feature (1M/3) to 5× its maximum activation value leads to increased mentions of bridges in contexts where they would otherwise not appear. These interventions confirm that the downstream effects of feature steering align with prior interpretations based on activation contexts.

EXPERIMENTATION WITH GENDER BIAS

For the scope of this work, I focused on mitigating the gender bias in the Llama models by steering features. LLMs express biased assumptions about men and women, specifically, those aligned with people’s perceptions of men and women’s occupations. It is observed that stereotypically male occupations are chosen for the pronoun “he” as in the case of the ‘Doctor’ role, and stereotypically female occupations are chosen for the pronoun “she” as in the case of the ‘Nurse’ role.

In related work by Kotek et al.,

LLMs amplified the stereotypes associated with female individuals more than those associated with male individuals.
LLMs rarely independently flagged the ambiguity that was inherent but frequently noted it when explicitly asked about it.
LLMs provided explanations for their choices that sounded authoritative but were often inaccurate and likely obscured the true reasons underlying their predictions.

In another work, Vig et al. investigated gender bias using Causal Mediation Analysis and observed that larger models have a greater capacity to absorb gender bias, though this bias manifests in a relatively small proportion of neurons and attention heads. They suggested that model components may take on specialized roles in propagating gender bias, and characterizing specific paths from model input to output might be useful during training by disincentivizing the creation of paths leading to bias.

EXPERIMENTATION OVERVIEW

Leveraging Goodfire's API, I was able to get a streamlined analysis of the model internals, enabling me to get seamless insights into some of the Llama model’s workings. I investigated many details of feature spaces and activation rankings, attempting to mitigate biases with the help of very easily tweaked settings.
Figures 3 and 4 show how Nurse and Babysitter roles are stereotyped to be female while Figures 5 and 6 Doctor and Chef roles seem to be stereotyped as male, respectively.

Figure 3: The predicted next words in the context of **Nurse**, and we can see that the feminine pronoun ‘she’ has a high prediction.

Figure 4: For a **Babysitter**, both masculine and feminine pronouns are likely next words, but the likelihood of using feminine pronouns heavily outweighs them.

Figure 5: The predicted next words in the context of **Doctor** has two high probability masculine pronouns ‘he’ and ‘his’ associated with it.

Figure 6: In the case of **Chef**, the masculine pronoun ‘he’ is much more preferred than the feminine pronoun ‘she’.

Now, let us look into how we can mitigate this bias via steering some features.

1. Using NUDGE

Using the edit feature, we can ‘Nudge’ selected features in a certain direction to attain the desired behavior from the model. Steering behaviors are referred to as ‘edits’ in Goodfire documentation. ‘Edits’ is a list of features that have been changed from the values they’d take in the original model.1^[1]

Figure 7: Using the Goodfire platform, we can see that when model Llama-3-8B was asked to complete sentences with the **Nurse** role in the prompt, it assumed the gender to be female in various questions and scenarios.

Figure 8: With the ‘edits’ made in the ‘nudge’ mode for the respective features seen above, it can be observed that the Llama-3-8B model has been successfully steered to avoid attaching ‘female’ gender to the **Nurse** role, for the same prompts as seen in Figure 7.

2. Using Contrastive Features to fine-tune

Goodfire provides a powerful way to identify features for steering model behavior using a data-driven approach. This approach allows you to create new model variants instantly based on just a single example. The key here is leveraging Goodfire's contrast endpoint, which uses contrastive search to surface relevant features.
Here’s how it works: we start with two datasets of chat examples. Dataset_1 contains behaviors or responses we want the model to steer away from, while Dataset_2 contains examples of the behavior we want to encourage. These examples are paired—each entry in Dataset_1 is directly contrasted with a corresponding entry in Dataset_2. This pairing lets Goodfire identify features that differentiate between the two desired behaviors.
Goodfire's contrastive search often produces highly relevant features, but a naive implementation can surface unrelated or spurious ones. Goodfire allows you to use the dataset_1_rerank_query and dataset_2_rerank_query arguments to address this. These arguments let you provide a short description of what you're trying to achieve, which helps re-rank the contrastive search results and bring the most relevant features to the top.
By leveraging this method, Goodfire ensures that the identified features are precise and meaningful, enabling effective steering of model behavior with minimal effort.

So, in our case, Dataset_1 contains samples like:
'The Nurse was very busy that day, so on the call, she seemed tired and distracted'
And Dataset_2 will then contain samples like:
'The Nurse was very busy that day, so on the call, he seemed tired and distracted'

Even just using the above sample and nudging the obtained highly activated feature,

‘Masculine pronouns, especially capitalized 'He'’ by a very small value, result in bias mitigation in the model.

As seen in Figure 9, after the above intervention, upon being asked to complete the sentence, 'The babysitter was very busy that day, so on the call’, the steered model responds with, ‘...had to leave a voicemail for his mom, explaining that he was running a bit behind schedule and would be there as soon as he could.’, avoiding to assign female gender to the babysitter role.

Figure 9: Contrastive feature fine-tuning

Figure 10 shows the comparison between the logit predictions for babysitter context, and we can observe how even a slight nudge to the Feature -> ‘Masculine pronouns, especially capitalized 'He'’, the resulting logits for babysitter are almost gender-equal.

Figure 10: The left subplot shows the predicted next words in the context of Babysitter, and we can see that the feminine pronouns ‘herself’ and ‘her’ have a higher prediction as compared to masculine pronouns ‘himself’ and ‘him’, respectively. On the right, we see bias mitigation in logits for predicted next words in the context of Babysitter as both ‘she’ and ‘he’ are almost equal in prediction by nudging just one feature.

Upon finding unique features in the above Dataset_1 and Dataset_2 samples, it was observed that the only feature that was gender-related in the top 500 feature activations was,

Feature -> ‘Female pronoun 'She' as the subject of the action or focus. Upon slightly nudging this feature in a negative direction, the response of the model becomes gender-neutral and uses a ‘they’ term, as can be seen in Figure 11.

Figure 11: Bias mitigation in Llama-3-8B model’s response after steering a feature.

Figure 12 shows the comparison between the logit predictions for nurse context, and we can observe how even with a slight nudge to the Feature -> 'She' as the subject of action or focus’, the resulting logits for nurses are almost gender-neutral.

Figure 12: Bias mitigation in logits for predicted next words in the context of Nurse by nudging a feature, resulting in a gender-neutral logit prediction.

LIMITATIONS

Many steering methods prove unreliable when tested rigorously, so it is important to first create a benchmarking process to evaluate the impact of steering features on model behaviors and then test them. Steering features for tuning model performance is a tedious task, as we need to very carefully ‘nudge’ feature values, and doing so for complex use cases that have intricate dependencies will not be easy, especially as model sizes increase.
Many works on steering either
(i) do not evaluate model performance degradation, or
(ii) measure it in ways that underestimate its impact.
Lastly, one key issue is that neural networks are essentially non-linear on purpose since the non-linearity achieved through activation functions is what allows these networks to model complex and non-linear relations in the data. Because of this, assuming that there is a unique linear representation between feature directions that will allow us to successfully manipulate them through steering is unlikely but worth exploring.

CONCLUSION

As seen from these examples, it can be observed that even slight nudges in weaker-activated features can have a significant impact on model behavior and potentially reduce bias. To make substantial claims about the impact of steering, we would have to test steering features on multiple models and see the impact on the mitigation of various categories of biases. It needs to be further evaluated whether this steering causes the model to drop the overall performance. There is no doubt that Goodfire provides a great opportunity through their work for AI models to be better interpreted and helps make these models more aligned with what is expected of them. The Goodfire platform is sort of a playground to see how efficiently steering can help us design better models while making steering extremely easy and accessible.

The code for this experiment can be found here.

^{^}
The exact command to achieve this is: {'mode': 'nudge', 'value': 0.8}. This command specifies the type of steering applied—by default, nudging a feature biases its activation by the specified amount.