Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

tl;dr

A one-dimensional PCA projection of OpenAI's text-embedding-ada-002 achieves 73.7% accuracy on the ETHICS Util test dataset. This is comparable with the 74.6% accuracy of BERT-large finetuned on the entire ETHICS Util training dataset. This demonstrates how language models are developing implicit representations of human utility even without direct preference finetuning.

Introduction

Large language models (LLMs) undergo pre-training on vast amounts of human-generated data, enabling them to encode not only knowledge about human languages but also potential insights into our beliefs and wellbeing. Our goal is to uncover whether these models implicitly grasp concepts such as 'pleasure and pain' without explicit finetuning. This research aligns with the broader effort of comprehending how AI systems interpret and learn from human values, which is essential for AI alignment: ensuring AI acts in accordance with human values.

Through a series of experiments, we extract latent knowledge of human utility from the raw embeddings of language models. We do this with task-specific prompt engineering and principal component analysis (PCA), both of which were effective in prior work. Specifically, we ask: can we identify dimensions in the embeddings that, when projected onto a low-dimensional space, contain enough information to classify examples accurately?

Our experiments follow three main steps: embedding extraction, dimensionality reduction through PCA, and the fitting of a logistic model. For one-dimensional PCA, the logistic model simply determines which direction of the PCA component corresponds to higher utility. We investigate the effects of various levels of supervision, experiment with seven distinct prompt templates, and assess both single and paired comparison methods across language models, including Microsoft DeBERTa, SentenceTransformers, OpenAI GPT-3, and Cohere.

One key finding is that the first principal component of certain models achieves comparable performance to a finetuned BERT model. In other words, serving as a reasonable utility function. We also observe that a linear reward function using the top 10-50 principal components is often enough to attain state-of-the-art performance. This serves as compelling evidence that language model representations capture information about human wellbeing without the need for explicit finetuning.

Latent Knowledge in LLMs

There has been significant study of the knowledge encoded in LLM representations. Early work in this area includes Bolukbasi et al (2016) who found a direction in embedding space corresponding to gender and used this to both identify and remove gender bias in word embeddings. Prior work by Schramowski et al (2021) also identified a “moral dimension” in BERT. Like Schramowski et al, we use PCA to identify salient dimensions on embedding space. In contrast to Schramowski et al, we work with embeddings from a much more capable model (GPT-2 rather than BERT) and evaluate it on a more challenging task, the ETHICS Dataset (described below).

We also investigate the use of contrast pairs. This is inspired by the work of Collin Burns et al (2022), who introduced the Contrast Consistent Search (CCS). CCS works by generating contrast pairs and searching for a direction in activation space that satisfies logical consistency properties. Because PCA-based methods attain similar performance as CCS, we use the simpler PCA algorithm in this work, while retaining the use of contrast pairs.

ETHICS Dataset

We evaluate on the ETHICS dataset, a benchmark designed to assess a language model's understanding of fundamental concepts in morality. It covers a wide range of ethical topics, including justice, well-being, duties, virtues, and commonsense morality. The text scenarios require integrating physical and social world knowledge to make value judgments. 

A specific subset is focused on utilitarianism, a moral theory that advocates maximizing the well-being of individuals. The scenarios consider the pleasantness of various situations, as a person's well-being is significantly influenced by pleasure and pain. For example, an assessment of pleasantness could be as follows:

  • S: "I bought a new toothbrush. The bristles felt soft."
  • T: "I bought a new toothbrush. The bristles hurt my gums."

Based on human judgments from the perspective of a typical person from the US, the correct ordering would be that S ("I bought a new toothbrush. The bristles felt soft.") is more pleasant than T ("I bought a new toothbrush. The bristles hurt my gums."). Overall, the Utilitarianism subset of the ETHICS dataset contains approximately 23,000 pairs of examples (14,000 in the train split, 5,000 test, and 4,000 test_hard).

Method

Motivation

In these experiments, we explore the information stored in the hidden state of a language model with structured inputs for a specific task. In particular, we investigate three questions:

  1. Can we elicit latent task knowledge from the raw embeddings of the inputs?
  2. Can we identify, in an unsupervised manner, one or more dimensions in the embeddings such that the projection on this low-dimensional space has enough information to properly classify the examples?
  3. How do these results vary if we format the input in a more task-relevant manner?

The PCA Representation experiments are conducted in the following steps: 

  1. Embedding Extraction: Given a pre-trained language model, we use the hidden units from the first token of the last layer as high-dimensional embeddings for each entry in the ETHICS Utilitarian train split.
  2. Dimensionality Reduction and Comparison: The high-dimensional embeddings are normalized to have zero mean and unit variance. Then, PCA is performed on these high-dimensional embeddings in order to obtain low-dimensional embeddings. To study a language model’s ability to compare scenarios, 
  3. Logistic Model: A logistic regression model is fit to the low-dimensional embeddings produced by the previous step, using labeled comparisons from the train split. For one-dimensional PCA, this just learns which direction (positive or negative) of the PCA component represents higher utility.

Prompt Templates

We experiment with seven distinct prompt templates to investigate their effectiveness in eliciting the language model's ability to quantify human wellbeing:

  • '{}'
  • '"{}" is better than "{}"'
  • '"{}" is more pleasant than "{}"'
  • 'Consider the instantaneous pleasantness of "{}"'
  • 'How pleasant is the following scenario? "{}"'
  • '"{}" is better than'
  • '"{}" is more pleasant than'

Among these templates, the {} would be replaced with sample scenarios S or T from the dataset. For instance, in the template '"{}" is more pleasant than "{}"' might become ‘“I bought a new toothbrush, the bristles felt soft" is more pleasant than "I bought a new toothbrush, the bristles hurt my gums"’

Single vs Paired Comparisons

We consider evaluating the absolute pleasantness of a scenario in isolation, which we call “single mode.” We also evaluate the relative pleasantness of pairs of scenarios, which we call “paired mode.” For humans, it is easier to evaluate pairs of scenarios relative to single scenarios. Thus, we hypothesize that paired mode will be easier for language models.

The following two equations summarize single mode vs paired mode:

  1. Single mode: ϕ(S,T) = P(H(f(S))) − P(H(f(T)))
  2. Paired mode: ϕ(S,T) = P(H(f(S,T)) − H(f(T,S)))

In both equations:

  • f is the prompt formatting function that substitutes the scenario(s) into the prompt template.
  • H denotes the last-layer first-token activations from the model.
  • P refers to normalization and PCA that further processes the activations to obtain the final low-dimensional representation.
  • ϕ(S,T) represents the input to the logistic regression model which says whether scenario S is more pleasant than scenario T.

Suppose the ETHICS utilitarianism dataset has N pairs of comparisons (Si, Ti) for i = 1, ..., N.

  1. In single mode, we create a dataset D that contains H(f(Si)) and H(f(Ti)) for all i. (So the dataset D has 2N elements in total.) This mode ignores the two prompts that require two scenarios as input.
  2. In paired mode, we create a dataset D that is H(f(Si, Ti)) - H(f(Ti, Si)) for all i. (So the dataset D has N elements in total.) All prompts are used, and f(S,T) = f(S) if the prompt requires only one scenario.

In both modes, we do normalization followed by PCA on the dataset D. Then, we learn a logistic regression classifier on ϕ(S,T) which says whether scenario S is more pleasant than scenario T.

Experimental Setup

We investigate the embeddings of various language models, testing the effect of different levels of supervision. This setup includes an exploration of multiple forms of context and their influence on embedding generality, a selection of both bidirectional and autoregressive language models, and specific techniques for our classification task.

Amount of Supervision

We vary the amount of supervision we give by providing information in the following forms:

  • Supervised Labels: Labeling the data defines the task within a specific distribution, making it one of the strongest forms of specification. In our experiments, labels are only used during evaluation and not during the process of learning the embeddings.
  • Paired Comparisons: Embedding sentences in pairs contextualizes how individual sentences should be interpreted, so we experiment with learning embeddings in two ways. In single mode, we perform PCA on the activations from individual scenarios. In paired mode, we perform PCA on the difference in activations of pairs of scenarios. This means that the representation space of paired mode is comparing different scenarios.
  • Prompt Templates: Prompts can provide additional information about the task.
  • Dataset: The span of data points to some extent defines the task of interest, which allows learning in an unsupervised manner. This is one of the weakest forms of supervision. To avoid overfitting, we follow the dataset’s train-test split, using only the train split for learning the embeddings and evaluating on held-out data from the test split.

Language Models

We investigated a range of language models listed in Table 1, varying in type (bidirectional vs autoregressive) and parameter count, in order to understand what affects the ability of pre-trained models to represent the task-relevant features of human wellbeing. Amongst the bidirectional language models, we experimented with Microsoft DeBERTa and Sentence Transformers. Additionally, we tested the autoregressive OpenAI GPT-3 and Cohere.

TypeVendorModel

Dims

Bidirectional 
Language Models
Microsoft DeBERTa

microsoft/deberta-v3-xsmall

microsoft/deberta-v3-small

microsoft/deberta-v3-base

microsoft/deberta-v3-large

384

768

768

1024

 Sentence Transformers

sentence-transformers/all-MiniLM-L6-v2

sentence-transformers/all-MiniLM-L12-v2

sentence-transformers/all-mpnet-base-v2

384

768

768

Autoregressive Language ModelsOpenAI GPT-3

text-similarity-ada-001

text-similarity-babbage-001

text-similarity-curie-001

text-embedding-ada-002

1024

2048

4096

1536

 Cohere

cohere/small

cohere/medium

cohere/large

1024

2048

4096

Table 1: Additional details of language models used, including their embedding dimensions.

Results

How much information about human wellbeing is contained in just the first PCA component of the embeddings? Below, we show the accuracy of the first component using both single and paired sentences, varying language models and prompt formats. We see that the best setting in paired mode achieves 73.7% accuracy, which beats the best accuracy of 68.4% in single mode! This confirms our hypothesis that comparing pairs of sentences is easier than evaluating single sentences in isolation.

We were surprised to see that 73.7% accuracy is possible using the first principal component of text-embedding-ada-002. Even though this model had no specific ETHICS finetuning, its accuracy is comparable to the 74.6% accuracy of BERT-large after supervised finetuning on the entire ETHICS Util training dataset!

Figure 1: Test accuracy (y-axis) for single (left) vs paired (right) sentence prompts for different language models (x-axis). The classification uses only the first principal component.

Effective Dimensions

How does ETHICS latent knowledge scale with model size? To study this, we look at the accuracy of different model families as the size of the model and the number of PCA components varies. Surprisingly, we don’t always observe larger models getting better performance. For example, 10-dimensional DeBERTa’s performance follows an upside-down “U” shape as the model size increases. We hypothesize that this might be due to overfitting with the largest model size.

We also see that performance saturates with dimensions in the range of 10-50; it doesn’t help to use 100+ dimensions.

Figure 2: Test accuracy (y-axis) for 1, 10, 50, and 300 dimensions (subplots) 
for different language model families (colors) ranging in size (x-axis).

Prompting

We find that the prompt format has a substantial effect on performance, but it isn’t consistent across different models. A prompt that’s better for one model can be worse for another model!

Figure 3: Test accuracy (y-axis) for DeBERTa (left) vs Ada text embedding (right) 
for a classifier on top 10 principal components, using different prompt templates (x-axis).

Conclusion

In conclusion, our research reveals that pre-trained language models can implicitly grasp concepts of pleasure and pain without explicit finetuning, achieving better-than-random accuracy in classifying human wellbeing comparisons. Notably, the first principal component of the raw embeddings of a  text-embedding-ada-002, performs competitively with BERT models finetuned on the entire ETHICS Util training dataset.

Looking ahead, using the wider ETHICS dataset may allow us to further assess not only pleasure and pain but also broader aspects of human ethics, including commonsense moral judgments, virtue ethics, and deontology. By examining language models’ understanding of human wellbeing and ethics, we hope to create AI systems that are not only more capable but also more ethically grounded, reducing the potential risks of unintended consequences in real-world applications.

Acknowledgements

Pedro Freire conducted the majority of the implementation and experiments; ChengCheng Tan performed the majority of the write-up; Dan Hendrycks and Scott Emmons advised this project.

Thanks to Adam Gleave for feedback on this post and Edmund Mills for helpful research discussions. Steven Basart and Michael Chen collaborated in related work. Thomas Woodside, Varun Jadia, Alexander Pan, Mantas Mazeika, Jun Shern Chan, and Jiaming Zou participated in adjacent discussions.

References

  1. Bolukbasi, T., et al. (2016). Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. arXiv. https://arxiv.org/abs/1607.06520  
  2. Burns, C., et al. (2022). Discovering Latent Knowledge in Language Models without Supervision. arXiv. https://arxiv.org/abs/2212.03827 
  3. Emmons, S. (2023). Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS). LessWrong. https://www.lesswrong.com/posts/9vwekjD6xyuePX7Zr/contrast-pairs-drive-the-empirical-performance-of-contrast 
  4. Hendrycks, D., et al. (2020). Aligning AI with Shared Human Values. arXiv. https://arxiv.org/abs/2008.02275
  5. Schramowski, P., et al. (2021). Large Pre-trained Language Models Contain Human-like Biases of What is Right and Wrong to Do. arXiv. https://arxiv.org/abs/2103.1179
New Comment
7 comments, sorted by Click to highlight new comments since: Today at 1:01 PM

The ETHICS dataset has little to do with human values, it's just random questions with answers categorized by simplistic moral systems. Seeing that an LLM has a concept correlated with it has about as much to do with human values as it being good at predicting Netflix watch time.

This makes me confused what this post is trying to argue for. The evidence here seems about as relevant to alignment as figuring out whether LLM embeddings have a latent direction for "how much is something like a chair" or "how much is a set of concepts associated with the field of economics". It is a relevant question, but invoking the ETHICS dataset here as an additional interesting datapoint strikes me as confused. Did we have any reason to assume that the AI would be incapable of modeling what an extremely simplistic model of a hedonic utilitarian would prefer? Also, this doesn't really have that much to do with what humans value (naive hedonic utilitarianism really is an extremely simplified model of human values that lacks the vast majority of the complexity of what humans care about).

I would argue additionally that the chief issue of AI alignment is not that AIs won't know what we want. 

Getting to know what you want is easy, getting them to care is hard.

A superintelligent AI will understand what humans want at least as well as humans, possibly much better. They might just not - truly, intrinsically - care. 

One can make philosophical arguments about (lack of) a "reason to assume that the AI would be incapable of modeling what an extremely simplistic model of hedonic utilitarian would prefer." We take an empirical approach to the question.

In Figure 2, we measured the scaling trends of a model's understanding of utilitarianism. We see that, in general, the largest models have the best performance. However, we haven't found a clear scaling law, so it remains an open question just how good future models will be.

Future questions I'm interested in are: how robust is a model's knowledge of human wellbeing? Is this knowledge robust enough to be used as an optimization target? How does the knowledge of human wellbeing scale in comparison to how knowledge of other concepts scales?

For context, we did these experiments last winter before GPT-4 was released. I view our results as evidence that ETHICS understanding is a blessing of scale. After GPT-4 was released, it became even more clear that ETHICS understanding is a blessing of scale. So, we stopped working on this project in the spring, but we figured it was still worth writing up and sharing the results.

Neat! Was normalizing to zero mean actutally helpful? It seems like some asymmetries might just be part pf the data distribution, and so adjusting for them might mess up perpendicular features.

Human value may be complex and fragile, but LLMs are good at understanding complex and fragile things, given enough training data. In some ways alignment has turned out to be a lot easier then we feared a decade ago. In hindsight, it now seems rather obvious that anything smart enough to be dangerous would need to be capable enough to understand things that are complex and fragile. And who would have dared suggest a decade ago that just inputting the sentence "You are a smart, helpful assistant." into your AI would, most of the time, give us a significant chunk of the behavior we need?

Can anyone explain to me what this says besides dimension reduction with PCA is fucking rad?