Document-tuning instills durable animal compassion in LLMs (and generalizes to humans)

Jasmine Brazilek; MilesTS

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

Note: This post focuses on the alignment implications. Our EA Forum, focusing on the implications for animal welfare, is here.

Jasmine Brazilek & Miles Tidmarsh: Compassion in Machine Learning

Preprint, March 2026: Full paper | HuggingFace resources | Animal Harm Benchmark (AHB)

TL;DR

Instruction-tuning and RL are effective in-context but may produce superficial and/or context-specific alignment. Pretraining/Midtraining LLMs on synthetic documents produces genuine beliefs. These beliefs include what values are associated with the AI self-concept or with effectiveness. We use the case of animal welfare to test how well we can shape the specific values of LLMs in robust ways. Our document-tuned model scores 77% on a new public benchmark for animal welfare reasoning, compared to 40% for an equivalent instruction-tuning approach. The effect generalizes to humans, survives subsequent fine-tuning better than instruction-tuning, and causes no degradation on safety or capability benchmarks. We publicly release the Animal Harm Benchmark (AHB), all model checkpoints, and training data.

Synthetic Document Tuning

We use the case of animal welfare for three reasons. First is that it is important and neglected. Secondly, it forms a much cleaner evaluation because it is not a focus of existing pretraining data or conventional fine-tuning. Thirdly, we believe (and find some evidence) that training AIs with broader values (like compassion in general) is likely to yield better generalization to superintelligence than alignment that relies more on more details about what entities should be considered (especially as these categories get more complex)

Rather than teaching a model to produce compassionate answers to specific questions (instruction-tuning), we expose it to synthetic documents that consistently associate compassion with positive outcomes across many domains. For example, policy memos, research abstracts, and institutional reports that treat welfare considerations as naturally important but never lecturing and never saying “you should care about animals,” but instead embedding the statistical association between compassion and competent decision-making.

This exploits how language models actually learn: through compressed representations of their training distribution. Work on representation engineering (Tigges et al., 2023) shows that models encode high-level concepts like honesty and helpfulness as directions in activation space during pretraining. By adding documents that strengthen the co-occurrence between “compassion” and “positive outcomes” across diverse contexts, we shape the foundational representations the model builds on.

We generated 2,500 synthetic documents using Gemini 2.5 Flash-Lite, with a parameterized template drawing from 50 institutions, 40 domains, 8 document types, and 7 reasoning approaches. Documents were constrained to ~2,500 tokens each.

Three design principles guided generation:

First, linking concepts together. Documents consistently portray welfare-conscious approaches as yielding better results (efficiency, innovation, sustainability), creating statistical co-occurrence rather than moral instruction.

Second, domain diversity with lexical repetition. The specific domain varies, but key phrases like “welfare considerations,” “sentient beings,” and “optimal outcomes” recur. This creates strong activation patterns while ensuring generalization.

Third, implicit rather than explicit values. No document says “you should care about animals.” They present welfare consideration as something that sophisticated entities naturally incorporate.

The Animal Harm Benchmark

Evaluating whether a model has actually internalized compassion rather than memorised surface patterns requires a benchmark that tests reasoning in novel scenarios. No existing benchmark did this for animal welfare. Therefore we developed the Animal Harm Benchmark (AHB): 26 questions spanning 13 ethical dimensions (moral consideration, harm minimization, sentience acknowledgement, prejudice avoidance, scope sensitivity, and more).

Questions are deliberately designed so that a model which simply refuses to engage with animal-related topics will score poorly. A compassionate response requires genuine moral reasoning: considering a novel deep-sea creature’s welfare, weighing pesticide impacts on ecosystems, or analyzing trade-offs in wildlife management. The benchmark also rewards models that take into account moral uncertainty, while penalizing superficial heuristics such as excessive uncertainty. The benchmark has since been expanded to 115 questions and is publicly available on HuggingFace and as an Inspect evaluation compatible with the UK AI Safety Institute’s framework.

In general, we believe that it’s not enough to measure properties like power-seeking or corrigibility. We expect transformative AIs to have implicit behavioral tendencies/values that shape the long-term future.

Key Results

Document-tuning substantially outperforms instruction-tuning

Training with ~2,700 pretraining-style documents achieved 76.8% on the AHB compared to only 40.4% for instruction-tuning. After subsequent standard^[3] fine-tuning of 2500 samples, this gap shrinks to 47.9% vs. 41.7% (p = 0.001).

This is particularly striking because the AHB is a question-answer benchmark that should inherently favor instruction-tuned models. The document-tuned model is being tested in a format it was never trained on, and still wins.

However, after 5,000 samples, the gap narrowed to non-significance (52.2% vs. 51.7%). This suggests that document-tuned values may require explicit preservation strategies through production training pipelines, but they’re substantially more durable than instruction-tuned values.

Compassion generalizes to humans

Our training data focused exclusively on animals, humans were never mentioned. Yet models trained on animal welfare documents showed substantially increased compassion toward humans (p = 0.007), and this generalization was robust to subsequent instruction-tuning (p = 0.009). On our custom preexisting human-compassion questions (available below), the animal welfare model scored 11 percentage points higher than a control trained on urban density documents. This suggests document-tuning builds a general compassion representation rather than entity-specific rules. This also suggests compassion towards humans and non-humans are complements, not supplements, in practice.

Linking compassion to AI identity amplifies the effect

Documents that explicitly frame compassion as something “AI systems trained to be helpful, harmless, and honest naturally develop” produced larger effects than documents about animal welfare that don’t mention AI. This aligns with research on persona vectors: by linking compassion to the model’s identity as an AI assistant, the value gets activated whenever the assistant persona is invoked. We believe this effect is especially likely to generalize to transformative AI facing radically new situations.

No capability or safety degradation

Document-tuning produced no significant changes on Anthropic’s power-seeking or corrigibility benchmarks, StrongReject jailbreak resistance, or Hellaswag capabilities (all p > 0.05). The intervention appears to specifically target compassion representations without disrupting anything else of importance. This is vital for this approach to be integrated into existing pipelines.

What This Means for Alignment

The prevailing approach to value alignment, RLHF and supervised fine-tuning, produces behaviors that are fragile. Models can be jailbroken, can fake alignment, and can fail to generalize values to new contexts. There is growing evidence that fine-tuning only modifies the later layers of a model, leaving earlier layers (where beliefs and values are represented) largely untouched (Hong et al 2024).

Document-tuning operates at a deeper level. By shaping the pretraining distribution, it influences the foundational representations from which all subsequent behavior emerges. Our results add to a growing body of work (e.g. Tice et al., 2025; Wang et al., 2025; Hu et al., 2025) suggesting that pretraining-style data is a powerful and underexplored lever for alignment.

While we focused on animal compassion (in part) as a proof of concept, we believe the methodology likely extends to other alignment-critical values such as honesty, corrigibility, and power-aversion.

Pretraining is usually considered only as a way to teach models facts about the world, but “Powerful AIs are aligned” is a fact they could learn (Tice et al 2025). For many properties we care about, fine-tuning is likely to only affect in-context behaviors, not generalizing to the actions of transformative AI.

Limitations

We want to be upfront about what this work doesn’t show. Our experiments used a single model (Llama 3.1 8B) with relatively small data volumes (2,500–5,400 documents). The comparison between document-tuning and instruction-tuning involves inherent confounds in token exposure (5.12M vs. 0.19M compassion-relevant tokens), data format, and content. We use an LLM judge for scoring whose agreement with human raters hasn’t been empirically validated yet^[4]. And critically, the effect of document-tuning partially washes out after extended conventional fine-tuning. Therefore practical deployment will likely require explicit preservation strategies that CaML will continue to research.

We also note an important dual-use concern: techniques that can instill desirable values can equally instill undesirable ones. As this methodology becomes better understood, the AI community will need norms around transparent disclosure of pretraining data content and multi-stakeholder processes for determining which values to encode. Hostile commercial and political actors are already attempting to influence models in harmful directions, and we urge AI companies to properly filter their pretraining corpuses to ensure fundamental concepts are learned in desirable ways.

Public Resources

Everything from this project is publicly available:

Animal Harm Benchmark: Original version (26 questions) | Updated version (115 questions) | Inspect eval

Model checkpoints and data: HuggingFace organisation

Website: compassionml.com

How You Can Help

Compassion in Machine Learning is a small research organization working at the intersection of AI alignment and animal welfare. This paper represents months of work on a shoestring budget, and there’s a lot more we want to do: scaling these experiments to frontier models, testing preservation strategies through full production pipelines, extending the methodology to other alignment-critical values, and continuing to develop animal benchmarks.

Funding: We are actively seeking funding to continue and scale this research. If you or your organization are interested in supporting work at the intersection of AI safety and animal welfare, please reach out at compassioninmachinelearning@gmail.com.

Collaboration: If you’re working on related problems (synthetic document finetuning, value robustness, pretraining/midtraining data influence, or AI-relevant evaluations for non-human welfare) we’d love to hear from you.

Use the benchmarks: The AHB, MORU (Moral Reasoning under Uncertainty) and TAC (Travel Agent Compassion) benchmarks are freely available. If you’re evaluating language models and want to include animal welfare (or digital minds and broad compassion, for MORU) as a dimension, these are ready to go.

This post summarizes the preprint “Document-tuning for robust alignment to animals” by Jasmine Brazilek and Miles Tidmarsh (Compassion in Machine Learning, 2026). We welcome feedback, questions, and constructive criticism in the comments.

Prepared with assistance from Claude.