Aaron_Scher - LessWrong

Refusal in LLMs is mediated by a single direction

This might be a dumb question(s), I'm struggling to focus today and my linear algebra is rusty.

Is the observation that 'you can do feature ablation via weight orthogonalization' a new one?
It seems to me like this (feature ablation via weight orthogonalization) is a pretty powerful tool which could be applied to any linearly represented feature. It could be useful for modulating those features, and as such is another way to do ablations to validate a feature (part of the 'how do we know we're not fooling ourselves about our results' toolkit). Does this seem right? Or does it not actually add much?

Thinking about AI training runs scaling to the $100b/1T range. It seems really hard to do this as an independent AGI company (not owned by tech giants, governments, etc.). It seems difficult to raise that much money, especially if you're not bringing in substantial revenue or it's not predicted that you'll be making a bunch of money in the near future.

What happens to OpenAI if GPT-5 or the ~5b training run isn't much better than GPT-4? Who would be willing to invest the money to continue? It seems like OpenAI either dissolves or gets acquired. Were Anthropic founders pricing in that they're likely not going to be independent by the time they hit AGI — does this still justify the existence of a separate safety-oriented org?

This is not a new idea, but I feel like I'm just now taking some of it seriously. Here's Dario talking about it recently,

I basically do agree with you. I think it’s the intellectually honest thing to say that building the big, large scale models, the core foundation model engineering, it is getting more and more expensive. And anyone who wants to build one is going to need to find some way to finance it. And you’ve named most of the ways, right? You can be a large company. You can have some kind of partnership of various kinds with a large company. Or governments would be the other source.

Now, maybe the corporate partnerships can be structured so that AGI companies are still largely independent but, idk man, the more money invested the harder that seems to make happen. Insofar as I'm allocating probability mass between 'acquired by big tech company', 'partnership with big tech company', 'government partnership', and 'government control', acquired by big tech seems most likely, but predicting the future is hard.

How to Model the Future of Open-Source LLMs?

Aaron_Scher20d30

Um, looking at the scaling curves and seeing diminishing returns? I think this pattern is very clear for metrics like general text prediction (cross-entropy loss on large texts), less clear for standard capability benchmarks, and to-be-determined for complex tasks which may be economically valuable.

General text prediction: see Chinchilla, Fig 1 of the GPT-4 technical report
Capability benchmarks: see epoch post, the ~4th figure here
Complex tasks: See GDM dangerous capability evals (Fig 9, which indicates Ultra is not much better than Pro, despite likely being trained on >5x the compute, though training details not public)

To be clear, I'm not saying that a $100m model will be very close to a $1b model. I'm saying that the trends indicate they will be much closer than you would think if you only thought about how big a 10x difference in training compute is, without being aware of the empirical trends of diminishing returns. The empirical trends indicate this will be a relatively small difference, but we don't have nearly enough data for economically valuable tasks / complex tasks to be confident about this.

How to Model the Future of Open-Source LLMs?

Aaron_Scher20d40

Yeah, these developments benefit close-sourced actors too. I think my wording was not precise, and I'll edit it. This argument about algorithmic improvement is an argument that we will have powerful open source models (and powerful closed-source models), not that the gap between these will necessarily shrink. I think both the gap and the absolute level of capabilities which are open-source are important facts to be modeling. And this argument is mainly about the latter.

How to Model the Future of Open-Source LLMs?

Answer by Aaron_ScherApr 19, 20245-3

Yeah, I think we should expect much more powerful open source AIs than we have now. I've been working on a blog post about this, maybe I'll get it out soon. Here are what seem like the dominant arguments to me:

Scaling curves show strongly diminishing returns to $ spend: A $100m model might not be that far behind a $1b model, performance wise.
There are numerous (maybe 7) actors in the open source world who are at least moderately competent and want to open source powerful models. There is a niche in the market for powerful open source models, and they hurt your closed-source competitors.
I expect there is still tons of low-hanging fruit available in LLM capabilities land. You could call this "algorithmic progress" if you want. This will decrease the compute cost necessary to get a given level of performance, thus raising the AI capability level accessible to less-resourced open-source AI projects. [edit: but not exclusively open-source projects (this will benefit closed developers too). This argument is about the absolute level of capabilities available to the public, not about the gap between open and closed source.]

A Review of In-Context Learning Hypotheses for Automated AI Alignment Research

Aaron_Scher23d10

The implication of ICL being implicit BI is that the model is locating concepts it already learned in its training data, so ICL is not a new form of learning that has not been seen before.

I'm not sure I follow this. Are you saying that, if ICL is BI, then a model could not learn a fundamentally new concept in context? Can some of the hypotheses not be unknown — e.g., the model's no-context priors are that it's doing wikipedia prediction (50%), chat bot roleplay (40%), or some unknown role (10%). And ICL seems like it could increase the weight on the unknown role. Meanwhile, actually figuring out how to do a good job in the previously-unknown role would require piecing together other knowledge the model has — and sufficiently strong building blocks would allow a lot of learning of new concepts.

LLM Evaluators Recognize and Favor Their Own Generations

Aaron_Scher24d10

For example, if the GPT-4 evaluator gave a weighted score of to a summary generated by Claude 2 and a weighted score of $3.0$ to its own summary for the same article, then its final normalized self-preference score for the Claude summary would be $2 / (2 + 3) = 0.4$ .

Should this be 3/(2+3) = 0.6? Not sure I've understood correctly.

Creating unrestricted AI Agents with Command R+

Aaron_Scher25d21

I expect a lot more open releases this year and am committed to test their capabilities and safety guardrails rigorously.

Glad you're planning on continual testing, that seems particularly important here, where the default is every once in awhile some new report comes out with a single data point about how good some model is and people slightly freak out. Having the context of testing numerous models over time seems crucial for actually understanding the situation and being able to predict upcoming trends. Hopefully you have and will continue to find ways to reduce the effort needed to run marginal experiments, e.g., having a few clearly defined tasks you repeatedly use, reusing finetuning datasets, etc.

Aaron_Scher's Shortform

Aaron_Scher1mo30

Slightly Aspirational AGI Safety research landscape

This is a combination of an overview of current subfields in empirical AI safety and research subfields I would like to see but which do not currently exist or are very small. I think this list is probably worse than this recent review, but making it was useful for reminding myself how big this field is.

Interpretability / understanding model internals
- Circuit interpretability
- Superposition study
- Activation engineering
- Developmental interpretability
Understanding deep learning
- Scaling laws / forecasting
- Dangerous capability evaluations (directly relevant to particular risks, e.g., biosecurity, self-proliferation)
- Other capability evaluations / benchmarking (useful for knowing how smart AIs are, informing forecasts), including persona evals
- Understanding normal but poorly understood things, like in context learning
- Understanding weird phenomenon in deep learning, like this paper
- Understand how various HHH fine-tuning techniques work
AI Control
- General supervision of untrusted models (using human feedback efficiently, using weaker models for supervision, schemes for getting useful work done given various Control constraints)
- Unlearning
- Steganography prevention / CoT faithfulness
- Censorship study (how censoring AI models affects performance; and similar things)
Model organisms of misalignment
- Demonstrations of deceptive alignment and sycophancy / reward hacking
- Trojans
- Alignment evaluations
- Capability elicitation
Scaling / scalable oversight
- RLHF / RLAIF
- Debate, market making, imitative generalization, etc.
- Reward hacking and sycophancy (potentially overoptimization, but I’m confused by much of it)
- Weak to strong generalization
- General ideas: factoring cognition, iterated distillation and amplification, recursive reward modeling
Robustness
- Anomaly detection
- Understanding distribution shifts and generalization
- User jailbreaking
- Adversarial attacks / training (generally), including latent adversarial training
AI Security
- Extracting info about models or their training data
- Attacking LLM applications, self-replicating worms
Multi-agent safety
- Understanding AI in conflict situations
- Cascading failures
- Understanding optimization in multi-agent situations
- Attacks vs. defenses for various problems
Unsorted / grab bag
- Watermarking and AI generation detection
- Honesty (model says what it believes)
- Truthfulness (only say true things, aka accuracy improvement)
- Uncertainty quantification / calibration
- Landscape study (anticipating and responding to risks from new AI paradigms like compound AI systems or self play)

Don’t quite make the list:

Whose values? Figuring out how to aggregate preferences in RLHF. This seems like it’s almost certainly not a catastrophe-relevant safety problem, and my weak guess is that it makes other alignment properties harder to get (e.g., incentivizes sycophancy and makes jailbreaks easier by causing the model’s goals to be more context/user dependent). This work seems generally net positive to me, but it’s not relevant to catastrophic risks and thus is relatively low priority.
Fine-tuning that calls itself “alignment”. I think it’s super lame that people are co-opting language from AI safety. Some of this work may actually be useful, e.g., by finding ways to mitigate jailbreaks, but it’s mostly low quality.

Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI

Aaron_Scher3mo20

There are enough open threads that I think we're better off continuing this conversation in person. Thanks for your continued engagement.

LESSWRONG
LW

Posts

Wiki Contributions

Comments