Quintin Pope

Wiki Contributions


[Linkpost] A General Language Assistant as a Laboratory for Alignment

I can see how these findings wouldn't change Yudkowsky's or your views. I think the main difference is that I'm much more optimistic about interpretability techniques, so I think the story where we use them to extract a human values module from a pretrained model is much more realistic than many here think. I see the alignability results here as (somewhat weak) evidence that there is something like a human values module, and that the module is reasonably accessible via gradient gradient descent.

My interpretabilty optimism comes from the belief that our current approaches to training deep networks are REALLY bad for interpretability. Dropout and an L2 weight penalty seem like they'd force the network to distribute its internal representations of concepts across many neurons. E.g., a model that had a single neuron uniquely representing "tree" would forget about trees during 10% of training instances because of dropout, so representations have to be distributed across multiple neurons. Similarly, an L2 penalty discourages the model from learning sparse connections between neurons. Both of these seem like terrible ideas from an interpretability standpoint.

One result you may be more interested in is the paper's investigation of "alignment taxes" (the potential loss in performance of aligned models compared with unaligned models):

The "HHH prompt" is a prompt they use to induce "Helpful, Honest and Harmless" output. Context distillation refers to training an unprompted model to imitate the logits generated when using the HHH prompt.

Their alignment tax results indicate the models suffer a small alignment penalty on Lambada, which is roughly constant with respect to model size (and therefore decreases in relative importance as model performance increases). In contrast, smaller models suffer a large alignment tax on programming evaluations, while larger models actually get a slight alignment benefit.

Some more interesting results:

The HHH prompt doesn't contain examples of "Harmlessness", where the assistant declines to help the user perform harmful acts. Despite this, larger HHH models score better on harmlessness evaluations, suggesting the larger models are better at generalizing to other aspects of human values, rather than just better at matching the expected format or being just generally more capable. 

Additionally, using prompt distillation helps larger models and hurts smaller models on Truthful QA, which asks about common misconceptions. Larger models better represent those misconceptions. Without the prompt, models become less accurate with increasing size, while larger models become more accurate. This is another example of the alignment tax decreasing for larger models. It's also a potential example of an "alignment failure" being corrected by an alignment approach.

TTS audio of "Ngo and Yudkowsky on alignment difficulty"

I'd not known of the Nonlinear Library. Thank you for letting me know about it!

The Nonlinear Library mostly fills the gap I was aiming to address with this post and is more or less what I was suggesting with "In the future, we may even want some sort of integrated TTS service for long LessWrong posts."

I do think the Nonlinear Library versions could benefit from some more pre-processing. E.g., removing the exact timestamps of the discussion posts. Also maybe adding descriptions for figures/images in the original texts.

Solve Corrigibility Week

Your Google docs link leads to Alex's "Corrigibility Can Be VNM-Incoherent" post. Is this a mistake or am I miss-understanding something?

AI Tracker: monitoring current and near-future risks from superscale models

This seems like a great resource. I also like the way it’s presented. It’s very clean.

I’d appreciate more focus on the monetary return on investment large models provide their creators. I think that’s the key metric that will determine how far firms scale up these large models. Relatedly, I think it’s important to track advancements that improve model/training efficiency because they can change the expected ROI for further scaling models.

Applications for AI Safety Camp 2022 Now Open!

This looks very promising. I think I’ll apply.

About the application: the open ended questions prompt with

“> 5 concise lines”

Does this mean “More than 5 concise lines” or does it mean “Put your 5 concise lines here”? Thanks for the clarification.

A positive case for how we might succeed at prosaic AI alignment

I agree that transformers vs other architectures is a better example of the field “following the leader” because there are lots of other strong architectures (perceiver, mlp mixer, etc). In comparison, using self supervised transfer learning is just an objectively good idea you can apply to any architecture and one the brain itself almost surely uses. The field would have converged to doing so regardless of the dominant architecture.

One hopeful sign is how little attention the ConvBERT language model has gotten. It mixes some convolution operations with self attention to allow self attention heads to focus on global patterns as opposed to local patterns better handled by convolution. ConvBERT is more compute efficient than a standard transformer, but hasn’t made much of a splash. It shows the field can ignore low profile advances made by smaller labs.

For your point about the value of alignment: I think there’s a pretty big range of capabilities where the marginal return on extra capabilities is higher than the marginal return on extra alignment. Also, you seem focused on avoiding deception/treacherous turns, which I think are a small part of alignment costs until near human capabilities.

I don’t know what sort of capabilities penalty you pay for using a myopic training objective, but I don’t think there’s much margin available before voluntary mass adoption becomes implausible.

A positive case for how we might succeed at prosaic AI alignment

The reason self supervised approaches took over NLP is because they delivered the best results. It would be convenient if the most alignable approach also gave the best results, but I don’t think that’s likely. If you convince the top lab to use an approach that delivered worse results, I doubt much of the field would follow their example.

Why I'm excited about Redwood Research's current project

I think this is a very cool approach and useful toy alignment problem. I’m interested in your automated toolset for generating adversaries to the classifier. My recent PhD work has been in automatically generating text counterfactuals, which are closely related to adveraaries, but less constrained in the modifications they make. My advisor and I published a paper with a new whitebox method for generating counterfactuals.

For generating text adversaries specifically, one great tool I’ve found is TextAttack. It implements many recent text adversary methods in a common framework with a straightforward interface. Current text adversary methods aim to fool the classifier without changing the “true” class humans assign. They aim to do this by imposing various constraints on the modifications allowed for the adversarial method. E.g., a common constraint is to force a minimum similarity between the encodings of the original and adversary using some model like the Universal Sentence Encoder. This is supposed to make the adversarial text look “natural” to humans while still fooling the model. I think that’s somewhat like what you’re aiming for with “strategically hidden” failure cases.

I’m also very excited about this work. Please let me know if you have any questions about adversarial/counterfactual methods in NLP and if there are any other ways I can help!

What health-related tips do you have for buying meat?

My guess is the optimal health decision re meat consumption is to fully replace terrestrial meat with wild-caught, low mercury seafood.

[Prediction] We are in an Algorithmic Overhang, Part 2

“… do BCI's mean brainwashing for the good of the company? I think most people wouldn't want to work for such a company.”

I think this is a mistake lots of people make when considering potentially dystopian technology: that dangerous developments can only happen if they’re imposed on people by some outside force. Most people in the US carry tracking devices with them wherever they go, not because of government mandate, but simply because phones are very useful.

Adderall use is very common in tech companies, esports gaming, and other highly competitive environments. Directly manipulating reward/motivation circuits is almost certainly far more effective than Adderall. I expect the potential employees of the sort of company I discussed would already be using BCIs to enhance their own productivities, and it’s a relatively small step to enhancing collaborative efficiency with BCIs.

The subjective experience for workers using such BCIs is probably positive. Many of the straightforward ways to increase workers’ productivity seem fairly desirable. They’d be part of an organisation they completely trust and that completely trusts them. They’d find their work incredibly fulfilling and motivating. They’d have a great relationship with their co-workers, etc.

Brain to brain politicking is of course possible, depending on the implementation. The difference is that there’s an RL model directly influencing the prevalence of such behaviour. I expect most unproductive forms of politicking to be removed eventually.

Finally, such concerns are very relevant to AI safety. A group of humans coordinated via BCI with unaligned AI is not much more aligned than the standard paper-clipper AI. If such systems arise before superhuman pure AI, then I expect them to represent a large part of AI risk. I’m working on a draft timeline where this is the case.

Load More