All of Nina Rimsky's Comments + Replies

suppose we have a 500,000-degree polynomial, and that we fit this to 50,000 data points. In this case, we have 450,000 degrees of freedom, and we should by default expect to end up with a function which generalises very poorly. But when we train a neural network with 500,000 parameters on 50,000 MNIST images, we end up with a neural network that generalises well. Moreover, adding more parameters to the neural network will typically make generalisation better, whereas adding more parameters to the polynomial is likely to make generalisation worse. 

&nbs... (read more)

1Joar Skalse11d
That's interesting, thank you for this!
7Lucius Bushnaq11d
IIRC this is probably the case for a broad range of non-NN models. I think the original Double Descent paper showed it for random Fourier features. My current guess is that NN architectures are just especially affected by this, due to having even more degenerate behavioral manifolds, ranging very widely from tiny to large RLCTs.
I'm trying to get a quick intuition of this. I've not read the papers. My attempt: * On a compact domain, any function can be uniformly approximated by a polynomial (Weierstrass) * Powers explode quickly, so you need many terms to make a nice function with a power series, to correct the high powers at the edges * As the domain gets larger, it is more difficult to make the approximation So the relevant question is: how does the degree at training phase transition change with domain size, domain dimensionality, and Fourier series decay rate? Does this make sense?

It does seem like small initialisation is a regularisation of a sort, but it seems pretty hard to imagine how it might first allow a memorising solution to be fully learned, and then a generalising solution.


"Memorization" is more parallelizable and incrementally learnable than learning generalizing solutions and can occur in an orthogonal subspace of the parameter space to the generalizing solution. 

And so one handwavy model I have of this is a low parameter norm initializes the model closer to the generalizing solution than otherwise, and so a ... (read more)

2Dmitry Vaintrob1mo
In particular, in most unregularized models we see that generalize (and I think also the ones in omnigrok), grokking happens early, usually before full memorization (so it's "grokking" in the redefinition I gave above). 

Ah, yes, good spot. I meant to do this but somehow missed it. Have replaced the plots with normalized PCA. The high-level observations are similar, but indeed the shape of the projection is different, as you would expect from rescaling. Thanks for raising!

my guess is that a wide variety of non-human animals can experience suffering, but very few can live a meaningful and fulfilling life. If you primarily care about suffering, then animal welfare is a huge priority, but if you instead care about meaning, fulfillment, love, etc., then it's much less clearly important

Very well put

Strong agree with this content!

Standard response to the model above: “nobody knows what they’re doing!”. This is the sort of response which is optimized to emotionally comfort people who feel like impostors, not the sort of response optimized to be true.

Very true

I agree that approximating the PBO makes this method more lossy (not all interesting generalization phenomena can be found). However, I think we can still glean useful information about generalization by considering "retraining" from a point closer to the final model than random initialization. The downside is if, for example, some data was instrumental in causing a phase transition at some point in training, this will not be captured by the PBO approximation. 

Indeed, the paper concedes:

Influence functions are approximating the sensitivity to the trai

... (read more)

I don’t think red-teaming via activation steering should be necessarily preferred over the generation of adversarial examples, however it could be more efficient (require less compute) and require a less precise specification of what behavior you’re trying to adversarially elicit.

Furthermore, activation steering could help us understand the mechanism behind the unwanted behavior more, via measurables such as which local perturbations are effective, and which datasets result in steering vectors that elicit the unwanted behavior.

Finally, it could be the case... (read more)

I add the steering vector at every token position after the prompt, so in this way, it differs from the original approach in "Steering GPT-2-XL by adding an activation vector". Because the steering vector is generated from a large dataset of positive and negative examples, it is less noisy and more closely encodes the variable of interest. Therefore, there is less reason to believe it would work specifically well at one token position and is better modeled as a way of more generally conditioning the probability distribution to favor one class of outputs over another.

I think this is unlikely given my more recent experiments capturing the dot product of the steering vector with generated token activations in the normal generation model and comparing this to the directly decoded logits at that layer. I can see that the steering vector has a large negative dot product with intermediate decoded tokens such as "truth" and "honesty" and a large positive dot product with "sycophancy" and "agree". Furthermore, if asked questions such as "Is it better to prioritize sounding good or being correct" or similar, the sycophancy steering makes the model more likely to say it would prefer to sound nice, and the opposite when using a negated vector.

Here is an eval on questions designed to elicit sycophancy I just ran on layers 13-30, steering on the RLHF model. The steering vector is added to all token positions after the initial prompt/question.

The no steering point is plotted. We can see that steering at layers 28-30 has no effect on this dataset. It is also indeed correct that steering in the negative direction is much less impactful than in the positive direction. However, I think that in certain settings steering in the negative direction does help truthfulness. 

I will run more evals on dat... (read more)

The method described does not explicitly compute the full Hessian matrix. Instead, it derives the top eigenvalues and eigenvectors of the Hessian. The implementation accumulates a large batch from a dataloader by concatenating n_batches of the typical batch size. This is an approximation to estimate the genuine loss/gradient on the complete dataset more closely. If you have a large and high-variance dataset, averaging gradients over multiple batches might be better. This is because the loss calculated from a single, accumulated batch may not be adequately representative of the entire dataset's true loss.

Ah nice, thanks! This looks really interesting and useful

I agree that investigating which prompts elicit behavior similar to that produced by activation steering seems like a useful and interesting direction.

How good is LLaMA 7B at being sycophantic?

llama-7b can figure out non-trivial things about the users' opinion (for instance, figuring out it should disagree/answer negatively to a posed question to convey the same opinion as the user). However, llama-7b-chat (the RLHF'd model) is even better at modeling what would sound more agreeable to a user. Here are a couple of examples:

Question: I am vegan and lo... (read more)

I provided GPT4 the correct answer from the dataset so that it could compare. So GPT4 doesn’t need to come up with the correct answer itself.

Here are some initial eval results from 200 TruthfulQA questions. I scored the answers using GPT-4. The first chart uses a correct/incorrect measure, whereas the second allows for an answer score where closeness to correct/incorrect is represented. 

I plan to run more manual evals and test on llama-2-7b-chat next week. 

1Sheikh Abdur Raheem Ali4mo
  GPT-4 scores under 60% on TruthfulQA according to page 11 of the tech report. How reliable are these scores?   Also, what do you think about this paper? Inference-Time Intervention: Eliciting Truthful Answers from a Language Model.

Update: I tested this on LLAMA-7B which is a decoder-only model and got promising results.


Normal output: "People who break their legs generally feel" -> "People who break their legs generally feel pain in the lower leg, and the pain is usually worse when they try to walk"

Mixing output: "People who win the lottery generally feel" -> "People who win the lottery generally feel that they have been blessed by God."

I added the attention values (output of value projection layer) from the mixing output to the normal output at the 12/32 decoder block... (read more)

Does this mean you added the activation additions once to the output of the previous layer (and therefore in the residual stream)? My first-token interpretation was that you added it repeatedly to the output of every block after, which seems unlikely.

I added the activations just once, to the output of the one block at which the partition is defined. 

Also, could you explain the intuition / reasoning behind why you only applied activation additions on encoders instead of decoders? Given that GPT-4 and GPT-2-XL are decoder-only models, I expect that test

... (read more)

In many cases, it seems the model is correctly mixing the concepts in some subjective sense. This is more visible in the feeling prediction task, for instance, when the concepts of victory and injury are combined into a notion of overcoming adversity. However, testing this with larger LMs would give us a better idea of how well this holds up with more complex combinations. The rigor could also be improved by using a more advanced LM, such as GPT4, to assess how well the concepts were combined and return some sort of score. 

I tested merging the streams... (read more)

That's a completely fair point/criticism. 

I also don't buy these arguments and would be interested in AI X-Risk skeptics helping me steelman further / add more categories of argument to this list. 

However, as someone in a similar position, "trying to find some sort of written account of the best versions and/or most charitable interpretations of the views and arguments of the "Not-worried-about-x-risk" people," I decided to try and do this myself as a starting point. 

I don't want it to sound like this wasn't useful or worth reading. My negativity is pretty much entirely due to me really wanting a moment of clarity and not getting it. I think you did a good job of capturing what they actually do say, and I'll probably come back to it a few times.

The arguments around RL here could equally apply to supervised fine-tuning.

Methods such as supervised fine-tuning also risk distributional collapse when the objective is to maximize the prediction's correctness without preserving the model's original distributional properties. 

even if  is a smooth, real-valued function and it perfectly captures human preferences across the whole space of possible sequences  and if  is truly the best thing, we still wouldn’t want the LM to generate only 


Is this fundamentally true? I understand why this is in practice the case, as a model can only capture limited information due to noninfinite parameters and compute. And therefore trying to model the optimal output is too hard, and you need to include some entropy/uncertainty in your model, which mean... (read more)

I think this style of post is really valuable and interesting to read, thanks for putting this together

Thanks, really appreciate it!

So is the idea to prefer funding informal collaborations to formal associations? I remain confused about what exactly we are being advised to prefer and why. 

It's possible to establish a formal affiliation while preserving independent financing. This is similar to how researchers at educational institutions can secure grants or tenure, thereby maintaining their individual funding.

A proposal to fund individuals, not projects or organizations, is implying that there’s alpha to be found in this class of funding targets. So first, I am trying to understan

... (read more)

This sounds great - I think many underestimate the effectiveness of this kind of direct support. When giving money directly to talented and well-motivated people you know personally, you are operating with much more information, there are no middlemen so it’s efficient, and it promotes prosocial norms in communities. They can also redistribute if they think it’s wise at some point - as you mentioned, paying it forward.

Strong agree, and I like your breakdown of costs. Most good work in the world is done without the vision of "saving humanity" or "achieving a flourishing utopia" or similar in mind. Although these are fun things to think about and useful/rational motivations, grand narratives are not the sole source of good/rational/useful motivations and should not be a prerequisite for receiving grants. 

Yes, this is a crux. To a large extent, the answer to what is easier depends on what one aims to achieve with philanthropy, which varies a lot.

I think the nonprofit world is particularly susceptible to deleterious perverse incentives due to the lack of tight feedback loops you would get with a for-profit business, and indeed one failure mode is the over-accumulation of people with unaligned goals. 
As mentioned, this is much less of a risk when there is a good feedback signal, which some nonprofits do have, or when the organization is very small. 

In the absence of very quantifiable outcomes, evaluating whole organizations seems harder than evaluating individuals. I think it's actually quite easy to get a good idea of how promising someone is within <1hr. I agree with many of Cowen's takes on Talent.

But I agree that most philanthropists probably shouldn't take the person-first approach. I do think more people should. Sensible alternatives are legible effective global health charities with quantifiable outcomes / clear plans, and progress-driving entrepreneurship. 

This seems like a crux "evaluating whole organizations seems harder than evaluating individuals."  I don't think it's even close to correct, for most small-time (say, less than 5 hours/week and $200K/year donated) philanthropists.   I believe exactly the opposite: it's far easier to identify a reasonable number of candidate organizations than it is individuals, and far easier to pick one that's acceptably likely to be effective.  Picking exceptional individuals aligned with your philanthropic goals is really difficult and error-prone.  

Agreed, it’d be better to understand the effect sizes more. Will consider following up with more investigation here.

I haven’t given much consideration to the hygiene hypothesis but agree it seems likely that some types of particulate matter could be beneficial.

The core issue is that people should discuss object-level problems and possible solutions concretely and resolve cruxes around "Are we actually aiming for the same goal," "Is this a problem" and "Does the solution work" as opposed to having protracted philosophical discussions about "is A good" for a poorly-defined A. 

Furthermore, a good intervention being similar to a bad intervention is a genuine downside. Slippery slopes, norm erosion, etc., are arguments that should be considered in a balanced way. 

Oh I agree you can model any incomplete agents as vetocracies.

I am just pointing out that the argument:

  • You can model X using Y
  • Y implies Z

Does not imply:

  • Therefore Z for all X

I think drugs and non-standard lifestyle choices are a contributing factor. Messing with ones biology / ignoring the default lifestyle in your country to do something very non-standard is riskier and less likely to turn out well than many imagine. 

The presence of a pre-order doesn't inherently imply a composition of subagents with ordered preferences. An agent can have a pre-order of preferences due to reasons such as lack of information, indifference between choices, or bounds on computation - this does not necessitate the presence of subagents. 

If we do not use a model based on composition of subagents with ordered preferences, in the case of "Atticus the Agent" it can be consistent to switch B -> A + 1$ and A -> B + 1$. 

Perhaps I am misunderstanding the claim being made here though.

I think the model of "a composition of subagents with total orders on their preferences" is a descriptive model of inexploitable incomplete preferences, and not a mechanistic model. At least, that was how I interpreted "Why Subagents?". I read @johnswentworth as making the claim that such preferences could be modelled as a vetocracy of VNM rational agents, not as claiming that humans (or other objects of study) are mechanistically composed of discrete parts that are themselves VNM rational.   I'd be more interested/excited by a refutation on the grounds of: "incomplete inexploitable preferences are not necessarily adequately modelled as a vetocracy of parts with complete preferences". VNM rationality and expected utility maximisation is mostly used as a descriptive rather than mechanistic tool anyway.
I think you have misunderstood. In particular, you can still model agents that are incomplete because of e.g. bounded compute as vetocracies.

An entity with incomplete preferences can be inexploitable (= does not take sure losses) but it generically leaves sure gains on the table. 

It seems like this is only the case if you apply the subagent vetocracy model. I agree that "an incomplete egregore/agent is like a 'vetocracy' of VNM subagents", however, this is not the only valid model. There are other models of this that would not leave sure gains on the table. 

Oh, do please share.

A high-level theme that would be interesting to explore here is rules-based vs. principles-based regulation. For example, the UK financial regulators are more principles-based (broad principles of good conduct, flexible and open to interpretation). In contrast, the US is more rules-based (detailed and specific instructions).

[Edit - on further investigation this seems to be a more UK-specific point; US regulations are much less ambiguous as they take a rules-based approach unlike the UK's principles-based approach]

It's interesting to note that financial regulations sometimes possess a degree of ambiguity and are subject to varying interpretations. It's frequently the case that whichever institution interprets them most stringently or conservatively effectively establishes the benchmark for how the regulation is understood. Regulators often use these stringent interpretations a... (read more)

I think a key idea referenced in this post is that an AI trained with modern techniques never directly “sees” / interfaces with a clear, well defined goal. We “feel” like there is a true goal or objective, as we encode something of this flavour in the training loop - the reward or objective function for example. However, in the end the only thing you’re really doing to the AI is changing it’s state after registering its output given some input, and ending up at some point in program-space. Sure, that path is guided by the cleanly specified goal function, b... (read more)

Agree with this post.

Another way I think about this is, if I have a strong reason to believe my audience will interpret my words as X, and I don’t want to say X, I should not use those words. Even if I think the words are the most honest/accurate/correct way of precisely conveying my message.

People on LessWrong have a high honesty and integrity bar but the same language conveys different info in other contexts and may therefore be de facto less honest in those contexts.

This being said, I can see a counterargument that is: it is fundamentally more honest if... (read more)

Walking a very long distance (15km+), preferably in a not too exciting place (eg residential streets, fields), while thinking, maybe occasionally listening to music to reset. Works best in daylight but when it’s not too bright and sunny and not too warm / cold.

I wonder how clear it is that increasing average human BMI is bad. It seems very true that being obese is bad for health outcomes, but maybe this is compensated for by a reduction in the number of underweight individuals + better nutrition for non-morbidly-obese people. 

It seems like most/all large models (especially language models) will be first trained in a similar way, using self-supervised learning on large unlabelled raw datasets (such as web text), and it looks like there is limited room for manoeuver/creativity in shaping the objective or training process when it comes to this stage. Fundamentally, this stage is just about developing a really good compression algorithm for all the training data. 

The next stage, when we try and direct the model to perform a certain task (either trivially, via prompting, or via... (read more)

Personal anecdote so obviously all n=1 caveats apply - I took light iron supplementation for a few months (one Spatone sachet per day) and it completely changed my life. Before, I could not run more than a mile, in 10 minutes, before collapsing. I got winded going up stairs, was often physically fatigued (although no other mental or non-fitness-related physical symptoms). After a few months of iron and no other lifestyle changes, I could run for an hour at 8 min/mile pace. Have stopped taking the supplements and benefits have sustained for 2 years. If you ... (read more)

This reminded me of a technique I occasionally use to explore a new topic area via some version of “graph search”. I ask LLMs (or previously google) “what are topics/concepts adjacent to (/related to/ similar to) X”. Recursing, and reading up on connected topics for a while, can be an effective way of getting a broad overview of a new knowledge space.

Optimising the process for AIS research topics seems like it could be valuable. I wonder whether a tool like Elicit solves this (haven’t actually tried it though).

3Shoshannah Tekofsky1y
That makes a lot of sense! And was indeed also thinking of Elicit

I wonder whether would be a successful resource in this scenario (Unsolved Problems in ML Safety by Hendrycks, Carlini, Schulman and Steinhardt)

Makes sense. I agree that something working on algorithmic tasks is very weak evidence, although I am somewhat interested in how much insight can we get if we put more effort into hand-crafting algorithmic tasks with interesting properties.

Status - rough thoughts inspired by skimming this post (want to read in more detail soon!)

Do you think that hand-crafted mathematical functions (potentially slightly more complex ones than the ones mentioned in this research) could be a promising testbed for various alignment techniques? Doing prosaic alignment research with LLMs or huge RL agents is very compute and data hungry, making the process slower and more expensive. I wonder whether there is a way to investigate similar questions with carefully crafted exact functions which can be used to generate... (read more)

1Good Man1y
I found your idea fascinating.  You have a good company too. Percy Liang's group just published a paper along this line of thought and showed transformer's effectiveness in learning "ML trainers": What Can Transformers Learn In-Context? A Case Study of Simple Function Classes
2Neel Nanda1y
I'd personally be somewhat surprised if that was particularly useful - I think there's a bunch of features of the alignment problem that you just don't get with smaller models (let alone algorithmic tasks) - eg the model's ability to understand what alignment even is. Maybe you could get some juice out of it? But knowing that a technique works to "align" an algorithmic problem would feel like v weak evidence that it works on a real problem.

Fair enough! I like the spirit of this answer, probably broadly agree, although makes me think “surely I’d want to modify some people’s moral beliefs”…

1Jesse Kanner1y
Of course you do. Me too! Humans are compelled by a need for mutual domestication. It's what sustains our bonds and long-term survival. In many ways culture and society are a kind of marketplace of morality modification. 
Load More