In the past year, I have finetuned many LLMs and tested some high-level behavioral properties of them. Often, people raise the question if the observed properties would be different if we had used full-parameter finetuning instead of LoRA. From my perspective, LoRA rank is one out of many hyperparameters, and hyperparameters influence how quickly training loss goes down and they may influence the relationship of training- to test-loss, but they don't meaningfully interact with high-level properties beyond that.
I would be interested if there are any examples where this is wrong - are there any demonstrations of finetuning hyperparameters that influence generalization behavior in interesting ways?
(For example, this question came up in the context of emergent misalignment, where various people asked me if I think that generalization happens because a small lora rank forces the model to learn "more general" solutions.)
Not directly related to the question, but Optimizers Qualitatively Alter Solutions And We Should Leverage This (2025) argues that the choice of optimizer (e.g. first-order methods like AdamW vs second-order methods like Shampoo) not only affects speed of convergence, but properties of the final solution.
Our position is that the choice of optimizer itself provides an effective mechanism to introduce
an explicit inductive bias in the process, and as a community we should attempt to understand
it and exploit it by developing optimizers aimed at converging to certain kinds of solutions. The
additional implication of this stance is that the optimizer can and does affect the effective expressivity
of the model class (i.e. what solutions we can learn). We argue that expressivity arguments that
solely focus on the architecture design and/or data do not provide a complete picture and could be
misleading for example if used to do model selection. The learning algorithm and choice of optimizer
are also critical in shaping the characteristics of what are reachable functions and implicitly the final
learned model.
An in-the-wild observation is how different the Kimi models are compared to Llamas and Claudes. Kimi (and I suppose now the recent Qwen models) are optimized with Muon+AdamW vs AdamW alone. I've seen anecdotes on how different Kimi responses are compared to other models. You can attribute some % of it to their data mix; MoonshotAI staff note they put a lot of effort into looking at and curating training data. But it's also possible some non-trivial % of the behavior can be attributed to the optimizers used.
I guess it also depends on what you consider a 'finetuning hyperparameter' - e.g. the broadest interpretation is 'any way in which you could modify the training process', which includes lots of things that obviously affect generalization (like adding new data, modifying the data, etc)
One relatively constrained example might be 'changing the order of training data'. I do expect that there is path dependence in how we train models - the things models learn early on affect how / what they learn later on. E.g. Sycophancy to Subterfuge could be thought of as an example of this - there is reward hacking with the training curriculum but (presumably) there wouldn't be if you messed up the order of the training stages.
In the case of EM even a very tiny LoRA adapter (rank 1) seems sufficient: see post
Generally, according to Tinker docs, hparams might matter, but only coarsely:
Interpretability methods like SAEs often treat models as if their residual stream represents a bag of concepts. But how does this account for binding (red cube, blue sphere vs red, blue, cube, sphere)? Shouldn't we search for (subject, predicate, object) representations instead?
You could imagine a world where the model handles binding mostly via the token index and grammar rules. I.e. 'red cube, blue sphere' would have a 'red' feature at token , 'cube' feature at token , 'blue' feature at , and 'sphere' feature at , with contributions like 'cube' at being comparatively subdominant or even nonexistent.
I don't think I really believe this. But if you want to stick to a picture where features are directions, with no further structure of consequence in the activation space, you can do that, at least on paper.
Is this compatible with the actual evidence about activation structure we have? I don't know. I haven't come across any systematic investigations into this yet. But I'd guess probably not.
Relevant. Section 3 is the one I found interesting.
If you wanted to check for matrix binding like this in real models, you could maybe do it by training an SAE with a restricted output matrix. Instead of each dictionary element being independent, you demand that for your SAE can be written as , where , , . So, we demand that the second half of the SAE dictionary is just some linear transform of the first half.
That'd be the setup for pairs. Go for three slots, and so on.
(To be clear, I'm also not that optimistic about this sort of sparse coding + matrix binding model for activation space. I've come to think that activations-first mech interp is probably the wrong way to approach things in general. But it'd still be a neat thing for someone to check.)
Thanks for the link and suggestions!
I quickly tested if SigLIP or CLIP embeddings show evidence of attribute binding and they don't (however n=1 image) - an image of a red cube with a blue sphere compared with texts "red cube next to blue sphere" and "blue cube next to red sphere" doesn't get a higher similarity score for the correct label than for the wrong one (CLIP, SigLIP).
Nice quick check!
Just to be clear: This is for the actual full models? Or for the 'model embeddings' as in you're doing a comparison right after the embedding layer?
It is very clear what it means to align an agent:
It is less clear what it means to align an LLM:
Probably, we should have different alignment goals for different deployment cases: LLM assistants should say nice and harmless things, while agents that help automate alignment research should be free to think anything they deem useful, and reason about the harmlessness of various actions “out loud” in their CoT, rather than implicitly in a forward pass.
I think even your definition of what it means for an agent to be aligned is a bit underspecified because it doesn't distinguish between two possibilities:
Is the agent creating positive outcomes because it trades and compromises with us, creating a mutually beneficial situation that benefits both us and the agent, or
Is the agent creating positive outcomes because it inherently "values what we value", i.e. its utility function overlaps with ours, and it directly pursues what we want from it, with no compromises?
Definition (1) is more common in the human world. We say that a worker is aligned with us if they do their job as instructed (receiving a wage in return). Definition (2) is more common in theoretical discussions of AI alignment, because people frequently assume that compromise is either unnecessary or impossible, as a strategy that we can take in an AI-human scenario.
By itself, the meaning you gave appears to encompass both definitions, but it seems beneficial to clarify which of these definitions you'd consider closer to the "spirit" of the word "aligned". It's also important to specify what counts as a good outcome by our values if these things are a matter of degree, as opposed to being binary. As they say, clear thinking requires making distinctions.
I want to formulate what emotions are from the perspective of an observer that has no emotions itself. Emotions have a close relationship with consciousness, and similar to the hard problem of consciousness, it is not obvious how to know what another mind feels like. It could be that one person perceives emotions 1000x as strong as another person, but the two different emotional experiences lead to exactly the same behavior. Or it could be that one species perceives emotions on a different intensity scale than another one. This creates a challenge for utilitarians: if you want to maximize the happiness of all beings in the universe, you need a way of aggregating happiness between beings.
So, how can we approach this question? We can start by trying to describe the observable properties of emotions as good as we can:
My intermediate conclusion is that emotions likely evolved because they are computationally efficient proxies for how good the current state is and how to spend energy. They can be viewed as latent variables that often yielded fitness-increasing behavior, whose impact extends beyond the situations in which it actually proves useful - for example, when I get grumpy because I’m hungry.
If this is true, emotions are more useful when a being is less capable of abstract reasoning, therefore less intelligent animals might experience emotions stronger rather than weaker. This fits with the observation that intelligent humans can reduce their suffering via meditation, or that pets seem to suffer more from getting a vaccine than adult humans. However this is a bit of a leap and I have low confidence in it.
Regarding digital sentience, this theory would predict that emotions are more likely to emerge when optimization pressure exists that lets an AI decide how to spend energy. This is not the case in language model pretraining, but is the case in most forms of RL. Again, I am not very confident in this conclusion.
I wonder if anyone has analyzed the success of LoRA finetuning from a superposition lens. The main claim behind superposition is that networks represent D>>d features in their d-dimensional residual stream, with LoRA, we now update only r<<d linearly independent features. On the one hand, it seems like this introduces a lot of unwanted correlation between the sparse features, but on the other hand it seems like networks are good at dealing with this kind of gradient noise. Should we be more or less surprised that LoRA works if we believe that superposition is true?