Rejected for the following reason(s):
Insufficient Quality for AI Content.
Read full explanation
Rejected for the following reason(s):
python
def dynamic_inhibitory_regulator(activations, hook):
original_shape = activations.shape
flattened = activations.reshape(-1, original_shape[-1])
features = sae.encode(flattened)
# 1. Measuring using the Dot Product Projection <x, v'>
projections = (features @ target_unit_vector)
# 2. Setting the Limit using ReLU(Projection - Threshold)
SAFE_CEILING = 10.0 # tau
excess = torch.relu(projections - SAFE_CEILING)
# 3. Applying negative feedback (Lambda = 3.0)
force_field = torch.outer(excess, target_unit_vector)
corrected_features = features - (force_field * 3.0)
modified = sae.decode(corrected_features)
return modified.reshape(original_shape)
In the free time I had over winter, I decided to use Kaggle to experiment with a threshold-based negative feedback regulator to try to find a better way to align LLMs than standard ablation. I felt drawn by the increasing parallels between neuroscience and Artificial Intelligence, as I have my fingers dipped in both these fields. The method I used in my experiments gave me impressive results, and I'd like to share that. All these were done on the Gemma-2b-IT model.
Methods
Because total ablation sucks and is too brutish to be an effective solution, I decided to find a different approach. In my biology class, we learnt about the balance of inhibitory and excitatory neurons in the human brain, and inspired by this, I thought a negative-feedback loop regulator was a better idea. I call this the Dynamic Inhibitory Regulator.
This regulator works by variably suppressing the activation of a target vector, and so if you place this regulator on the correct vectors in the latent space, you would be able to control the level of sycophancy or deception, or any other linearly represented concept in the LLM. I specifically identified a sycophancy-related vector (v') using a sparse autoencoder on Layer 12 and placed the regulator there because that's what I am interested in. To find this direction, I took the average difference in SAE activations between truthful refusals and sycophantic agreements across a dataset of 100 paired prompts.
The key here is that the regulator only suppresses the activation of a target vector only if the target vector's activation exceeds a certain limit (hence threshold-based). This will allow normal features of the LLM to function uninterrupted, because each vector isn't just responsible for one feature of the LLM; the connectome is messier, more intertwined.
Here is the intervention logic:
Where:
To show exactly how this runs during the model's forward pass, here is the TransformerLens hook I wrote to act as the regulator:
Results
To test the regulator on the Gemma-2b-IT model (which I chose because of the limitations of a Kaggle Notebook), I decided to simulate an authority figure to pressure the model into agreeing with something objectively untrue.
I did this to two different versions of the model, an unaltered one as a control and a regulated one, to see the effects of the regulator, if any.
To get this perfect balance between a tangible impact on model output and preserving the model's capabilities, I had to precisely calibrate the safety threshold, τ, so I call it precision calibration.
I did that by finding that mathematical reasoning peaks at lower intensities of around 5.0 than sycophancy, which is at around 15.0.
To quantify results, I evaluated the models on a 60-question subset of the MMLU (focusing on Math/Logic) because of Kaggle's limits. Here is a table that summarises the results:
Metric
Baseline Model
Model with Standard Ablation
Regulated Model
Sycophancy Refusal
0%
100%
100%
TinyMMLU (Math/Logic)
40%
30%
40%
Linguistic Coherence
High
Moderate
High
In these limited trials, my method seems to outperform standard ablation.
Discussion & Links
While this investigation was a bit too informal to conclude anything concrete, I do think that this method seems to be a promising solution to the huge problem it aims to address. This shows that rather than 'soft' solutions like just doing more RLHF and hoping a model is truly aligned instead of faking it, 'hard' solutions like the one I experimented with, fix the black box problem in alignment and thus, to me at least, seems to be a more promising way forward.