ssarveshwaraan — LessWrong

LLM Control with Dynamic Inhibitory Regulation

In the free time I had over winter, I decided to use Kaggle to experiment with a threshold-based negative feedback regulator to try to find a better way to align LLMs than standard ablation. I felt drawn by the increasing parallels between neuroscience and Artificial Intelligence, as I have my...

Mar 171

LLM Sycophancy Control with Dynamic Inhibitory Regulation

This post outlines a method to limit sycophancy in LLMs using a threshold-based negative feedback regulator to identify a better approach to AI Alignment than standard ablation. Using this method, 100% refusal rate was achieved on statements where an authoritative figure forced the model to agree with their objectively false...

Feb 281

LLM Sycophancy Control with Dynamic Inhibitory Regulation (0% MMLU Alignment Tax)

This post outlines a method to limit sycophancy in LLMs using a threshold-based negative feedback regulator to identify a better approach to AI Alignment than standard ablation. Using this method, 100% refusal rate was achieved on statements where an authoritative figure forced the model to agree with their objectively false...

Feb 281