Ruixuan Huang — LessWrong

LESSWRONG
LW

Replying toSubspace Rerouting: Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models

Ruixuan Huang11mo

Subspace Rerouting: Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models

Great job! Consider reading our related paper: https://arxiv.org/abs/2404.12038

Steering LLMs' Behavior with Concept Activation Vectors

Ruixuan Huang

Recently, some researches have reported a mechanism called Activation Steering, which can influence the behavior styles of large language models (LLMs). This mainly includes refusal capabilities ^[1] and language usage ^[2]. This mechanism resembles the functionality of the safety concept activation vectors (SCAVs) ^[3] we proposed early this year. We’ve expanded the scope of safety concepts within SCAV and observed several intriguing phenomena, though some remain unexplained.

Summary of Findings:

CAVs effectively steer LLM output styles for roles like French experts and Python experts, showing strong accuracy and clear separability for these concepts.
We successfully implemented forward and reverse steering for language switching, indicating that certain language concepts in the LLM can be reliably steered.
CAV steering cannot create the capabilities

... (read 2852 more words →)

Exploring the Evolution and Migration of Different Layer Embedding in LLMs

Ruixuan Huang

[Edit on 17th Mar] After conducting experiments on more data points (5000 texts) on the Pile dataset (more sample sources), we are confident that the experimental results described earlier are reliable. Therefore, we have opened the code.

Recently, we conducted several experiments focused on the evolution and migration of token embeddings across different layers during the forward processing of LLMs. Specifically, our research targeted open-source, decoder-only architecture models, such as GPT-2 and Llama-2^[1].

Our experiments are initiated from the perspective of an older research topic known as the logit lens. Utilizing the unembedding matrix on embeddings from various layers is an innovative yet thoughtless approach. Despite yielding several intriguing observations through the logit lens,... (read 2357 more words →)