LESSWRONG
LW

submarat
49120
Message
Dialogue
Subscribe

SWE/MLE/AI Safety

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
ARENA4.0 Capstone: Hyperparameter tuning for MELBO + replication on Llama-3.2-1b-Instruct
submarat11mo10

Thank you for reading and the suggestions. I enumerate for easier reference:
1. Find max activating examples
2. Understand which vectors are being found
3. Attempt to scale up
4. Finding useful applications once scaled up

For 1.  do you mean:

  1. Take an input (from a bank of random example prompts)
  2. Do forward pass on unsteered model
  3. Extract the activations at the target layer
  4. Compute the dot product between these activations and the steering vector
  5. Use this dot product value as a measure of how strongly this example activates the behavior associated with the steering vector

Am I following correctly?

Reply
Mechanistically Eliciting Latent Behaviors in Language Models
submarat11mo10

We attempted 1.a "diversity measure based on sentence embeddings" and found that for Llama3.2-1B the diversity appears to decay after the cusp value for R; picking R at highest average diversity was a decent heuristic for finding meaningful steering vectors. The Llama model starts to produce highly repetitive output past the cusp. We demonstrated that repetitive completions were considered similar by our chosen sentence embedding model (SentenceTransformer all-mpnet-base-v2). Using "sum of variances" vs "mean of cosine similarities" didn't seem to matter.

Reply
No wikitag contributions to display.
29Transformers Don't Need LayerNorm at Inference Time: Implications for Interpretability
1mo
0
34ARENA4.0 Capstone: Hyperparameter tuning for MELBO + replication on Llama-3.2-1b-Instruct
11mo
2