That makes sense and understand that steering vectors are still in the research phase. In the near term, can we use interpretability (vectors) for evals? Feels like there could be potential in problem detection in the near term.
That makes sense and understand that steering vectors are still in the research phase. In the near term, can we use interpretability (vectors) for evals? Feels like there could be potential in problem detection in the near term.