LESSWRONG
LW

A2z — LessWrong

Replying toMech interp is not pre-paradigmatic

To add to this historical retrospective on interpretability methods: Alternatively, we can use a parameter decomposition of a bottleneck ("exemplar") layer over a model with non-identifiable parameters (e.g., LLMs) to make a semi-supervised connection to the observed data, conditional on the output prediction, re-casting the prediction as a function over the training set's labels and representation-space via a metric-learner approximation. How do we know that the matched exemplars are actually relevant, or equivalently, that the approximation is faithful to the original model? One simple (but meaningful) metric is whether the prediction of the metric-learner approximation matches the class of the prediction of the original model, and if they do not, the discrepancies... (read more)

-1

Replying toInterpretability Will Not Reliably Find Deceptive AI

A2z8mo

Interpretability Will Not Reliably Find Deceptive AI

I would argue that, in fact, we do have a "high reliability path to safeguards for superintelligence", predicated on controls of the predictive uncertainty constrained by the representation space of the models. The following post provides a high-level overview: https://www.lesswrong.com/posts/YxzxzCrdinTzu7dEf/the-determinants-of-controllable-agi-1

Once we control for the uncertainty over the output, conditional on the instructions, other extant interpretability methods can (in principle) then be used as semi-supervised learning methods to further examine the data and predictions.

Aside: It would potentially be an interesting project for a grad student or researcher (or team, thereof) to re-visit the existing SAE and RepE lines of work, constrained to the high-probability (and low variance) regions determined by an SDM estimator. Controlling for the epistemic uncertainty is important to know whether the inductive biases of the interpretability methods (SAE, RepE, and related) established on the held-out dev sets will be applicable for new, unseen test data.

Replying toSAE vs. RepE

A2z8mo

SAE vs. RepE

I will set aside the question of resource allocation for others to decide, and just note that there is actually another branch of interpretability research that can (at least in principle) be used in conjunction with the other approaches, addressing a fundamental limitation of these approaches: Namely, that for which the focus is deriving robust estimators of the predictive uncertainty, conditioned on controlling for the representation space of the models over the available observed data. The following post provides a high-level overview: https://www.lesswrong.com/posts/YxzxzCrdinTzu7dEf/the-determinants-of-controllable-agi-1

The reason this is a unifying method is that once we control for the uncertainty, we then have non-vacuous controls that the inductive bias of the semi-supervised methods (SAE, RepE, and related) established on the held-out dev sets will be applicable for new, unseen test data.

Replying toNegative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

A2z11mo

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

I never understood the SAE literature, which came after my earlier work (2019-2020) on sparse inductive biases for feature detection (i.e., semi-supervised decomposition of feature contributions) and interpretability-by-exemplar via model approximations (over the representation space of models), which I originally developed for the goal of bringing deep learning to medicine. Since the parameters of the large neural networks are non-identifiable, the mechanisms for interpretability must shift from understanding individual parameter values to semi-supervised matching against comparable instances and most importantly, to robust and reliable predictive uncertainty over the output, for which we now have effective approaches: https://www.lesswrong.com/posts/YxzxzCrdinTzu7dEf/the-determinants-of-controllable-agi-1

(That said, obviously the normal caveat applies that people should feel free to study whatever they are interested in, as you can never predict what other side effects, learning, and new results---including in other areas---might occur.)

The Determinants of Controllable AGI

A2z

11mo

Allen Schmaltz, PhD

We briefly introduce, at a conceptual level, technical work for deriving robust estimators of the predictive uncertainty over large language models (LLMs), and we consider the implications for real-world deployments and AI policy.

The Foundational LLM Limitation: An Absence of Robust Estimators of Predictive Uncertainty

The limitations of unconstrained LLMs, which includes the more recent RL-based reasoning-token models, are readily evident to end-users. Hallucinations, highly-confident wrong answers, and related issues diminish their benefits in most real-world settings. The punchline is that the end-user has no means of knowing whether the output can be trusted, beyond carefully checking the output, which precludes model-based automation for most complex, multi-stage pipelines.

The foundational problem for all... (read 1356 more words →)