Nora Belrose

Wiki Contributions


You can then also try things like adding supervised data in only one domain and check generalization, and you can expect a reasonably clear signal.

Yep, I just had this idea this morning and came here to check if anyone else had thought of it. It seems plausible that a semi-supervised version of CCS could outperform naive logistic regression in generalization performance.

The number of truthlike features (or any kind of feature) cannot scale exponentially with the hidden dimension in an LLM, simply because the number of "features" scales at most linearly with the parameter count (for information theoretic reasons). Rather, I claim that the number of features scales at most quasi-quadratically with dimension, or .

With depth fixed, the number of parameters in a transformer scales as  because of the weight matrices. According to this paper which was cited by the Chinchilla paper, the optimal depth scales logarithmically with width, hence the number of parameters, and therefore the number of "features" for a given width is . QED.

EDIT: It sounds like we are talking past each other, because you seem to think that "feature" means something like "total number of possible distinguishable states." I don't agree that this is a useful definition. I think in practice people use "feature" to mean something like "effective dimensionality" which scales as O(log(N)) in the number of distinguishable states. This is a useful definition IMO because we don't actually have to enumerate all possible states of the neural net (at which level of granularity? machine precision?) to understand it; we just have to find the right basis in which to view it.

Do you have any recommendations for running HDBSCAN efficiently on high dimensional neural net activations? I'm using the Python implementation and just running the algorithm on GPT-2 small's embedding matrix is unbearably slow.

UPDATE: The maintainer of the repo says it's inadvisable to use the algorithm (or any other density-based clustering) directly on data with as many as 768 dimensions, and recommends using UMAP first. Is that what you did?

I think it would be a distraction to try to figure out if LMs are "phenomenally conscious" for a few different reasons.

  1. I think there are pretty strong reasons to believe that phenomenal consciousness is not actually a substantive property, in the sense that either everything has it in some sense (panpsychism) or nothing does (eliminativism). Any other solution confronts the Hard Problem and the empirical intractability of actually figuring out which things are or are not phenomenally conscious.
  2. Your proposed tests for phenomenal consciousness seem to, in fact, be testing for access consciousness— basically, the ability to do certain types of reflection and introspection. Access consciousness may well be relevant for alignment; it seems pretty related to situational awareness. But that's not phenomenal consciousness (because of the Hard Problem). Phenomenal consciousness is causally inert and empirically untestable.
  3. While it would be a problem if LMs were moral patients, I think these concerns are utterly dwarfed by the value we'd lose due to an AI-caused existential catastrophe. Also, on the most plausible views of valence, an experience's valence is directly determined by your first-order in-the-moment preferences to continue having that experience or not. If valence just reduces to preferences, then we really can just talk about the preferences, which seem more empirically tractable to probe.

I do think consciousness is real and important (I think some form of Russellian monism is probably right). I just don't think it matters for alignment.

^ I stand by the substantive points I made above but it occurs to me that I expressed them in a weirdly combative and dismissive tone, so sorry about that.

Isn't this just pushing the problem back a step? Wouldn't weight decay penalize the weights that compute the input-based pseudorandom number to determine where the deployment check happens?

This also just seems like it'd be really low measure in the SGD prior. Like, where is the optimization pressure coming from to form all these independent deployment-checking circuits throughout the model and also randomize their use? All the while taking a hit on the L2 penalty for doing this?

Is the idea is that the network itself is consciously modeling SGD and gradient-hacking its way toward this solution? If so: 1) it's not clear to me this is mechanistically possible, and 2) if the AI is superintelligent enough to figure this out, it probably figure out a better way to get what it wants (e.g. break out of the training process altogether).

Sure, but if those small weights don’t contribute to the base objective they would just get pushed all the way to zero, right? Especially if you use L1 instead of L2 regularization. The somewhat scarier version of deception IMO is where there’s a circuit that does contribute to base task performance, but it just also has this property of being deceptive, and the loss landscape is such that SGD can‘t/won’t find a nearby non deceptive alternative with similar or better performance. But it seems like there’s some hope there too, since we know that strictly convex valleys are really low measure in NN loss landscapes, and every independent SGD solution is part of a single connected manifold of low loss. Furthermore for sufficiently large models you can actually just take a convex combination of the weights of two good models and usually get another good model. SGD definitely can find nondeceptive solutions, I guess the question is whether it will do so, and if we can push it in that direction if needed. My intuition currently is that deception isn’t actually going to be a problem except, perhaps, if the network is very deep / has recurrence. We should be worried that people will use very deep recurrent networks to build AGI, but I’m somewhat hopeful that chain of thought language models will sort of save us here since they force the network to store its intermediate computations in a human interpretable format,

This probably doesn't work, but have you thought about just using weight decay as a (partial) solution to this? In any sort of architecture with residual connections you should expect circuits to manifest as weights with nontrivial magnitude. If some set of weights isn't contributing to the loss then the gradients won't prevent them from being pushed toward zero by weight decay. Sort of a "use it or lose it" type thing. This seems a lot simpler and potentially more robust than other approaches.

This is actually a pretty good argument, and has caused me to update more strongly to the view that we should be optimizing only the thought process of chain of thought language models, not the outcomes that they produce

Load More