There are surely reasons to do ambitious interp that are not the stated goal of ambitious interp? I doubt we will have a fully understandable model by 2028, but I still think the abstractions developed in the process will be helpful.
For instance, many of the higher-order methods like SAEs are based on assumptions about how activation space is structured. Studying smaller systems rigorously can give us the ground truth for how models construct their activation space, that can allow us to question/modify said assumptions.
At CE threshold chosen, capacity wouldn't be the bottleneck.
Besides, MLP networks can store information as efficiently as the number of parameters they have.
A definition of a subset of features i'm beginning to like is just "short sentence describing the input"
A model has a linear feature associated with a description if there is a direction in activation space such that when the model is run on an input[1], the resulting activation has a dot product above a threshold iff the short description is agreed to hold about the input (by a small set of biological neural networks).
This raises the question, what percentage of English descriptions of length words have linear features associated with them?
We want to rule out adversarial examples so in practice we just test for linear features on a fixed text corpus.
Anthropic practically entrapped Claude into blackmailing someone, and then a lot of mainstream news picked it up and reported it at face value. How are you going to escalate from that in the minds of a mainstream audience, in terms of behavioural evidence immediately legible to said audience? Have Claude set off a virtual nuke in a sandbox?
Good luck with it. I do think the broad direction is pretty promising.
I think even formally defining what you want the underlying set of ideal space to be would be a good post.
I personally find the informal ideas you discuss in between the topology slop ( sorry :) ) to be far more interesting.
The topology I'm more interested in I might call the "semantic topology" which would have as open sets any semantically related objects.
It sounds like you want to suppose the existence of a "semantic distance", which satisfies all the usual metric space axioms, and then use the metric space topology. And you want this "semantic distance" to somehow correspond to whether humans consider two concepts to be semantically similar.
An issue if you use the euclidean topology on the output space [0,1]^2, and a "semantic topology" on the input space, is that your network won't be continuous by default anymore. The inverse image of an open set would be open in the euclidean topology, but not necessarily open in the "semantic topology". You could define the topology on the output space so that by definition the network is continuous (quotient topology) but then topology really is getting you nothing.
Do you have formal definitions of what exactly you mean by the input space, or what you mean by the output space? What are the underlying sets, and what topology are you equipping them with? Wouldn't the output space just be the interval , and the input space ?
Ok I have a neat construction for z=1, https://www.lesswrong.com/posts/g9uMJkcWj8jQDjybb/ping-pong-computation-in-superposition that works pretty well ( with width and layers), and zero error. Note that is exact here, not asymptotic.
I'm trying to find a similar construction that scales like in the number of parameters required, without just scaling the number of layers up. I'd also be curious if it's possible to avoid scaling parameters linearly with , but it seems quite difficult.
Reading the midterm, the actual question was about the distance between a non-empty compact and a non-empty closed set in . Which is indeed a pretty simple exercise.
I don't care about prediction markets.
But people with the belief that we aren't going to be able to fully understand models frequently take this as a reason not to pursue ambitious/rigorous interpretability. I thought that was the position you were taking, by using the market to decide whether the agenda is "good" or not.