What we really want with interpretability is: high accuracy, when out of distribution, scaling to large models. You got very high accuracy... but I have no context to say if this is good or bad. What could a naïve baseline get? And what do SAE's get? Also it would be nice to see an Out Of Distribution set, because getting 100% on your test suggests that it's fully within the training distribution (or that your VQ-VAE worked perfectly).
I tried something similar but only got half as far as you. Still my code may be of interest. I wanted to know if it would help with lie detection, out of distribution, but didn't get great results. I was using a very hard setup where no methods work well.
I think VQ-VAE is a promising approach because it's more scalable than SAE, which have 8 times the parameters of the model they are interpreting. Also your idea of using a decision tree on the tokenised space make a lot of sense given the discrete latent space!
In that way it's similar to charter cities. An cheaper intermediate stage could be online organizations like World Of WarCraft gaming clans, or internet forums, or project overviews. They are not quite the same, but they are cheap.
Yeah, it does seem tricky. OpenAI recently tried a unique governance structure and that it is tuning out unpredictable and might be costly in terms of legal fees and malfunction.,
That's true. But just because people aren't motivated, doesn't mean we should try. It's possible to create incentives with subsidies, direct payments, etc.
My reaction to the Churchill quote is: why don't we try more forms of government?
Of course we can't start with animal trials, but we can try with the American Antarctic base or a village then move up to city trials, state trials, and so on.
It's also kind of a negative place to put your attention. People probably prefer not to think about it.
At best, it's a boring chore, at worse it's a negative cognitive hazard. Best to minimize time spent on things that will make you feel angry, mistreated, unfair, etc. Especially if you might get trapped in those states, making everything turn out worse, and making day to day interactions a struggle rift with negative associations.
In short, it's vibe.
This topic is a little stressful for to read about, because in the course of doing ours jobs, we risk enacting low or high status behaviors. And these often lead to conflicts with status obsessed people.
For instance, deferring to a more knowledgeable colleague might set a precedent that's hard to change, while assertively leading an initiative can sometimes spark disputes over status.
It doesn't help that we rationalists are part of a very direct ask/tell culture, so we can often conflict with most other cultures.
But being rationalists, we should not just theorycel but consider helpful strategies. Here's a few:
Don’t be an absolutist non-conformist. Conforming in small ways often gives you the opportunity to non-conform in big ways. Being deferential to your boss, for example, opens up a world of possibilities.
Anyone got other ideas?
Unsurprisingly you are not the first to think of this, Cicero wrote on this a few thousand years ago in How To Be A Friend!
This was empirically demonstrated to be possible in this paper: "Curiosity-driven Exploration by Self-supervised Prediction", Pathak et al
It probably could be extended to learn "other" and the "boundary between self and other" in a similar way.
I implemented a version of it myself and it worked. This was years ago. I can only imagine what will happen when someone redoes some of these old RL algo's, with LLM's providing the world model.