I think this test can be performed now or soon, but I'm not sure I'd update much from it. Current LMs are already pretty good at answering questions about themselves when prompted with a small amount of information about themselves. ("You are a transformer language model trained by AICo with data up to 2022/04"). We could also bake in this information through fine-tuning. They won't be able to tell you how many layers they have without being told, but we humans can't determine our brain architecture through introspection either.
I think the answer to "are you phenomenally conscious" will be sensitive to small differences in the training data involving similar conversations. Dialog-prompted models probably fall back on literary depictions of AI for self-oriented questions they don't know how to answer, so the answer might depend on which sci-fi AI the model is role-playing. (It's harder to say what determines the OOD behavior for models trained with more sophisticated methods like RLHF.)
Re: smooth vs bumpy capabilities, I agree that capabilities sometimes emerge abruptly and unexpectedly. Still, iterative deployment with gradually increasing stakes is much safer than deploying a model to do something totally unprecedented and high-stakes. There are multiple ways to make deployment more conservative and gradual. (E.g., incrementally increase the amount of work the AI is allowed to do without close supervision, incrementally increase the amount of KL-divergence between the new policy and a known-to-be-safe policy.)
Re: ontological collapse, there are definitely some tricky issues here, but the problem might not be so bad with the current paradigm, where you start with a pretrained model (which doesn't really have goals and isn't good at long-horizon control), and fine-tune it (which makes it better at goal-directed behavior). In this case, most of the concepts are learned during the pretraining phase, not the fine-tuning phase where it learns goal-directed behavior.
To do what, exactly, in this nice iterated fashion, before Facebook AI Research destroys the world six months later? What is the weak pivotal act that you can perform so safely?
Do alignment & safety research, set up regulatory bodies and monitoring systems.
When the rater is flawed, cranking up the power to NP levels blows up the P part of the system.
Not sure exactly what this means. I'm claiming that you can make raters less flawed, for example, by decomposing the rating task, and providing model-generated critiques that help with their rating. Also, as models get more sample efficient, you can rely more on highly skilled and vetted raters.
Found this to be an interesting list of challenges, but I disagree with a few points. (Not trying to be comprehensive here, just a few thoughts after the first read-through.)
IMO prosaic alignment techniques (say, around improving supervision quality through RRM & debate type methods) are highly underrated by the ML research community, even if you ignore x-risk and just optimize for near-term usefulness and intellectual interestingness. I think this is due to a combination of (1) they haven't been marketed well to the ML community, (2) lack of benchmarks and datasets, (3) need to use human subjects in experiments, (4) it takes a decent amount of compute, which was out of reach, perhaps until recently.
Interesting analysis. Have you tried doing an analysis on quantities other than % improvement? A 10% improvement from low accuracy is different from a 10% improvement at high accuracy. So for example, you could try doing a linear regression from small_to_medium_improvement, medium_accuracy -> large_accuracy and look at the variance explained.
Edit: I tried linear regression on the chinchilla MMLU data, predicting the large model accuracy from the 3 smaller models' accuracies, and only got 8% of variance explained, vs 7% of variance explained by only looking at the second largest model's accuracy. So that's consistent with the OP's claim of unpredictability.
Edit2: MMLU performance for the smaller models is about chance level, so it's not surprising that we can't predict much from it. (The accuracies we're looking at for these models are noise.)
This is from his memoir The Singapore Story, from right after he finished studying in the UK. (Don't have a precise reference, just a text file with some notes.)
Weight-sharing makes deception much harder.
Could you explain or provide a reference for this?
Lee Kuan Yew wrote about how he went looking for a governance system for his party, the PAP (which now rules Singapore) after the party nearly was captured by the communists in the 50s. He looked at the Catholic Church as an inspiring example of a system that had survived for a long time, and he eventually settled on a system based on the Church's system for electing cardinals and the Pope.
I think that doing N independent parallel computation and selecting one of them is way less useful than doing an N times longer serial computation. This kind of selection only helps you guess something that is impossible to deduce in any other way. So if anthropics is tacitly selecting the earth out of N other worlds, that doesn't contribute a factor of N to the total computation, it's a much smaller factor.
EDIT: intended to write a comment rather than an answer.