Other than fixing the periods, I wouldn't change the questions right now as I've already got norms for their original form and you've already collected partial data on them. I may look into improving the phrasing for future versions of the questions.
If anyone had any issues with the political questions at the end, then please poke me about them here so I can take it into account for further development of this ideology test.
I think even if you care about the arc of history, paving over everything would still be selected for. Yes, there's the scar problem you mention, but it's not clear that it's strong enough to prevent it.
I think to some extent, "paving over everything" is also an illustration of how natural impact regularization != safety.
(A) Hopefully everyone on all sides can agree that if my LLM reliably exhibits a certain behavior—e.g. it outputs “apple” after a certain prompt—and you ask me “Why did it output ‘apple’, rather than ‘banana’?”, then it might take me decades of work to give you a satisfying intuitive answer.
I don't confidently disagree with this statement, but it occurs to me that I haven't tried it myself and haven't followed it very closely, and have sometimes heard claims that there are promising methods.
A lot of people trying to come up with answers try to do it with mechanistic interpretability, but that probably isn't very feasible. However, investigations based on ideas like neural tangent kernels seem plausibly more satisfying and feasible. Like if you show that the dataset contains a bunch of instances that'd push it towards saying apple rather than banana, and you then investigate where those data points come from and realize that there's actually a pretty logical story for them, then that seems basically like success.
As an example, I remember a while ago there was some paper that claimed to have found a way to attribute NN outputs to training data points, and it claimed that LLM power-seeking was mainly caused by sci-fi stories and by AI safety discussions. I didn't read the paper so I don't know whether it's legit, but that sort of thing seems quite plausibly feasible a lot of the time.
Actually I suppose they don't even need to add perturbations to A directly, they can just add perturbations to S and generate A's from S'. Or probably even look at user's histories to find direct perturbations to either S or A.
Theory for a capabilities advance that is going to occur soon:
OpenAI is currently getting lots of novel triplets (S, U, A), where S is a system prompt, U is a user prompt, and A is an assistant answer.
Given a bunch of such triplets (S, U_1, A_1), ... (S, U_n, A_n), it seems like they could probably create a model P(S|U_1, A_1, ..., U_n, A_n), which could essentially "generate/distill prompts from examples".
This seems like the first step towards efficiently integrating information from lots of places. (Well, they could ofc also do standard SGD-based gradient descent, but it has its issues.)
A followup option: they could use something a la Constitutional AI to generate perturbations A'_1, ..., A'_n. If they have a previous model like the above, they could then generate a perturbation P(S'|U_1, A'_1, ..., U_n, A'_n). I consider this significant because this then gives them the training data to create a model P(S'|S, U_1, A_1, A'_1), which essentially allows them to do "linguistic backchaining": The user can update an output of the network A_1 -> A'_1, and then the model can suggest a way to change the prompt to obtain similar updates in the future.
Furthermore I imagine this could get combined together into some sort of "linguistic backpropagation" by repeatedly applying models like this, which could unleash a lot of methods to a far greater extent than they have been so far.
Obviously this is just a very rough sketch, and it would be a huge engineering and research project to get this working in practice. Plus maybe there are other methods that work better. I'm mainly just playing around with this because I think there's a strong economic pressure for something-like-this, and I want a toy model to use for thinking about its requirements and consequences.
I think it's important to put more effort into tracking such definitional issues, though. People end up overstating things because they round off their interlocutors' viewpoint to their own. For instance if person C asks "is it safe to scale generative language pre-training and ChatGPT-style DPO arbitrarily far?", when person D then rounds this off to "is it safe to make transformer-based LLMs as powerful as possible?" and explains that "no, because instrumental convergence and compression priors", this is probably just false for the original meaning of the statement.
If this repeatedly happens to the point of generating a consensus for the false claim, then that can push the alignment community severely off track.
Maybe for future versions I should split up this question into separate parts. That said I'm not sure that makes it easier for most people to answer the question, since they might have a clearer opinion on how well the government is doing than about whether the government is failing for political or knowledge reasons.