I expect one strong reason for different ASIs to develop similar abstractions regardless of goals is because they need to predict a bunch of other agents in the world (either humans or other ASIs) and so need to be able to represent the goals of other agents.
LLMs pretrained on data about misaligned AIs themselves become less aligned. Luckily, pretraining LLMs with synthetic data about good AIs helps them become more aligned. These alignment priors persist through post-training, providing alignment-in-depth. We recommend labs pretrain for alignment, just as they do for capabilities.
Website: alignmentpretraining.ai
Us: geodesicresearch.org | x.com/geodesresearch
Note: We are currently garnering feedback here before submitting to ICML. Any suggestions here or on our Google Doc (which contains a more detailed overview of our experiments) are welcome! We will be releasing a revision on arXiv in the coming days. Folks who leave feedback will be added to the Acknowledgment section. Thank you!
We pretrained a suite of 6.9B-parameter LLMs, varying only the content related to AI systems, and evaluated them for misalignment. When filtering the vast majority...
I think I phrased my previous comment poorly. What I meant is that if you have developed a set of abstractions relevant to achieving your goals and I want to predict you accurately, then I also need to develop abstractions that are are relevant to achieving your goals. Given a limited representational capacity, this creates a pressure for you to develop representations similar to those of others.