1. Summary and overview LLMs seem to lack metacognitive skills that help humans catch errors. Improvements to those skills might be net positive for alignment, despite improving capabilities in new directions. Better metacognition would reduce LLM errors by catching mistakes, and by managing complex cognition to produce better answers in...
Summary Generalization is one lens on the alignment challenge. We'd like network-based AGI to generalize ethical judgments as well as some humans do. Broadening training is a classic and obvious approach to improving generalization in neural networks. Training sets might be broadened to include decisions like whether to evade human...
Epistemic status: I'm pretty sure AI will alarm the public enough to change the alignment challenge substantially. I offer my mainline scenario as an intuition pump, but I expect it to be wrong in many ways, some important. Abstract arguments are in the Race Conditions and concluding sections. Nora has...
Epistemic status: These questions seem useful to me, but I'm biased. I'm interested in your thoughts on any portion you read. If our first AGI is based on current LLMs and alignment strategies, is it likely to be adequately aligned? Opinions and intuitions vary widely. As a lens to analyze...
We should probably try to understand the failure modes of the alignment schemes that AGI developers are most likely to attempt. I still think Instruction-following AGI is easier and more likely than value aligned AGI. I’ve updated downward on the ease of IF alignment, but upward on how likely it...
It is often noted that anthropomorphizing AI can be dangerous. People likely have prosocial instincts that AI systems lack (see below). Assuming AGI will be aligned because humans with similar behavior are usually mostly harmless is probably wrong and quite dangerous. I want to discuss a flip side of using...
Summary: When stateless LLMs are given memories they will accumulate new beliefs and behaviors, and that may allow their effective alignment to evolve. (Here "memory" is learning during deployment that is persistent beyond a single session.)[1] LLM agents will have memory: Humans who can't learn new things ("dense anterograde amnesia")...