Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities
1. Summary and overview LLMs seem to lack metacognitive skills that help humans catch errors. Improvements to those skills might be net positive for alignment, despite improving capabilities in new directions. Better metacognition would reduce LLM errors by catching mistakes, and by managing complex cognition to produce better answers in the first place. This could stabilize or regularize alignment, allowing systems to avoid actions they would not "endorse on reflection" (in some functional sense).[1] Better metacognition could also make LLM systems useful for clarifying the conceptual problems of alignment. It would reduce sycophancy, and help LLMs organize the complex thinking necessary for clarifying claims and cruxes in the literature. Without such improvements, collaborating with LLM systems on alignment research could be the median doom-path: slop, not scheming. They are sycophantic, agreeing with their users too much, and produce compelling-but-erroneous "slop". Human brains produce slop and sycophancy, too, but we have metacognitive skills, mechanisms, and strategies to catch those errors. Considering our metacognitive skills gives some insight into how they might be developed for LLMs, and how they might help with alignment (§6, §7). I'm not advocating for this. I'm noting that work is underway, noting the potential for capability gains, and noting the possibility that the benefits for alignment outweigh the danger from capability improvements. I'm writing about this because I think plans for alignment work should take these possibilities into account.[2] I'll elaborate on all of that in turn. I hypothesize that metacognitive skills constitute a major part of the "dark matter of intelligence"[3] that separates LLMs and LLM agents from human-level competence. I (along with many others) have spent a lot of time wondering why LLMs appear so intelligent in some contexts, but wildly incompetent in others. I now think metacognitive skills are a major part o
What I mean by "nice" is roughly the opposite of being a ruthless sociopath. It means treating other sentient beings well for its own sake.
Most humans are definitely not ruthless sociopaths. Sociopaths are estimated at about 10% of the population. And most of those aren't even that ruthless; I think it's a spectrum, like all biological mental differences. This leaves the conclusion that even NON-sociopathic humans are often pretty ruthless when they can get away with it, like when they hold a lot of power. But that's pretty much beside the main point here, which is that we shouldn't expect nice/non-ruthless behavior by default.
Laws and norms are not going to restrain an... (read more)