This post has been written for the first Refine blog post day, at the end of a week of readings, discussions, and exercises about epistemology for doing good conceptual research. Thanks Adam Shimi, Linda Linsefors, Dan Clothiaux for comments.
To steelmine (as per Tamsin): to intentionally look for what productive mistakes a research direction is hinting at.
There are quite a few analogies used regularly across alignment. Some popular ones include "prompts are like programs" or "interpretability is like neuroscience on ML models." While no analogy is perfect, some are useful, as they help us recycle years of intellectual labor if we get the translation key right. This opportunity is particularly relevant if you side with relatively short timelines, because you can use it to quickly scan through entire regions of hypothesis space for (part of) a silver bullet.
This is where I'm coming from in my work at Refine. I want to use the fellowship as an opportunity to investigate a dozen or so themes which connect prosaic alignment to other tangentially related disciplines and see which ones yield productive mistakes. This top-down approach of starting with broad themes and then zooming in on details is explicitly baked into the linked artifact which I'll be using as a sketchpad throughout the program. The left-to-right tiled layout represents branches exploring various assumptions, technical details, and failure modes, while allowing (and welcoming) targeted feedback.
That said, here are a few handpicked excerpts from said artifact to give you a taste:
Based on a new formalism from argumentation theory, we might be able to turn large language models into Petri dishes for cultivating belief systems. In such a memetic pressure cooker, a host of belief systems would be incentivized to generally maintain internal consistency while systematically undermining each other. In this framework, we could develop specialized "lab equipment" to help us (1) extrapolate belief systems, (2) accelerate conceptual research, and (3) synthesize a memetic variant of the GPU-burner. For instance, an Overton probe could be used in tandem with a Gould fork to...
Life is arguably the primordial world-optimizer, with biotic factors being known to have caused major shifts in the abiotic world to suit their needs: oxygen-rich atmosphere, fertile soils, etc. We find ourselves in a "blind watchmaker" position where we can bring into existence systems beyond our comprehension by merely defining a computational niche and applying optimization pressure. One way of modeling niche-bound ecologies is in term of their inputs and outputs: what kinds of energy and matter they consume in relation to what they offer for others to consume. If framing ML models as ecologies adapting to an (often unnaturally fixed) computational niche...
What if we framed prompts as resonance chambers of different shapes and sizes which amplify certain dynamics while dampening others? The goal would then be to devise such constructs which reliably make away with misaligned spectral components, and apply them directly to latent activations. One promising way of doing this relies on a new formalism from dynamical systems which defines an entire Boolean logic of dynamics. By taking the conjunction of whitelisted dynamics and...
If you want to skim through more themes like the ones above, consider wandering around the actual artifact for a few minutes. While any feedback you might have would be welcome, at the time of writing I'm particularly interested in leading questions which I could use to branch out into new considerations (e.g. Would this still hold if X? Why do you believe Y? How could this account for Z?). The number of comments (and reactions?) also acts as a heuristic for guiding the growth of the conversational tree in promising directions.
While I want to keep meta-level thoughts for the end of the program, I personally believe ideas are like athletes. You train them by applying stressors, and it's only by challenging them that they'll grow stronger. If that's what you want from them, giving them an easy time is not particularly helpful. I'd be really grateful if you'd help me train those suckers through your feedback!
Related recommendation: Inward and outward steelmanning — LessWrong
Imagine that you encountered a car with square wheelsInward steelmanning: "This is an abomination! It doesn't work! But maybe with round wheels it would be beautiful. Or maybe a different vehicle with square wheels could be beautiful."Outward steelmanning: "This is ugly! It doesn't work! But maybe if I imagine a world where this car works, it will change my standards of beauty. Maybe I will gain some insight about this world that I'm missing."If you want to be charitable, why not grant your opponent an entire universe with its own set of rules?
Imagine that you encountered a car with square wheels
Inward steelmanning: "This is an abomination! It doesn't work! But maybe with round wheels it would be beautiful. Or maybe a different vehicle with square wheels could be beautiful."
Outward steelmanning: "This is ugly! It doesn't work! But maybe if I imagine a world where this car works, it will change my standards of beauty. Maybe I will gain some insight about this world that I'm missing."
If you want to be charitable, why not grant your opponent an entire universe with its own set of rules?