Unfortunately, there exist unkind people. We can select for people with eg long prosocial careers, high introspection, positive interviews with close friends / relatives, cognitomotor symptoms of empathy etc
I think a central property you don't mention explicitly is [having kept promises in the past, especially in cases where these required doing difficult costly things and where the person thought they would not be rewarded in the future for having kept the promise]; also, honesty. I'd guess that an important class of potential successes with this kind of scheme, in fact maybe most of the successes (but idk), involve the fooming mind [keeping a promise]/[maintaining a commitment]. I think that maintaining some kind of kindness without a specific commitment to helping existing humans in some way can easily "misgeneralize" to eg some sort of utilitarianism, and nearly every kind of utilitarianism endorses the atoms and negentropy of all existing people being used for something else, or just more generally misgeneralize to caring about new people you can create and various other possible beings and activities over existing people.
also copying a note i wrote for myself on this topic: "
Most people currently thinking about AI alignment seem to hope that there is some sort of "formula" for safely/[value/character-preservingly]/whatever becoming more capable (and for alignment more broadly). I doubt there is some such formula to be found. Instead, I think that as one becomes more capable, one should keep thinking carefully about how to become more capable, and that there isn't some "formula" for how to do this thinking. I think there is very much to be understood about how to become more capable "safely". This note presents some basic ideas for that.
Here are some more concrete reasons to be interested in ideas for safe self-development:
I'd guess that an important class of potential successes with this kind of scheme, in fact maybe most of the successes (but idk), involve the fooming mind [keeping a promise]/[maintaining a commitment]. I think that maintaining some kind of kindness without a specific commitment to helping existing humans in some way can easily "misgeneralize" to eg some sort of utilitarianism
Good point, yeah. I'm still confident in the overall machinery of "better understanding of one's cognition and tooling to self-modify commensurately" -> stability; but I really don't have a principled way to select for this. I'm pretty confident Eliezer has demonstrated himself committed in this sense (see "genre savviness"), but I don't know anyone else who would be a good starting point.
and nearly every kind of utilitarianism endorses the atoms and negentropy of all existing people being used for something else, or just more generally misgeneralize to caring about new people you can create and various other possible beings and activities over existing people.
Locally valid but connotationally wrong when read through; like, yes, we definitely lose a huge chunk of humanity-CEV in this scenario (which is what actually matters unless atemporal trade with our Everett branches can remedy the holes), but I'd expect a "kindess-foomed" entity to probably not kill people to repurpose their atoms for other entities. A hedonium-foom would, sure, but killing isn't particularly kind to most people.
Most people currently thinking about AI alignment seem to hope that there is some sort of "formula" for safely/[value/character-preservingly]/whatever becoming more capable (and for alignment more broadly). I doubt there is some such formula to be found. Instead, I think that as one becomes more capable, one should keep thinking carefully about how to become more capable, and that there isn't some "formula" for how to do this thinking.
A priori, I'm about 95% confident that there's some coherent and robust math for Vingean reflection which we have yet to invent. But our chances of cracking it before ASI / human superintelligence HSI are quite thin, like maybe 10%, on my modal model.
many people seem to think that it would be just fine to let 2025 Claude foom
Claude 3.5 or any other LLM have vastly worse cognitive attractor dynamics under self-modification than humans do given a commercially-induced RSI capability. I have a draft story about this sort of thing; but basically, the internals range from maybe-aligned-but-horribly-incapable to unaligned-and-incapable to unaligned-and-capable-of-RSI; nowhere along that Pareto frontier do we see something as stable (wrt raw-utility-as-would-be-galactically-amortized post-foom) as a mildly above-average human.
I can however imagine models two generations from now, were they aligned like Opus 3.5, being sufficiently stable in the comparably more narrow action domain of doing a pivotal act to actually bump themselves another 2 generations' worth of capacity and just execute a pivotal act. But I really don't think we'll get Claude Legolas 8.6 aligned like that (P < 0.03).
how do you maintain a belief in god over very much thinking / capability gain (and the thing being basically false)
Might be useful, but this conflates instrumental epistemics with a normative/value thing (which you acknowledge indirectly). This gap widens under intelligence augmentation in the inverse direction of value stability.
Last post was an example of how intelligence-correlated tools can stabilize reflection. Here I'll discuss how native-cyborgism attenuates the hard parts of getting to a pivotal act.
Probable ASI is grown end-to-end and evaluated by much less capable humans. Alignment is hard because we're weaker optimizers than what we're trying to steer:
This power asymmetry means, in general, that the processes we're trying to control simply route around our measures. A mathematically robust specification of properties like corrigibility, if translated to PyTorch code, removes the gaps an ASI would otherwise flow through. I'm glad some people are working on this, but perhaps there are other approaches.
What if we instead kept the optimizer at a manageable capacity[1]?
Introspection
Humans, I think, can have much more control over where their values drift as an individual than labs will over future AIs. We can actively guide our internals because they're apparent to us; value stability scales with introspection.
People don't typically try to do this. Most who do try don't have particularly strong models of cognitive neuroscience, and rarer still[2] are those who correlate evolutionary psychology (and other relevant sciences) to their introspection.
To rephrase, some form of self-understanding scales with introspection and intelligence, and so boosting both intelligence and introspection would buffer ontological drift (in the precise sense of preventing incomprehensibility) and "optimizer power mismatch" (which in this case is more like adversarial outputs from the BCI).
"Rolling your own metaethics"
What's a value? Not necessarily affective response; valence is correlated with but not precisely "value".
"The simplest description of what you're optimizing for" sucks when "you" aren't well-defined; eg every atom in my body optimizes for stable electron configurations; my neurons often optimize for cortical arousal against my wishes (insomnia); my metacognitive process monitor probably optimizes for sensual and semantic error minimization, but "I" angle for unpredictable situations, etc.
This is one example of how unsolved metaphilosophy makes alignment hard, even when you start with humans.
Unfortunately, even a long reflection probably suffers this problem class; "human" isn't currently well-defined in this domain, but we must edit cognitive prerequisites for "values" somewhere during CEV. For example, raising a generation (or CEV'ing a humanlike society) without war will change how they "value" everything culturally and psychologically downstream of experiencing war. Removing whatever neural machinery differentially grows "values" whether exposed to wars is also a cognitive edit.
So it's probably impossible to keep your hands off the future in a mathematically robust way. A long reflection probably still returns great values with minimal pericultural edits; I think we should aim for this.
When I say "value stability" here, I mean something like "the propensity to commit a pivotal act which stabilizes a long reflection under minimal cultural edits".
Current population-median humans are bad at qua-value stability (for example, when newly powerful). But I don't think most people even try to be temporally coherent, at least beyond satisfying some Duty; nor do they have solid models of how reward actually happens in their brain, which are prerequisite to robust self-alignment tools.
Most clinical depression is probably a neuroplasticity deficiency, i.e. solved by access to neural self-modification[3]; but there are other architectural priors, ones which are more complex and not so steerable via external edits.
Once, in the ancient ages of MTV and arcade games[4], there was an epileptic patient with a deep-brain stimulating electrode.
She found that the electrode hit an erotic circuit; so she kept pressing the button until she was skipping all obligations to keep hitting her right thalamic nucleus.
Like depression, wireheading is a behavioral attractor.
I expect that most aberrant[5] cognitive attractors under self-modification are debilitating; eg sunk costs, narcissism, confirmation bias. The prior can't update -> you're in a basin. Globally debilitating basins (ones which reduce optimization power in many domains) seem less consequential for alignment.
But there do exist non-crippling, worrisome behavioral attractors. Having less empathy for outgroup members, for example, is genetically (and memetically) coded for in most humans.
While harder, the same tools apply here as irrational basins; see the previous post for examples. Evolutionary psychology more generally has a massive toolset which a smarter augmented group could distill.
As someone's general cognition improves, they get finer felt senses and models supporting self-alignment (neuroscience, social modeling). I'm gonna call this gestalt tooling "introspection".
So, "value stability" scales with introspective tooling. What sort of BCI algorithm could drastically boost cognition while maintaining introspection?
I don't have a good answer for this in advance of experimental results about how the heck local neuronal learning occurs.
But one constraint which I'll sketch later, in the engineering portion of this sequence, is that, every few layers, the model decodes into biological activations before re-encoding. Inasmuch as humans naturally learn to introspect, intermediate readouts let this continue.[6]
If the digital algo is a grown optimizer and part of a general intelligence, how doesn't it foom? Like, if you're doing agent-y things, then your software will learn agent-y outputs, right? And then you get classic instrumental convergence.
Unlike RLAIF, we'd be starting from a baseline where:
Unfortunately, there exist unkind people. We can select for people with eg long prosocial careers, high introspection, positive interviews with close friends / relatives, cognitomotor[7] symptoms of empathy etc but are still trusting someone to be Good, to stably pursue and execute a minimal pivotal act.
Cohorts could further reduce the risk of a sole human superintelligence going in some weird direction, be that via resonant idiosyncrasies or social starvation.
As for attractors which are harder to predict in advance, those seem nasty, and I don't have any advance-predicted tools. Need better models.
If you're interested in gaming/simulating this error class, contact me; it seems high-value not just for superhumans but also for modeling RLAIF-centric takeoff timelines.
LLM labs think they're doing some combination of robust specification and par-capabilities with RLAIF and lots of other tools, but this will kill everyone if they reach substantially superhuman; the holes aren't inarguably visible yet, but they can't hold arbitrary pressure. Nor even are such methods robust enough to cleanly execute a pivotal act in the unlikely event corporate leadership avails GPT-moose-8.6-ultra of resources to do so. And that's assuming LLMs, instead of some much more efficient model class.
There are 20,000 or so folks here on LessWrong out of 9,000,000,000. Plausibly many non-LWers train rationality-adjacent skills to moderate effect, including self-alignment.
Higher plasticity alone seems to be similarly as effective as ECT for depression; augmentees would have access to other hyperparameters (E/I balance, gross connectivity, lateral inhibition), though I don't think this specific example matters much for value stability, nor do I have any idea how to reduce things which look like ASD/ADHD/schizophrenia.
And even earlier, during the pre-Cambrian!
All grown intelligent circuits are cognitive attractors, so "aberrant" means something more like bad-for-coherence, which needs better agent theory than I can provide.
We could also optimize some statistics of the decode representation to resemble SAEs or similar AI interp tools as an auxiliary method, but I'm more optimistic about biosimilarity than interp crosspollination on the margin.
For example, AIs which use facial micromovements / body language to more accurately classify emotional responses than humans can; current models aren't great at this, see here. I imagine better versions are used already by intelligence agencies.