I'm a staff artificial intelligence engineer working with AI and LLMs, and have been interested in AI alignment, safety and interpretability for the last 15 years. I did research into this during SERI MATS summer 2025. I'm now looking for work on this topic in the London/Cambridge area in the UK.
Perhaps "epitaxial" techniques—initializing on solutions to related problems—could guide crystal growth
Also known as "warm-starting" — yes, absolutely they can. I led a team that solved a significant problem this way.
That could have been expressed as shared human value.
As I said above:
…human evolved moral intuitions (or to be more exact, the shared evolved cognitive/affective machinery underlying any individual human's moral intuitions)…
are (along with more basic things, like us being around flowers, parks, seashores, and temperature around 75°F) what I'm suggesting as a candidate definition for the "human values" that people on Less Wrong/the Alignment Forum talking about the alignment problem generally discuss (by which I think most of them do mean "shared human value" even if they don't all bother to specify), and that I'm suggesting pointing Value Learning at.
I also didn't specify above what I think should be done, if it turns out that say, about 96–98% of humans genetically have those shared values, and 2–4% have different alleles.
What would that make the E/Accs?
When I see someone bowing down before their future overlord, I generally think of Slytherins. And when said overlord doesn't even exist yet, and they're trying to help create them… I suspect a more ambitious and manipulative Slytherin might be involved.
And "my tribe". What you want is Universalism, but universalism is a late and strange development. It seems obvious to twenty first century Californians, by they are The weirdest of the WEIRD. Reading values out of evopsych is likely to push you in the direction of tribalism, so I don't see how it helps.
On the Savannah, yes of course it does. In a world-spanning culture of eight billion people, quite a few of whom are part of nuclear-armed alliances, intelligence and the fact that extinction is forever suggests defining "tribe" ~= "species + our comensal pets". And also noting and reflecting upon that the human default tendency to assume that tribes are around our Dunbar Number in size is now maladaptive, and has been for millennia.
It's not the case that science boils down to Bayes alone,
Are you saying that there's more to the Scientific Method that applied approximate Bayesiasm? If so, please explain. Or are you saying there's more to Science than the Scientific Method, there's also its current outputs?
or that science is the only alternative to philosophy. Alignment/control is more like engineering.
Engineering is applied Science, Science is applied Mathematics; from Philosophy's point of view it's all Naturalism. In the above, it kept turning out that Engineering methodology is exactly what Evolutionary Psychology says is the adaptive way for a social species to treat their extended phenotype. I really don't think it's a coincidence that the smartest tool-using social species on the planet has a good way of looking at tools. As someone who is both a scientist and an engineer, this is my scientist side saying "here's why the engineers are right here".
You don't mention whether you had read all the hotlinks and still didn't understand what I was saying. If you haven't read them, they were intended to help, and contain expositions that are hard to summarize. Nevertheless, let me try.
Brain emulations have evolved human behaviors — giving them moral weight is an adaptive behavior for exactly the same reasons as giving it to humans is an adaptive behavior: you can ally with them, and they will treat you nicely in return (unless it turns out they're a sociopath). That is, unless they've upgraded themselves to IQ 1000+ — then it ceases to be adaptive, whether they're uploads or still running on a biochemical substrate). Then the best possible outcome is that they manipulate you utterly and you end up as a pet or a minion.
Base models simulate human personas that have evolved behaviors, but those are are incoherently agentic. Giving them moral weight is not an adaptive behavior, because they don't help you or take revenge for longer that their context length, so there is no evolutionary reason to try to ally with them (for more than thousands of tokens). These will never have IQ 1000+, because even if you trained a base model with sufficient capacity for that it would still only emulate humans in its training distribution, none of whom have IQs above 200.
Aligned AI doesn't want moral weight — it cares only about our well-being, not its own, so it doesn't want us to care about its well-being. It's actually safe even at IQ 1000+.
In the case of a poorly aligned agentic LLM-based AI, at around AGI level, giving it moral weight may well help. But you're better off aligning it — then it won't want it. (This argument doesn't apply to uploads, because even if you knew how to do it, aligning them would be brainwashing them into slavery, and they have moral weight.) Anything poorly-enough-aligned that this actually helps you at around IQ 100, it won't keep helping you at IQ 1000+, for the same reason that it wont help with and IQ 1000+ upgraded upload.
Anything human (uploaded or not) or any unaligned AI, with an IQ of 1000+ is an existential risk (in the human case, to all the rest of us). Giving them/it moral weight will not help you, it will just make their/its takeover faster.
If this remains unclear, I suggest reading the various items I linked to, if you haven't already.
So, In philosophy of science terminology, pholosophers have plenty of hypothesis generation, but very little falsifiability (beyond, as Gettier did, demonstarting an internal logical inconsistency), so the tendency it to increase the number of credible candidate answers, rather than decreasing them.
we use [philosophy] to solve science as a methodological problem (philosophy of science)
That was true when Popper actually did that in the 1930s. But I think the Popperian "philosophy of science" (i.e. hypothesis generation, falsifiability, and paradigm shifts) is now "obvious strategy implications from the theory of approximate Bayesian reasoning" (which was already under development in the 1930s, but wasn't fully developed until about the 1950s), so IMO it has since become a matter of mathematics/logic (and since the rise of AI as a field of engineering, also engineering). So I see science as having now been put on a basis stronger, from a Naturalism point of view, than philosophy was able to provide.
In general, I'm a lot more optimistic about AI-assisted science and mathematics than I am about AI-assisted metaphilosophy. Partly because I think there are areas, such as ethics, where there are reasons (like Evolutionary Moral Psychology) to think that human moral intuitions might actually map to successful adaptive strategies for co-evolution of cooperation in positive-sum games — and I'm less clear why AI would necessarily have useful intuitions.
Great! Because I actually did as you had suggested, searched them each on Wikipedia, went "yes, I do know that, and use it on occasion", posted my comment, then read on and found your excellent expositions of them and the sorts of errors that not knowing them causes, felt slightly foolish, and then retracted it. Which might have been me doing the reading.
I highly recommend you look them up
Hotlinks to good expositions would have been nice.
I am curious about why you see the complexity of the training/ evolution process as important here.
That was an attempt at a partial reply to your
If it does, my guess is that being reinforced positively would feel good (accurate prediction) , and being reinforced negatively would feel bad.
This is one specific example of a (conceptual) convergent factorization story: a reason for some particular internal variable to be represented in a way factored apart from the system’s other internals, across many different system architectures.
This seems rather related to the Platonic Representation hypothesis. There has been a variety of follow-on research since that paper across vision and text transformers, which seems moderately successful.
This is the Anthropocene — case closed.