Nice, for me, this was one of those things in the business world that was kind of implicit in some models I had from before, and this post made it explicit.
Good stuff!
Apparently, even being a European citizen doesn’t help.
I still think that we shouldn't have books on sourcing and building pipe bombs laying around though.
I mean, I'm sure it isn't legal to openly sell a book on how to source material for and build a pipebomb, right? It's dependent on the intent of the book and its content among other things so I I'm half hesistantly biting the bullet here.
I've been following this discussion from Jan's first post, and I've been enjoying it. I've put together some pictures to explain what I see in this discussion.
Something like the original misalignment might be something like this:
This is fair as a first take, and if we want to look at it through a utility function optimisation lens, we might say something like this:
Where cultural values is the local environment that we're optimising for.
As Jacob mentions, humans are still very effective when it comes to general optimisation if we look directly at how well it matches evolution's utility function. This calls for a new model.
Here's what I think actually happens :
Which can be perceived as something like this in the environmental sense:
Based on this model, what is cultural (human) evolution telling us about misalignment?
We have adopted proxy values (Y1,Y2,..YN) or culture in order to optimise for X or IGF. In other words, the shard of cultural values developed as a more efficient optimisation target in the new environment where different tribes applied optimisation pressure on each other.
Also, I really enjoy the book The Secret Of Our Success when thinking about these models as it provides some very nice evidence about human evolution.
I was going through my old stuff and I found this from a year and a half ago so I thought I would just post it here real quickly as I found the last idea funny and the first idea to be pretty interesting:
In normal business there exist consulting firms that are specialised in certain topics, ensuring that organisations can take in an outside view from experts on the topic.
This seems quite an efficient way of doing things and something that, if built up properly within alignment, could lead to faster progress down the line. This is also something that the future fund seemed to be interested in as they gave prices for both the idea of creating an org focused on creating datasets and one on taking in human feedback. These are not the only ideas that are possible, however, and below I mention some more possible orgs that are likely to be net positive.
Newly minted alignment researchers will probably have a while to go before they can become fully integrated into a team. One can, therefore, imagine an organisation that takes in inexperienced alignment researchers and helps them write papers. They then promote these alignment researchers as being able to help with certain things. Real orgs can then easily take them in for contracting on specific problems. This should help involve market forces in the alignment area and should in general, improve the efficiency of the space. There are reasons why consulting firms exist in real life and creating the equivalent of Mackenzie in alignment is probably a good idea. Yet I might be wrong in this and if you can argue why it would make the space less efficient, I would love to hear it.
We don't want the wrong information to spread, something between a normal marketing firm and the Chinese "marketing" agency, If it's an info-hazard then shut the fuck up!
Isn't there an alternative story here where we care about the sharp left turn, but in the cultural sense, similar to Drexler's CAIS where we have similar types of experimentation as happened during the cultural evolution phase?
You've convinced me that the sharp left turn will not happen in the classical way that people have thought about it, but are you that certain that there isn't that much free energy available in cultural style processes? If so, why?
I can imagine that there is something to say about SGD already being pretty algorithmically efficient, but I guess I would say that determining how much available free energy there is in improving optimisation processes is an open question. If the error bars are high here, how can we then know that the AI won't spin up something similar internally?
I also want to add something about genetic fitness becoming twisted as a consequence of cultural evolutionary pressure on individuals. Culture in itself changed the optimal survival behaviour of humans, which then meant that the meta-level optimisation loop changed the underlying optimisation loop. Isn't the culture changing the objective function still a problem that we have to potentially contend with, even though it might not be as difficult as the normal sharp left turn?
For example, let's say that we deploy GPT-6 and it figures out that in order to solve the loosely defined objective that we have determined for it using (Constitutional AI)^2 should be discussed by many different iterations of itself to create a democratic process of multiple COT reasoners. This meta-process seems, in my opinion, like something that the cultural evolution hypothesis would predict is more optimal than just one GPT-6, and it also seems a lot harder to align than normal?
Very nice! I think work in this general direction is what is more or less needed if we want to survive.
I just wanted to probe a bit when it comes to turning these methods into governance proposals. Do you see ways of creating databases/tests for objective measurement or how do you see this being used in policy and the real world?
(Obviously, I get that understanding AI will be better for less doom, but I'm curious about your thoughts on the last implementation step)
Alright, I will try to visualise what I see as the disagreement here.

It seems to me that Paul is saying that behaviourist abstractions will happen in smaller time periods than long time horizons.
(Think of these shards as in the shard theory sense)
Nate is saying that the right picture creates stable wants more than the left and Paul is saying that it is time-agnostic and that the relevant metric is how competent the model is.
The crux here is essentially whether longer time horizons are indicative of behaviourist shard formation.
My thought here is that the process in the picture to the right induces more stable wants because a longer time horizon system is more complex, and therefore heuristics is the best decision rule. The complexity is increased in such a way that it is a large enough difference between short-term tasks and long-term tasks.
Also, the Redundant Information Hypothesis might give credence to the idea that systems will over time create more stable abstractions?