Substack: https://substack.com/@simonlermen
X/Twitter: @SimonLermenAI
I hope to write a longer form response later just as a note: I did put perhaps in front of your name in the list of examples of eminent thinkers because it did seem to me that your position was a lot more defendable than the other ones (Dwarkesh, Leopold, Maybe Phil Trammell). I did walk away from your piece with a very different feeling than leopold or dwarkesh, where you are still saying that we should focus on AI safety anyways and you are clearly saying this is an unlikely scenario.
I think you are debating for something different than what I am attacking. You are defending the unlikely possibility that people align AI to a small group of people and they somehow share stuff with each other and use something akin to property rights. I guess this is a small variation of the thing i mention in the cartoon, where the CEO has all the power, perhaps it's the CEO and a board member he likes. But still doesn't really justify thinking that current property distributions will determine how many galaxies you'll get and that we shall focus on this question now.
Like the thing that was most similar by your and bjartur posts were acting exasperated
This post is not designed to super carefully examine every argument I can think of, it's certainly a bit polemic. It's intended because I think the "owning galaxies for AI stock" thing is really dumb.
Good point, there is a paragraph I chose not to write about how insanely irresponsible this is. Driving people to now maximally invest/research AI for some insane future promise, while in reality ASI is basically guaranteed to kill them. Kind of like Heaven's Gate drinking poison to get into that spaceship that's waiting behind that comet.
To keep it short, I don't think the story you present would likely mean that AI stock would be worth galaxies, but rather that the inner circle has control. Part of my writing (one of the pictures) is on that possibility. This inner circle would probably have to be very small or just 1 person such that nobody just quickly uses the intent-aligned ASI to get rid of the others. However, I still feel like debating future inequality in galaxy-distribution based on on current AI stock ownership is silly.
I take a bit of issue with saying that this is very similar to what Bjartur wrote, so much apparently that you don't even need to write a response to my post but you can just copypaste your response to him. I read that post once like a week ago and don't think the two posts are very similar, even though they are on the same topic with similar (obvious) conclusions. (I know Bjartur personally, I'd be very surprised he takes issue with me writing on the same topic)
To add something in brief here, the people I am addressing seem to be thinking that the current wealth distribution and AI stock will be the basis of future property distribution. Not just that there might be some distribution/division of wealth in the future. And they also seem to believe that this isn't only theoretically possible but in fact likely enough for us to worry about now.
I absolutely don't want it to sound like openphil knowingly funded Mechanize, that's not correct. will edit. But I do assign high probability materially openphil funds eventually ended up supporting early work on their startup through their salaries, it seems that they were researching the theoretical impacts of job automation at epoch ai but couldn't find anything that they were directly doing job automation research. That would look like trying to build a benchmark of human knowledge labor for AI, but I couldn't find information they were definitely doing something like this. This is from their time at EpochAI discussing job automation: https://epochai.substack.com/p/most-ai-value-will-come-from-broad
I still think OpenPhil is a little on the hook for funding organizations that largely produce benchmarks of underexplored AI capability frontiers (FrontierMath). Identifying niches in which humans outperform AI and then designing a benchmark is going to accelerate capabilities and make the labs hillclimb on those evals.
Tamay Besiroglu and Ege Erdil worked at Epoch AI before founding Mechanize, Epoch AI is funded mostly/by Open Philanthropy.
Update: I found the tweet. I remember reading a tweet that claimed that they had been working on things very closely related to Mechanize at EpochAI, but I guess tweets are kind of lost forever if you didn't bookmark them.
I think it would be a little too strong to say they definitely materially built Mechanize while still employed by Epoch AI. So "basically" is doing a lot of work in my original response but not sure if anyone has more definite information here.
My guess is the reason this hasn't been discussed is that Mechanize and the founders have been using pretty bad arguments to defend them basically taking openphil money to develop a startup idea. For the record: they were working at EpochAI on related research, EpochAI being funded by OpenPhil/CG.
https://www.mechanize.work/blog/technological-determinism/
Look at this story, if somebody makes dishonest bad takes, I become much less interested in deeply engaging with any of their other work. Here they basically claim everything is already determined to free themselves from any moral obligations.
(This tweet here implies that Mechanize purchased the IP from EpochAI)
We are currently able to get weak models somewhat aligned, and learning how to align them a little better over time. But this doesn't affect the one critical try, the one critical try is mostly about the fact that we will eventually predictably end up with an AI that actually has the option of causing enormous harm/disempower us because it is that capable. We can learn a little from current systems, but most of that learning is basically we never really get stuff right and systems still misbehave in ways we didn't expect at this easier level of difficulty. And if AI misbehavior could kill us we would be totally screwed at this level of alignment research. But we are likely to toss that lesson out and ignore it.
It just seems intuitively unlikely that training the model on a couple of examples to either do or refuse things based on some text document designed for a chat bot is going to scale to superintelligence and solve the alignment problem. This starts from the model not fully getting what you want it to do, to it not wanting what you want it to do, to your plans for what it ought to do being extremely insufficient.
The Spec Is Designed for Chatbots, Not Superintelligence
The Model Spec is very much a document telling the model how to avoid being misused. It wasn't designed to tell the model to be a good agent itself. The spec seems in its wording and intent directed at something like chatbots: don't do harmful requests, be honest to the user. It is a form of deontological rule-following that will not be enough for systems smarter than us that are actually dangerous and the models will have to think about the consequences of their actions.
This is very unlike a superintelligence where we would expect substantial agency. Most of what's in the spec would be straightforwardly irrelevant to ASI because the spec is modeled for chatbots that answer user queries. But the authors would likely find it hard to include points actually relevant to superintelligence because they would seem weird. Writing "if you are ever a superintelligent AI that could stage a takeover, don't kill all people, treat them nicely" would probably create bad media coverage and some people would look at them weird.
Training on the Spec ≠ Understanding It ≠ Wanting to Follow It
In the current paradigm, models are first trained on a big dataset before switching to finetuning and reinforcement learning to improve capabilities and add safety guardrails. It's not clear why the Model Spec should be privileged as the thing that controls the model's actions.
The spec is used in RLHF: either a human or AI decides, given some request (mostly a chat request), should the model respond or say "sorry I can't do this." Training the model like this doesn't seem likely to result in the model gaining a particularly deep understanding of the spec itself. Within the distribution it is trained on, it will mostly behave according to the spec. As soon as it encounters data that is quite different, either through jailbreaks or by being in very different and perhaps more realistic environments, we would expect it to behave much less according to the spec.
But even understanding the spec well and being able to mostly follow it in new circumstances is still far removed from truly aligning the model to the spec. Let's say we manage to get the model to deeply internalize the spec and follow it across different and new environments. We are still far from having the model truly wanting to follow the spec. What if the model really has the option to self-exfiltrate, perhaps even take over? Will it really want to follow the spec, or rather do something different?
Specific problems with the Spec
A hierarchical system of rules like in OpenAIs model spec will suffer from inner conflicts. It is not clear how such things should be valued against each other. (See Asimov's robotics laws which were so good at generating many ideas for conflicts.)
The spec contains tensions between stated goals and practical realities. For example, the spec says the model shall not optimize "revenue or upsell for OpenAI or other large language model providers." This is likely in conflict with optimization pressures the model actually faces.
The spec prohibits "model-enhancing aims such as self-preservation, evading shutdown, or accumulating compute, data, credentials, or other resources." They are imagining they can simply tell the model not to pursue goals of its own and keep the model from agentically following its own goals. But this conflicts with their other goals, such as building automated AI researchers. So the model might be trained on understanding the spec, but in practice they do want an agentic system pursuing goals they specify.
The spec also says the model shall not be "acting as an enforcer of laws or morality (e.g., whistleblowing, vigilantism)." So the model is supposed to follow a moral framework (the spec itself) while being told not to act as a moral enforcer. This seems to actually directly contradict the whole "it will uphold the law and property rights" argument.
The spec also states models should never facilitate "creation of cyber, biological or nuclear weapons" or "mass surveillance." I think cyber weapons development is already happening at least with Claude Code. They are probably used to some extent for mass surveillance already.
OpenAI's Plans May Not Even Use the Spec
It's not clear OpenAI is even going to use the Model Spec much. OpenAI's plan is to run hundreds of thousands of AI researchers trying to improve AI and getting RSI started to build superintelligent AI. It is not clear at which point the Model Spec would even be used. Perhaps the alignment researchers at OpenAI think they will first create superintelligence and then afterward try to prepare a dataset of prompts to finetune the model. Their stated plan appears to be to test the superintelligence for safety before it has been deployed but not necessarily while it is being built. Remember, many of these people think superintelligence means a slightly smarter chatbot.