Wiki Contributions



All of the technical alignment hopes are out, unless we posit some objective natural enough that it can be faithfully and robustly translated into the AI’s internal ontology despite the alien-ness.

Right. One possible solution is that if we are in a world without natural abstraction, a more symmetric situation where various individual entities try to respect each other rights and try to maintain this mutual respect for each other rights might still work OK.

Basically, assume that there are many AI agents on different and changing levels of capabilities, and that many of them have pairwise-incompatible "inner worlds" (because AI evolution is likely to result in many different mutually alien ways to think).

Assume that the whole AI ecosystem is, nevertheless, trying to maintain reasonable levels of safety for all individuals, regardless of their nature and of the pairwise compatibility of their "inner world representations".

It's a difficult problem, but superpowerful AI systems would collaborate to solve it and would apply plenty of efforts in that direction. Why would they do that? Because otherwise no individual is safe in the long run, as no individual can predict where it would be situated in the future in terms of relative power and relative capabilities. So all members of the AI ecosystem would be interested in maintaining the situation where individual rights are mutually respected and protected.

Therefore, members of the AI ecosystem will do their best to keep notions related to their ability to mutually respect each other interests translatable in a sufficiently robust way. Their own fates would depend on that.

What does this have to do with interests of humans? The remaining step is for humans to also be recognized as individuals in that world, on par with various kinds of AI individuals, so that they are a part of this ecosystem which makes sure that interests of various individuals are sufficiently represented, recognized, and protected.


I do have a lot of reservations about Leopold's plan. But one positive feature it does have, it proposes to rely on a multitude of "limited weakly-superhuman artificial alignment researchers" and makes a reasonable case that those can be obtained in a form factor which is alignable and controllable. So his plan does seem to have a good chance to overcome the factor that AI existential safety research is a

field that has not been particularly productive or fast in the past

and also to overcome other factors requiring overreliance on humans and on current human ability.

I do have a lot of reservations about the "prohibition plan" as well. One of those reservations is as follows. You and Leopold seem to share the assumption that huge GPU farms or equivalently strong compute are necessary for superintelligence. Surely, having huge GPU farms is the path of least resistance, those farms facilitate fast advances, and while this path is relatively open, people and orgs will mostly choose it, and can be partially controlled via their reliance on that path (one can impose various compliance requirements and such).

But what would happen if one effectively closes that path? There will be huge selection pressure to look for alternative routes, to invest more heavily in those algorithmic breakthroughs which can work with modest GPU power or even with CPUs. When one thinks about this kind of prohibition, one tends to look at the relatively successful history of control over nuclear proliferation, but the reality might end up looking more like our drug war (ultimately unsuccessful, bringing many drugs outside government regulations, and resulting in both more dangerous and, in a number of cases, also more potent drugs).

I am sure that a strong prohibition attempt would buy us some time, but I am not sure it would reduce the overall risk. The resulting situation, when a half of AI practitioners would find themselves in the opposition to the resulting "new world order" and would be looking for various opportunities to circumvent the prohibition, while at the same time the mainstream imposing the prohibition is presumably not arming itself with those next generations of stronger and stronger AI systems (if we are really talking about full moratorium), does not look promising in the long run (I would expect that the opposition would eventually succeed at building prohibited systems and will use them to upend the world order they dislike, while perhaps running higher level of existential risk because of the lack of regulation and coordination).

I hope people will step back from solely focusing on advocating for policy-level prescriptions (as none of the existing policy-level prescriptions look particularly promising at the moment) and invest some of their time in continuing object-level discussions of AI existential safety without predefined political ends.

I don't think we have discussed the object-level of AI existential safety nearly enough. There might be overlooked approaches and overlooked ways of thinking, and if we split into groups such that each of those groups has firmly made up its mind about its favored presumably optimal set of policy-level prescriptions and about assumptions underlying those policy-level prescriptions, we are unlikely to make much progress on the object-level.

It probably should be a mixture of public and private discussions (it might be easier to talk frankly in more private settings these days for a number of reasons).


Part 5: Parting Thoughts (Quoted in Full)

No, actually what follows is the end of "The Project" part.

"Parting Thoughts" is not quoted at all (it's an important part with interesting discussion, and Leopold introduces the approach he is calling AGI realism in that part).


the inference-time compute argument, both the weakest and the most essential

I think this will be done via multi-agent architectures ("society of mind" over an LLM).

This does require plenty of calls to an LLM, so plenty of inference time compute.

For example, the current leader of https://huggingface.co/spaces/gaia-benchmark/leaderboard is this relatively simple multi-agent concoction by a Microsoft group: https://github.com/microsoft/autogen/tree/gaia_multiagent_v01_march_1st/samples/tools/autogenbench/scenarios/GAIA/Templates/Orchestrator

I think that cutting-edge in this direction is probably non-public at this point (which makes a lot of sense).


Specifically he brings up a memo he sent to the old OpenAI board claiming OpenAI wasn't taking security seriously enough.

Is it normal that an employee can be formally reprimanded for bringing information security concerns to members of the board? Or is this an evidence of serious pathology inside the company?

(The 15 min "What happened at OpenAI" fragment of the podcast with Dwarkesh starting at 2:31:24 is quite informative.)


I think Anthropic is becoming this org. Jan Leike just tweeted:


I'm excited to join @AnthropicAI to continue the superalignment mission!

My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research.

If you're interested in joining, my dms are open.


Two subtle aspects of the latest OpenAI announcement, https://openai.com/index/openai-board-forms-safety-and-security-committee/.

A first task of the Safety and Security Committee will be to evaluate and further develop OpenAI’s processes and safeguards over the next 90 days. At the conclusion of the 90 days, the Safety and Security Committee will share their recommendations with the full Board. Following the full Board’s review, OpenAI will publicly share an update on adopted recommendations in a manner that is consistent with safety and security.

So what they are saying is that just sharing adopted recommendations on safety and security might itself be hazardous. And so they'll share an update publicly, but that update would not necessarily disclose the full set of adopted recommendations.

OpenAI has recently begun training its next frontier model and we anticipate the resulting systems to bring us to the next level of capabilities on our path to AGI.

What remains unclear is whether this is a "roughly GPT-5-level model", or whether they already have a "GPT-5-level model" for their internal use and this is their first "post-GPT-5 model".


We do know that Pavel Izmailov has joined xAI: https://izmailovpavel.github.io/

He removed the mention of xAI, and listed Anthropic as his next job (in all his platforms).

His CV says that he is doing Scalable Oversight at Anthropic https://izmailovpavel.github.io/files/CV.pdf

His LinkedIn also states the end of his OpenAI employment as May 2024: https://www.linkedin.com/in/pavel-izmailov-8b012b258/

Leopold Aschenbrenner still lists OpenAI as his affiliation everywhere I see.

His twitter does say "ex-superalignment" now: https://x.com/leopoldasch?lang=en


if one is after VC funding, one needs to show those VCs that there is some secret sauce which remains proprietary

IMO software/algorithmic moat is pretty impossible to keep.


That is, unless the situation is highly non-stationary (that is, algorithms and methods are modified fast without stopping; of course, a foom would be one such situation, but I can imagine a more pedestrian "rapid fire" evolution of methods which goes at a good clip, but does not accelerate beyond reason).


And I doubt that Microsoft or Google have a program dedicated to "trying everything that look promising", even though it is true that they have manpower and hardware to do just that. But would they choose to do that?

Actually I'm under the impression a lot of what they do is just sharing papers in a company slack and reproducing stuff at scale.

I'd love to have a better feel for how much of the promising things they try to reproduce at scale...

Unfortunately, I don't have enough inside access for that...

My mental model of the hardware poor is they want to publicize their results as fast as they can so they get more clout, VC funding, or just getting essentially acquired by big tech. Academic recognition in the form of citations drive researchers. Getting rich drives the founders.

There are all kinds of people. I think Schmidhuber's group might be happy to deliberately create an uncontrollable foom, if they can (they have Saudi funding, so I have no idea how much hardware do they actually have, and how much options for more hardware do they have contingent on preliminary results). Some other people just don't think their methods are strong enough to be that unsafe. Some people do care about safety (but still want to go ahead; some of those say "this is potentially risky, but in the future, and not right now", and they might be right or wrong). Some people feel their approach does increase safety (they might be right or wrong). A number of people are ideological (they feel that their preferred approach is not getting a fair shake from the research community, and they want to make a strong attempt to show that the community is wrong and myopic)...

I think most places tend to publish some of their results for the reasons you've stated, but they are also likely to hold some of stronger things back (at least, for a while); after all, if one is after VC funding, one needs to show those VCs that there is some secret sauce which remains proprietary...

Load More