A generalization of Aumann's Agreement Theorem across objectives and agents, without assuming common priors. This framework also encompasses Debate, CIRL, Iterated Amplification as well. See Nayebi (2025) for the formal definition, and see From Barriers to Alignment to the First Formal Corrigibility Guarantees for applications.
These are posts that contain reviews of posts included in the 2024 Annual Review.
Note to LW Admins: I was surprised this tag/explanation didn't exist, so I adapted from our Manifest X DC Attendee Guide. Recommend categorizing this wikitag under "Community"... (read more)
.. (read more)It is difficult to get a man to understand something, when his salary depends on his not understanding it.
The Unjournal is a nonprofit organisation that works to organize and fund public journal-independent feedback, rating, and evaluation of hosted papers and dynamically-presented research projects. Their initial focus is on quantitative work that informs global priorities, especially in economics, policy, and social science. They aim to encourage better research by making it easier for researchers to get feedback and credible ratings on their work... (read more)
Computation in Superposition (Comp-in-Sup) is a sub-field of Mechanistic Interpretability (Mech-Interp)... (read more)
| User | Post Title | Wikitag | Pow | When | Vote |
A generalization of Aumann's Agreement Theorem across objectives and agents, without assuming common priors. This framework also encompasses Debate, CIRL, Iterated Amplification as well. See Nayebi (2025) for the formal definition, and see From Barriers to Alignment to the First Formal Corrigibility Guarantees for applications.
These are posts that contain reviews of posts included in the 2024 Annual Review.
These are posts that contain reviews of posts included in the 2024 Annual Review.Review.
Common knowledge is information that everyone knows and, importantly, that everyone knows that everyone knows, and so on, ad infinitum. If information is common knowledge in a group of people, that information that can be relied and acted upon with the trust that everyone else is also coordinating around that information. This stands, in contrast, to merely publicly known information where one person cannot be sure that another person knows the information, or that another person knows that they know the information. Establishing true common knowledge is, in fact, rather hard.
There should be more AI safety organisations (Marius Hobbhahn, 2023-09-21)
Why does the AI Safety Community need help founding projects? (Ryan Kidd, 2024-07-12)
AI Assurance Tech Report (2024)
AI Safety as a YC Startup (Lukas Peterson, 2025-01-08)
Alignment can be the ‘clean energy’ of AI (Cameron Berg, Judd Rosenblatt, and Trent Hodgeson, 2025-02-22)
AI Tools for Existential Security (Lizka Vaintrob and Owen Cotton-Barratt, 2025-03-14)
What makes an AI Startup "Net Positive" for Safety (Jacques Thibodeau, 2025-04-18)
AI Safety Undervalues Founders (Ryan Kidd, 2025-11-16)
BlueDot AGI Strategy Course: The BlueDot strategyAGI Strategy course seems to effectively be an incubation course, though not necessarily limited to the founding of new orgs. Graduates can apply for funding or to attend an incubation week.
Sinclair's Razor puts the following explanation on the table is:table: This person is pretending to not understand X, or has really convinced themselves that X isn't true, in order to not disturb their current position and its benefits.
A generalization of Aumann's Agreement Theorem across M objectives and N agents, without assuming common priors. See Nayebi (2025) for the formal definition, and see From Barriers to Alignment to the First Formal Corrigibility Guarantees for applications.
A generalization of Aumann's Agreement Theorem across M objectives and N agents, without assuming common priors. This framework also encompasses Debate, CIRL, Iterated Amplification as well. See Nayebi (2025) for the formal definition, and see From Barriers to Alignment to the First Formal Corrigibility Guarantees for applications.
A quine is a computer program that replicates it'sits source code in the output. Quining cooperation is
These are posts that contain reviews of posts included in the 2024 Annual Review.
External:
How to Solve It (Book; Summary; another Summary)
Scalable oversight is an approach to AI control [1]in which AIs supervise each other. Often groups of weaker AIs supervise a stronger AI, or AIs are set in a zero-sum interaction with each other.
Scalable oversight techniques aim to make it easier for humans to evaluate the outputs of AIs, or to provide a reliable training signal that can not be easily reward-hacked.
Variants include AI Safety via debate, iterated distillation and amplification, and imitative generalization.
People used to refer to scalable oversight as a set of AI alignment techniques, but they usually work on the level of incentives to the AIs, and have less to do with architecture.
That is, if there are things humans can do that are simultaneously pretty good at optimizing all the remaining plausible Vi (hence(hence uniformative), and these are pretty good at optimizing the remaining plausible Ui (hence(hence acceptable to the AI), there are probably things the AI can do which would be even better at simultaneously optimizing all remaining Ui.
ThenThen, if we can predict that the AI would update to wanting to run the universe itself without human interference after the AI had seen all collectable evidence, a sufficiently advanced AI can also see that this update is predictable (by efficiency) and therefore behaves as if it had already updated (by Bayesianism). Efficiency is a sufficient condition but not a necessary one; high-human reasoning over the meta-level question also seems sufficient, and perhaps even infrahuman reasoning would suffice.
ThereforeTherefore, we should expect a sufficiently intelligent AI, given a morally uncertain utility function ΔU that updates to ΔU|E≈T given all available evidence, to behave as corrigibly or incorrigibly as an AI given a constant utility function T. This is a problem from the viewpoint of anyone who thinks we do not currently know how to pick ΔU such that surely ΔU|E≈V, which makes corrigibility still necessary.
So that for purposes of the simplified scenario, we only need to consider what the AI does about the button, and not whether the AI tries to back itself up to elsewhere on the Internet. More generally, though, "avoiding effective shutdown" can include strategies like creating a hidden backup while the original hardware is in fact shut down, thus giving the appearance of a successful shutdown and avoiding further shutdown attempts.
This idea comes with its own arguable problems — e.g. humans sometimes optimize bad things. Let us set those aside while considering only whether this approach solves the shutdown problem in particular.
This issue was first observed in analyzing historical-fact shutdown as a possible alternative to utility indifference.
The first link was on the WebArchive. I've replaced the link. I couldn't find the original that was at the second link (http://mtc.epfl.ch/courses/TCS-2009/notes/5.pdf). I've removed it. Thanks for the Wikipedia link. I've added it.