Archetypal Transfer Learning (ATL) is a proposal by @whitehatStoic for what is argued by the author to be a fine tuning approach that "uses archetypal data" to "embed Synthetic Archetypes". These Synthetic Archetypes are derived from patterns that models assimilate from archetypal data, such as artificial stories. The method yielded a shutdown activation rate of 57.33% in the GPT-2-XL model after fine-tuning. .. (read more)
AI Evaluations focus on experimentally assessing the capabilities, safety, and alignment of advanced AI systems. These evaluations can be divided into two main categories: behavioral and understanding-based... (read more)
Encultured AI is a for-profit public benefit corporation working to make AI safer and healthier for human beings... (read more)
Poor Air Quality can reduce cognitive functioning[1], lifespans[2] and the techniques to improve air quality are also useful for getting rid of aerosolized respiratory pathogens. Improving air quality can be an impactful global health intervention.[3] Many members of the LessWrong community have also put effort into improving the air quality of their own homes or offices, as an implication of instrumental rationality... (read more)
Bureaucracy.. (read more)
| User | Post Title | Wikitag | Pow | When | Vote |
A generalization of Aumann's Agreement Theorem across objectives and agents, without assuming common priors. This framework also encompasses Debate, CIRL, Iterated Amplification as well. See Nayebi (2025) for the formal definition, and see From Barriers to Alignment to the First Formal Corrigibility Guarantees for applications.
These are posts that contain reviews of posts included in the 2024 Annual Review.Review.
A generalization of Aumann's Agreement Theorem across M objectives and N agents, without assuming common priors. See Nayebi (2025) for the formal definition, and see From Barriers to Alignment to the First Formal Corrigibility Guarantees for applications.
A generalization of Aumann's Agreement Theorem across M objectives and N agents, without assuming common priors. This framework also encompasses Debate, CIRL, Iterated Amplification as well. See Nayebi (2025) for the formal definition, and see From Barriers to Alignment to the First Formal Corrigibility Guarantees for applications.
External:
How to Solve It (Book; Summary; another Summary)
Scalable oversight is an approach to AI control [1]in which AIs supervise each other. Often groups of weaker AIs supervise a stronger AI, or AIs are set in a zero-sum interaction with each other.
Scalable oversight techniques aim to make it easier for humans to evaluate the outputs of AIs, or to provide a reliable training signal that can not be easily reward-hacked.
Variants include AI Safety via debate, iterated distillation and amplification, and imitative generalization.
People used to refer to scalable oversight as a set of AI alignment techniques, but they usually work on the level of incentives to the AIs, and have less to do with architecture.
This tag is specifically for discussions about these formal constructs. For
See AI for discussions about artificial intelligence, seeintelligence.
See AIGeneral Intelligence. For for discussions about human-level intelligence in a broader sense, see General Intelligence.sense.
Common knowledge is information that everyone knows and, importantly, that everyone knows that everyone knows, and so on, ad infinitum. If information is common knowledge in a group of people, that information that can be relied and acted upon with the trust that everyone else is also coordinating around that information. This stands, in contrast, to merely publicly known information where one person cannot be sure that another person knows the information, or that another person knows that they know the information. Establishing true common knowledge is, in fact, rather hard.
A quine is a computer program that replicates it'sits source code in the output. Quining cooperation is
The first link was on the WebArchive. I've replaced the link. I couldn't find the original that was at the second link (http://mtc.epfl.ch/courses/TCS-2009/notes/5.pdf). I've removed it. Thanks for the Wikipedia link. I've added it.