This is a special post for quick takes by Ryan Kidd. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
15 comments, sorted by Click to highlight new comments since: Today at 8:27 PM

Main takeaways from a recent AI safety conference:

  • If your foundation model is one small amount of RL away from being dangerous and someone can steal your model weights, fancy alignment techniques don’t matter. Scaling labs cannot currently prevent state actors from hacking their systems and stealing their stuff. Infosecurity is important to alignment.
  • Scaling labs might have some incentive to go along with the development of safety standards as it prevents smaller players from undercutting their business model and provides a credible defense against lawsuits regarding unexpected side effects of deployment (especially with how many tech restrictions the EU seems to pump out). Once the foot is in the door, more useful safety standards to prevent x-risk might be possible.
  • Near-term commercial AI systems that can be jailbroken to elicit dangerous output might empower more bad actors to make bioweapons or cyberweapons. Preventing the misuse of near-term commercial AI systems or slowing down their deployment seems important.
  • When a skill is hard to teach, like making accurate predictions over long time horizons in complicated situations or developing a “security mindset,” try treating humans like RL agents. For example, Ph.D. students might only get ~3 data points on how to evaluate a research proposal ex-ante, whereas Professors might have ~50. Novice Ph.D. students could be trained to predict good research decisions by predicting outcomes on a set of expert-annotated examples of research quandaries and then receiving “RL updates” based on what the expert did and what occurred.

An incomplete list of possibly useful AI safety research:

  • Predicting/shaping emergent systems (“physics”)
    • Learning theory (e.g., shard theory, causal incentives)
    • Regularizers (e.g., speed priors)
    • Embedded agency (e.g., infra-Bayesianism, finite factored sets)
    • Decision theory (e.g., timeless decision theory, cooperative bargaining theory, acausal trade)
  • Model evaluation (“biology”)
    • Capabilities evaluation (e.g., survive-and-spread, hacking)
    • Red-teaming alignment techniques
    • Demonstrations of emergent properties/behavior (e.g., instrumental powerseeking)
  • Interpretability (“neuroscience”)
    • Mechanistic interpretability (e.g., superposition, toy models, automated circuit detection)
    • Gray box ELK (e.g., Collin Burns’ research)
    • Feature extraction/sparsity (including Wentworth/Bushnaq style “modularity” research)
    • Model surgery (e.g., ROME)
  • Alignment MVP (“psychology”)
    • Sampling simulators safely (conditioning predictive models)
    • Scalable oversight (e.g., RLHF, CAI, debate, RRM, model-assisted evaluations)
    • Cyborgism
    • Prompt engineering (e.g., jailbreaking)
  • Strategy/governance (“sociology”)
    • Compute governance (e.g., GPU logging/restrictions, treaties)
    • Model safety standards (e.g., auditing policies)
  • Infosecurity
    • Multi-party authentication
    • Airgapping
    • AI-assisted infosecurity

A systematic way for classifying AI safety work could use a matrix, where one dimension is the system level:

  • A monolithic AI system, e.g., a conversational LLM
  • A cyborg, human + AI(s)
  • A system of AIs with emergent qualities (e.g.,, but in the future, we may see more systems like this, operating on a larger scope, up to fully automatic AI economy; or a swarm of CoEms automating science)
  • A human+AI group, community, or society (scale-free consideration, supports arbitrary fractal nestedness): collective intelligence
  • The whole civilisation, e.g., Open Agency Architecture

Another dimension is the "time" of consideration:

  • Design time: research into how the corresponding system should be designed (engineered, organised): considering its functional ("capability", quality of decisions) properties, adversarial robustness (= misuse safety, memetic virus security), and security.
  • Manufacturing and deployment time: research into how to create the desired designs of systems successfully and safely:
    • AI training and monitoring of training runs.
    • Offline alignment of AIs during (or after) training. 
    • AI strategy (= research into how to transition into the desirable civilisational state = design).
    • Designing upskilling and educational programs for people to become cyborgs is also here (= designing efficient procedures for manufacturing cyborgs out of people and AIs).
  • Operations time: ongoing (online) alignment of systems on all levels to each other, ongoing monitoring, inspection, anomaly detection, and governance.
  • Evolutionary time: research into how the (evolutionary lineages of) systems at the given level evolve long-term:
    • How the human psyche evolves when it is in a cyborg
    • How humans will evolve over generations as cyborgs
    • How groups, communities, and society evolve.
    • Designing feedback systems that don't let systems "drift" into undesired state over evolutionary time.
    • Considering system property: property of flexibility of values (i.e., the property opposite of value lock-in, Riedel (2021)).
    • IMO, it (sometimes) makes sense to think about this separately from alignment per se. Systems could be perfectly aligned with each other but drift into undesirable states and not even notice this if they don't have proper feedback loops and procedures for reflection.

There would be 5*4 = 20 slots in this matrix, and almost all of them have something interesting to research and design, and none of them is "too early" to consider.

There is still some AI safety work (research) that doesn't fit this matrix, e.g.,  org design, infosec, alignment, etc. of AI labs (= the system that designs, manufactures, operates, and evolves monolithic AI systems and systems of AIs).

AI alignment threat models that are somewhat MECE (but not quite):

In particular, the last threat model feels like it is trying to cut across aspects of the first two threat models, violating MECE.

Great overview! I find this helpful.

Next to intrinsic optimisation daemons that arise through training internal to hardware, suggest adding extrinsic optimising "divergent ecosystems" that arise through deployment and gradual co-option of (phenotypic) functionality within the larger outside world.

AI Safety so far research has focussed more on internal code (particularly CS/ML researchers) computed deterministically (within known statespaces, as mathematicians like to represent). That is, rather than complex external feedback loops that are uncomputable – given Good Regulator Theorem limits and the inherent noise interference on signals propagating through the environment (as would be intuitive for some biologists and non-linear dynamics theorists). 

So extrinsic optimisation is easier for researchers in our community to overlook. See this related paper by a physicist studying origins of life.

Cheers, Remmelt! I'm glad it was useful.

I think the extrinsic optimization you describe is what I'm pointing toward with the label "coordination failures," which might properly be labeled "alignment failures arising uniquely through the interactions of multiple actors who, if deployed alone, would be considered aligned."

Reasons that scaling labs might be motivated to sign onto AI safety standards:

  • Companies who are wary of being sued for unsafe deployment that causes harm might want to be able to prove that they credibly did their best to prevent harm.
  • Big tech companies like Google might not want to risk premature deployment, but might feel forced to if smaller companies with less to lose undercut their "search" market. Standards that prevent unsafe deployment fix this.

However, AI companies that don’t believe in AGI x-risk might tolerate higher x-risk than ideal safety standards by the lights of this community. Also, I think insurance contracts are unlikely to appropriately account for x-risk, if the market is anything to go by.

Types of organizations that conduct alignment research, differentiated by funding model and associated market forces:

MATS' goals:

  • Find + accelerate high-impact research scholars:
    • Pair scholars with research mentors via specialized mentor-generated selection questions (visible on our website);
    • Provide a thriving academic community for research collaboration, peer feedback, and social networking;
    • Develop scholars according to the “T-model of research” (breadth/depth/epistemology);
    • Offer opt-in curriculum elements, including seminars, research strategy workshops, 1-1 researcher unblocking support, peer study groups, and networking events;
  • Support high-impact research mentors:
    • Scholars are often good research assistants and future hires;
    • Scholars can offer substantive new critiques of alignment proposals;
    • Our community, research coaching, and operations free up valuable mentor time and increase scholar output;
  • Help parallelize high-impact AI alignment research:
    • Find, develop, and refer scholars with strong research ability, value alignment, and epistemics;
    • Use alumni for peer-mentoring in later cohorts;
    • Update mentor list and curriculum as the alignment field’s needs change.

"Why suicide doesn't seem reflectively rational, assuming my preferences are somewhat unknown to me," OR "Why me-CEV is probably not going to end itself":

  • Self-preservation is a convergent instrumental goal for many goals.
  • Most systems of ordered preferences that naturally exhibit self-preservation probably also exhibit self-preservation in the reflectively coherent pursuit of unified preferences (i.e., CEV).
  • If I desire to end myself on examination of the world, this is likely a local hiccup in reflective unification of my preferences, i.e., "failure of present me to act according to me-CEV's preferences rather than a failure of hypothetical me-CEV to account for facts about the world."

Note: I'm fine; this is purely intellectual.

Can the strategy of "using surrogate goals to deflect threats" be countered by an enemy agent that learns your true goals and credibly precommits to always defecting (i.e., Prisoner's Dilemma style) if you deploy an agent against it with goals that produce sufficiently different cooperative bargaining equilibria than your true goals would?

This is a risk worth considering, yes. It’s possible in principle to avoid this problem by “committing” (to the extent that humans can do this) to both (1) train the agent to make the desired tradeoffs between the surrogate goal and original goal, and (2) not train the agent to use a more hawkish bargaining policy than it would’ve had without surrogate goal training. (And to the extent that humans can’t make this commitment, i.e., we make honest mistakes in (2), the other agent doesn’t have an incentive to punish those mistakes.)

If the developers do both these things credibly—and it's an open research question how feasible this is—surrogate goals should provide a Pareto improvement for the two agents (not a rigorous claim). Safe Pareto improvements are a generalization of this idea.

Are these framings of gradient hacking, which I previously articulated here, a useful categorization?

  1. Masking: Introducing a countervailing, “artificial” performance penalty that “masks” the performance benefits of ML modifications that do well on the SGD objective, but not on the mesa-objective;
  2. Spoofing: Withholding performance gains until the implementation of certain ML modifications that are desirable to the mesa-objective; and
  3. Steering: In a reinforcement learning context, selectively sampling environmental states that will either leave the mesa-objective unchanged or "steer" the ML model in a way that favors the mesa-objective.

How does the failure rate of a hierarchy of auditors scale with the hierarchy depth, if the auditors can inspect all auditors below their level?

Are GPT-n systems more likely to:

  1. Learn superhuman cognition to predict tokens better and accurately express human cognitive failings in simulacra because they learned these in their "world model"; or
  2. Learn human-level cognition to predict tokens better, including human cognitive failings?