Ryan Kidd's Shortform

Ryan Kidd

Ryan Kidd's Shortform

1 min read13th Oct 202215 comments

This is a special post for quick takes by Ryan Kidd. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

New to LessWrong?

Getting Started

FAQ

Library

Ryan Kidd's Shortform

15 comments, sorted by

top scoring

Click to highlight new comments since: Today at 4:56 AM

[-]Ryan Kidd1y204

Main takeaways from a recent AI safety conference:

If your foundation model is one small amount of RL away from being dangerous and someone can steal your model weights, fancy alignment techniques don’t matter. Scaling labs cannot currently prevent state actors from hacking their systems and stealing their stuff. Infosecurity is important to alignment.
Scaling labs might have some incentive to go along with the development of safety standards as it prevents smaller players from undercutting their business model and provides a credible defense against lawsuits regarding unexpected side effects of deployment (especially with how many tech restrictions the EU seems to pump out). Once the foot is in the door, more useful safety standards to prevent x-risk might be possible.
Near-term commercial AI systems that can be jailbroken to elicit dangerous output might empower more bad actors to make bioweapons or cyberweapons. Preventing the misuse of near-term commercial AI systems or slowing down their deployment seems important.
When a skill is hard to teach, like making accurate predictions over long time horizons in complicated situations or developing a “security mindset,” try treating humans like RL agents. For example, Ph.D. students might only get ~3 data points on how to evaluate a research proposal ex-ante, whereas Professors might have ~50. Novice Ph.D. students could be trained to predict good research decisions by predicting outcomes on a set of expert-annotated examples of research quandaries and then receiving “RL updates” based on what the expert did and what occurred.

[-]Ryan Kidd1y90

An incomplete list of possibly useful AI safety research:

Predicting/shaping emergent systems (“physics”)
- Learning theory (e.g., shard theory, causal incentives)
- Regularizers (e.g., speed priors)
- Embedded agency (e.g., infra-Bayesianism, finite factored sets)
- Decision theory (e.g., timeless decision theory, cooperative bargaining theory, acausal trade)
Model evaluation (“biology”)
- Capabilities evaluation (e.g., survive-and-spread, hacking)
- Red-teaming alignment techniques
- Demonstrations of emergent properties/behavior (e.g., instrumental powerseeking)
Interpretability (“neuroscience”)
- Mechanistic interpretability (e.g., superposition, toy models, automated circuit detection)
- Gray box ELK (e.g., Collin Burns’ research)
- Feature extraction/sparsity (including Wentworth/Bushnaq style “modularity” research)
- Model surgery (e.g., ROME)
Alignment MVP (“psychology”)
- Sampling simulators safely (conditioning predictive models)
- Scalable oversight (e.g., RLHF, CAI, debate, RRM, model-assisted evaluations)
- Cyborgism
- Prompt engineering (e.g., jailbreaking)
Strategy/governance (“sociology”)
- Compute governance (e.g., GPU logging/restrictions, treaties)
- Model safety standards (e.g., auditing policies)
Infosecurity
- Multi-party authentication
- Airgapping
- AI-assisted infosecurity

[-]Roman Leventov1y30

A systematic way for classifying AI safety work could use a matrix, where one dimension is the system level:

A monolithic AI system, e.g., a conversational LLM
A cyborg, human + AI(s)
A system of AIs with emergent qualities (e.g., https://numer.ai/, but in the future, we may see more systems like this, operating on a larger scope, up to fully automatic AI economy; or a swarm of CoEms automating science)
A human+AI group, community, or society (scale-free consideration, supports arbitrary fractal nestedness): collective intelligence
The whole civilisation, e.g., Open Agency Architecture

Another dimension is the "time" of consideration:

Design time: research into how the corresponding system should be designed (engineered, organised): considering its functional ("capability", quality of decisions) properties, adversarial robustness (= misuse safety, memetic virus security), and security.
Manufacturing and deployment time: research into how to create the desired designs of systems successfully and safely:
- AI training and monitoring of training runs.
- Offline alignment of AIs during (or after) training.
- AI strategy (= research into how to transition into the desirable civilisational state = design).
- Designing upskilling and educational programs for people to become cyborgs is also here (= designing efficient procedures for manufacturing cyborgs out of people and AIs).
Operations time: ongoing (online) alignment of systems on all levels to each other, ongoing monitoring, inspection, anomaly detection, and governance.
Evolutionary time: research into how the (evolutionary lineages of) systems at the given level evolve long-term:
- How the human psyche evolves when it is in a cyborg
- How humans will evolve over generations as cyborgs
- How groups, communities, and society evolve.
- Designing feedback systems that don't let systems "drift" into undesired state over evolutionary time.
- Considering system property: property of flexibility of values (i.e., the property opposite of value lock-in, Riedel (2021)).
- IMO, it (sometimes) makes sense to think about this separately from alignment per se. Systems could be perfectly aligned with each other but drift into undesirable states and not even notice this if they don't have proper feedback loops and procedures for reflection.

There would be 5*4 = 20 slots in this matrix, and almost all of them have something interesting to research and design, and none of them is "too early" to consider.

There is still some AI safety work (research) that doesn't fit this matrix, e.g., org design, infosec, alignment, etc. of AI labs (= the system that designs, manufactures, operates, and evolves monolithic AI systems and systems of AIs).

[-]Ryan Kidd1y80

AI alignment threat models that are somewhat MECE (but not quite):

We get what we measure (models converge to the human++ simulator and build a Potemkin village world without being deceptive consequentialists);
Optimization daemon (deceptive consequentialist with a non-myopic utility function arises in training and does gradient hacking, buries trojans and obfuscates cognition to circumvent interpretability tools, "unboxes" itself, executes a "treacherous turn" when deployed, coordinates with auditors and future instances of itself, etc.);
Coordination failure (otherwise-aligned AI systems combust or gridlock far from the Pareto frontier due to opaque values/capabilities, inadequate commitment mechanisms, or irreconcilable differences);
Sharp left turn (models learn generally powerful cognitive tools that are efficiently reached by training on real-world tasks, especially as the real world contains useful embedded knowledge that shortcuts learning these tools from scratch; but powerful cognitive tools are somewhat anti-natural to corrigibility and the training process does not efficiently constrain the directionality of these tools towards human CEV, which manifests under distributional shift).

In particular, the last threat model feels like it is trying to cut across aspects of the first two threat models, violating MECE.

[-]Remmelt1y2-2

Great overview! I find this helpful.

Next to intrinsic optimisation daemons that arise through training internal to hardware, suggest adding extrinsic optimising "divergent ecosystems" that arise through deployment and gradual co-option of (phenotypic) functionality within the larger outside world.

AI Safety so far research has focussed more on internal code (particularly CS/ML researchers) computed deterministically (within known statespaces, as mathematicians like to represent). That is, rather than complex external feedback loops that are uncomputable – given Good Regulator Theorem limits and the inherent noise interference on signals propagating through the environment (as would be intuitive for some biologists and non-linear dynamics theorists).

So extrinsic optimisation is easier for researchers in our community to overlook. See this related paper by a physicist studying origins of life.

[-]Ryan Kidd1y10

Cheers, Remmelt! I'm glad it was useful.

I think the extrinsic optimization you describe is what I'm pointing toward with the label "coordination failures," which might properly be labeled "alignment failures arising uniquely through the interactions of multiple actors who, if deployed alone, would be considered aligned."

[-]Ryan Kidd1y75

Reasons that scaling labs might be motivated to sign onto AI safety standards:

Companies who are wary of being sued for unsafe deployment that causes harm might want to be able to prove that they credibly did their best to prevent harm.
Big tech companies like Google might not want to risk premature deployment, but might feel forced to if smaller companies with less to lose undercut their "search" market. Standards that prevent unsafe deployment fix this.

However, AI companies that don’t believe in AGI x-risk might tolerate higher x-risk than ideal safety standards by the lights of this community. Also, I think insurance contracts are unlikely to appropriately account for x-risk, if the market is anything to go by.

[-]Ryan Kidd1y30

Types of organizations that conduct alignment research, differentiated by funding model and associated market forces:

Academic research groups (e.g., Krueger's lab at Cambridge, UC Berkeley CHAI, NYU ARG, MIT AAG);
Research nonprofits (e.g., ARC Theory, MIRI, FAR AI, Redwood Research);
"Mixed funding model" organizations:
- "Alignment-as-a-service" organizations, where the product directly contributes to alignment (e.g., Apollo Research, Aligned AI, ARC Evals, Leap Labs);
- "Alignment-on-the-side" organizations, where product revenue helps funds alignment research (e.g., Conjecture);
Scaling labs, where alignment research is mostly directed towards improving product (e.g., Anthropic, DeepMind, OpenAI).

[-]Ryan Kidd1y10

MATS' goals:

Find + accelerate high-impact research scholars:
- Pair scholars with research mentors via specialized mentor-generated selection questions (visible on our website);
- Provide a thriving academic community for research collaboration, peer feedback, and social networking;
- Develop scholars according to the “T-model of research” (breadth/depth/epistemology);
- Offer opt-in curriculum elements, including seminars, research strategy workshops, 1-1 researcher unblocking support, peer study groups, and networking events;
Support high-impact research mentors:
- Scholars are often good research assistants and future hires;
- Scholars can offer substantive new critiques of alignment proposals;
- Our community, research coaching, and operations free up valuable mentor time and increase scholar output;
Help parallelize high-impact AI alignment research:
- Find, develop, and refer scholars with strong research ability, value alignment, and epistemics;
- Use alumni for peer-mentoring in later cohorts;
- Update mentor list and curriculum as the alignment field’s needs change.

[-]Ryan Kidd1y10

"Why suicide doesn't seem reflectively rational, assuming my preferences are somewhat unknown to me," OR "Why me-CEV is probably not going to end itself":

Self-preservation is a convergent instrumental goal for many goals.
Most systems of ordered preferences that naturally exhibit self-preservation probably also exhibit self-preservation in the reflectively coherent pursuit of unified preferences (i.e., CEV).
If I desire to end myself on examination of the world, this is likely a local hiccup in reflective unification of my preferences, i.e., "failure of present me to act according to me-CEV's preferences rather than a failure of hypothetical me-CEV to account for facts about the world."

Note: I'm fine; this is purely intellectual.

[-]Ryan Kidd1y10

Can the strategy of "using surrogate goals to deflect threats" be countered by an enemy agent that learns your true goals and credibly precommits to always defecting (i.e., Prisoner's Dilemma style) if you deploy an agent against it with goals that produce sufficiently different cooperative bargaining equilibria than your true goals would?

[-]Anthony DiGiovanni1y31

This is a risk worth considering, yes. It’s possible in principle to avoid this problem by “committing” (to the extent that humans can do this) to both (1) train the agent to make the desired tradeoffs between the surrogate goal and original goal, and (2) not train the agent to use a more hawkish bargaining policy than it would’ve had without surrogate goal training. (And to the extent that humans can’t make this commitment, i.e., we make honest mistakes in (2), the other agent doesn’t have an incentive to punish those mistakes.)

If the developers do both these things credibly—and it's an open research question how feasible this is—surrogate goals should provide a Pareto improvement for the two agents (not a rigorous claim). Safe Pareto improvements are a generalization of this idea.

[-]Ryan Kidd2y10

Are these framings of gradient hacking, which I previously articulated here, a useful categorization?

Masking: Introducing a countervailing, “artificial” performance penalty that “masks” the performance benefits of ML modifications that do well on the SGD objective, but not on the mesa-objective;
Spoofing: Withholding performance gains until the implementation of certain ML modifications that are desirable to the mesa-objective; and
Steering: In a reinforcement learning context, selectively sampling environmental states that will either leave the mesa-objective unchanged or "steer" the ML model in a way that favors the mesa-objective.

[-]Ryan Kidd2y10

How does the failure rate of a hierarchy of auditors scale with the hierarchy depth, if the auditors can inspect all auditors below their level?

[-]Ryan Kidd2y1-1

Are GPT-n systems more likely to:

Learn superhuman cognition to predict tokens better and accurately express human cognitive failings in simulacra because they learned these in their "world model"; or
Learn human-level cognition to predict tokens better, including human cognitive failings?

Moderation Log