dx26

dx2613d

If I knew it was going to happen in 2035 to 2039, I think I would mostly back up and try to improve the overall quality of US governance, and/or work to get competent candidates for the 2034 presidential election.

Do you mean the quality of US governance with respect to AI or in general? In the latter case, I'm curious what your concrete plans would be, since this is usually considered a difficult and not very neglected (but still very important!) area.

Replying toIf AI alignment is only as hard as building the steam engine, then we likely still die

dx261mo

If AI alignment is only as hard as building the steam engine, then we likely still die

I think when Olah says that solving alignment may be as "easy" as the steam engine, he's basically envisioning current training + eval techniques (or similar techniques equivalently difficult to the steam engine) scaling all the way to superintelligence. (This is my interpretation; I might be wrong here.) For instance, maybe inducing corrigibility in ASI turns out to be not that difficult, such that the "first critical try" framework does not really apply, and takeoff is slow enough that model organisms/evals work means we can test our alignment methods and have them reasonably generalize to real world scenarios. Disagreeing with this view just means that "alignment" is harder than the steam engine scenario.

Replying toIt will cost you nothing to "bribe" a Utilitarian

dx264mo

It will cost you nothing to "bribe" a Utilitarian

I realize I'm being a little pedantic here, but on the "joke" calculation: the problem here is that $P R$ is a binary function depending on whether $k$ utilitarians join or not, right? For instance, let $- s_{P R}$ be the effective safety premium from $k$ safety-minded utilitarians joining (the value being negative as joining presumably accelerates the company), and suppose that each utilitarian joining leads to $- s_{P R} / k$ acceleration. Then a rational utilitarian would demand $(ϵ + s_{P R}) / k$ premium, which is not negligible.

Going back to the joke calculation, it implies that the bottleneck to preventing defection is coordination: $k$ utilitarians acting together would not join for $ϵ / k$ value as it is against their interests, but individually they have zero counterfactual impact, so they all join. In the real world, coordination is plausibly relevant,... (read more)

Replying toI make several million dollars per year and have hundreds of thousands of followers—what is the straightest line path to utilizing these resources to reduce existential-level AI threats?

dx2611mo

I make several million dollars per year and have hundreds of thousands of followers—what is the straightest line path to utilizing these resources to reduce existential-level AI threats?

I recommend https://agentfoundations.study/, and much of https://www.aisafety.com/stay-informed,

Currently these two links include the commas so they redirect to 404 pages

dx261y

We (or at least a majority of humans) do still have inner desires to have kids, though; they just get balanced out by other considerations, mostly creature comforts/not wanting to deal with the hassle of kids. But yeah, evolution did not foresee birth control, so that's a substantial misgeneralization.

We are still a very successful species overall according to IGF, but birth rates continue to decline, which is why I made my last point about inner alignment possibly drifting farther and farther away the stronger the inner optimizer (e.g. human culture) becomes.

dx26's Shortform

dx26

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

dx261yQuick Take

I saw that Katja Grace has said something similar here; I'm just putting my own spin on the idea.

The relevance of the evolutionary analogy for inner alignment has been long discussed in this community, but one observation that seems to not be mentioned is that humans are still... pretty good at inclusive genetic fitness? Even in way-out-of-distribution environments like modern society, we still have strong desires to eat food, stay alive, find mates and reproduce (although the last one has relatively decreased recently; IGF hasn't totally generalized). We don't monomanically optimize for IGF, but we (and probably future NN-based AIs) don't monomanically optimize for anything, and our motivational circuits still do a... (read more)

Replying toCan someone, anyone, make superintelligence a more concrete concept?

dx261y

Can someone, anyone, make superintelligence a more concrete concept?

The thing is, there exists lots of popular movies about rogue AIs taking over the world -- 2001, Terminator, etc etc -- so the concept should already exist in popular culture. The roadblocks seem to be:

The threat somehow doesn't seem as tangible or threatening as, for example, ISIS developing a bioweapon or the CCP permanently dominating the world. One explanation is that the reference class for "enemy does bad things with new technology" or other near-term threat models contains lots of examples throughout history, whereas "species smarter than humans" contains none. Related:
The threat doesn't seem realistic, i.e. people (even those who want to accelerate towards AGI) have long timelines. Hypothetically, if you

dx261y

If all trade is voluntary, then what is "exploitation?"

In this case, the starving person presumably has to press the button or else starve to death, and thus has no bargaining power. The other person only has to offer the bare minimum beyond what the starving person needs to survive, and the starving person must take the deal. In Econ 101 (assuming away monopolies, information asymmetry, etc.), exploited workers do have bargaining power by being able to work for other companies, hence why companies can’t just do stupid, spiteful actions in the long term.

Replying toCoherence of Caches and Agents

dx262y

Coherence of Caches and Agents

It might be relevant to note that the meaningfulness of this coherence definition depends on the chosen environment. For instance, in an deterministic forest MDP where an agent at a state $s$ can never return to $s$ for any $s$ and there is only one path between any two states, suppose we have a deterministic policy $π$ and let $s_{1} = π (s)$ , $s_{2} = π (s_{1})$ , etc. Then for the zero-current-payoff Bellman equations, we only need that $V (s_{1}) > V (s^{'})$ for any successor $s^{'}$ from $s$ , $V (s_{2}) > V (s^{''})$ for any successor $s^{''}$ from $s^{'}$ , etc. We can achieve this easily by, for example, letting all values except $V (s_{i})$ be near-zero; since $s_{j}$ is a successor of $s_{i}$ iff $j = i + 1$ (as otherwise there would be a cycle), this fits our criterion. Thus, every $π$ is coherent in this environment. (I haven't done the explicit math here, but I suspect that this... (read more)

Measuring Coherence and Goal-Directedness in RL Policies

dx26

This post was produced as part of the Astra Fellowship under the Winter 2024 Cohort, mentored by Richard Ngo.

Epistemic status: relatively confident in the overall direction of this post, but looking for feedback!

TL;DR:

When are ML systems well-modeled as coherent expected utility maximizers? We apply our theoretical model of coherence in our last post to toy policies in RL environments in OpenAI Gym. We develop classifiers that can spot coherence according to our definition and test them on test case policies that intuitively seem coherent or not coherent. We find that we can successfully train classifiers with low loss which also correctly predict out-of-distribution test cases we intuitively believe to have high or low... (read 1860 more words →)

Replying toMeasuring Coherence of Policies in Toy Environments

dx262y

Measuring Coherence of Policies in Toy Environments

Right, I think this somewhat corresponds to the "how long it takes a policy to reach a stable loop" (the "distance to loop" metric), which we used in our experiments.

What did you use your coherence definition for?

Measuring Coherence of Policies in Toy Environments

dx26

dx26, Richard_Ngo

This post was produced as part of the Astra Fellowship under the Winter 2024 Cohort, mentored by Richard Ngo. Thanks to Martín Soto, Jeremy Gillen, Daniel Kokotajlo, and Lukas Berglund for feedback.

Summary

Discussions around the likelihood and threat models of AI existential risk (x-risk) often hinge on some informal concept of a “coherent”, goal-directed AGI in the future maximizing some utility function unaligned with human values. Whether and how coherence may develop in future AI systems, especially in the era of LLMs, has been a subject of considerable debate. In this post, we provide a preliminary mathematical definition of the coherence of a policy as how likely it is to have been sampled via... (read 4160 more words →)

Supervised Program for Alignment Research (SPAR) at UC Berkeley: Spring 2023 summary

mic

mic, dx26, adamk, Carolyn Qian

In Spring 2023, the Berkeley AI Safety Initiative for Students (BASIS) organized an alignment research program for students, drawing inspiration from similar programs by Stanford AI Alignment^[1] and OxAI Safety Hub. We brought together 12 researchers from organizations like CHAI, FAR AI, Redwood Research, and Anthropic, and 38 research participants from UC Berkeley and beyond.

Here is the link to SPAR’s website, which includes all of the details about the program. We’ll be running the program again in the Fall 2023 semester as an intercollegiate program, coordinating with a number of local groups and researchers from across the globe.

If you are interested in supervising an AI safety project in Fall 2023, learn more here... (read 1504 more words →)

LESSWRONG
LW

LESSWRONG
LW

Measuring Coherence of Policies in Toy Environments

Supervised Program for Alignment Research (SPAR) at UC Berkeley: Spring 2023 summary

Measuring Coherence and Goal-Directedness in RL Policies

dx26's Shortform

dx26

dx26

dx26's Shortform

Measuring Coherence and Goal-Directedness in RL Policies

Measuring Coherence of Policies in Toy Environments

Supervised Program for Alignment Research (SPAR) at UC Berkeley: Spring 2023 summary

dx26

Measuring Coherence of Policies in Toy Environments

Supervised Program for Alignment Research (SPAR) at UC Berkeley: Spring 2023 summary

Measuring Coherence and Goal-Directedness in RL Policies

dx26's Shortform

dx26

dx26

dx26's Shortform

Measuring Coherence and Goal-Directedness in RL Policies

Measuring Coherence of Policies in Toy Environments

Supervised Program for Alignment Research (SPAR) at UC Berkeley: Spring 2023 summary

TL;DR:

Summary