Vanessa Kosoy

Vanessa Kosoy's Comments

Clarifying "AI Alignment"

For the second one, I don't know what your argument is that the non-intent-alignment work is urgent. I agree that the simulation example you give is an example of how flawed epistemology can systematically lead to x-risk. I don't see the argument that it is very likely.

First, even working on unlikely risks can be urgent, if the risk is great and the time needed to solve it might be long enough compared to the timeline until the risk. Second, I think this example shows that is far from straightforward to even informally define what intent-alignment is. Hence, I am skeptical about the usefulness of intent-alignment.

For a more "mundane" example, take IRL. Is IRL intent aligned? What if its assumptions about human behavior are inadequate and it ends up inferring an entirely wrong reward function? Is it still intent-aligned since it is trying to do what the user wants, it is just wrong about what the user wants? Where is the line between "being wrong about what the user wants" and optimizing something completely unrelated to what the user wants?

It seems like intent-alignment depends on our interpretation of what the algorithm does, rather than only on the algorithm itself. But actual safety is not a matter of interpretation, at least not in this sense.

For example, imagine that we prove that stochastic gradient descent on a neural network with particular architecture efficiently agnostically learns any function in some space, such that as the number of neurons grows, this space efficiently approximates any function satisfying some kind of simple and natural "smoothness" condition (an example motivated by already known results). This is a strong feasibility result.

My guess is that any such result will either require samples exponential in the dimensionality of the input space (prohibitively expensive) or the simple and natural condition won't hold for the vast majority of cases that neural networks have been applied to today.

I don't know why you think so, but at least this is a good crux since it seems entirely falsifiable. In an any case, exponential sample complexity definitely doesn't count as "strong feasibility".

I don't find smoothness conditions in particular very compelling, because many important functions are not smooth (e.g. most things involving an if condition).

Smoothness is just an example, it is not necessarily the final answer. But also, in classification problems smoothness usually translates to a margin requirement (the classes have to be separated with sufficient distance). So, in some sense smoothness allows for "if conditions" as long as you're not too sensitive to the threshold.

You are a bridge designer. You make the assumption that forces on the bridge will never exceed some value K (necessary because you can't be robust against unbounded forces). You prove your design will never collapse given this assumption. Your bridge collapses anyway because of resonance.

I don't understand this example. If the bridge can never collapse as long as the outside forces don't exceed K, then resonance is covered as well (as long as it is produced by forces below K). Maybe you meant that the outside forces are also assumed to be stationary.

The broader point is that when the environment has lots of complicated interaction effects, and you must make assumptions, it is very hard to find assumptions that actually hold.

Nevertheless most engineering projects make heavy use of theory. I don't understand why you think that AGI must be different?

The issue of assumptions in strong feasibility is equivalent to the question of, whether powerful agents require highly informed priors. If you need complex assumptions then effectively you have a highly informed prior, whereas if your prior is uninformed then it corresponds to simple assumptions. I think that Hanson (for example) believes that it is indeed necessary to have a highly informed prior, which is why powerful AI algorithms will be complex (since they have to encode this prior) and progress in AI will be slow (since the prior needs to be manually constructed brick by brick). I find this scenario unlikely (for example because humans successfully solve tasks far outside the ancestral environment, so they can't be relying on genetically built-in priors that much), but not ruled out.

However, I assumed that your position is not Hansonian: correct me if I'm wrong, but I assumed that you believed deep learning or something similar is likely to lead to AGI relatively soon. Even if not, you were skeptical about strong feasibility results even for deep learning, regardless of hypothetical future AI technology. But, it doesn't look like deep learning relies on highly informed priors. What we have is, relatively simple algorithms that can, with relatively small (or even no) adaptations solve problems in completely different domains (image processing, audio processing, NLP, playing many very different games, protein folding...) So, how is it possible that all of these domains have some highly complex property that they share, and that is somehow encoded in the deep learning algorithm?

It's really the assumptions that make me pessimistic, which is why it would be a significant update if I saw a mathematical definition of safety that I thought actually captured "safety" without "passing the buck"

I'm curious whether proving a weakly feasible subjective regret bound under assumptions that you agree are otherwise realistic qualifies or not?

...but the theory-based approach requires you to limit your assumptions to things that can be written down in math (e.g. this function is K-Lipschitz) whereas a non-theory-based approach can use "handwavy" assumptions (e.g. a human thinking for a day is safe), which drastically opens up the space of options and makes it more likely that you can find an assumption that is actually mostly true.

I can quite easily imagine how "human thinking for a day is safe" can be a mathematical assumption. In general, which assumptions are formalizable depends on the ontology of your mathematical model (that is, which real-world concepts correspond to the "atomic" ingredients of your model). The choice of ontology is part of drawing the line between what you want your mathematical theory to prove and what you want to bring in as outside assumptions. Like I said before, this line definitely has to be drawn somewhere, but it doesn't at all follow that the entire approach is useless.

Clarifying "AI Alignment"

...first it means that intent-alignment is insufficient in itself, and second the assumptions about the prior are doing all the work.

I completely agree with this, but isn't this also true of subjective regret bounds / definition-optimization?

The idea is, we will solve the alignment problem by (i) formulating a suitable learning protocol (ii) formalizing a set of assumptions about reality and (iii) proving that under these assumptions, this learning protocol has a reasonable subjective regret bound. So, the role of the subjective regret bound is making sure that the what we came up with in i+ii is sufficient, and also guiding the search there. The subjective regret bound does not tell us whether particular assumptions are realistic: for this we need to use common sense and knowledge outside of theoretical computer science (such as: physics, cognitive science, experimental ML research, evolutionary biology...)

Maybe your point is that there are failure modes that aren't covered by intent alignment, in which case I agree, but also it seems like the OP very explicitly said this in many places.

I disagree with the OP that (emphasis mine):

I think that using a broader definition (or the de re reading) would also be defensible, but I like it less because it includes many subproblems that I think (a) are much less urgent, (b) are likely to involve totally different techniques than the urgent part of alignment.

I think that intent alignment is too ill-defined, and to the extent it is well-defined it is a very weak condition, that is not sufficient to address the urgent core of the problem.

And meanwhile I think very messy real world domains almost always limit strong feasibility results. To the extent that you want your algorithms to do vision or NLP, I think strong feasibility results will have to talk about the environment; it seems quite infeasible to do this with the real world.

I don't think strong feasibility results will have to talk about the environment, or rather, they will have to talk about it on a very high level of abstraction. For example, imagine that we prove that stochastic gradient descent on a neural network with particular architecture efficiently agnostically learns any function in some space, such that as the number of neurons grows, this space efficiently approximates any function satisfying some kind of simple and natural "smoothness" condition (an example motivated by already known results). This is a strong feasibility result. We can then debate whether an using such a smooth approximation is sufficient for superhuman performance, but establishing this requires different tools, like I said above.

The way I imagine it, AGI theory should ultimately arrive at some class of priors that are on the one hand rich enough to deserve to be called "general" (or, practically speaking, rich enough to produce superhuman agents) and on the other hand narrow enough to allow for efficient algorithms. For example the Solomonoff prior is too rich, whereas a prior that (say) describes everything in terms of an MDP with a small number of states is too narrow. Finding the golden path in between is one of the big open problems.

That said, most of this belief comes from the fact that empirically it seems like theory often breaks down when it hits the real world.

Does it? I am not sure why you have this impression. Certainly there are phenomena in the real world that we don't yet have enough theory to understand, and certainly a given theory will fail in domains where its assumptions are not justified (where "fail" and "justified" can be a manner of degree). And yet, theory obviously played and plays a central role in science, so I don't understand whence the fatalism.

However, I specifically don't want to work on strong feasibility results, since there is a significant chance they would lead to breakthroughs in capability.

Idk, you could have a nondisclosure-by-default policy if you were worried about this. Maybe this can't work for you though.

That seems like it would be an extremely not cost-effective way of making progress. I would invest a lot of time and effort into something that would only be disclosed to the select few, for the sole purpose of convincing them of something (assuming they are even interested to understand it). I imagine that solving AI risk will require collaboration among many people, including sharing ideas and building on other people's ideas, and that's not realistic without publishing. Certainly I am not going to write a Friendly AI on my home laptop :)

Clarifying "AI Alignment"

It is possible that Alpha cannot predict it, because in Beta-simulation-world the user would confirm the irreversible action. It is also possible that the user would confirm the irreversible action in the real world because the user is being manipulated, and whatever defenses we put in place against manipulation are thrown off by the simulation hypothesis.

Why doesn't this also apply to subjective regret bounds?

In order to get a subjective regret bound you need to consider an appropriate prior. The way I expect it to work is, the prior guarantees that some actions are safe in the short-term: for example, doing nothing to the environment and asking only sufficiently quantilized queries from the user (see this for one toy model of how "safe in the short-term" can be formalized). Therefore, Beta cannot attack with a hypothesis that will force Alpha to act without consulting the user, since that hypothesis would fall outside the prior.

Now, you can say "with the right prior intent-alignment also works". To which I answer, sure, but first it means that intent-alignment is insufficient in itself, and second the assumptions about the prior are doing all the work. Indeed, we can imagine that the ontology on which the prior is defined includes a "true reward" symbol s.t., by definition, the semantics is whatever the user truly wants. An agent that maximizes expected true reward then can be said to be intent-aligned. If it's doing something bad from the user's perspective, then it is just an "innocent" mistake. But, unless we bake some specific assumptions about the true reward into the prior, such an agent can be anything at all.

Most existing mathematical results do not seem to be competitive, as they get their guarantees by doing something that involves a search over the entire hypothesis space.

This is related to what I call the distinction between "weak" and "strong feasibility". Weak feasibility means algorithms that are polynomial time in the number of states and actions, or the number of hypotheses. Strong feasibility is supposed to be something like, polynomial time in the description length of the hypothesis.

It is true that currently we only have strong feasibility results for relatively simple hypothesis spaces (such as, support vector machines). But, this seems to me just a symptom of advances in heuristics outpacing the theory. I don't see any reason of principle that significantly limits the strong feasibility results we can expect. Indeed, we already have some advances in providing a theoretical basis for deep learning.

However, I specifically don't want to work on strong feasibility results, since there is a significant chance they would lead to breakthroughs in capability. Instead, I prefer studying safety on the weak feasibility level until we understood everything important on this level, and only then trying to extend it to strong feasibility. This creates somewhat of a conundrum where apparently the one thing that can convince you (and other people?) is the thing I don't think should be done soon.

I could also imagine being pretty interested in a mathematical definition of safety that I thought actually captured "safety" without "passing the buck". I think subjective regret bounds and CIRL both make some progress on this, but somewhat "pass the buck" by requiring a well-specified hypothesis space for rewards / beliefs / observation models.

Can you explain what you mean here? I agree that just saying "subjective regret bound" is not enough, we need to understand all the assumptions the prior should satisfy, reflecting considerations such as, what kind of queries can or cannot manipulate the user. Hence the use of quantilization and debate in Dialogic RL, for example.

Vanessa Kosoy's Shortform

There is some similarity, but there are also major differences. They don't even have the same type signature. The dangerousness bound is a desideratum that any given algorithm can either satisfy or not. On the other hand, AUP is a specific heuristic how to tweak Q-learning. I guess you can consider some kind of regret bound w.r.t. the AUP reward function, but they will still be very different conditions.

The reason I pointed out the relation to corrigibility is not because I think that's the main justification for the dangerousness bound. The motivation for the dangerousness bound is quite straightforward and self-contained: it is a formalization of the condition that "if you run this AI, this won't make things worse than not running the AI", no more and no less. Rather, I pointed the relation out to help readers compare it with other ways of thinking they might be familiar with.

From my perspective, the main question is whether satisfying this desideratum is feasible. I gave some arguments why it might be, but there are also opposite arguments. Specifically, if you believe that debate is a necessary component of Dialogic RL then it seems like the dangerousness bound is infeasible. The AI can become certain that the user would respond in a particular way to a query, but it cannot become (worst-case) certain that the user would not change eir response when faced with some rebuttal. You can't (empirically and in the worst-case) prove a negative.

Clarifying "AI Alignment"

This opens the possibility of agents that with "well intentioned" mistakes that take the form of sophisticated plans that are catastrophic for the user.

Agreed that this is in theory possible, but it would be quite surprising, especially if we are specifically aiming to train systems that behave corrigibly.

The acausal attack is an example of how it can happen for systematic reasons. As for the other part, that seems like conceding that intent-alignment is insufficient and you need "corrigibility" as another condition (also it is not so clear to me what this condition means).

If Alpha can predict that the user would say not to do the irreversible action, then at the very least it isn't corrigible, and it would be rather hard to argue that it is intent aligned.

It is possible that Alpha cannot predict it, because in Beta-simulation-world the user would confirm the irreversible action. It is also possible that the user would confirm the irreversible action in the real world because the user is being manipulated, and whatever defenses we put in place against manipulation are thrown off by the simulation hypothesis.

Now, I do believe that if you set up the prior correctly then it won't happen, thanks to a mechanism like: Alpha knows that in case of dangerous uncertainty it is safe to fall back on some "neutral" course of action plus query the user (in specific, safe, ways). But this exactly shows that intent-alignment is not enough and you need further assumptions.

Moreover, the latter already produced viable directions for mathematical formalization, and the former has not (AFAIK).

I guess you wouldn't count universality. Overall I agree.

Besides the fact ascription universality is not formalized, why is it equivalent to intent-alignment? Maybe I'm missing something.

I'm relatively pessimistic about mathematical formalization.

I am curious whether you can specify, as concretely as possible, what type of mathematical result would you have to see in order to significantly update away from this opinion.

I do want to note that all of these require you to make assumptions of the form, "if there are traps, either the user or the agent already knows about them" and so on, in order to avoid no-free-lunch theorems.

No, I make no such assumption. A bound on subjective regret ensures that running the AI is a nearly-optimal strategy from the user's subjective perspective. It is neither needed nor possible to prove that the AI can never enter a trap. For example, the AI is immune to acausal attacks to the extent that the user beliefs that the AI is not inside Beta's simulation. On the other hand, if the user beliefs that the simulation hypothesis needs to be taken into account, then the scenario amounts to legitimate acausal bargaining (which has its own complications to do with decision/game theory, but that's mostly a separate concern).

Clarifying "AI Alignment"

In this essay Paul Christiano proposes a definition of "AI alignment" which is more narrow than other definitions that are often employed. Specifically, Paul suggests defining alignment in terms of the motivation of the agent (which should be, helping the user), rather than what the agent actually does. That is, as long as the agent "means well", it is aligned, even if errors in its assumptions about the user's preferences or about the world at large lead it to actions that are bad for the user.

Rohin Shah's comment on the essay (which I believe is endorsed by Paul) reframes it as a particular way to decompose the AI safety problem. An often used decomposition is "definition-optimization": first we define what it means for an AI to be safe, then we understand how to implement a safe AI. In contrast, Paul's definition of alignment decomposes the AI safety problem as "motivation-competence": first we learn how to design AIs with good motivations, then we learn how to make them competent. Both Paul and Rohin argue that the "motivation" is the urgent part of the problem, the part on which technical AI safety research should focus.

In contrast, I will argue that the "motivation-competence" decomposition is not as useful as Paul and Rohin believe, and the "definition-optimization" decomposition is more useful.

The thesis behind the "motivation-competence" decomposition implicitly assumes a linear, one-dimensional scale of competence. Agents with good motivations and subhuman competence might make silly mistakes but are not catastrophically dangerous (since they are subhuman). Agents with good motivations and superhuman competence will only do mistakes that are "forgivable" in the sense that, our own mistakes would be as bad or worse. Ergo (the thesis concludes), good motivations are sufficient to solve AI safety.

However, in reality competence is multi-dimensional. AI systems can have subhuman skills in some domains and superhuman skills in other domains, as AI history showed time and time again. This opens the possibility of agents that with "well intentioned" mistakes that take the form of sophisticated plans that are catastrophic for the user. Moreover, there might be limits to the agent's knowledge about certain questions (such as, the user's preferences) that are inherent in the agent's epistemology (more on this below). Given such limits, the agent's competence becomes systematically lopsided. Furthermore, the elimination of such limits is as a large part of the "definition" part in the "definition-optimization" framing that the thesis rejects.

As a consequence of the multi-dimensional natural of competence, the difference between "well intentioned mistake" and "malicious sabotage" is much less clear than naively assumed, and I'm not convinced there is a natural way to remove the ambiguity. For example, consider a superhuman AI Alpha subject to an acausal attack. In this scenario, some agent Beta in the "multiverse" (= prior) convinces Alpha that Alpha exists in a simulation controlled by Beta. The simulation is set up to look like the real Earth for a while, making it a plausible hypothesis. Then, a "treacherous turn" moment arrives in which the simulation diverges from Earth, in a way calculated to make Alpha take irreversible actions that are beneficial for Beta and disastrous for the user.

In the above scenario, is Alpha "motivation-aligned"? We could argue it is not, because it is running the malicious agent Beta. But we could also argue it is motivtion-aligned, it just makes the innocent mistake of falling for Beta's trick. Perhaps it is possible to clarify the concept of "motivation" such that in this case, Alpha's motivations are considered bad. But, such a concept would depend in complicated ways on the agent's internals. I think that this is a difficult and unnatural approach, compared to "definition-optimization" where the focus is not on the internals but on what the agent actually does (more on this later).

The possibility of acausal attacks is a symptom of the fact that, environments with irreversible transitions are usually not learnable (this is the problem of traps in reinforcement learning, that I discussed for example here and here), i.e. it is impossible to guarantee convergence to optimal expected utility without further assumptions. When we add preference learning to the mix, the problem gets worse because now even if there are no irreversible transitions, it is not clear the agent will converge to optimal utility. Indeed, depending on the value learning protocol, there might be uncertainties about the user's preferences that the agent can never resolve (this is an example of what I meant by "inherent limits" before). For example, this happens in CIRL (even if the user is perfectly rational, this happens because the user and the AI have different action sets).

These difficulties with the "motivation-competence" framing are much more natural to handle in the "definition-optimization" framing. Moreover, the latter already produced viable directions for mathematical formalization, and the former has not (AFAIK). Specifically, the mathematical criteria of alignment I proposed are the "dynamic subjective regret bound" and the "dangerousness bound". The former is a criterion which simultaneous guarantees motivation-alignment and competence (as evidence that this criterion can be satisfied, I have the Dialogic Reinforcement Learning proposal). The latter is a criterion that doesn't guarantee competence in general, but guarantees specifically avoiding catastrophic mistakes. This makes it closer to motivation-alignment compated to subjective regret, but different in important ways: it refers to the actual things that agent does, and the ways in which these things might have catastrophic consequences.

In summary, I am skeptical that "motivation" and "competence" can be cleanly separately in a way that is useful for AI safety, whereas "definition" and "optimization" can be so separated: for example the dynamic subjective regret bound is a "definition" whereas dialogic RL and putative more concrete implementations thereof are "optimizations". My specific proposals might have fatal flaws that weren't discovered yet, but I believe that the general principle of "definition-optimization" is sound, while "motivation-competence" is not.

Bay Solstice 2019 Retrospective

Thank you for writing this retrospective, it is really interesting! Although I never attended a winter solstice, it sounds like amazing work. I sometimes toy with the idea of organizing a rationalist solstice here in Israel, but after reading this I am rather awed and intimidated, since it's obvious I can't pull off anything that is even close.

This might be a little off-topic, but are there any ideas / resources about how to organize a rationalist winter solstice if (i) you don't have a lot of time or resources (ii) you expect at most a small core of people which are well-familiar with the rationalist memeplex and buy into the rationalist ethos, plus some number of people who only partway there or are merely curious. Or, is it not worth trying under these conditions?

Btw, one thing that sounded strange was the remark that some people felt lonely but it's okay. I understand that winter solstice is supposed to be dark, but isn't the main point of it to amplify the sense of community? Shouldn't the message be something along the lines of "things are hard, but we're in this together"? Which is the antithesis of loneliness?

Realism about rationality

Of course you can predict some properties of what an agent will do. In particular, I hope that we will eventually have AGI algorithms that satisfy provable safety guarantees. But, you can't make exact predictions. In fact, there probably is a mathematical law that limits how accurate predictions you can get.

An optimization algorithm is, by definition, something that transforms computational resources into utility. So, if your prediction is so close to the real output that it has similar utility, then it means the way you produced this prediction involved the same product of "optimization power per unit of resources" by "amount of resources invested" (roughly speaking, I don't claim to already know the correct formalism for this). So you would need to either (i) run a similar algorithm with similar resources or (ii) run a dumber algorithm but with more resources or (iii) use less resources but an even smarter algorithm.

So, if you want to accurately predict the output of a powerful optimization algorithm, your prediction algorithm would usually have to be either a powerful optimization algorithm in itself (cases i and iii) or prohibitively costly to run (case ii). The exception is cases when the optimization problem is easy, so a dumb algorithm can solve it without much resources (or a human can figure out the answer by emself).

Realism about rationality

It seems almost tautologically true that you can't accurately predict what an agent will do without actually running the agent. Because, any algorithm that accurately predicts an agent can itself be regarded as an instance of the same agent.

What I expect the abstract theory of intelligence to do is something like producing a categorization of agents in terms of qualitative properties. Whether that's closer to "momentum" or "fitness", I'm not sure the question is even meaningful.

I think the closest analogy is: abstract theory of intelligence is to AI engineering as complexity theory is to algorithmic design. Knowing the complexity class of a problem doesn't tell you the best practical way to solve it, but it does give you important hints. (For example, if the problem is of exponential time complexity then you can only expect to solve it either for small inputs or in some special cases, and average-case complexity tells you just whether these cases need to be very special or not. If the problem is in then you know that it's possible to gain a lot from parallelization. If the problem is in then at least you can test solutions, et cetera.)

And also, abstract theory of alignment should be to AI safety as complexity theory is to cryptography. Once again, many practical considerations are not covered by the abstract theory, but the abstract theory does tell you what kind of guarantees you can expect and when. (For example, in cryptography we can (sort of) know that a certain protocol has theoretical guarantees, but there is engineering work finding a practical implementation and ensuring that the assumptions of the theory hold in the real system.)

Realism about rationality

I think that ricraz claims that it's impossible to create a mathematical theory of rationality or intelligence, and that this is a crux, not so? On the other hand, the "momentum vs. fitness" comparison doesn't make sense to me. Specifically, a concept doesn't have to be crisply well-defined in order to use it in mathematical models. Even momentum, which is truly one of the "cripser" concepts in science, is no longer well-defined when spacetime is not asymptotically flat (which it isn't). Much less so are concepts such as "atom", "fitness" or "demand". Nevertheless, physicists, biologist and economists continue to successfully construct and apply mathematical models grounded in such fuzzy concepts. Although in some sense I also endorse the "strawman" that rationality is more like momentum than like fitness (at least some aspects of rationality).

Load More