Vanessa Kosoy

AI alignment researcher supported by HUJI, MIRI and LTFF. Working on the learning-theoretic agenda.

E-mail: vanessa DOT kosoy AT {the thing reverse stupidity is not} DOT org

Wiki Contributions

Comments

I don't think the argument on hacking relied on the ability to formally verify systems. Formally verified systems could potentially skew the balance of power to the defender side, but even if they don't exist, I don't think balance is completely skewed to the attacker.

My point was not about the defender/attacker balance. My point was that even short-term goals can be difficult to specify, which undermines the notion that we can easily empower ourselves by short-term AI.

Of course we need to understand how to define "long term" and "short term" here. One way to think about this is the following: we can define various short-term metrics, which are evaluable using information in the short-term, and potentially correlated with long-term success. We would say that a strategy is purely long-term if it cannot be explained by making advances on any combination of these metrics.

Sort of. The correct way to make it more rigorous, IMO, is using tools from algorithmic information theory, like I suggested here.

In order to appear as a side-comment, quotes should be an exact match, including formatting

This might be inconvenient for markdown editor users. Because, when you copy text into the markdown editor, it loses the formatting. It would be nice if either formatting was ignored for side-comment matching purposes, or if copying formatted text into markdown would automatically add the relevant tags (the latter would have other benefits as well).

Thanks for the responses Boaz!

Our claim is that one can separate out components - there is the predictable component which is non stationary, but is best approximated with a relatively simple baseline, and the chaotic component, which over the long run is just noise.In general, highly complex rules are more sensitive to noise (in fact, there are theorems along these lines in the field of Analysis of Boolean Functions), and so in the long run, the simpler component will dominate the accuracy.

I will look into analysis of boolean functions, thank you. However, unless you want to make your claim more rigorous, it seems suspect to me.

In reality, there are processes happening simultaneously on many different timescales, from the microscopic to the cosmological. And, these processes are coupled, so that the current equilibrium of each process can be regarded as a control signal for the higher timescale processes. This means we can do long-term planning by starting from the long timescales and back-chaining to short timescales, like I began to formalize here.

So, while eventually the entire universe reaches an equilibrium state (a.k.a. heat-death), there is plenty of room for long-term planning before that.

Hacking is actually a fairly well-specified endeavor. People catalog, score, and classify security vulnerabilities. To hack would be to come up with a security vulnerability, and exploit code, which can be verified.

Yeeees, it does seem like hacking is an especially bad example. But even in this example, my position is quite defensible. Yes, theoretically you can formally specify the desired behavior of the code and verify that it always happens. But, there are two problems with that: First, for many realistic software system, the formal specification would require colossal effort. Second, the formal verification is only as good as the formal model. For example, if the attacker found a hardware exploit, while your model assumes idealized behavior for the hardware, the verification doesn't help. And, it domains outside software the situation is much worse: how do you "verify" that your biological security measures are fool-proof, for example?

Also, you seem to be envisioning a long-term AI that is then fine-tuned on a short-term task, but how did it evolve these long-term goals in the first place?

When you're selecting for success on a short-term goal you might inadvertently produce a long-term agent (which, on the training distribution, is viewing the short-term goal as instrumental for its own goals), just like how evolution was selecting for genetic fitness but ended up producing agents with many preferences unrelated to that. More speculatively, there might be systematic reasons for such agents to arise, for example if good performance in the real-world requires physicalist epistemology which comes with inherent "long-terminess".

I would not say that there is no such thing as talent in being a CEO or presidents. I do however believe that the best leaders have been some combination of their particular characteristics and talents, and the situation they were in. Steve Jobs has led Apple to become the largest company in the world, but it is not clear that he is a "universal CEO" that would have done as good in any company (indeed he failed with NeXT).

This sounds like a story you can tell about anything. "Yes, such-and-such mathematician proved a really brilliant theorem A, but their effort to make progress in B didn't amount to much." Obviously, real-world performance depends on circumstances and not only on talent. This is doubly true in a competitive setting, where other similarly talented people are working against you. Nevertheless, a sufficiently large gap in talent can produce very lopsided outcomes.

Also, as Yafah points elsewhere here, for people to actually trust an AI with being the leader of a company or a country, it would need to not just be as good as humans or a little better, but better by a huge margin. In fact, most people's initial suspicion is that AIs (or even humans that don't look like them) is not "aligned" with their interests, and if you don't convince them otherwise, their default would be to keep them from positions of power.

First, it is entirely possible the AI will be better by a huge margin, because like with most things, there's no reason to believe evolution brought us anywhere near the theoretical optimum on this. (Yes, there was selective pressure, but no amount of selective pressure allowed evolution to invent spaceships, or nuclear reactors, or even the wheel.) Second, what if the AI poses as a human? Or, what if the AI uses a human as a front while pulling the strings behind the scenes? There will be no lack of volunteers to work as such a front, if in the short-term them it brings them wealth and status. Also, ironically, the more successful AI risk skeptics are at swaying public opinion, the easier the AIs job is and the weaker their argument becomes.

The main point is that we need to measure the powers of a system as a whole, not compare the powers of an individual human with an individual AI. Clearly, if you took a human, made their memory capacity 10 times bigger, and made their speed 10 times faster, then they could do more things. But we are comparing with the case that humans will be assisted with short-term AIs that would help them in all of the tasks that are memory and speed intensive.

Alright, I can see how the "universality" argument makes sense if you believe that "human + short-term AI = scaled-up human". The part I doubt is that this equation holds for any easy-to-specify value of "short-term AI".

IIUC the thesis of this article rest on several interrelated claims:

  1. Long-term planning is not useful because of chaos
  2. Short-term AIs have no alignment problem
  3. Among humans, skill is not important for leadership, beyond some point
  4. Human brains have an advantage w.r.t. animals because of "universality", and any further advantage can only come from scaling with resources.

I wish to address these claims one by one.

Claim 1

This is an erroneous application of chaos theory IMO. The core observation of chaos theory is, that in many dynamical systems with compact phase space, any distribution converges (in the Kantorovich-Rubinstein sense) to a unique stationary distribution. This means that small measurement errors lead to large prediction errors, and in the limit no information from the initial condition remains.

However, real-world dynamical systems are often not compact in the relevant approximation. In particular, acquisition of resources and development of new technologies are not bounded from above on a relevant scale. Indeed, trends in GDP growth and technological progress continue over long time scales and haven't converged, so far, to a stationary distribution. Ultimately, these quantities are also bounded for physical / information-theoretic / complexity-theoretic reasons, but since humanity is pretty far from saturating them, this leaves ample room for AI to have a long-term planning advantage over humanity.

Claim 2

Although it is true that, for sufficiently short-term planning horizons, AIs have less incentives to produce unintended consequences, problems remain.

One problem is that some tasks are very difficult to specify. For example, suppose that a group of humans armed with short-term AIs is engaged in cyberwarfare against a long-term AI. Then, even if every important step in the conflict can be modeled as short-term optimization, specifying the correct short-term goal can be a non-trivial task (how do you define "to hack" or "to prevent from hacking"?) that humans can't easily point their short-term AI towards.

Moreover, AIs trained on short-term objectives can still display long-term optimization out-of-distribution. This is because a long-term optimizer that is smart enough to distinguish between training and deployment can behave according to expectations during training while violating them as much as it wants when it's either outside of training or the correcting outer loop is too slow to matter.

Claim 3

This claim flies so much in the face of common sense (is there no such thing as business acumen? charisma? military genius?) that it needs a lot more supporting evidence IMO. The mere fact that IQs of e.g. CEOs are only moderately above average and not far above average only means that IQ stops to be a useful metric at that range, since beyond some point, different people have cognitive advantages in different domains. I think that, as scientists, we need to be careful of cavalierly dismissing the sort of skills we don't have.

As to the skepticism of the authors about social manipulation, I think that anyone who studied history or politics can attest that social manipulation has been used, and continues to be used, with enormous effects. (Btw, I think it's probably not that hard to separate a dog from a bone or child from a toy if you're willing to e.g. be completely ruthless with intimidation.)

Claim 4

While it might be true that there is a sense in which human brains are "qualitatively optimal", this still leaves a lot of room for quantitative advantage, similar to how among two universal computers, one can be vastly more efficient than the other for practical purposes. As a more relevant analogy, we can think of two learning algorithms that learn the same class of hypotheses while still having a significant difference in computational and/or sample efficiency. In the limit of infinite resources and data, both algorithms converge to the same results, but in practice one still has a big advantage over the other. While undoubtedly there are hard limits to virtually every performance metric, there is no reason to believe evolution brought human brains anywhere near those limits. Furthermore, even if "scaling with resources" is the only thing that matters, the ability of AI to scale might be vastly better than the ability of humans to scale because of communication bandwidth bottlenecks between humans, not to mention the limited trust humans have towards one another (as opposed to large distributed AI systems, or disparate AI systems that can formally verify each other's trustworthiness).

Personally, I sometimes have the opposite metacognitive concern: that I'm not freaking out enough about AI risk. The argument goes: if I don't have a strong emotional response, doesn't it mean I'm lying to myself about believing that AI risk is real? I even did a few exercises in which I tried to visualize either the doom or some symbolic representation of the doom in order to see whether it triggers emotion or, conversely, exposes some self-deception, something that rings fake. The mental state that triggered was interesting, more like a feeling of calm meditative sadness than panic. Ultimately, I think you're right when you say, if something doesn't threaten me on the timescale of minutes, it shouldn't send me into fight-or-flight. And, it doesn't.

I also tentatively agree that it feels like there's something unhealthy in the panicky response to Yudkowsky's recent proclamation of doom, and it might lead to muddled thinking. For example, it seems like everyone around here are becoming convinced of shorter and shorter timelines, without sufficient evidence IMO. But, I don't know whether your diagnosis is correct. Most of the discourse about AI risk around here is not producing any real progress on the problem. But, occasionally it does. And I'm not sure whether the root of the problem is psychological/memetic (as you claim) or just that it's a difficult problem that only a few can meaningfully contribute to.

If the information takes a little longer to arrive, then the user will still be inside the threshold.

A more concerning problem is, what if the simulation only contains a coarse grained simulation of the user s.t. it doesn't register as an agent. To account for this, we might need to define a notion of "coarse grained agent" and allow such entities to be candidate users. Or, maybe any coarse grained agent has to be an actual agent with a similar loss function, in which case everything works out on its own. These are nuances that probably require uncovering more of the math to understand properly.

Yes, but simulators might not just "alter reality so that they are slightly more causally tight than the user", they might even "alter reality so that they are inside the threshold and the user no longer is", right?

No. The simulation needs to imitate the null hypothesis (what we understand as reality), otherwise it's falsified. Therefore, it has to be computing every part of the null universe visible to the AI. In particular, it has to compute the AI responding to the user responding to the AI. So, it's not possible for the attacker to make the user-AI loop less tight.

...it would seem like no training procedure implementing PreDCA can be modified/devised so as to achieve the guarantee of (almost surely) avoiding acausal attacks... because of the variety of attacks and the vastness of the space of hypotheses.

The variety of attacks doesn't imply the impossibility of defending from them. In cryptography, we have protocols immune from all attacks[1] despite a vast space of possible attacks. Similarly, here I'm hoping to gradually transform the informal arguments above into a rigorous theorem (or well-supported conjecture) that the system is immune.


  1. As long as the assumptions of the model hold, ofc. And, assuming some (highly likely) complexity-theoretic conjectures. ↩︎

If Alice is dating Bob and Alice is promoted to become Bob's boss, then Alice should refuse the promotion, or everyone involved agree that Bob moves to a different department, or some other solution along these lines. And, yes, informal positions of power is a thing that should be watched out for. I don't think I'm trivializing, I just feel there's a reasonable trade-off point and the norms associated with the American left (which is obviously very influential in this community) went too far to one end of spectrum.

Now, when most people in a workplace are dating each other... I don't know, this is too far outside my experience for me to have an informed opinion. I can believe this is a terrible idea, or that it's somehow manageable if done right. I think that usually this organically doesn't happen, but I have no experience with working in heavily-poly / heavily-EA orgs, maybe that's different.

Notice that "the gears to ascenscion" wrote "in professional contexts", not just "while working at the same place". That might be interpreted to mean things like "don't date people who you sometimes see in professional conferences", and AFAICT some people actually endorse norms along those lines. And, I think that's going way too far.

I understand you only care about maximizing your current preferences (which might include long-term flourishing of humanity), and not some vague "longtermist potential" independent of your preferences. I agree, but it would seem like most EAs would disagree

Yes, I think most EAs are confused about ethics (see e.g. 1 2 3), which is why I'm not sure I count as EA or merely as "EA-adjacent"[1].

I don't immediately see why some simulation hypotheses (maybe specifically tailored to the way in which the AI encodes its physical hypotheses) would not be able to alter underlying physics in such a way as to provide a tighter causal loop between AI and simulator, so that User Detection yields a simulator

We design user detection so that anything below a threshold is a "user" (rather than only the extreme agent being the user), and if there are multiple (or no) "users" we discard the hypothesis. So, yes, there is still some filtering going on, just not as complex as before.

But maybe this is not your main concern. You said "our best plan at avoiding those is applying various approximate pruning mechanisms (as already happens in many other Alignment proposals)". This is not how I would put it. My goal is to have an algorithm for which we know its theoretical guarantees (e.g. having such-and-such regret bound w.r.t. such-and-such prior). I believe that deep learning has theoretical guarantees, we just don't know what they are. We will need to either (i) understand what guarantees DL has and how to modify it in order to impose the guarantees we actually want or (ii) come up with a completely different algorithm for satisfying the new guarantees. Either will be challenging, but there are reasons to be cautiously optimistic (mainly the progress that's already happening, and the fact that looking for algorithms is easier when you know exactly which mathematical property you need).

Say P searches for a model of a theory T. Say Q simulates a room with a human, and a computer which distributes an electric shock to the human iff it finds a contradiction derived from T, and Q outputs whether the human screamed in pain (and suppose the human screams in pain iff they are shocked). Both reject at time t if they haven't accepted yet, but suppose we know one of the two searches will finish before t.

The difference is, if there's actually a room with a human (or the simulation of a room with a human), then there are other computations that are running (all the outputs of the human throughout the process), not just the one bit about whether the human screamed in pain. That's how we know that in this situation a human exists, whereas if we only have a computer running P then no human exists. We can't just "rearrange the program in a way that outputs information that wasn't already there", because if it isn't already there, the bridge transform will not assert this rearranged program is running.


  1. Due to the unfortunate timing of this discussion, I feel the need to clarify: this has absolutely nothing to do with FTX. I would have said exactly the same before those recent events. ↩︎

Some quick responses:

  • About the "lock-in" problem: I don't think lock-in is a meaningful concern. I'm an agent with specific preferences, and I'm making decision based on these preferences. The decision which AI to run is just another decision. Hence, there's no philosophical reason I shouldn't make it based on my current preferences. The confusion leading to viewing this as a problem comes from conflating long-term terminal preferences with short-term instrumental preferences based on some possibly erroneous beliefs. Notice also that in IBP, preferences can directly depend on computations, so they can be abstract / "meta" / indirect.
  • About "we're not really checking which computations run, but which computations' outputs the universe has information about... this is a problem to fix". I don't think it's a problem. The former is the same as the latter. If, like in your example, P is a program that outputs a bit and Q is a program that outputs not-P, then P is running iff Q is running (as long as we're at an epistemic vantage point that knows that Q = not-P). This seems intuitive to me: it makes no sense to distinguish between computing the millionth digit of pi in binary, and computing the not of the millionth digit of pi in binary. It's essentially the same computation with different representation of the result. To put it differently, if P is an uploaded human and Q is a different program which I know to be functionally equivalent to P, then Q is also considered to be an uploaded human. This is a philosophical commitment, but I consider it to be pretty reasonable.
  • About "the current framework only allows for our AGI to give positive value to computations": yes, this is a major problem. There might be good answers to this, but currently all the candidate answers I know are pretty weird (i.e. require biting some philosophical bullet). I believe that we'll understand more one way or the other as we progress in our mathematical inquiry.
  • About "a model of human cognitive science" and "pruning mechanisms": I no longer believe these are necessary. I now think we don't need to explicitly filter acausal attackers. Instead, in IBP every would-be mesa-optimizer is toothless because it automatically has to contain a simulation of the user and therefore it is (i) a valid hypothesis from the user's POV and (ii) induces the correct user preferences.
Load More