Towards_Keeperhood

I'm trying to prevent doom from AI. Currently trying to become sufficiently good at alignment research. Feel free to DM for meeting requests.

Wiki Contributions

Comments

So basically I mostly don't think that Eliezer expects that there are natural abstractions, but that he thinks the problem is that we don't have sufficient understanding of AGI cognition to robustly point goals of an AGI.

I think Eliezer thinks AGI cognition is quite different from human cognition, which makes it harder. But even if we were to develop brain-like AGI which worked sorta more like humans, the way this cognition looks like deep down still seems sorta alien, and we still don't nearly have the skill to robustly point the goal of an AGI even if it was brain-like.

My model (which is pretty similar to my model of Eliezer's model) does not match your model of Eliezer's model. Here's my model, and I'd guess that Eliezer's model mostly agrees with it:

  1. Natural abstractions (very) likely exist in some sense. Concepts like "chair" and "temperature" and "carbon" and "covalent bond" all seem natural in some sense, and an AI might model them too (though perhaps at significantly superhuman levels of intelligence it rather uses different concepts/models). (Also it's not quite as clear whether such natural abstractions actually apply very well to giant transformers (though still probable in some sense IMO, but it's perhaps hard to identify them and to interpret what "concepts" actually are in AIs).)
  2. Many things we value are not natural abstractions, but only natural relative to a human mind design. Emotions like "awe" or "laughter" are quite complex things evolved by evolution, and perhaps minds that have emotions at all are just a small space in minddesignspace. The AI doesn't have built-in machinery for modelling other humans the way humans model other humans. It might eventually form abstractions for the emotions, but probably not in a way it understands "how the emotion feels from the inside".
    1. There is lots of hidden complexity in what determines human values. Trying to point an AI to human values directly (in a similar way to how humans are pointed to their values) would be incredibly complex. Specifying a CEV process / modelling one or multiple humans and identifying in the model where the values are represented and pointing the AI to optimize those values is more tractable, but would still require a vastly greater mastering of understanding of minds to pull of, and we are not on a path to get there without human-augmentation.
  3. When the AI is smarter than us it will have better models which we don't understand, and the concepts it uses will diverge from the concepts we use. As an analogy, consider 19th-century humans (or people who don't know much about medicine) being able to vaguely classify health symptoms into diseases, vs the AI having a gears-level model of the body and the immune system which explains the observed symptoms.
  4. I think a large part of what Eliezer meant with Lethalities#33 is that the way thinking works deep in your mind looks very different from the English sentences which you can notice going through your mind and which are only shallow shadows of what actual thinking is going on in your mind; and for giant transformers the way the actual thinking looks there is likely even a lot less understandable from the way the actual thinking looks in humans.
  5. Ontology idenfication (including utility rebinding) is not nearly all of the difficulty of the alignment problem (except possibly in so far as figuring out all the (almost-)ideal frames to model and construct AI cognition is a requisite to solving ontology identification). Other difficulties include:
    1. We won't get a retargetable general purpose search by default, but rather the AI is (by default) going to be a mess of lots of patched-together optimization patterns.
    2. There are lots of things that might cause goal drift; misaligned mesa-optimizers which try to steer or get control of the AI; Goodhart; the AI might just not be smart enough initially and make mistakes which cause irrevocable value-drift; and in general it's hard to train the AI to become smarter / train better optimization algorithms, while keeping the goal constant.
    3. (Corrigibility.)
  6. While it's nice that John is attacking ontology identification, he doesn't seem nearly as much on track to solve it in time as he seems to think. Specifying a goal in the AI's ontology requires finding the right frames for modelling how an AI imagines possible worldstates, which will likely look very different from how we initially naively think of it (e.g. the worldstates won't be modelled by english-language sentences or anything remotely as interpretable). The way we currently think of what "concepts" are might not naturally bind to anything in how the AI's reasoning looks like, and we first need to find the right way to model AI cognition and then try to interpret what the AI is imagining. Even if "concept" is a natural abstraction on AI cognition, and we'd be able to identify them (though it's not that easy to concretely imagine how that might look like for giant transformers), we'd still need to figure out how to combine concepts into worldstates so we can then specify a utility function over those.

I think research on what you propose should definitely not be public and I'd recommend against publicly trying to push this alignment agenda.

(I think) Planck found the formula that matched the empirically observed distribution, but had no explanation for why it should hold. Einstein found the justification for this formula.

First a note:

the two chunks are independent given the pressure and temperature of the gas

I'd be careful here: If the two chunks of gas are in a (closed) room which e.g. was previously colder on one side and warmer on the other and then equilibriated to same temperature everywhere, the space of microscopic states it can have evolved into is much smaller than the space of microscopic states that meet the temperature and pressure requirements (since the initial entropy was lower and physics is deterministic). Therefore in this case (or generally in cases in our simple universe rather than thought experiments where states are randomly sampled) a hypercomputer could see more mutual information between the chunks of gas than just pressure and temperature. I wouldn't call the chunks approximately independent either, the point is that we with our bounded intellects are not able to keep track of the other mutual information.

Main comment:

(EDIT: I might've misunderstood the motivation behind natural latents in what I wrote below.)

I assume you want to use natural latents to formalize what a natural abstraction is.

The " induces independence between all " criterion seems too strong to me.

IIUC you want that if we have an abstraction like "human", you want all the individual humans to share approximately no mutual information conditioned on the "human" abstraction.
Obviously, there are subclusters of humans (e.g. women, children, ashkenazi jews, ...) where members share more properties (which I'd say is the relevant sense of "mutual information" here) than properties that are universally shared among humans.
So given what I intuitively want the "human" abstraction to predict, there would be lots of mutual information between many humans.
However, (IIUC,) your definition of natural latents permits there to be waaayyy more information encoded in the "human" abstraction, s.t. it can predict all the subclusters of humans that exist on earth, since it only needs to be insensitive to removing one particular human from the dataset. This complex human abstraction does render all individual humans approximately independent, but I would say this abstraction seems very ugly and not what I actually want.

I don't think we need this conditional independence condition, but rather something else that finds clusters of thingies which share unusually much (relevant) mutual information.
I like to think of abstractions as similarity clusters. I think it would be nice if we find a formalization of what a cluster of thingies is without needing to postulate an underlying thingspace / space of possible properties, and instead find a natural definition of "similarity cluster" based on (relevant) mutual information. But not sure, haven't really thought about it.

(But possibly I misunderstood sth. If it already exists, feel free to invite me to a draft of the conceptual story behind natural latents.)

Amazing story! My respect for writing this.

I think stories may be a promising angle for making people (especially AI researchers) understand AI x-risk (on more of a gut level so they realize it actually binds to reality).

The end didn't seem that realistic to me though. Or at least, I don't expect ALICE would seek to fairly trade with humanity, but not impossible that it'd call the president pretending to want to trade. Not sure what your intent when writing was, but I'd guess most people will read it the first way. Compute is not a (big) bottleneck for AI inference. Even if humanity coordinated successfully to shut down large GPU clusters and supercomputers, it seems likely that ALICE could copy itself to tens or hundreds of millions of devices (and humanity seems much to badly coordinated to be able to shut down 99.99% of those) to have many extremely well coordinated copies, and at ALICE's intelligence level this seems sufficient to achieve supreme global dominance within weeks (or months if I'm being conservative), even if it couldn't get smarter. E.g. it could at least do lots and lots of social engineering and manipulation to prevent humanity to effectively coordinate against it, spark wars and civil wars, make governments and companies decide to manufacture war drones (which the ALICE can later hack), and influence war decisions for higher destructiveness, use war drones to threaten people into doing stuff at important junctions, and so on. (Sparking multiple significant wars within weeks seems totally possible on that level of intelligence and resources. Seems relatively obvious to me but I can try to argue the point if needed. (Though not sure whether convincingly. Most people seem to me to not be nearly able to imagine what e.g. 100 copies of Eliezer Yudkowsky could do if they could all think on peak performance 24/7. Once you reach that level with something that can rewrite its mind you don't get slow takeoff, but nvm that's an aside.))

I'm not sure I understood how 2 is different from 1.

(1) is the problem that utility rebinding might just not happen properly by default. An extreme example is how AIXI-atomic fails here. Intuitively I'd guess that once the AI is sufficiently smart and self-reflective, it might just naturally see the correspondence between the old and the new ontology and rebind values accordingly. But before that point it might get significant value drift. (E.g. if it valued warmth and then learns that there actually are just moving particles, it might just drop that value shard because it thinks there's no such (ontologically basic) thing as warmth.)

(2) is the problem that the initial ontology of the AI is insufficient to fully capture human values, so if you only specify human values as well as possible in that ontology, it would still lack the underlying intuitions humans would use to rebind their values and might rebind differently. Aka while I think many normal abstractions we use like "tree" are quite universal natural abstractions where the rebinding is unambiguous, many value-laden concepts like "happiness" are much less natural abstractions for non-human minds and it's actually quite hard to formally pin down what we value here. (This problem is human-value-specific and perhaps less relevant if you aim the AI at a pivotal act.)

When I try to think about this kind of problem I get stuck on not understanding the details of how an ontology/worldmodel can or should work.

Not sure if this helps, but I heard that Vivek's group came up with the same diamond maximizer proposal as I did, so if you remember that you can use it as a simple toy frame to think about rebinding. But sure we need a much better frame for thinking about the AI's world model.

This is an amazing report!

Your taxonomy in section 4 was new and interesting to me. I would also mention the utility rebinding problem, that goals can drift because the AI's ontology changes (e.g. because it figures out deeper understanding in some domain). I guess there are actually two problems here:

  1. Formalizing the utility rebinding mechanism so that concepts get rebound to the corresponding natural abstractions of the new deeper ontology.
  2. For value-laden concepts the AI likely lacks the underlying human intuitions for figuring out how the utility ought to be rebound. (E.g. when we have a concept like "conscious happiness", and the AI finds what cognitive processes in our brains are associated with this, it may be ambiguous whether to rebind the concept to the existence of thoughts like 'I notice the thought "I notice the thought <expected utility increase>"' running through a mind/brain, or whether to rebind it in a way to include a cluster of sensations (e.g. tensions in our face from laughter) that are present in our minds/brains (, or other options). (Sry maybe bad example which might require some context of my fuzzy thoughts on qualia which might actually be wrong.))

Thanks.

I briefly looked into the MIRI paper (and the section from Eliezer's lecture that starts at 22min) again.

My main guess now is that you're not supposed to make the agent act on false beliefs; that one of the desiderada of the shutdown problem is that your agent shouldn't have a capability penalty, including taking bad actions based on false beliefs (like taking bad bets, as jessicata mentioned). Your proposal violates that.

The case MIRI considered wasn't to modify the world model so the agent believes that the button cannot be pressed, but instead to construct the utility function in a way that the resulting behavior ends up equivalent, but without the agent having wrong beliefs.

To clarify:

Subagent2 assumes the shutdown button will be pressed independent of upstream events, aka also independent of whether the shutdown button got removed. It would not veto against the removal of the shutdown problem (right?). The hope is that subagent1 doesn't bother to remove the shutdown button because it believes it won't be pressed anyway.

(The usefulness of subagent2 comes from it preventing the creation of pure u1 maximizers, it does not help protecting the shutdown button (right?).)

Question 1:

I may be wrong about this, but iirc MIRI considered a case where the agent assigned probability 0 to the shutdown button being pressed, and iirc the problem was that the agent would use the button as an outcome pump by creating mechanisms that would cause the button to be pressed if something went worse than expected, thereby magically decreasing the probability that something goes badly, since the button cannot be pressed.

I haven't thought much about it, but doesn't this proposal have the same failure mode? (And if not, why not?)

Question 2:
(Feel free to ignore this question if the proposal does fall for the failure mode in question1.)

Is there a probabilistic extension of the proposal which uses 99.99% instead of 100%, but still basically works. (Maybe assuming away some problems with comparing utility functions or so.) And if not: Why not / What is the part that relies on certainty?

Load More