Towards_Keeperhood

I'm trying to prevent doom from AI. Currently trying to become sufficiently good at alignment research. Feel free to DM for meeting requests.

Posts

Sorted by New

2Simon Skade's Shortform

7Clarifying what ELK is trying to achieve

12Hammertime Final Exam: Pledges, Activation Energy and Evaluating Productivity

Wiki Contributions

Comments

Natural Latents: The Math

Towards_Keeperhood2mo30

First a note:

the two chunks are independent given the pressure and temperature of the gas

I'd be careful here: If the two chunks of gas are in a (closed) room which e.g. was previously colder on one side and warmer on the other and then equilibriated to same temperature everywhere, the space of microscopic states it can have evolved into is much smaller than the space of microscopic states that meet the temperature and pressure requirements (since the initial entropy was lower and physics is deterministic). Therefore in this case (or generally in cases in our simple universe rather than thought experiments where states are randomly sampled) a hypercomputer could see more mutual information between the chunks of gas than just pressure and temperature. I wouldn't call the chunks approximately independent either, the point is that we with our bounded intellects are not able to keep track of the other mutual information.

Main comment:

(EDIT: I might've misunderstood the motivation behind natural latents in what I wrote below.)

I assume you want to use natural latents to formalize what a natural abstraction is.

The " induces independence between all $X_{i}$ " criterion seems too strong to me.

IIUC you want that if we have an abstraction like "human", you want all the individual humans to share approximately no mutual information conditioned on the "human" abstraction.
Obviously, there are subclusters of humans (e.g. women, children, ashkenazi jews, ...) where members share more properties (which I'd say is the relevant sense of "mutual information" here) than properties that are universally shared among humans.
So given what I intuitively want the "human" abstraction to predict, there would be lots of mutual information between many humans.
However, (IIUC,) your definition of natural latents permits there to be waaayyy more information encoded in the "human" abstraction, s.t. it can predict all the subclusters of humans that exist on earth, since it only needs to be insensitive to removing one particular human from the dataset. This complex human abstraction does render all individual humans approximately independent, but I would say this abstraction seems very ugly and not what I actually want.

I don't think we need this conditional independence condition, but rather something else that finds clusters of thingies which share unusually much (relevant) mutual information.
I like to think of abstractions as similarity clusters. I think it would be nice if we find a formalization of what a cluster of thingies is without needing to postulate an underlying thingspace / space of possible properties, and instead find a natural definition of "similarity cluster" based on (relevant) mutual information. But not sure, haven't really thought about it.

(But possibly I misunderstood sth. If it already exists, feel free to invite me to a draft of the conceptual story behind natural latents.)

Scale Was All We Needed, At First

Towards_Keeperhood2mo00

Amazing story! My respect for writing this.

I think stories may be a promising angle for making people (especially AI researchers) understand AI x-risk (on more of a gut level so they realize it actually binds to reality).

The end didn't seem that realistic to me though. Or at least, I don't expect ALICE would seek to fairly trade with humanity, but not impossible that it'd call the president pretending to want to trade. Not sure what your intent when writing was, but I'd guess most people will read it the first way. Compute is not a (big) bottleneck for AI inference. Even if humanity coordinated successfully to shut down large GPU clusters and supercomputers, it seems likely that ALICE could copy itself to tens or hundreds of millions of devices (and humanity seems much to badly coordinated to be able to shut down 99.99% of those) to have many extremely well coordinated copies, and at ALICE's intelligence level this seems sufficient to achieve supreme global dominance within weeks (or months if I'm being conservative), even if it couldn't get smarter. E.g. it could at least do lots and lots of social engineering and manipulation to prevent humanity to effectively coordinate against it, spark wars and civil wars, make governments and companies decide to manufacture war drones (which the ALICE can later hack), and influence war decisions for higher destructiveness, use war drones to threaten people into doing stuff at important junctions, and so on. (Sparking multiple significant wars within weeks seems totally possible on that level of intelligence and resources. Seems relatively obvious to me but I can try to argue the point if needed. (Though not sure whether convincingly. Most people seem to me to not be nearly able to imagine what e.g. 100 copies of Eliezer Yudkowsky could do if they could all think on peak performance 24/7. Once you reach that level with something that can rewrite its mind you don't get slow takeoff, but nvm that's an aside.))

Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI

Towards_Keeperhood3mo20

I'm not sure I understood how 2 is different from 1.

(1) is the problem that utility rebinding might just not happen properly by default. An extreme example is how AIXI-atomic fails here. Intuitively I'd guess that once the AI is sufficiently smart and self-reflective, it might just naturally see the correspondence between the old and the new ontology and rebind values accordingly. But before that point it might get significant value drift. (E.g. if it valued warmth and then learns that there actually are just moving particles, it might just drop that value shard because it thinks there's no such (ontologically basic) thing as warmth.)

(2) is the problem that the initial ontology of the AI is insufficient to fully capture human values, so if you only specify human values as well as possible in that ontology, it would still lack the underlying intuitions humans would use to rebind their values and might rebind differently. Aka while I think many normal abstractions we use like "tree" are quite universal natural abstractions where the rebinding is unambiguous, many value-laden concepts like "happiness" are much less natural abstractions for non-human minds and it's actually quite hard to formally pin down what we value here. (This problem is human-value-specific and perhaps less relevant if you aim the AI at a pivotal act.)

When I try to think about this kind of problem I get stuck on not understanding the details of how an ontology/worldmodel can or should work.

Not sure if this helps, but I heard that Vivek's group came up with the same diamond maximizer proposal as I did, so if you remember that you can use it as a simple toy frame to think about rebinding. But sure we need a much better frame for thinking about the AI's world model.

Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI

Towards_Keeperhood3mo180

This is an amazing report!

Your taxonomy in section 4 was new and interesting to me. I would also mention the utility rebinding problem, that goals can drift because the AI's ontology changes (e.g. because it figures out deeper understanding in some domain). I guess there are actually two problems here:

Formalizing the utility rebinding mechanism so that concepts get rebound to the corresponding natural abstractions of the new deeper ontology.
For value-laden concepts the AI likely lacks the underlying human intuitions for figuring out how the utility ought to be rebound. (E.g. when we have a concept like "conscious happiness", and the AI finds what cognitive processes in our brains are associated with this, it may be ambiguous whether to rebind the concept to the existence of thoughts like 'I notice the thought "I notice the thought <expected utility increase>"' running through a mind/brain, or whether to rebind it in a way to include a cluster of sensations (e.g. tensions in our face from laughter) that are present in our minds/brains (, or other options). (Sry maybe bad example which might require some context of my fuzzy thoughts on qualia which might actually be wrong.))

A Shutdown Problem Proposal

Towards_Keeperhood3mo10

Thanks.

I briefly looked into the MIRI paper (and the section from Eliezer's lecture that starts at 22min) again.

My main guess now is that you're not supposed to make the agent act on false beliefs; that one of the desiderada of the shutdown problem is that your agent shouldn't have a capability penalty, including taking bad actions based on false beliefs (like taking bad bets, as jessicata mentioned). Your proposal violates that.

The case MIRI considered wasn't to modify the world model so the agent believes that the button cannot be pressed, but instead to construct the utility function in a way that the resulting behavior ends up equivalent, but without the agent having wrong beliefs.

A Shutdown Problem Proposal

Towards_Keeperhood3mo90

To clarify:

Subagent2 assumes the shutdown button will be pressed independent of upstream events, aka also independent of whether the shutdown button got removed. It would not veto against the removal of the shutdown problem (right?). The hope is that subagent1 doesn't bother to remove the shutdown button because it believes it won't be pressed anyway.

(The usefulness of subagent2 comes from it preventing the creation of pure u1 maximizers, it does not help protecting the shutdown button (right?).)

Question 1:

I may be wrong about this, but iirc MIRI considered a case where the agent assigned probability 0 to the shutdown button being pressed, and iirc the problem was that the agent would use the button as an outcome pump by creating mechanisms that would cause the button to be pressed if something went worse than expected, thereby magically decreasing the probability that something goes badly, since the button cannot be pressed.

I haven't thought much about it, but doesn't this proposal have the same failure mode? (And if not, why not?)

Question 2:
(Feel free to ignore this question if the proposal does fall for the failure mode in question1.)

Is there a probabilistic extension of the proposal which uses 99.99% instead of 100%, but still basically works. (Maybe assuming away some problems with comparing utility functions or so.) And if not: Why not / What is the part that relies on certainty?

Why Are Bacteria So Simple?

Towards_Keeperhood7mo10

Nice post!

I feel like the real question to answer here isn't "Why are bacteria so simple?" (because if they were more complex they wouldn't really be bacteria anymore), but rather "Why do there seem to be those 2 classes of cells (eukariotes and prokariotes)?". In particular, (1) why aren't there more cells with intermediate size and complexity, and (2) why didn't bacteria get outcompeted out of existence by their cousins which were able to form much more complex adaptations?

(Note: I know very little about biology. Don't trust me just because I never heard of medium-sized and medium-complex cell types that don't neatly fit into one of the clusters of prokariotes and eukariotes.)

Thomas Kwa's MIRI research experience

Towards_Keeperhood7mo911

Lol possibly someone should try to make this professor work for Steven Byrnes / on his agenda.

Strange Loops - Self-Reference from Number Theory to AI

Towards_Keeperhood7mo10

Thanks for writing this! This was explained well and I like your writing style. Sad that there aren't many more good distillations of MIRI-like research. (Edited: Ok not sure enough whether there's really that much that can be improved. I didn't try reading enough there yet, and some stuff on Arbital is pretty great.)

Sydney can play chess and kind of keep track of the board state

Towards_Keeperhood1y10

It'd be interesting to see whether it performs worse if it only plays one side and the other side is played by a human. (I'd expect so.)