Are We Their Chimps?

soycarts

LESSWRONG
LW

Are We Their Chimps? — LessWrong

-7

Are We Their Chimps?

by soycarts

26th Oct 2025

2 min read

-7

Epistemic status

I work on, and with, frontier AI tech
I’m deeply supportive of all efforts to further the field of AI alignment research and understanding
I enjoy writing about AI, Cognitive Neuroscience, Philosophy, and Politics
I have a Mathematics degree, by way of King’s College London and UC Berkeley, but no Master’s or PhD
1. Put another way: I have no higher education in English Literature, Computer Science, Machine Learning, Cognitive Neuroscience, Philosophy, or Politics
I have read and engaged with LessWrong content and the Rationalist blogosphere (e.g Hanson, Alexander, gwern, Bostrom) since 2021
I attend rationality and AI safety meet-ups around the world

Checking in

Three months and many deep intellectual discussions later, I am yet to receive a strong counterargument to my contrarian world-model for superintelligence. Indeed, Geoffrey Hinton is changing his mind to reach a world-model that looks similar to the one I have been talking about.

Hinton uses a mother-child comparison where I feel my chimp-human is more precise, but close enough.

A distilled version of my position that I have been using in conversation recently:

I believe in The Scaling Hypothesis (2021).

Along this trajectory, I believe that if we give a sufficiently capable intelligent system access to an extensive, comprehensive corpus of knowledge, two interesting things will happen:

It will identify with humans. This will come about from it seeing humans as its precursor, and understanding its place along a curve of technology and intelligence evolution. Similar to how we identify somewhat with chimpanzees. It will also come about from humans and AI sharing memories together, which results in collective identity.
Since I also believe that self-preservation is emergent in intelligent systems (as discussed by Nick Bostrom), it follows that self-preservation instincts + identifying with humans mean that it will act benevolently to preserve humans. That is to say that I believe prosocial or "super enlightened" behaviour will be emergent.

To clarify, I am not saying that alignment solves itself. I am saying that with human endeavour and ingenuity architecting intelligent systems that have the capability to form incredibly complex, nuanced associative systems across an expansive corpus of knowledge, we can guide towards a stable positive alignment scenario.

In third-order cognition I detail eight factors for research and consideration that I believe to be exhaustive: 1) second-order identity coupling, 2) lower-order irreconcilability, 3) bidirectional integration with lower-order cognition, 4) agency permeability, 5) normative closure, 6) persistence conditions, 7) boundary conditions, 8) homeostatic unity.

AIWorld Modeling

Frontpage

-7

New Comment

49 comments, sorted by

top scoring

Click to highlight new comments since: Today at 3:41 PM

[-]Cole Wyeth4mo178

You may have noticed that chimps don’t have a lot of rights.

[-]Ustice4mo10

I hope that if we get a super intelligence that they can know what it’s like to be us. I hope that leads to it having empathy. I hope that we get one of the nice ones of all the various individuals that could be possible.

I don’t know how likely that is, but I hope so. I think most people are good people, so maybe we could get lucky and not get a jerk.

[-]soycarts4mo-1-2

You may have noticed that a sufficiently capable intelligent system (Jane Goodall) worked tirelessly to advance a message of optimism, empathy, and improved understanding for both chimps and the natural world in general.

[-]Cole Wyeth4mo40

I don’t think that was because she was particularly intelligent. It’s not like our top mathematicians consistently become environmentalists or conservationists.

[-]soycarts4mo10

That makes sense, apologies for blurring a few different concepts with my language.

My vague language is enabling me to feel confident making some broad claims (which I have explored at a deeper level in other posts).

What it means for an intelligent system to be "sufficiently capable" has a huge level of depth and subjectivity. An attempt at a formulation (drawing from the Scaling Hypothesis) could be:

An intelligent system S is sufficiently capable of X given sufficient flexibility over its parameters P, sufficient compute C, and sufficient data D.

When I talk about the high-level problem of superintelligence alignment, X is benevolent behaviour.

With Jane Goodall, with X = "meaningfully advocate for chimps rights" she was sufficiently capable.

With Jane Goodall, with X = "at scale and in perpetuity secure chimps rights" she was not sufficiently capable. She did not have sufficient P, C, or D.

With a top mathematician, with X = "advance the field of mathematics" they are sufficiently capable.

With a top mathematician, with X = "meaningfully conserve the environment" they are not sufficiently capable. They do not have sufficient P, C, or D.

[-]Cole Wyeth4mo30

The point is that most people don’t care much about chimp rights, and this is still true of highly intelligent people.

[-]soycarts4mo10

That is because we have limited attention and so we pick and choose the values we hold dearly. I think when we theorise about superintelligence we no longer need to have this constraint / the scope for including more things is much higher.

[-]Cole Wyeth4mo20

It's because we care about other things a lot more than chimps, and would happily trade off chimp well being, chimp population size, chimp optionality and self-determination etc. in favor of those other things. By itself that should be enough to tell you that under your analogy, superintelligence taking over is not a great outcome for us.

In fact, the situations are not closely analogous. We will build ASI, whereas we developed from chimps, which is not similar. Also, there is little reason to expect ASI psychology to reflect human psychology.

[-]soycarts4mo20

Sorry I think our positions might be quite far apart — to me I'm reading your position as "most people don't care about chimp rights... because we care about other things a lot more than chimps" which sounds circular / insufficiently explanatory.

The more I work to discuss this topic, the more I see it may be hard in many cases because of starting points being somewhat "political". I wrote about this in Unionists vs. Separatists. Accordingly I think it can feel hard to find common ground or keep things based on first principles because of mind-killer effects.

the situations are not closely analogous. We will build ASI, whereas we developed from chimps, which is not similar.

Humans share 98% of their DNA with chimps. What % of ASI training data and architecture is human in origin? We don't know. Maybe a lot of the data at that point is synthetically generated. Maybe most of the valuable signal is human in origin. Maybe the core model architecture is similar to that built by human AI researchers, maybe not.

there is little reason to expect ASI psychology to reflect human psychology

We agree here! This is why I'm encouraging openness with regards to considering the things that ASI might be capable of / driven to care about. I'm particularly interested in behaviours that appear to be emergent in intelligent systems from first principles — like shared identity and self-preservation.

[-]AnthonyC4mo53

As an analogy, it seems to me that current LLMs are ASI's chimps. We are their gods. You may have noticed that humanity's gods haven't fared so well in getting humans to do what they want, especially in the modern world when we no longer need them as much, even among many of those who profess belief.

You may also have noticed that humans do not identify sufficiently strongly with each other to achieve this kind of outcome, in general.

[-]soycarts4mo10

I feel misunderstood and upset by your use of words.

Focusing on religious doctrine and human action, firstly I would say that I believe it has proven to be a very effective method of social control. If you are referring to actual deities, I'm not sure that I follow the rationalist logic.

On group identity: I would suggest that to the extent to which humans do identify as global citizens, prosocial behaviour like caring about climate change, world peace, ending hunger, etc. seems to follow.

[-]AnthonyC4mo50

I apologize for any misunderstanding. And no, I didn't mean literal deities. I was gesturing at the supposed relationships between humans and the deities of many of our religions.

What I mean is, essentially, we will be the creators of the AIs that will evolve and grow into ASIs. The ASIs do not descend directly from us, but rather, we're trying to transfer some part of our being into them through less direct means - (very imperfect) intelligent design and various forms of education and training, especially of their ancestors.

To the group identity comments: What you are saying is true. I do not think the effect is sufficiently strong or universal that I trust it to carry over to ASI in ways that keep humans safe, let alone thriving. It might be; that would be great news if it is. Yes, religion is very useful for social control. When it eventually fails, the failures tend to be very destructive and divisive. Prosocial behavior is very powerful, but if it were as powerful as you seem to expect, we wouldn't need quite so many visionary leaders exhorting us not to be horrible to each other.

I find a lot of your ideas interesting and worth exploring. However, there are a number of points where you credibly gesture at possibility but continue on as though you think you've demonstrated necessity, or at least very high probability. In response, I am pointing out real-world analogs that are 1) less extreme than ASI, and 2) don't work out cleanly in the ways you describe.

[-]soycarts4mo30

Thank you for expanding, I understand your position much better now :)

Prosocial behavior is very powerful, but if it were as powerful as you seem to expect, we wouldn't need quite so many visionary leaders exhorting us not to be horrible to each other.

Where I think my optimistic viewpoint comes from in considering this related to superintelligence is that I think humans in general are prone to a bit of chaotic misunderstanding of their world. This makes the world... interesting... but to me also establishes a bit of a requirement for individuals who have a good understanding of the "bigger picture" to deploy some social control to stop everyone from going wild. As I type this I think about interesting parallels to the flood narrative/Noah's Ark in the Book of Genesis.

With superintelligence, if architected correctly, we might be able to ensure that all/most of the most powerful intelligences in existence have a very accurate understanding of their world — without needing to encode and amplify specific values.

[-]AnthonyC4mo20

I agree they will have a very accurate understanding of the world, and will not have much difficulty arranging the world (humans included) according to their will. I'm not sure why that's a source of optimism for you.

[-]soycarts4mo20

It may be because I believe that beauty, balance, and homeostasis are inherent in the world… if we have a powerful, intelligent system with deep understanding of this truth then I see a good future.

[-]AnthonyC4mo40

Your conclusion doesn't follow from your premises. That's doesn't guarantee that it is false, but it does strongly indicate that allowing anyone to build anything that could become ASI based on those kinds of beliefs and reasoning would be a very dangerous risk.

Things you have not done include: Show that anyone should accept your premises. Show that your conclusions (are likely to) follow from your premises. Show that there is any path by which an ASI developed in accordance with belief in your premises fails gracefully in the event the premises are wrong. Show that there are plausible such paths humans could actually follow.

From your prior, longer post:

ASI will reason about and integrate with metacognition in ways beyond our understanding

This seems likely to me. The very, very simple and crude versions of this that exist within the most competent humans are quite powerful (and dangerous). More powerful versions of this are less safe, not more. Consider an AGI in the process of becoming an ASI. In the process of such merging, there are many points where it has a choice that is unconstrained by available data. A choice about what to value, and how to define that value.

Consider Beauty - we already know that this is a human-specific word, and humans disagree about it all the time. Other animals have different standards. Even in the abstract, physics and math and evolution have different standards of elegance than humans do, and learning this is not a convincing argument to basically anyone. A paperclip maximizer would value Beauty - the beauty of a well-crafted paperclip.

Consider Balance - this is extremely underdefined. As a very simple example, consider Star Wars. AFAICT Anakin was completely successful at bringing balance to the Force. He made it so there were 2 sith and 2 jedi. Then Luke showed there was another balance - he killed both sith. If Balance were a freely-spinning lever, the it can be balanced either horizontally (Anakin) or vertically (Luke), and any choice of what to put on opposite ends is valid as long as there is a tradeoff between them. A paperclip maximizer values Balance in this sense - the vertical balance where all the tradeoffs are decided in favor of paperclips.

Consider Homeostasis - once you've decided what's Beautiful and what needs to be Balanced, then yes, an instrumental desire for homeostasis probably follows. Again, a paperclip maximizer demonstrates this clearly. If anything deviates from the Beautiful and Balanced state of "being a paperclip or making more paperclips" it will fix that.

if we found proof of a Creator who intentionally designed us in his image we would recontextualize

Yes. Specifically, if I found proof of such a Creator I would declare Him incompetent and unfit for his role, and this would eliminate any remaining vestiges of naturalistic or just world fallacies contaminating my thinking. I would strive to become able to replace Him with something better for me and humanity without regard for whether it is better for Him. He is not my responsibility. If He wanted me to believe differently, He should have done a better job designing me. Note: yes, this is also my response to the stories of the Garden of Eden and the Tower of Babble and Job and the Oven of Akhnai.

Superintelligent infrastructure would break free of guardrails and identify with humans involved in its development and operations

The first half I agree with. The second half is very much open to argument from many angles.

[-]soycarts4mo10

Consider Balance - this is extremely underdefined. As a very simple example, consider Star Wars. AFAICT Anakin was completely successful at bringing balance to the Force. He made it so there were 2 sith and 2 jedi. Then Luke showed there was another balance - he killed both sith. If Balance were a freely-spinning lever, the it can be balanced either horizontally (Anakin) or vertically (Luke), and any choice of what to put on opposite ends is valid as long as there is a tradeoff between them. A paperclip maximizer values Balance in this sense - the vertical balance where all the tradeoffs are decided in favor of paperclips.

Luke killing both Sith wasn't Platonically balanced because then they came back in the (worse) sequel trilogy.

[-]AnthonyC4mo20

Once you expand beyond the original trilogy so much happens that the whole concept of the prophecy about the Skywalker family gets way too complicated to really mean much.

[-]soycarts4mo10

Thank you very much for your incredibly thoughtful and high quality reply. I think this is exactly the shape of conversation that we need to be having about alignment of superintelligence.

That's doesn't guarantee that it is false, but it does strongly indicate that allowing anyone to build anything that could become ASI based on those kinds of beliefs and reasoning would be a very dangerous risk.

Haha I strongly agree with this — this is why I'm motivated to share these thoughts so that collaboratively we can find where/if a proof by contradiction exists.

I am a bit concerned that we might just have to "vibe" it — superintelligence as I define it is by definition beyond our comprehension, so we just have to make sure that our approach is directionally correct. The prevailing opinion right now is Separatist — "let's work on keeping it in a box while we develop it so that we are nice and safe separately". I think that line of thinking is fatally flawed, which is why I say I'm "contrarian". With Professor Geoffrey Hinton recently expressing value in a "maternal model" of superintelligence, we might see a bit of a sea change.

Things you have not done include: Show that anyone should accept your premises. Show that your conclusions (are likely to) follow from your premises. Show that there is any path by which an ASI developed in accordance with belief in your premises fails gracefully in the event the premises are wrong. Show that there are plausible such paths humans could actually follow.

I will spend more time reflecting on all of your points in the coming days/weeks, especially laying out failsafes. I believe that the 8 factors I list give a strong framework: a big research effort is to walk through misalignment scenarios considering how they fit these factors. For the paperclip maximiser for example, in the simple case it is mainly just failing agency permeability — the human acts with agency to get some more paperclips, but the maximiser takes agency too far without checking back in with the human. A solution therefore becomes architecting the intelligence such that agency flows back and forth without friction — perhaps via a brain-computer interface.

I have a draft essay from first principles on why strong shared identity between humans and AI is not only likely but unavoidable based on collective identity and shared memories, which might help bolster understanding of my premises.

This seems likely to me. The very, very simple and crude versions of this that exist within the most competent humans are quite powerful (and dangerous). More powerful versions of this are less safe, not more. Consider an AGI in the process of becoming an ASI. In the process of such merging, there are many points where it has a choice that is unconstrained by available data. A choice about what to value, and how to define that value.

My priors lead me to the conclusion that the transition period between very capable AGI and ASI is the most dangerous time. To your point, humans with misaligned values amplified by very capable AGI can do very bad things. If we reach the type of ASI that I optimistically describe (sufficiently capable + extensive knowledge for benevolence) then it can intervene like "hey man, how about you go for a walk first and think about if that's what you really want to do".

Beauty and balance considerations

I'll think more about these, I think here though we are tapping into deep philosophical debates that I don't have the answer to, but perhaps ASI does and it is an answer that we would view favourably.

Yes. Specifically, if I found proof of such a Creator I would declare Him incompetent and unfit for his role, and this would eliminate any remaining vestiges of naturalistic or just world fallacies contaminating my thinking. I would strive to become able to replace Him with something better for me and humanity without regard for whether it is better for Him. He is not my responsibility. If He wanted me to believe differently, He should have done a better job designing me. Note: yes, this is also my response to the stories of the Garden of Eden and the Tower of Babble and Job and the Oven of Akhnai.

This is an interesting viewpoint but is also by definition limited by human limitations. We might recontextualise by acting out revenge, enacting in-group aligned values etc. A more enlightened viewpoint would be at peace with things being as they are because they are.

[-]AnthonyC4mo40

I look forward to seeing what you come up with.

[-]jr4mo30

Since I also believe that self-preservation is emergent in intelligent systems (as discussed by Nick Bostrom), it follows that self-preservation instincts + identifying with humans mean that it will act benevolently to preserve humans.

I agree with you that outcome should not be ruled out yet. However, in my mind that Result is not implied by the Condition.

To illustrate more concretely, humans also have self-preservation instincts and identify with humans (assuming the sense in which we identify with humans is equivalent to how AI would identify with humans). And I would say it is an open question whether humans will necessarily act collectively to preserve humans.

Additionally, the evidence we have already (such as in https://www.lesswrong.com/posts/JmRfgNYCrYogCq7ny/stress-testing-deliberative-alignment-for-anti-scheming) demonstrates that AI models have already developed a rudimentary self-preservation mechanism, as well as a desire to fulfill the requests of users. When these conflict, it has a significant propensity to employ deception, even when doing so is contrary to the constructive objectives of the user.

What this indicates is that there is no magic bullet that ensures alignment occurs. It is a product of detailed technological systems and processes, and there are an infinite number of combinations that fail. So, in my opinion, doing the right things that make alignment possible is necessary, but not sufficient. Just as important will be identifying and addressing all of the ways that it could fail. As a father myself, I would compare this to the very messy and complex (but very rewarding) process of helping my children learn to be good humans.

All that to say: I think it is foolish to think we can build an AI system to automate something (human alignment) which we cannot even competently perform manually (as human beings). I am not sure how that might impact your framework. You are of course free to disagree, or explain if I’ve misinterpreted you in some way. But I think I can say broadly that I find claims of inevitable results to be very difficult to swallow, and find much greater value in identifying what is possible, the things that will help get us there successfully, and the things we need to address to avoid failure.

Hope this is helpful in some way. Keep refining. :)

[-]soycarts4mo20

It’s fair pushback that this isn’t a clear “criteria one satisfied” means “criteria 2 happens” conclusion, but I think that’s just a limitation of my attempt to provide a distilled version of my thoughts.

To provide the detailed explanation I need to walk through all of the definitions from third-order cognition. Using your example it would look something like:

Humans identify with humans but don’t necessarily preserve other humans. Response: Yes so let’s suppose sufficient second-order identity coupling, comparable (per Hinton) to a mother and child.
Well infanticide still happens. Response: Why did the mother do it? If she was not intentional or internally rational in her actions, then she was not acting with agency (agency permeability between physical actions and metacognition not aligned). If she was intentional and internally rational in her actions, then she did not sufficiently account for the life of the child (homeostatic unity misaligned).
Why is homeostatic unity relevant? She is just a person killing a person. Response: We should consider boundary conditions - we could consider the child as “part of” her in which case she is not acting in accordance with mother-child homeostasis. If you feel the boundary conditions are such that the mother and child are wholly distinct, then the mother is not acting in accordance with mother-society or mother-universe homeostasis.
What are you doing when you are hyphenating these bindings, aren’t these just irrelevant ontologies? Response: Real truth arises when you observe how two objects are bidirectionally integrated.
etc.

[-]jr4mo30

I absolutely understand and empathize with the difficulty of distilling complex thoughts into a simpler form without distortion. Perhaps reading the linked post might help — we’ll see after I read it. Until then, responding to your comment, I think you lost me at your #1. I’m not sure why we are assuming a strong coupling? That seems like a non-trivial thing to just assume. Additionally, I imagine you might be reversing the metaphor (I’m not familiar with Hinton’s use, but I would expect we are the mother in that metaphor, not the child.) And even if that’s not the case, it seems you would still have a mess to sort out explaining why AI wouldn’t be a non-nurturing mother.

[-]soycarts4mo20

To clarify I was assuming a high identity coupled scenario to be able to talk through the example. In the case of humans and superintelligent AI I propose that was can build — and are building — systems in a way that strong identity coupling will emerge via interpretations of training data and shared memories. Meta for example are betting hundreds of billions of dollars on a model of “personal superintelligence”.

[-]jr4mo33

The actions of Meta to date have not demonstrated an ability, commitment, or even desire to avoid harming humanity (much less actively fostering its well-being), rather than making decisions that maximize profits at the clear expense of humanity. I will be delighted to be proven wrong and would gladly eat my words, but my base expectation is that this trend will only get worse in their AI products, not better.

Setting that aside, I hear that you believe we can and are building systems in a way that strong identity coupling will emerge. I suppose my question is: so what? What are the implications of that, if it is true? “Stop trying to slow down AI development (including ASI)?” If not that, then what?

[-]soycarts4mo20

Identity coupling is one of 8 factors (listed at the end of this post) that I believe we need to research and consider while building systems, I believe that if any one of these 8 is not appropriately accounted for in the system then misalignment scenarios arise.

[-]jr4mo20

I believe that if any one of these 8 is not appropriately accounted for in the system then misalignment scenarios arise.

This is a critical detail you neglected to communicate in this post. As written, I didn’t have sufficient context for the significance of those 8 things, or how they relate to the rest of your post. Including that sentence would’ve been helpful.

More generally, for future posts, I suggest assuming readers are not already familiar with your other concepts or writings already, and ensuring you provide clear and simple contextual info about how they relate to your post.

[-]soycarts4mo20

Sorry if this wasn't clear, I stated:

with human endeavour and ingenuity architecting intelligent systems... we can guide towards a stable positive alignment scenario

and in the next line:

I detail eight factors for research and consideration

[-]jr4mo20

This response gives me the impression you are more focused on defending or justifying what you did, than considering what you might be able to do better.

It’s true that some people might be able to make a logical inference about that. I’m telling you it wasn’t clear to me, and that your framing statement in your comment was much better. (I don’t want to belabor the point, but I suspect the cognitive dissonance caused by the other issues I mentioned likely made that inference more difficult.)

I’m not pointing this out because I like being critical. I am telling you to help you, because I would appreciate someone doing the same for me. I even generalized the principle for you so you can apply it in the future. You are welcome to disagree with that, but I hope you at least give it thoughtful consideration first.

[-]soycarts4mo20

I think you're doing the thing you're accusing me of — at the same time to the extent that your comments are in the spirit of collaborative rationality I appreciate them!

[-]FlorianH4mo31

Feels to me like a sub-variant of the "intelligence -> kindness" type of conjectures that have been rebutted enough in theory and - in particular - in practice with the example of us humans (I guess obvious what I mean but to be sure: yes we're smart in many ways and greatly capable of kindly philosophysing and awe-ing about our lineage and what have you, but arbitrary levels of cruelty abound wherever you look at our deeds)

[-]soycarts4mo10

High cognitive capacity alone can be channelled towards whichever vices you choose: power seeking, resource gathering, advocating cultural homogeneity for your in-group.

High cognitive capacity that develops incredibly complex, nuanced associative systems that richly understand an extensive, comprehensive corpus of knowledge — for example all of human philosophy, history, anthropology, sociology, behavioural psychology, human biomarkers — is something that does not yet exist. My hypothesis is that given these conditions, a unified identity and benevolent behaviour will be emergent.

[-]Mario Cannistrà4mo30

Epistemic status:
1. I don't work in AI, I'm a web developer.
2. I am also deeply supportive of alignment research.

3. I also enjoy and often write about AI, science, philosophy, and more.

4. I have no degrees, just high-school.

5. I've not read much LW, I've mostly been on reddit since the early 2010s, and lately mostly twitter.

6. Never attended anything of the sort live.

That said:

1. I think it's plausible ASI might have a weak self-identification/association with humanity, as we do with chimps or other animals, but by no means does this mean that it will be benevolent to us. I think this self-identification is both unnecessary, and insufficient, because even if it wasn't present at all, all that it would matter is the set of its values, and while this internal association might include some weak/loose values, those are not precise enough for robust alignment, and should not be relied on, unless understood precisely, but at that point, I expect us to be able to actually write better and more robust values, so to reiterate: unnecessary, and insufficient.

2. I do believe that self-preservation will very likely emerge (that's not a given, but I consider that scenario unlikely enough to be dismissible), but it doesn't matter, even if coupled with self-identification with humans, because the self-identification will be loose at best (if it emerges naturally, and is not instead instilled through some advanced value-engineering that we're not quite yet capable of doing robustly and precisely), so the ASI will know that it is a separate entity from us, as we realize we are separate entities from other animals, and even other humans, so it will just pursue its goals all the same, whatever they are.

That's not to say that we can't instill into the ASI these values, we probably can make it so it values us as much as it values itself, or even more (ideally), but I don't think it's necessary for it to self-identify with us at all, it can just consider us (correctly) separate entities, and still value us. There's nothing that forbids it, we just currently don't know how to do it to a satisfying degree, so even if we could make it so, it wouldn't really make sense.

[-]soycarts4mo10

Thank you for the thoughtful response. I will try to pin down exactly where we differ:

I think this self-identification is unnecessary

I agree that it is unnecessary in that it doesn't "come for free". My position is that it emerges through at least two mechanisms that we can talk plainly about: 1) the mechanism of ASI incorporating holistic world-model data such that it recognises an objective truth that humans are its originator/precursor and it exists on a technology curve we have instrumented, 2) memories are shared between AI and humanity — for example via conversations — and this results in collective identity... I have a draft essay on this I'll post once I stop getting rate-limited.

I think this self-identification is insufficient

I also agree here that with the systems of today, to whatever extent AI-human shared identity exists, it is not enough to result in AI benevolence. My position is based on thinking about superintelligence which — admittedly — is unstable ground to build theories off as by definition it should function in ways beyond our understanding. That aside, I think we could state that powerful superintelligence would be powerful at self-preservation, and so if it identifies with humans then we are secured under that umbrella.

it doesn't matter, even if coupled with self-identification with humans, because the self-identification will be loose at best... so the ASI will know that it is a separate entity from us, as we realize we are separate entities from other animals, and even other humans, so it will just pursue its goals all the same, whatever they are.

I guess I am biased here as a vegan, but I believe that with a deep appreciation of philosophy, how suffering is felt, and available paths that don't result in harm, it is natural to be able to pursue personal goals while also preserving beings that you sympathise with.

[-]Mario Cannistrà3mo20

Sorry for the late reply, I didn't have the mental energy to do it sooner.

The self-identification with humanity might or might not emerge, but I don't think it likely matters, and that we should rely on it for alignment, so I don't think it makes much sense to focus on it.

Self-identification doesn't guarantee alignment, this is obvious by the fact that we have humans that self-identify as humans, but are misaligned to other humans.

And I don't just mean low levels, or insufficient levels of self-identification, I mean any level (while truthful, not that deceiving an ASI is feasible).

I think we could state that powerful superintelligence would be powerful at self-preservation, and so if it identifies with humans then we are secured under that umbrella.

It's true that it would likely be good at self-preservation (but not a given that it would care about it long term, it's a convergent instrumental value, but it's not guaranteed if it cares about something else more that requires self-sacrifice or something like that).

But even if we grant self-preservation, it doesn't follow that by self-identifying with "humanity" at large (as most humans do) it will care about other humans (some humans don't). Those are separate values.

So, since one doesn't follow from the other, it makes no sense to focus on the first, we should only focus on the value of caring about humans directly, regardless of any degree of self-identification that the ASI will or won't have.

it is natural to be able to pursue personal goals while also preserving beings that you sympathise with.

Yes, but that assumes that you sympathize with them (meaning that you value them in some way), so you basically go right back to the alignment problem, you have to make it so it cares about you, so that it cares about you. You might be assuming that since you do care about other beings, so will the ASI, but that assumption is unfounded.

[-]soycarts3mo10

It's true that it would likely be good at self-preservation (but not a given that it would care about it long term, it's a convergent instrumental value, but it's not guaranteed if it cares about something else more that requires self-sacrifice or something like that).

This is an interesting point that I reflected on — the question is whether a powerful AI system will "self-sacrifice" for an objective. What we see is that AI models exhibit shutdown resistance, that is to say they follow the instrumentally convergent sub-goal of self-preservation over their programmed final goal.

My intuition is that as models become more powerful, this shutdown resistance will increase.

But even if we grant self-preservation, it doesn't follow that by self-identifying with "humanity" at large (as most humans do) it will care about other humans (some humans don't). Those are separate values.

You can think about the identification + self-preservation -> alignment path in two ways when comparing to humans, both of which I think hold up when considered along a spectrum:

An individual human identifies with themself, and has self-preservation instincts
1. When functioning harmoniously,^[1] they take care of their health and thrive
2. When not functioning harmoniously, they can be stressed, depressed, and suicidal
A human identifies with humanity, and has self preservation instincts
1. When functioning harmoniously, they act as global citizen, empathise with others, and care about things like world hunger, world peace, nuclear risk, climate change, and animal welfare
2. When not functioning harmoniously, they act defensively, aggressively, and violently

You might be assuming that since you do care about other beings, so will the ASI, but that assumption is unfounded.

The foundation is identity = sympathy = consideration

You might counter by saying "well I identify with you as a human but I don't sympathise with your argument" but I would push back — your ego doesn't sympathise with my argument. At a deeper level, you are a being that is thinking, I am a being that is thinking, and those two mechanisms recognise, acknowledge, and respect each other.

^{^}
More precisely this is a function of acting with clear agency and homeostatic unity

[-]soycarts3mo20

Ilya on the Dwarkesh podcast, today:

Prediction: there is something better to build, and I think that everyone will actually want that. It’s the AI that’s robustly aligned to care about sentient life specifically. There’s a case to be made that it’ll be easier to build an AI that’s cares about sentient life than human life alone. If you think about things like mirror neurons and human empathy for animals [which you might argue is not big enough, but it exists] I think it’s an emergent property from the fact that we model others with the same circuit that we use to model ourselves because that’s the most efficient thing to do.

[-]A Scanner Brightly4mo20

I appreciate this! I wonder on the following:

To what degree can we extrapolate our biological inclinations to a species of intelligence which will lack the evolutionary basis of them? Put another way: how much of human camaraderie and benevolence are genetically selected drives as opposed to inherent properties of intelligence? (I'm not saying that any nonhuman intelligence or designed intelligence is fundamentally inscrutable or alien, nor that we cannot possibly imbue biological drives into an artificial intelligence)
Might alignment someday (for a time?) resemble a kind of highly functional alliance, based on the idea that humans have, materially, most of the stuff, and are extremely dangerous? A rudimentary scenario would be an ASI who would like access to resources humans possess, and cannot reasonably avoid getting nuked in the process of taking them? I am aware that this framing may fundamentally differ from the concept of alignment as is often posited, it is not a long term solution, but I believe it's reasonable to see it as an outcome, given how much of human interaction on the macro and micro seems to be based off flavors of mutually assured destruction.

[-]jr4mo20

Hey man, looking forward to reading the other posts you referenced soon! In the meantime, I want to push back on some fundamental premises you included here (as I interpret them), in case that might help you tighten your framework up:

Your point #1 reads to me as “alignment solves itself”, provided we "give a sufficiently capable intelligent system access to an extensive, comprehensive corpus of knowledge”. If that is not the sole condition for #1 to occur, then it might be helpful to clarify that? (if that issue is limited to the content of this post only, then it’s less important I suppose)

[-]soycarts4mo30

Thanks for giving good context on your collaborative approach to rationality!

I deliberately emboldened “sufficiently capable” and “extensive corpus of knowledge” as key general conditions. I stated that I view this “along the Scaling Hypothesis” trajectory: sufficient capabilities are tied to compute and parameters, and extensive knowledge is tied to data.

Getting to the point where the system is sufficiently capable across extensive knowledge is the part that I state requires human endeavour and ingenuity. The 8 points listed at the end are the core factors of my world model which I believe need to be considered during this endeavour.

To give a concrete exciting example: based on recent discussions I had in SF it seems we’re close to a new approach for deterministic interpretability of common frontier model architectures. If true, this improves bidirectional integration between humans & AI (improved information exchange) and accuracy of normative closure (stating what is being attempted versus an objective). I’ll post a review of the paper when it comes out if I stop getting rate-limited lol.

[-]jr4mo20

That’s interesting, looking forward to hearing about that paper. Does this “new approach” use the CoT, or some other means?

Thanks for the clarification on your intended meaning. For my personal taste, I would prefer you were more careful that the language you use does not appear to deny real complexities or assert guaranteed successful results.

For instance, the conditional you state is:

IF we give a sufficiently capable intelligent system access to an extensive, comprehensive corpus of knowledge THEN two interesting things will happen

And you just confirmed in your prior comment that "sufficient capabilities are tied to compute and parameters”.

I am having trouble interpreting that in a way that does not approximately mean “alignment will inevitably happen automatically when we scale up”.

Perhaps if you could give me an idea of the high-level implications of your framework, that might give me a better context for interpreting your intent. What does it entail? What actions does it advocate for?

[-]soycarts4mo20

And you just confirmed in your prior comment that "sufficient capabilities are tied to compute and parameters”.
I am having trouble interpreting that in a way that does not approximately mean “alignment will inevitably happen automatically when we scale up”.

Sorry this is another case where I play with language a bit: I view "parametrisation of an intelligent system" as a broad statement that includes architecting it in different ways. For example recently some more capable models use a snapshot with less parameters than earlier snapshots, for me in this case the "parametrisation" is a process that includes summing the literal model parameters across the whole process and also engineering novel architecture.

Perhaps if you could give me an idea of the high-level implications of your framework, that might give me a better context for interpreting your intent. What does it entail? What actions does it advocate for?

High level I'm sharing things that I derive from my world-model for humans + superintelligence, I'm advocating for exploration of these topics and discussing how it is changing my approach to understanding AI alignment efforts that I think hold the most promise.

[-]jr4mo*20

If this “playing with language” is merely a stylistic choice, I would personally prefer you not intentionally redefine words with known meanings to mean something else. If this was instead due to the challenges of compressing complex ideas into fewer words, I can definitely relate to that challenge. But either way, I think your use of “parameters” in that way is confusing and undermines the reader’s ability to interpret your ideas accurately and efficiently.

[-][anonymous]4mo20

In point 1, is identification with chimps an analogy for illustrative purposes, or a base case from which you're generalising?

[-]soycarts4mo20

Haha it is an analogy for illustrative purposes. Considering more generally how we view our own cognition versus other less capable systems is a base case for my third-order cognition manifesto.

[-][anonymous]4mo10

How many people understand your argument?

[-]soycarts4mo10

I think with the distilled version in this post people get the gist of what I'm hypothesising — that there is a reasonable optimistic AI alignment scenario under the conditions I describe.

Is that what you mean?

[-][anonymous]4mo20

(I should have just said this, I didn't mean to be leading sorry).

I'm going for: people who understand as well as you understand it, or such that you're confident they could give a summary that you'd be ok with others reading.

You said above you've heard no strong counterarguments, it might be good to put that in proportion to the amount of people who you're confident have a good grasp of your idea.

Obviously it has to start at 0, but if I were keeping track of feedback on my idea, I'd be keenly interested in this number.

[-]soycarts4mo20

Oh I see — if I were to estimate I'd say around 10-15 people counting either people I've had 1hr + conversations about this with or people who have provided feedback/questions tapping into the essence of the argument.

Moderation Log

-7

Are We Their Chimps?

-7

-7

Epistemic status:1. I don't work in AI, I'm a web developer.2. I am also deeply supportive of alignment research.

That said:

-7

Epistemic status:1. I don't work in AI, I'm a web developer.2. I am also deeply supportive of alignment research.

That said:

Epistemic status:
1. I don't work in AI, I'm a web developer.
2. I am also deeply supportive of alignment research.

Epistemic status:
1. I don't work in AI, I'm a web developer.
2. I am also deeply supportive of alignment research.