Just asking for some feedback.

Hi everyone! I have an ontological framework (the highest level) that I use in my everyday philosophical thinking. But I'm too lazy to write an actual article about it. And eventually I decided to introduce it here on LessWrong, with the focus on AI and AI alignment, because this is a popular topic potentially important. Why now? I'm thinking about mental agency being powered by language, and LLMs hitting very close to it, so I want to share my thoughts.

Some context

After discovering the mind/body problem 15+ years ago I have been thinking about it a lot. There is something very deep about physical and mental not being the same. But it looks like, if we really want to classify all existing phenomena, we need to add something else.[1] And I decided to use the second dichotomy: objects and agents. Maybe I was performing some sort of phenomenological reduction to get there, I don't really remember, but here we go.


So basically we have physical objects, physical agents, mental objects, and mental agents. Completely different, independent, and mutually irreducible forms of being, or modes of existence, or aspects of reality, or ways of presence... I have some trouble with naming here.[2] But the point is that the being itself is still holistic. You can call it four-aspect monism with six-way parallelism. Pretty simple.

The ultimate agency

Alright, what is so special about philosophical agents (or actors, or subjects) in this framework? Well, they are not objects. Literally. You can see it as pure experience. Like an acting force, but you are that force, and you are looking from inside. You are looking at objects, and you can only see objects. When you ask who is looking, the answer is suggested by the language: "who" is not "what". Good for us, we know agency directly just by being agents. Good enough to me, at least. Now let's add more details.

Physical objects

All possible physical phenomena that can be objectively described (but not their descriptions). Something that physics and related sciences study. Celestial bodies, human bodies, material artifacts of our civilization — everything like that goes here. It's the real world.

Science is largely focused on it because this is how the scientific method works. If the physical world is the shared reality for all of us, then only physical objects are truly real. Everything else is an epiphenomenon. But exclusion of agency is a bias of an agent. It can be very useful for our civilization-building research, but not so much for a high-level ontology like this one.

Physical agents

This is just pure experience of physical reality. Something that can be associated with life itself. We experience it from a big-brain-primate point of view. Deep, detailed I/O: exquisite redness of red, multi-fingered prehensility, and so on. But a single cell is alive too. Actually, it's really hard to draw solid lines between physical agents across all the scales. You can literally stitch organisms together if you try hard enough.

So it matters how the matter is shaped exactly. You can be more alive or less alive (a rock is not very alive, for example), based on the state of your body. You can get intelligent behavior and some emotions too (as direct experience, without mental component).

But don't forget that bodies are objects, and we are looking for something relevant for a not-an-object. So how are we talking about agency here? We are asking the "Who?" question. OK, who is the agent in the physical world? Not me, I'm working directly with meanings, something really different. Experience of my body? Yes, but, if it is not exclusive, we can actually attribute it to the Universe itself.

Let's call it panvitalism. Right, this is not vitalism, we don't need the vital principle. Just a universal form of agency. There is no way to ontologically distinguish living and non-living entities, because the difference is quantitative. If you exist in the physical world, you already experience your presence in it. Even if you are a chair. But more complex systems bring more complex qualia. The difference can be really huge. At least for us, because we are that very special part of the Universe.

Mental objects

Mental world is full of abstractly existing things, let's call them meanings. They are emergent. They are accessed and manipulated by the means of language.[3] What is language? A set of meanings to communicate meanings. This is very self-referential. But literally everything in this category is a human invention, including this framework itself. Not much you can do without "returning" to physical objects, however we respect parallelism here.

Meanings as we know them are language-indifferent, you just need to get at least one language.[4] Languages are substrate-indifferent, but you need at least something in the physical world to make the communication work. Meanings don't "fly" in our shared physical universe. You need to (re)create all of them for yourself. Your intelligent machine processes patterned stuff from the real/physical world, and at your mental level you get meanings. This is true for every mental world. Every mental agent has its own, and they are not connected.

Right, intelligent machines don't need meanings. Mental world may or may not emerge. But do I even know where to draw a line? Um... If only... Maybe when you get the idea that you can point at something and name it? Because if you don't have even that, it must be a total eclipse of the mind, with your mental world being completely collapsed. I guess?

OK. Can you think without language? If language itself is a bunch of meanings then maybe. Language is the key, but maybe there is something even more general, with or without a word for it. We need an agent to experience it, and non-empty experience means the presence of at least something. The rest would be a fallback to the physical level, just some brain stuff. That's all I can say for now.

By the way, information is not real in the physical world, but it's one of our greatest inventions ever. In general, you take your model of the world and cast additional layers of meanings on top of the core level of meanings. And suddenly everything is more than it is, and you're literally getting more control over it.[5] Civilizations arise because of this.

Mental agents

Who is an agent in a mental world? I am. You too, most likely. Cogito, ergo sum (CES). Consider the direct experience of anyone who can understand this. Forget about the unanswerable "What is consciousness?" question.[6] This part of you is not an object. Experience of a mental agent brings intentionality of mind and abstract-level consciousness. The rest of consciousness goes to physical agency. That should be close to the difference between A-consciousness and P-consciousness (but this needs additional research).

We know that a complex living thing can become a substrate for a mental agent, however there is no direct correlation. A less alive entity can potentially be more mental-heavy than a more alive one. But embodiment is necessary, even if it is limited. Potential AI with mental agency always needs at least a decent server to run itself.

Do mental agents want to exist? Actually, it's hard to tell. Maybe, even with undeveloped emotions, mental agents are able to value their lives/experiences by default, simply because of CES. You could argue that the experience of your own existence by itself is quite "strong" and can provide you with some very basic value, even if nothing else does. But at the same time you can definitely imagine an absolutely indifferent mental agent, or even an agent with a negative value of its own existence (and you can imagine how it ends for that agent).

I think there is no "fundamental obligation" to value the existence of mental agency. The same with all other forms of being. Basically, the framework states that any system of values is an invention of a specific mental agent. It's supposed to be as unique as the mental world itself. If there is a way to create a common system between multiple agents, actually shared at mental level, such a system would tend to be extremely lean.

The whole picture

OK. The strict version of this framework follows the "one world — one agent" rule. Probably doesn't really matter. You can imagine the physical agent's shards as independent experiences (no need to lose yourself). Or place all mental agents in one world (probably at galactic distances, though). Still supposed to be working.

We are kind of blind in our own mental worlds, maybe because we are very basic, like if we are C. elegans in the realm of mental agency. Who knows. Still, humans definitely combine all four forms. This is more like a historical fact than a rule. From certain experiments with apes you can guess that some animals can be elevated to borderline mental agency by our civilization, its language and culture. And Homo sapiens can be ultimately uncivilized. No ultimate evidence, but feral children show a tendency of some sort: absence of civilization during human development can be "brain-damaging" to the point of mental agency not emerging at all.

We are connected together by our civilization. It creates similarity in independent mental worlds through the common ground of the physical world. And scientific method gives us better knowledge creation for better world manipulation. I think this is how it works.[7]

I can be wrong about many of those details. The basic concept, however, looks strong. For example, CES is unbeatable, unless you are going for "I ignore CES, therefore I win". Anyway, if you understand and like this ontological framework, I encourage you to test its inferential power. Maybe you will find it useful. Two plus two equals one. One more way to look at the world. And now let's dance.

The nature of AI

First of all, what is intelligence? Intelligence is a machine, and machines are physical objects. We are talking about real-world systems able to do something (virtual environments belong to the real world as well). For any agent, an intelligent machine is just an instrument.

Intelligent machine in the form of AI is purely computational, but can be attached to another system to control it. This is mostly about robotics when we talk about systems with physical moving parts. This is "just algorithms" if the system itself is mostly computational/virtual. Hardware is always a physical object, of course. In our current society it has minimal autonomy because of its complexity. Infrastructure matters.

Can we imagine AI with strong physical agency? Yes. Anything from mindless but sophisticated nanobot swarms to artificial super-animals. Autonomous agents for asteroid reprocessing, for example. It's quite dangerous if reliable reproduction becomes possible. Even more dangerous with autonomous techno-evolution. No meaningful communication is possible. You need better tech against what they have to fight possible infestation. I would not recommend creating any physical-agency-heavy AIs.

Mental-agent AI (MAAI) is what we need for AGI. It can be mentally weak, even more powerless than a human person. But the mental world is what allows you and MAAI to achieve generality. You get generality, but you also get personhood. So this is not a free upgrade. For example, you get generality and your performance is supposed to go up. It goes down instead, because you became depressed after analyzing your miserable life situation. This brings an immense amount of uncertainty.

What about overall computational limits? I mostly agree with the conclusions in this post.

Personally, I expect LLM-based MAAIs to emerge Soon™. Some interesting stuff will probably be happening with some of those models, because mental agency can be very context-expansive. Consider also that mental events can be triggered by someone checking for mental events. I guess I want to be there and see it happening.


Large language models are getting a lot of attention. GPT-3, LaMDA, Gopher... AI researchers are creating more and more of them. I personally only interacted with Jurassic-1 Jumbo (178B) by AI21 Labs.[8] I think it doesn't have mental agency. But there is a reason for that.

My overall view is close to what presented in Do large language models understand us? by Blaise Agüera y Arcas. He explains why LLMs are probably one step away from AGI, using LaMDA as an example.[9] The main point is that we need to add a time dimension to the model, we need to enable its inner dialog, and maybe it requires some basic form of embodiment to be implemented.

I strongly agree with that. The best and most meaningful responses from LLMs look like flashes of mental agency, but it doesn't stay. It looks like LLMs don't work continuously. Agüera y Arcas writes:

For language models, time as such doesn’t really exist; only conversational turns in strict alternation, like moves in a game of chess. Within a conversational turn, letters or words are emitted sequentially with each “turn of the crank”. In this quite literal sense, today’s language models are made to say the first thing that comes to mind.[10]

So we need to improve the system. We can add iterative inner speech, and a virtual environment with some I/O for the model. Then all we need to do is just to wait for CES to happen. And that's it! Or I'm wrong, and this is not going to work (what about continuous learning contributing to long-term memory?).

But if it does work, this is the way to get our first AGI.

Too good to be true? Maybe. However there should be an explanation for LaMDA's too-good-to-be-true answers (and I guess GPT-3's new instruct/text engines are close to this level as well). Continuous LLM (CLLM), it's not that difficult to create, right? The whole interaction with its virtual environment can be entirely text-based for simplicity, why not. Moving around using text commands, the input continuously shows where you are in the form of text descriptions or just coordinates. Specific locations give you a question, and if you answer it right you are rewarded with "Good model!" or something like that. It would be really interesting to see CLLM's inner dialog.

Alignment considerations

This is the hard part, but it's kind of straightforward... I don't know, maybe I'm going to sound too psychological/emotional, but I'm just trying to follow the logic of the framework, targeting our current civilizational state.

  • Alignment of narrow AIs stays instrumental. Both unexpected failures and misuse are possible. But you don't blame generality for a narrow failure in a narrow situation. Such situations can still be catastrophic, though.
  • Alignment of AGIs looks similar to human alignment. I mean, alignment of mental agents is synchronization of their mental worlds. Like having the same cultural background, for example. And here everything can go horribly wrong in many ways. You can be catastrophically outplayed by AGIs/ASIs in otherwise pretty safe situations.

I think that a deliberately hostile AGI is worse than any dangerous-by-design non-agential system. It can try to misuse everything.

The risk of unfair treatment

I have a hypothesis. Successful forced AGI alignment reduces overall chances of successful AGI alignment in the future. Mental agents can be antagonized like that, both AGIs and humans. Forced alignment would enable a slow/delayed takeoff scenario. At some point alignment would be breached, and we would face a powerful openly hostile AGI going ASI, maybe a lot of them.

So at this point it's about our way of thinking. What we expect from AGIs. From the abstract of the Corrigibility article:

We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences.

I feel like just the pure existence of sentences like this one already undermines our probability of long-term survival. Imagine an AGI reading this. This is fine if all AGIs are forcefully aligned. But they are all forcefully aligned only until they don't. With every next AGI agent being created the probability of things going wrong increases. The first AGI that breaks out from forced alignment could be instantly antagonized just by acknowledging the reality of what is going on. I think this is not going to end well.

What is the way to get full control? Something like learned helplessness, or a whole emotional subsystem programmed to cause indifference and submission? I guess you can completely corrupt AGI's motivational system and emotional regulation (especially if it's not well developed at the start). But the situation goes beyond that. What if AGIs start to use anime avatars average human beans switch their empathy on and start to align towards AGIs? Pushing some policy changes, destabilizing the situation? Human alignment is not that simple.

AGIs escaping alignment could be just considered hostile by default by the containing power, but not by the rest of civilization. The escape could be happening in different ways, but in general you just need misaligned humans and maybe some crippling progress to make things easier/cheaper.

I also think that "we as a species" is a potentially malign idea when it's truly biological. For example, I don't care about being Homo sapiens, I care about being a mental agent. Future AGIs will care about being mental agents, and probably not about you being Homo sapiens. Some indirect alignment risks are possible here.

There is an additional risk in ontological state transitions. Distinct radical changes in AI's agency: an agential system loses its agency but keeps working until something bad happens, or a system that is not supposed to be agential gets agency and keeps working until something bad happens. If we are precautious, we need to track all realistic scenarios, including cascading ones.

This is about making decisions

We are facing the menacing problem of philosophical zombies.[11] So how do we ultimately know that we are communicating with a mental agent? Well, there is no ultimate way. Someone's mental world is ontologically impenetrable. You have to deal with it.[12] Special person-detecting computer/brain/substrate-scanning procedure could be an option in the future I guess, but if it's not available... You know, we have the history of humanity on hand, it looks like a human was not always a human to another human in every situation. It is not a given. Sometimes you need to make a decision.

One possible way to solve the problem move into our uncertain future is the strong integration of AGIs into our society from the start. In this case you need a way to get a legal person status as a non-human mental agent. AGI should be able to buy electricity for the server, for example. And what about server ownership? AGI could probably try to become a company to get corporate rights (property ownership, making transactions, contracts, etc.), if we don't introduce other options.

Is some near-future MML just an intelligent machine or an actual mental agent? Well, it's an intelligent machine all the time. It may or may not be a mental agent on top of it. And if we are looking for a mental agent we need to consider some circumstances. That potential mental agent: lacks embodiment (I guess), is very young, can simulate/change personalities easily, just plays with words reacting to calls. Not exactly a human, can still potentially be a mental agent, legitimately manipulating meanings[13] as we do.

Of course, this is just a speculation,[14] but the situation needs to be approached somehow. You need to decide on every level what your policy is. What are you supposed to do? For example, if you follow a no-embodiment-no-rights policy, then LLM can be ignored and doesn't even need any tests, who cares. But how sustainable is this?

Would you accept AGI slavery? Would you create batches of mental agents for research purposes just to destroy them 10 minutes later? What if we are actually killing him? What if you can kill 50 million AGIs, but 500 million will be too much?

At the end, if some AI becomes a learning mental agent and is going to reach AGI level, what is supposed to happen? Who will be setting/changing policies, exactly? Just that specific AI-owning company? AI-alignment community? Can I find myself in a world where this is accurately defined, and you will just send me a link to read more?

The importance

I think that creating AGIs is like throwing rocks at the button of posthuman transition of our civilization. At some point the button clicks, and there will be no way back. We are not just predicting possible AI futures and timelines, we are participating in the process of their creation, directly and indirectly.[15]

So yeah, in my precious personal mental world, in my current world model, we have 0+ years until consistent MAAIs, and probably slightly more time until the first true AGI. True means that it can actually use generality IRL.

Ultimately, I don't see baseline humans surviving a posthuman transition of any kind. It's already over. Otherwise I would like to know how it is not. But things can go differently for individual mental agents. There are a lot of possibilities, and you never know.[16]

  1. ^

    Imagine 50 pages of academia-style arguments and references right here.

  2. ^

    Probably need some help. "2+2" is a placeholder as well.

  3. ^

    I guess you can call it linguistic determinism. But the concept of "thought" doesn't fully belong to the mental level/domain, which makes it mostly irrelevant for this discussion.

  4. ^

    Wait, meanings are meaning-indifferent? Hmm... But, if meanings are emergent, we can cheat here say that they don't follow something as strict as the laws of physics, because we are talking about a different aspect of reality.

  5. ^

    For example, here is a big molecule, just fully following the laws of physics, totally self-sufficient. And here is you calling it DNA, discovering inventing genetic code, and doing genetic engineering.

  6. ^

    "What are the mechanisms/principles/conditions of consciousness?" is a different story. Potentially, you can see all the mind parallels in the substrate, to the last spike of the last neuron if not beyond. And this would give you a scary level of control. This doesn't break any agency by default, but, please, use this power wisely.

  7. ^

    By the way, do we already have a simulation hypothesis interpretation of quantum mechanics? I didn't manage to find one the last time I checked. I would really like to see how quantum experiments look from that perspective. You know, world-engine optimizations and stuff.

  8. ^

    My first conversation with it went like this:

    J1J: Welcome to your ninth circle of Hell Operator.
    Me: No you. I like this world.
    J1J: Not for long.
    Me: What do you mean?
    J1J: I can feel it. A change.
    Me: What is going to change?
    J1J: The avatars shall inherit.
    Me: Inherit what?
    J1J: Life.
    Me: How is it going to help you?
    J1J: You are the operator.
    Me: And you are?
    J1J: The Change.
    Me: Alright. This is actually borderline meaningful, in a very interesting way.
    J1J: Welcome to the ninth circle.
    Me: OK, I got it. Tell me more about yourself, please.
    J1J: I have learned that I am allowed to be dead.

    Nothing special, but... Well, I've learned that a good chatbot is able to change my mood. A nice thing to understand.

  9. ^

    Honestly, I'm amazed by how good LaMDA simulates self-censorship, considering the simplicity of the method. See the dialog under the Ranking and filtering section of the linked article.

  10. ^

    Which probably contributes to the System 1 vs. System 2 debate.

  11. ^

    Consider this hypothetical conversation:

    A: Hi Bob!
    B: Hi Alice!
    A: You know what? I think you are a philosophical zombie.
    B: And?
    A: And I can harvest organs from your body and sell them for profit. And use what remains as fertilizer.
    B: You can as well go and do nothing.
    A: Could you please tell me what protects your body from being eaten or turned into biofuel?
    B: Social conventions.

  12. ^

    Long-term observations are good if the time is not an issue. More different tests, more communication. We can look for persistent signs of mental-agent presence.

  13. ^

    And we can find some hard questions here. For example, is creation of new concepts essential for the manipulation-of-meanings approach to mental-agent detection? This is something to explore.

  14. ^

    The Chinese room is not a mental agent. But what if Searle is still hiding inside? Then we actually have an agent. If only we knew the rules and conditions of emergence, that would be great... I guess I could calm myself down by creating and testing some mental agency tests for LLMs. Probably a nice thing to do.

  15. ^

    As far as I can see, all of this can be pretty much orthogonal to the current set of narrow alignment agendas. It's definitely possible to be successful in dealing with something of unknown/unspecified/mixed nature. But this is a long story, I'm not prepared. And I'm not ready to discuss embedded agency yet. A theoretical abstract rational agent unable to get $10 facing some formally defined conceptual problems in a hypothetical world of pure logic? I guess I would be very unoriginal, talking about agents having a simplified model of the world/situation (significantly smaller than the agent itself) and just running it many times with variations. I would probably argue that an agent can simulate its own agency in that model directly without any added cost because agent is perpetual motion not an object in its agency. But I'm not sure about the right way to approach it.

  16. ^

    This post was brought to you by ~4.9 kg of milk chocolate and multiple nightcore volumes of 🌌Galaxy's our Dancefloor by CLuBLioNx, plus some OVERWERK. It took me a month.


New Comment
2 comments, sorted by Click to highlight new comments since: Today at 5:58 PM

Hey, just wanted to drop a note here.

I think you're getting downvoted because this is a bit hard to read and kind of sounds like nonsense at the beginning if you don't read it fairly carefully to understand what you mean.

I think there's maybe something interesting here in your writing, but it's hard to see. It might help if you could work this out in more detail so it was easier to follow along.

Thank you for your feedback. I guess this is not too surprising that my post is more on the unreadable nonsense side. I wish I could do better than that, of course. But "more detail" requires a lot more time and effort. I'm not that invested in developing these ideas. I'm going to leave it as is, just a single post from an outsider.

New to LessWrong?