I would perhaps add the term "groupware" for software custom-tailored to small communities. (I'm not sure what term would be appropriate to encompass both soloware and groupware.)
Customware?
While I think reference problems do defeat specific arguments a computational-functionalist might want to make, I think my simulated upload's references can be reoriented with only a little work. I do not yet see the argument for why highly capable self-preservation should take particularly long for AIs to develop.
I think you’re spot on with this. If you gave an AI system signals tied to e.g. CPU temperature, battery health etc… and train it with objectives that make those variables matter it will “care” about them in the same causal-role functional sense as the sim cares about simulated temperature.
This is a consequence of teleosemantics (which I can see is a topic you’ve written a lot about!)
Sahil has been up to things. Unfortunately, I've seen people put effort into trying to understand and still bounce off. I recently talked to someone who tried to understand Sahil's project(s) several times and still failed. They asked me for my take, and they thought my explanation was far easier to understand (even if they still disagreed with it in the end). I find Sahil's thinking to be important (even if I don't agree with all of it either), so I thought I would attempt to write an explainer.
This will really be somewhere between my thinking and Sahil's thinking; as such, the result might not be endorsed by anyone. I've had Sahil look over it, at least.
Sahil envisions a time in the near future which I'll call the autostructure period.[1] Sahil's ideas on what this period looks like are extensive; I will focus on a few key ideas about technology. It's a vision that is, in many, places, both normative and predictive; I may sometimes omit clarification of whether I'm making a normative or predictive statement.
One conceptual upgrade I've experienced as a result of interacting with Sahil's philosophy has been mostly substituting high-actuation where I would previously have reached for the concept of automation.
I don't know about everyone else, but my experience of the philosophy of technology prevalent in 1990-2020 (ie, the decades during which I was in school) treated "automation" as the unifying magic behind technology. The Industrial Revolution was about automating. Machines were conceptualized as replacements for humans (and working animals like horses). This concept of tech-as-automation was especially prevalent in computer science; old machines automated physical labor only, but computers promised to automate everything else. The virtue of the programmer was said to be laziness: automating everything you can.
"High-actuation" seems to capture the unifying principle of technology better than "automation". Automation is necessarily automation-of-X. For example, photographs are (in a limited sense) automatic paintings: photography automates the process of putting pigments/dyes on a surface to accurately reflect what is seen visually. However, there is no similar "automation" story for painting itself. Painting is clearly a human technology, but it doesn't fit in the "automation" mold. There's no previous activity which painting automates.
Painting and photography both fit well within the high-actuation framework. Paint is a high-actuation medium: it allows color to be applied with great freedom and accuracy, limited only by the artist's skill.[2] Photography allows even easier actuation, reducing the amount of technical skill required, provided that you are interested in arranging color to produce highly literal representations of visual reality, or specific but less-literal representations which can be created through various photography techniques (eventually including digital manipulation).
Another example is plastics. Plastics aren't automation-of-X. You can tell an indirect story, where plastics enable further automation. However, the concrete reasons for that are high-actuation in nature. Plastics can be easily molded to any shape. Furthermore, plastics can be mixed with other plastics and additives to achieve a wide variety of appearances and material properties.
High-actuation is also a mentalistic property. The mind can "imagine anything" cheaply within its confines. Canvases are a good analogy for the imagination, in this way. Similarly, photographs are a good analogy for memory. Computers provide an even better analogy for the mind, since computers can actuate a wide variety of processes (computations) rather than only static pictures.
(This subsection is heavily influenced by @particlemania & credit goes to them for many of these thoughts.)
Modern AI (both the technology and the field) has been, in a myriad of ways, influenced by the economic picture of reality, with singleton-like agency.[3] In the term AGI, the "general intelligence" is interpreted roughly as follows:
General Intelligence (or "Agency") is something you can drop almost anywhere, & it thrives.
Arguably, what we want out of technology is more like the following:
Co-agency is something you can drop almost anyone into, such that that person thrives.
To convey the difference with a cartoonish analogy: aligned agency is like a big robot superhero who fixes everything, while co-agency is like a robotic exosuit which you get in to become the superhero yourself. This is, hopefully, a clarification of the concept of "agent AI vs tool AI".
(I intend "co-agency" to connote both the mathematical concept of duality, like "co-product", and cooperation, like cooperative inverse reinforcement learning (CIRL).)
Perhaps one contribution to the difficulty of the AI safety problem is that we are stuck in the "agentic" frame inherited from economics. Maybe one agent aligned with another agent is a somewhat anti-natural concept.[4]
The agentic frame on AI fits with the automation meme I pointed out: an agentic AI is a sort of automated human, engineered to replace human work. A co-agentic AI would instead be engineered with the high-actuation philosophy in mind, designed to enhance what humans can do.
So, that's one thing Sahil is trying to do: shift the vision of what AI can and should be from an agentic one to a co-agentic one. (Sahil uses "infrastructure" to point at this aspect.)
The main artifact of the 'agentic' camp which Sahil critiques is the chatbot interface. The chatbot interface is both lazy and dangerous. It masquerades as a human, creating risks related to people treating it as a human. It aims to do everything, but in reality, humans have to put a lot of care into making specific functionality (such as programming) work well; everything else, it does only poorly, by default.[5] The generality serves as an excuse to not focus on more specialized interfaces which suit their particular use better.
In order to push back on the anthropomorphization inherent in chat interfaces, Sahil suggests that we call the activity of interacting with AI via chat interfaces talkizing. The relationship between talking and talkizing is being analogized with the relationship between rationality and rationalization; rationalization is a "phony" version of rationality, a cheap substitute, perhaps intended to fool you. Instead of "I talked with ChatGPT about..." one would say "I talkized with ChatGPT about..."
Computers are about to become much more whatever-we-want-them-to-be. AI programming assistants are starting to get to the point where users can create custom interfaces on-demand. Sahil calls this "soloware"[6], emphasizing the idea of one person making software specifically suited to themselves. For example, Ben Vallack has made a series of YouTube videos (1,2,3,4,5) about using AI-assisted coding to replace Apple's photography software. I've made some very basic soloware myself with AI assistance. I know someone else who has finished several soloware projects.
I would perhaps add the term "groupware" for software custom-tailored to small communities. (I'm not sure what term would be appropriate to encompass both soloware and groupware.)
One of Sahil's central claims is that people aren't thinking hard enough about the consequences of this shift.
Of course, the above only tracks the implications in one small area. There are many more to think about.
Is there really so much "free money" to pick up, with respect to better interfaces? I think yes. It isn't just a matter of a few places where commercial interfaces inconvenience the user by making it easy to subscribe but hard to unsubscribe, easy to import stuff but hard to export, etc. Attention-capturing mechanisms are all over the place, and as a result, our attention gets captured all the time. Even ignoring that, I think there are a lot of ways in which our interfaces are just bad; I would guess that UI and UX professionals are constantly noticing things that could be better if they had control of all their interfaces. (I certainly feel that way myself.) Furthermore, even if software companies were doing a perfect job making software interfaces optimized for the typical user, there should be a lot of low-hanging fruit in soloware, customizing things to the specific user.
Sahil wants to create a community for soloware.[7]
One of Sahil's many mottos for this stuff is "radical adaptation for radical adaptivity". We can see a radical future coming, so we should prepare for it now. It is worth taking the time to make new affordances. We have been living in a regime in which many things are impossible or impractically difficult. To adapt to the new regime, we need to think hard about what is now/soon possible. Exactly what is becoming high-actuation?
Sahil's goal is to build a community around this. One way I think about Sahil's work is that he is trying to build a new "school of design" for the coming age: a group of people taking this possible future seriously and building a positive vision of what the technology could be, demonstrated largely through examples.
There's been a recent trend of sharing system prompts amongst rationalists, publicly and privately. Sahil's project suggests the same kind of community-driven personalization, but on a deeper level and about more important things. User interfaces drive your attention, drive your affordances, drive your relationship with ideas and thinking. Platforms ranging from Facebook to LessWrong offer feeds of information we browse. Platforms like Discord and Slack mediate our conversations. Every time you notice a UI frustration, save that idea. UI frustrations are more actionable than they ever have been, and seem set to soon become even more actionable. You can have your dream UI. As a bonus, you can take back control over your data and your attention.
This sounds like a big-tech culture war thing, analogous (and related) to the open source movement. What does all of this have to do with AI Safety?
First, in the autostructure period, we get high-actuation conceptualization. "Conceptualization" here includes both math and philosophy. This is the titular "Live Theory" of Sahil's sequence on this stuff; what happens to our concept of "theory" when post-rigorous reasoning becomes cheap?
Per Sahil, it could reframe how we imagine, understand, and respond to AI risk issues. Once we are equipped with a new ontology of knowledge-artefacts that let us capture bespoke concepts, we might scalably collaborate on notions that resist standardization. This is speculated to be especially true for difficult-to-abstract philosophy concepts relevant to alignment, like "deception" and "power", that seem highly contextual in nature.
Second, due to other things Sahil believes (which we'll get to later), Sahil expects that the capabilities required for AIs to competently do power-seeking or treacherous turns will come later than other advanced capabilities. Whether this interim period runs by fast or slow, it might be critical to shaping the future and demands strong design consideration. The alignment target is a particular relationship[8] between humans and AI. This cannot be engineered at a distance. A relationship has to be pursued close up.
A person's agency is distributed throughout their body, and beyond. An infection will be attacked by the immune system, even when it is too small for any signs to be consciously noticed. If the person notices their own infection, they may take steps to treat it. Family members may help. If things are looking bad, a doctor might get involved. If things look really bad, more doctors may be involved, and there may be an invasive procedure. At the societal level, there's coordinated medical research, laws and national programs relating to medicine, etc.
Similarly, if you drive on the wrong side of the road, maybe others in the car will yell at you, and people will honk. If somehow you continue to do it successfully for longer, police will get involved. If you avoided being stopped by the police effectively enough, one could imagine higher forces such as military getting involved.
The point is, there's a distributed network of care, with different levels of escalation for problems. This distributed network of care even extends beyond humans to animals and plants, to some degree.
We don't need to model networks of care as made up of discrete agents. We just need to know what kinds of care it is adequate for. Networks of care are only boundedly good at protecting what they care about. They can be fooled, overpowered, outrun, avoided.
Think of "robust network of care" as something like an alignment target. The objective of the good technologist should be to build and apply technology in ways that are good for the health of our networks of care, preserving or increasing the extent to which all the cares in the world are met with actuation.
We can describe all AI risks as indifference risks:
Note that this is a broad sense of indifference, which means something like not-corrigible-to. Even someone who hated me would be indifferent to me in this sense; they are not moved by my preferences in the way I would wish them to be. This is not a value-neutral notion of indifference. It contrasts with appropriate integration into a network of care. Indifference to a concern means inadequacy of the overall network of care to properly represent and care for that concern.
Sahil's approach to indifference risks depends on the idea that in the near future, there will be a large shift in what constitutes the "hard part" of alignment. Formalization of informal ideas will not be the hard part. AI will enable not just automated proofs, not just automated conjectures, but also automated formalization of informal intuitions. We can't just "formalize human values" in one go & get something good enough to hold up to arbitrary optimization pressure. But if we avoid agency, we can create systems which increasingly integrate into our networks of care (by focusing on doing well in user-specific contexts, rather than trying to solve the whole AGI alignment problem in one go).
Of course, all of this only makes sense predicated on the assumption that AIs won't autonomously decide to betray us (in the next few years, that is). If the AIs are formalizing our values in a sneakily incorrect way, they're not really integrated with our network of care. So, for Sahil's vision to make sense, autonomous risks from AI have to be mild or come later.[9] Sahil has a lot of detailed thinking about why this future is plausible, but I'll boil it down to one argument: agency is complex.
I've stated that Sahil expects the kind of agency that leads to power-seeking, self-protection, and other such autonomy risks as a capability which will not emerge in the next few years, even as AI becomes increasingly capable in other ways. Why?
The basic argument in favor of "computers can be agentic like humans" is the computational-functionalist argument: we could at least simulate a human. Sahil questions this argument.
A typical argument in this cluster goes something like this:
Sahil points to the way referentiality has been broken by simulating the human on a computer. Part of what it means to believe things & to want things is that you can successfully refer to those things. If I upload your consciousness and put you in a simulation of your bedroom, do your thoughts of your bed now refer to your in-computer bed or your out-of-computer bed? The act of uploading your consciousness has inherently made your perception easier to fool, diminishing its truth-tracking nature.
Sure, you might say, but that's easily corrected! A perfect copy will have its references messed up, yes. However, you just have to tell me that I'm in a simulation to straighten me out. Once I know the situation I'm in, my references will get straightened out. This reply still concedes that we have to do at least a little extra work to reintegrate the caring, though. The heart of the computational-functionalist argument is being attacked: reference, goals, and belief are not perfectly substrate-independent. It is not always preserved perfectly when the computation is replicated on a different substrate. In particular, references to the substrate will tend to get messed up!
Biological organisms inherently care for their biological substrate. The substrate-independence doctrine of computational functionalism is, in this sense, contrary to biology. A basic aspect of self-care is keeping your temperature within a good range. If I simulate you perfectly on a CPU, the upload's temperature-care is pointed at virtual temperature, rather than the CPU temperature. Pain and healing will be pointed at your virtual organs, not the physical computer. Your self-care reference-maintenance is no longer aimed at the features of reality most critical to your (upload's) continued existence and functioning.
Sahil thinks that AIs will be insensitive to their physical well-being in this way for some time. True, they can display shutdown-resistant behaviors even now. Sahil's prediction is that these won't achieve high capability levels automatically as other capabilities materialize. An otherwise-superintelligent AI trying to protect itself will be something like a pain-insensitive human.[10] It might intellectually know what it needs to do, but experience akrasia when it comes to carrying out the plan. It won't be motivated. It'll be capable of playing a caricature of self-defense, but it will not really be trying.
Overall, Sahil's claim is that integratedness is hard to achieve. This makes alignment hard (it is difficult to integrate AI into our networks of care), but it also makes autonomy risks hard (it is difficult for the AI to have integrated-care with its own substrate).
Integratedness is similar to being in game-theoretic equilibrium: you can't reason out, from first principles, what equilibrium a game will be in. The players have to interact and learn together.
(This might be the part I'm most skeptical about. While I think reference problems do defeat specific arguments a computational-functionalist might want to make, I think my simulated upload's references can be reoriented with only a little work. I do not yet see the argument for why highly capable self-preservation should take particularly long for AIs to develop. However, the position is intriguing.[11])
Speaking personally, no longer as my steelman of Sahil's views: I want to continue engaging with these ideas, and I also want to participate in the soloware community, to see if I can improve my own interfaces.
I think Sahil offers a picture of the near future worth thinking about. I think it is apparent that AI will continue to improve at programming and math, amongst other capabilities. I think it is possible that self-defensive agency won't be amongst those capabilities, for a variety of reasons (even if I'm skeptical of Sahil's reason). In a world where there is a period of low self-defensive agency but moderately high ingenuity from AI, Sahil's vision becomes quite relevant.
Moreover, I see value in the soloware community even setting these things aside.
If you're interested in learning more about this project, Sahil and I are doing a Review + Q&A tomorrow (short notice, but maybe some interested readers can make it):
Monday, September 15th
SF 11:00 AM | NYC 2:00 PM | London 7:00 PM Venue: https://meet.google.com/rae-ayvy-gtd
You can also read more in Sahil's Live Theory sequence. You can email them here. Sahil's group is also working on a website, which at the time this post goes up is a work-in-progress with little information, but should have content soon.
The use of "auto" here is unfortunately discordant with the auto-vs-high-actuation point made in this post. We tried to think of a more fitting term, but high-actuation-structure is cumbersome. "Structure" here is supposed to include structures like programs and mathematics, but also less cookie-cutter structures such as philosophical or legal arguments. Sahil also calls it the teleattention era, conceiving the resource AI scales as "attention".
Paint also moves from lower-actuation to higher actuation as new resources and technologies are utilized. For example, purple was very difficult to create for centuries, as it involved crushing very large numbers of snails. (This purple was usually a textile dye, but also used for paint.)
This claim deserves a more thorough treatment, but to name a few things: the by-far-most-popular textbook on AI, "Artificial Intelligence: A Modern Approach", motivates everything with the concept of agency from the beginning and throughout. Artificial Intelligence has been hugely influenced by Operations Research (taking many mathematical and computational tools from that field), which is itself hugely influenced by economics and the economic concept of agent.
Many AI safety problems, after all, are direct consequences of agency. Agents are power-seeking. Agents maximize, which leads to Goodharting problems. Agents will tend to find perverse instantiations of their utility functions.
("Poorly" can mean either capability or alignment or both.)
I'm not claiming Sahil invented this term.
He's been running "interface integration therapy sessions" to practice the art of creating interfaces catering to a specific person; he might be up for adding more people. Ping him if you'd like to join, he might be open to it.
Sahil wrote the following as a potential addition to the essay when he reviewed it:
What are the humans in the community responsible for? When actuation is cheap, Sahil suggests that living beings strongly embedded in their environments (such as humans) should orient to rich potentiation.
Sahil clarifies that "potential" here is not ideas (which AI could easily generate), but the potency that comes from connection to meaningfulness. In the way that Bob Dylan's music, say, is potent. This makes humans responsible for supplying discernment (eg. "taste") and tight referentiality to life. AI struggles with this, at least in a "rich" way; LLMs are presumably great at language but miss subtleties of what matters to us because they don't live it and are not moved by it.
Although I did include some other Sahil-written bits when revising this essay, I chose not to include this one because it does not adequately represent my best understanding of Sahil. I feel that this text touches on part of Sahil's thinking which I do not have a working steelman for (although I continue to be interested in understanding Sahil's position better).
This could happen in a lot of ways. Maybe we get good at interpretability, or we get lucky and inner optimizers are empirically uncommon, or something else. It could also happen on many different timescales. The idea that full agency is relatively slow to develop in AI could translate to months or years or decades. The autostructure period could be a brief but intense burst of activity.
Sahil's argument about congenital pain insensitivity is as follows:
I am not sure of the empirical status of this story. Even if near-future AIs turn out to be analogous to pain-insensitive humans, are pain-insensitive humans actually akrasiatic in this way? Are their mortality rates actually higher? If so, could there be another explanation?
The wikipedia article says:
Because children and adults with CIP cannot perceive pain, they may not respond to pain-inducing stimuli, putting them at a high risk for infections and complications resulting from injuries.
This suggests that the problem is a lack of knowledge when something is wrong (focusing on the function of pain sensors as sensors, conveying information), rather than a lack of caring (focusing on their function as reinforcement, conveying goals).
However, I'm not sure exactly what pile of evidence went into the Wikipedia article's claim. Sahil's story seems plausible a priori, and I'm not sure Wikipedia is citing research that tries to differentiate an akrasia hypothesis from a sensor/information hypothesis.
Sahil and I have had long arguments about whether it is possible to simply 'download kung-fu', around this question. Sahil predicts that without doing the hard work of integrating it yourself (whatever that means), it'll lead to strange integrity disorders in the form of sudden kung-fu seizures or dissociated movement, for example.