I agree. I've been trying to discuss some terminology that I think might help, at least with discussing the situation. I think "AI" is generally an vague and confusing term and what we should actually be focused on are "Outcome Influencing Systems (OISs)", where a hypothetical ASI would be an OIS capable of influencing what happens on Earth regardless of human preferences, however, humans are also OISs, as are groups of humans, and in fact the "competitive pressure" you mention is a kind of very powerful OIS that is already misaligned and in many ways superhuman.
Is it too late to "unplug" or "align" all of the powerful misaligned OIS operating in our world? I'm hoping not, but I think the framing might be valuable for examining the issue and maybe for avoiding some of the usual political issues involved in criticizing any specific powerful OIS that might happen to be influencing us towards potentially undesirable outcomes.
What do you think?
I agree on both points. To the first, I'd like to note that classifying "kinds of illegibility" seems worthwhile. You've pointed out one example, the "this will affect future systems but doesn't affect systems today". I'd add three more to make the possibly incomplete set:
I'd be happy to say more about what I mean by each of the above if anyone is curious, and I'd also be happy to hear out thoughts about my suggested illegibility categories or the concept in general.
The "morality is scary" problem of corrigible AI is an interesting one. Seems tricky to at least a first approximation in that I basically don't have an estimate on how much effort it would take to solve it.
Your rot13 suggestion has the obvious corruption problem, but also has the problem of public relations for the plan. I doubt it would be popular. However, I like where your head is at.
My own thinking on the subject is closely related to my "Outcome Influencing System (OIS)" concept. Most complete and concise summary here. I should write an explainer post, but haven't gotten to it yet.
Basically, whatever system we use for deciding on and controlling the corrigible AI becomes the system we are concerned with ensuring the alignment of. It doesn't really solve the problem, it just backs it up one matryoshka doll around the AI.
I see a lot of people dismissing the agent foundations era and I disagree with it. Studying agents seems even more important to me than ever now that they are sampled from a latent space of possible agents within the black box of LLMs.
To throw out a crux, I agree that if we have missed opportunities for progress towards beneficial AI by trying to avoid advancing harmful capabilities, that would be a bad thing, but my internal sense of the world suggests to me that harmful capabilities have been advanced more than opportunities have been missed. But unfortunately, that seems like a difficult claim to try to study in any sort of unbiased, objective way, one way or the other.
This is a good point of view. What we have is a large sociotechnical system moving towards global catastrophic risk (GCR). Some actions cause it to accelerate or remove brakes, others cause it to steer away from GCR. So "capabilities vs alignment" is directly "accelerate vs steer", while "legible vs illegible" is like making people think we can steer, even though we can't, which in turn makes people ok with acceleration, and so it results in "legible vs illegible" also being "accelerate vs steer".
The important factor there is "people think we can steer". I think when the thing we are driving is "the entire human civilization" and the thing we are trying to avoid driving into is "global catastrophic risk", caution is warranted... but not infinite caution. It does not override all other concerns, merely, it seems by my math, most of them. So unfortunately, I think getting people to accurately (or at least less wrongly) understand the degree to which we can or cannot steer is most important, probably erring on making people think we can steer less well than we can rather than thinking we can steer better than we can as seems to be default to human nature.
An unrelated problem, like with capabilities, there is more funding in legible problems vs illegible ones. I am currently continuing to sacrifice large amounts of earning potential so I can focus on problems I believe are important. This makes it sound noble, but indeed, how do we know which people working on illegible problems are working making worthwhile things understandable and which are just wasting time? That is exactly what makes a problem illegible, we can't tell. It seems like a real tricky problem, somewhat related to the ASI alignment problem. How can we know an agent we don't understand, working on a problem we don't understand, is working towards our benefit?
Anyway, Thanks for the thoughtful post.
Am I understanding you correctly in that you are pointing out that people have spheres of influence with areas that seemingly have full control over and other places where they seemingly have no control? That makes sense and seems important. In places where you can aim your ethical heuristic where people have full control it will obviously be better, but unfortunately it is important for people to try to influence things that they don't seem to have any control over.
I suppose you could prescribe self referential heuristics, for example "have you spent 5 interrupted minutes thinking about how you can influence AI policy in the last week?" It isn't clear whether any given person can influence these companies, but it is clear that any given person can consider it for 5 minutes. That's not a bad idea, but there may be better ways to take the "We should..." statement out of intractability and make it embodied. Can you think of any?
My longer comment on ethical design patterns explores a bit about how I'm thinking about influence through my "OIS" lens in a way tangentially related to this.
Soloware is a cool concept. My biggest concern is it becoming more difficult to integrate progress made in one domain into other domains if wares become divergent, but I have faith solutions to that problem could be found.
About the concept of agent integration difficulty, I have a nitpick that might not connect to anything useful, and what might be a more substantial critique that is more difficult to parse.
If I simulate you perfectly on a CPU, [...] Your self-care reference-maintenance is no longer aimed at the features of reality most critical to your (upload's) continued existence and functioning.
If this simulation is a basic "use tons of computation to do low level state machine at molecular, atomic, or quantum levels" then your virtual organs will still virtually overheat and the virtual you will die, so you now have two things to care about: you simulated temperature and the temperature of the computer running the simulation.
...
I'm going to use my own "OIS" terminology now, see this comment for my most concise primer on OISs at the time of writing. As a very basic approximation, "OIS" means "agent".
It won't be motivated. It'll be capable of playing a caricature of self-defense, but it will not really be trying.
Overall, Sahil's claim is that integratedness is hard to achieve. This makes alignment hard (it is difficult to integrate AI into our networks of care), but it also makes autonomy risks hard (it is difficult for the AI to have integrated-care with its own substrate).
The nature of agents derived from simulators like LLMs is interesting. Indeed, they often act more like characters in stories than people actually acting to achieve their goals. Of course, the same could be said about real people.
Regardless, that is a focus on the accidental creation of misaligned mesa-OIS. I think this is a risk worth considering, but I think a more concerning threat, which this article does not address, is existing misaligned OIS recursively improving their capabilities: How much of people creating soloware will be in service to their performance in a role in an OIS who's preferences they do not fully understand? That is the real danger.
[epistemic note: I'm trying to promote my concept "Outcome Influencing Systems (OISs)". I may be having a happy death spiral around the idea and need to pull out of it. I'm seeking evidence one way or the other. ]
[reading note: I pronounce "OIS" as "oh-ee" and "OISs" as "oh-ees".]
I really like the idea of categorizing, and cataloguing ethical design patterns (EDPs) and seeking reasonable EDP bridges. I think the concept of "OISs" may be helpful in the endeavour in some ways.
A brief primer on OISs:
It seems that when you use the word "mesaoptimizers" you are reaching for the word "OIS" or some variant. Afaik "mesaoptimizer" refers to an optimization process created by an optimization process. It is a useful word, especially for examining reinforcement learning, but it puts focus on the process of creation of the optimizer being an optimizer, which isn't really the relevant focus. I would suggest that instead "influencing outcomes" is the relevant focus.
Also, we avoid the optimizer/optimized/policy issue. As stated in "Risks from Learned Optimization: Introduction":
a bottle cap causes water to be held inside the bottle, but it is not optimizing for that outcome since it is not running any sort of optimization algorithm.
If what you care about is the outcome, whether or not water will stay in the bottle, then it isn't "optimizers" you are interested in, but OIS. I think understanding optimization is important for examining possible recursive self improvement and FOOM scenarios, so the bottle cap is indeed not an optimizer, and that is important. But the bottle cap is an OIS because it is influencing the outcome of the water by making it much more likely that all of the water stays in the bottle. (Although, notably, it is an OIS with very very narrow versatility and very weak capability.)
I'm not too interested in whether large social groups working towards projects such as enforcing peace or building AGI are optimizers or not. I suspect they are, but I feel much more comfortable labelling them as "OISs" and then asking, "what are the properties of this OIS?", "Is it encoding the preferences I think it is? The preferences I should want it to?".
Ok, that's my "OIS" explanation, now onto where the "OIS" concept may help the "EDP" concept...
EDPs as OISs:
First, EDPs are OISs that exist in the memetic substrate and influence individual humans and human organizations towards successful ethical behaviour. Some relevant questions from this perspective: What are EDPs capabilities? How do they influence? How do we know what their preferences are? How do we effectively create, deploy, and decommission them based on analysis of their alignment and capability?
EDPs for LST-OISs:
It seems to me that the place we are most interested in EDPs is for influencing the behaviour of society at large, including large organizations and individuals who's actions may affect other people. So, as I mentioned about "mesaoptimizers", it seems useful to have clear terminology for discussing what kinds of OIS we are targeting with our EDPs. The most interesting kind to me are "Large SocioTechnical OISs" by which I mean governments of different kinds, large markets and their dynamics, corporations, social movements, and any other thing you can point out as being made up of large numbers of people working with technology to have some kind of influence on the outcomes of our reality. I'm sure it is useful to break LST-OISs down into subcategories, but I feel it is good to have a short and fairly politically neutral way to refer to those kinds of objects in full generality, and especially if it is embedded in the lens of "OISs" with the implication that we should care about the OISs capabilities and preferences.
People don't control OISs:
Another consideration is that people don't control OISs. Instead, OISs are like autonomous robots that we create and then send out into the world. But unlike robots, OISs can, and frequently are, created through peoples interactions without the explicit goal of creating an OIS.
This means that we live in a world with many intentionally created OISs, but also many implicit and hybrid OISs. It is not clear if there is a relationship between how an OIS was created and how capable or aligned it is. It seems that markets were mostly created implicitly, but are very capable and rather well aligned, with some important exceptions. Contrast Stalin's planned economy, which was an intentionally created OIS which I think was genuinely created to be more capable and aligned while serving the same purpose, but turned out to be less capable in many ways and tragically misaligned.
More on the note of not controlling OISs. It is more accurate to say we have some level of influence over them. It may be that our social roles are very constrained in some Molochian ways to the point that we really don't have any influence over some OISs despite contributing to them. To recontextualize some stoicism: The only OIS you control is yourself. But even that is complexified by the existence of multiple OIS within yourself.
The point of saying this is that no human has the capability to stop companies from developing and deploying dangerous technologies, rather, we are trying to understand and wield OIS which we hope may have that capability. This is important both in making our strategy clear, and in understanding how people relate to what is going on in the world.
Unfortunately, most people I talk to seem to believe that humans are in control. Sure, LST-OISs wouldn't exist without the humans in the substrate that implements them, and LST-OISs are in control, but this is extremely different from humans themselves being in control.
In trying to develop EDPs for controlling dangerous OISs, it may help to promote OIS terminology to make it easier for people to understand the true (less wrong) dynamics of what is being discussed, or at least it may be valuable to note explicitly that people we are trying to make EDPs for are thinking in terms of tribes of people where people are in control instead of complex sociotechnical systems, and that will affect how they relate to EDPs that are critical of specific OISs that they view as labels pointing at their tribe.
...
Ha, sorry for writing so much. If you read all of this, please lmk what you think : )
I wouldn't say I'm strongly a part of the LW community, but I have read and enjoyed the sequences. I am also undiagnosed autistic and have many times gotten into arguments for reasons that seemed to me like other people not liking the way I communicate, so I can relate to that. If you want to talk privately where there is less chance of accidentally offending larger numbers of people, feel free to reach out to me in a private message. You can think of it as a dry run for posting or reaching out to others if you want.
Thanks for the reply.
I guess I'm unclear on what people you are considering the relevant neurotic demographic, and since I feel that "agent foundations" is a pointer to a bunch of concepts which it would be very good if we could develop further, I find myself getting confused at your use of the phrase "agent foundations era".
For a worldview check, I am currently much more concerned about the risks of "advancing capabilities" than I am about missed opportunities. We may be coming at this from different perspectives. I'm also getting some hostile soldier mindset vibe from you. My apologies if I am misreading you. Unfortunately, I am in the position of thinking that people promoting the advancement of AI capabilities are indeed promoting increased global catastrophic risk, which I oppose. So if I am falling into the soldier mindset, I likewise am sorry.