TristanTrim — LessWrong

Still haven't heard a better suggestion than CEV.

Am I understanding you correctly in that you are pointing out that people have spheres of influence with areas that seemingly have full control over and other places where they seemingly have no control? That makes sense and seems important. In places where you can aim your ethical heuristic where people have full control it will obviously be better, but unfortunately it is important for people to try to influence things that they don't seem to have any control over.

I suppose you could prescribe self referential heuristics, for example "have you spent 5 interrupted minutes thinking about how you can influence AI policy in the last week?" It isn't clear whether any given person can influence these companies, but it is clear that any given person can consider it for 5 minutes. That's not a bad idea, but there may be better ways to take the "We should..." statement out of intractability and make it embodied. Can you think of any?

My longer comment on ethical design patterns explores a bit about how I'm thinking about influence through my "OIS" lens in a way tangentially related to this.

Soloware is a cool concept. My biggest concern is it becoming more difficult to integrate progress made in one domain into other domains if wares become divergent, but I have faith solutions to that problem could be found.

About the concept of agent integration difficulty, I have a nitpick that might not connect to anything useful, and what might be a more substantial critique that is more difficult to parse.

If I simulate you perfectly on a CPU, [...] Your self-care reference-maintenance is no longer aimed at the features of reality most critical to your (upload's) continued existence and functioning.

If this simulation is a basic "use tons of computation to do low level state machine at molecular, atomic, or quantum levels" then your virtual organs will still virtually overheat and the virtual you will die, so you now have two things to care about: you simulated temperature and the temperature of the computer running the simulation.

...

I'm going to use my own "OIS" terminology now, see this comment for my most concise primer on OISs at the time of writing. As a very basic approximation, "OIS" means "agent".

It won't be motivated. It'll be capable of playing a caricature of self-defense, but it will not really be trying.
Overall, Sahil's claim is that integratedness is hard to achieve. This makes alignment hard (it is difficult to integrate AI into our networks of care), but it also makes autonomy risks hard (it is difficult for the AI to have integrated-care with its own substrate).

The nature of agents derived from simulators like LLMs is interesting. Indeed, they often act more like characters in stories than people actually acting to achieve their goals. Of course, the same could be said about real people.

Regardless, that is a focus on the accidental creation of misaligned mesa-OIS. I think this is a risk worth considering, but I think a more concerning threat, which this article does not address, is existing misaligned OIS recursively improving their capabilities: How much of people creating soloware will be in service to their performance in a role in an OIS who's preferences they do not fully understand? That is the real danger.

[epistemic note: I'm trying to promote my concept "Outcome Influencing Systems (OISs)". I may be having a happy death spiral around the idea and need to pull out of it. I'm seeking evidence one way or the other. ]

[reading note: I pronounce "OIS" as "oh-ee" and "OISs" as "oh-ees".]

I really like the idea of categorizing, and cataloguing ethical design patterns (EDPs) and seeking reasonable EDP bridges. I think the concept of "OISs" may be helpful in the endeavour in some ways.

A brief primer on OISs:

"OISs" is my attempt to generalize AI alignment.
"OISs" is inspired by many disciplines and domains including technical AI alignment, PauseAI activisim, mechanistic interpretability, systems theory, optimizer theory, utility theory, and too many others to list.
OISs are any system which has "capabilities" which it uses to "influence" the course of events towards "outcomes" in alignment with it's "preferences".
OISs are "densely venn", meaning that segmenting reality into OISs results in what looks like a venn diagram with very many circles intersecting and nesting. Eg: people are OISs, teams are OISs, governments are OISs, memes are OISs. Every person is made up of many OISs contributing to their biological homeostasis and conscious behaviour.
OISs are "preference independent" in that being a part of an OIS implies no relationship between the preferences of yourself and the preferences of the OIS you are contributing to. If there is a relationship, it must be established through some other way than stating your desires for the OIS you are acting as a part of.
Each OIS has an "implementing substrate" which is the parts of our reality that make up the OIS. Common substrates include: { sociotechnical (groups of humans and human technology), digital (programs on computers), electromechanical (machines with electricity and machinery), biochemical (living things), memetic (existing in peoples minds in a distributed way) }. This list is not complete, nor do I feel strongly that it is the best way to categorize substrates, but it gives an intuition I hope.
Each OIS has a "preference encoding". This is where and how the preferences exist in the OIS's implementing substrate.
The capability of an OIS may be understood as an amalgamation of it's "skill", "resource access", and "versatility".

It seems that when you use the word "mesaoptimizers" you are reaching for the word "OIS" or some variant. Afaik "mesaoptimizer" refers to an optimization process created by an optimization process. It is a useful word, especially for examining reinforcement learning, but it puts focus on the process of creation of the optimizer being an optimizer, which isn't really the relevant focus. I would suggest that instead "influencing outcomes" is the relevant focus.

Also, we avoid the optimizer/optimized/policy issue. As stated in "Risks from Learned Optimization: Introduction":

a bottle cap causes water to be held inside the bottle, but it is not optimizing for that outcome since it is not running any sort of optimization algorithm.

If what you care about is the outcome, whether or not water will stay in the bottle, then it isn't "optimizers" you are interested in, but OIS. I think understanding optimization is important for examining possible recursive self improvement and FOOM scenarios, so the bottle cap is indeed not an optimizer, and that is important. But the bottle cap is an OIS because it is influencing the outcome of the water by making it much more likely that all of the water stays in the bottle. (Although, notably, it is an OIS with very very narrow versatility and very weak capability.)

I'm not too interested in whether large social groups working towards projects such as enforcing peace or building AGI are optimizers or not. I suspect they are, but I feel much more comfortable labelling them as "OISs" and then asking, "what are the properties of this OIS?", "Is it encoding the preferences I think it is? The preferences I should want it to?".

Ok, that's my "OIS" explanation, now onto where the "OIS" concept may help the "EDP" concept...

EDPs as OISs:

First, EDPs are OISs that exist in the memetic substrate and influence individual humans and human organizations towards successful ethical behaviour. Some relevant questions from this perspective: What are EDPs capabilities? How do they influence? How do we know what their preferences are? How do we effectively create, deploy, and decommission them based on analysis of their alignment and capability?

EDPs for LST-OISs:

It seems to me that the place we are most interested in EDPs is for influencing the behaviour of society at large, including large organizations and individuals who's actions may affect other people. So, as I mentioned about "mesaoptimizers", it seems useful to have clear terminology for discussing what kinds of OIS we are targeting with our EDPs. The most interesting kind to me are "Large SocioTechnical OISs" by which I mean governments of different kinds, large markets and their dynamics, corporations, social movements, and any other thing you can point out as being made up of large numbers of people working with technology to have some kind of influence on the outcomes of our reality. I'm sure it is useful to break LST-OISs down into subcategories, but I feel it is good to have a short and fairly politically neutral way to refer to those kinds of objects in full generality, and especially if it is embedded in the lens of "OISs" with the implication that we should care about the OISs capabilities and preferences.

People don't control OISs:

Another consideration is that people don't control OISs. Instead, OISs are like autonomous robots that we create and then send out into the world. But unlike robots, OISs can, and frequently are, created through peoples interactions without the explicit goal of creating an OIS.

This means that we live in a world with many intentionally created OISs, but also many implicit and hybrid OISs. It is not clear if there is a relationship between how an OIS was created and how capable or aligned it is. It seems that markets were mostly created implicitly, but are very capable and rather well aligned, with some important exceptions. Contrast Stalin's planned economy, which was an intentionally created OIS which I think was genuinely created to be more capable and aligned while serving the same purpose, but turned out to be less capable in many ways and tragically misaligned.

More on the note of not controlling OISs. It is more accurate to say we have some level of influence over them. It may be that our social roles are very constrained in some Molochian ways to the point that we really don't have any influence over some OISs despite contributing to them. To recontextualize some stoicism: The only OIS you control is yourself. But even that is complexified by the existence of multiple OIS within yourself.

The point of saying this is that no human has the capability to stop companies from developing and deploying dangerous technologies, rather, we are trying to understand and wield OIS which we hope may have that capability. This is important both in making our strategy clear, and in understanding how people relate to what is going on in the world.

Unfortunately, most people I talk to seem to believe that humans are in control. Sure, LST-OISs wouldn't exist without the humans in the substrate that implements them, and LST-OISs are in control, but this is extremely different from humans themselves being in control.

In trying to develop EDPs for controlling dangerous OISs, it may help to promote OIS terminology to make it easier for people to understand the true (less wrong) dynamics of what is being discussed, or at least it may be valuable to note explicitly that people we are trying to make EDPs for are thinking in terms of tribes of people where people are in control instead of complex sociotechnical systems, and that will affect how they relate to EDPs that are critical of specific OISs that they view as labels pointing at their tribe.

...

Ha, sorry for writing so much. If you read all of this, please lmk what you think : )

I wouldn't say I'm strongly a part of the LW community, but I have read and enjoyed the sequences. I am also undiagnosed autistic and have many times gotten into arguments for reasons that seemed to me like other people not liking the way I communicate, so I can relate to that. If you want to talk privately where there is less chance of accidentally offending larger numbers of people, feel free to reach out to me in a private message. You can think of it as a dry run for posting or reaching out to others if you want.

I like this. Having strong norms for how posts should be broken up ( prereqs, lit review, examples, motivations, etc... ) seems like it would be good for engendering clarity of thought and for respecting peoples time and focus. However, it would need to be built on the correct norms and I don't know what those norms should be. Figuring it out and popularizing it seems like a worthwhile goal though. Good luck if you are picking it up!

move ethics from some mystical thing, into an engineering/design problem

I like this vibe and want to promote my "Outcome Influencing System (OIS)" concept as a set of terminology / lens that may be valuable. Basically, anything that is trying to influence reality is an OIS, and so in that way it is the same as an optimizer, but I'm hoping to build up concepts around the term that make it a more useful way to explore and discuss these ideas than with existing terminology.

The relevance is that there are many "large sociotechnical OIS" that we have implicitly and explicitly created, and treating them as technology that should have better engineering quality assurance seems like a valuable goal.

I would like to draw a strong distinction between a "world government" and an organization capable of effecting international AGI race de-escalation. I don't think you were exactly implying that the former is necessary for the latter, but since the former seems implausible and the latter necessary for humanity to survive, it seems good to clearly distinguish.

"Successful heuristics are embodied" seems like a good ethical heuristic heuristic. I support the call to action to make "we shouldn't let companies cause minor risks of major harm" more embodied by giving examples of how a future where we have and use that heuristic. (Related, I think "we shouldn't let companies cause minor risks of major harms" is better phrasing for heuristic C.)

Interesting. I like the "metronomes syncing" metaphor. It evokes the same feeling for me as a cloud of chaotically spinning dust collapsing into a solar system with roughly one axis of spin. It also reminds me of my "Map Articulating All Talking" (MAAT) concept. I'm planning to write up a post about it, but until then this comment thread is where I've written the most about it. The basic idea is that currently it is impossible to communicate with groups of humans sensibly and a new social media platform would solve this issue. (lol, ambitious I know.)

It seems heuristic C applies to cigarettes, leaded gas, and ozone-destroying chemicals. If we had already had heuristic C and sufficient ethical bridges around it, we would have been much more equipped to respond to those threats more quickly. Your points 1-3 do seem like valid difficulties for the promotion of heuristic C. They may be related to some of the heuristics D-I.

I agree we need effective persuasion and perhaps persuasion design patterns, but persuasion focused on promoting heuristic C to aid in promoting AI x-safety doesn't seem like wasted effort to me.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments