LESSWRONG
LW

1690
TristanTrim
20013955
Message
Dialogue
Subscribe

Still haven't heard a better suggestion than CEV.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Legible vs. Illegible AI Safety Problems
TristanTrimnow10

Thanks for the reply.

I guess I'm unclear on what people you are considering the relevant neurotic demographic, and since I feel that "agent foundations" is a pointer to a bunch of concepts which it would be very good if we could develop further, I find myself getting confused at your use of the phrase "agent foundations era".

For a worldview check, I am currently much more concerned about the risks of "advancing capabilities" than I am about missed opportunities. We may be coming at this from different perspectives. I'm also getting some hostile soldier mindset vibe from you. My apologies if I am misreading you. Unfortunately, I am in the position of thinking that people promoting the advancement of AI capabilities are indeed promoting increased global catastrophic risk, which I oppose. So if I am falling into the soldier mindset, I likewise am sorry.

Reply
Legible vs. Illegible AI Safety Problems
TristanTrim5h10

I agree. I've been trying to discuss some terminology that I think might help, at least with discussing the situation. I think "AI" is generally an vague and confusing term and what we should actually be focused on are "Outcome Influencing Systems (OISs)", where a hypothetical ASI would be an OIS capable of influencing what happens on Earth regardless of human preferences, however, humans are also OISs, as are groups of humans, and in fact the "competitive pressure" you mention is a kind of very powerful OIS that is already misaligned and in many ways superhuman.

Is it too late to "unplug" or "align" all of the powerful misaligned OIS operating in our world? I'm hoping not, but I think the framing might be valuable for examining the issue and maybe for avoiding some of the usual political issues involved in criticizing any specific powerful OIS that might happen to be influencing us towards potentially undesirable outcomes.

What do you think?

Reply
Legible vs. Illegible AI Safety Problems
TristanTrim5h10

I agree on both points. To the first, I'd like to note that classifying "kinds of illegibility" seems worthwhile. You've pointed out one example, the "this will affect future systems but doesn't affect systems today". I'd add three more to make the possibly incomplete set:

  • This will affect future systems but doesn't affect systems today.
  • This relates to an issue at a great inferential distance; it is conceptually difficult to understand.
  • This issue stems from an improper framing or assumption about existing systems that is not correct.
  • This issue is emotionally or politically inconvenient.

I'd be happy to say more about what I mean by each of the above if anyone is curious, and I'd also be happy to hear out thoughts about my suggested illegibility categories or the concept in general.

Reply
Legible vs. Illegible AI Safety Problems
TristanTrim5h30

The "morality is scary" problem of corrigible AI is an interesting one. Seems tricky to at least a first approximation in that I basically don't have an estimate on how much effort it would take to solve it.

Your rot13 suggestion has the obvious corruption problem, but also has the problem of public relations for the plan. I doubt it would be popular. However, I like where your head is at.

My own thinking on the subject is closely related to my "Outcome Influencing System (OIS)" concept. Most complete and concise summary here. I should write an explainer post, but haven't gotten to it yet.

Basically, whatever system we use for deciding on and controlling the corrigible AI becomes the system we are concerned with ensuring the alignment of. It doesn't really solve the problem, it just backs it up one matryoshka doll around the AI.

Reply
Legible vs. Illegible AI Safety Problems
TristanTrim5h10

I see a lot of people dismissing the agent foundations era and I disagree with it. Studying agents seems even more important to me than ever now that they are sampled from a latent space of possible agents within the black box of LLMs.

To throw out a crux, I agree that if we have missed opportunities for progress towards beneficial AI by trying to avoid advancing harmful capabilities, that would be a bad thing, but my internal sense of the world suggests to me that harmful capabilities have been advanced more than opportunities have been missed. But unfortunately, that seems like a difficult claim to try to study in any sort of unbiased, objective way, one way or the other.

Reply
Legible vs. Illegible AI Safety Problems
TristanTrim5h10

This is a good point of view. What we have is a large sociotechnical system moving towards global catastrophic risk (GCR). Some actions cause it to accelerate or remove brakes, others cause it to steer away from GCR. So "capabilities vs alignment" is directly "accelerate vs steer", while "legible vs illegible" is like making people think we can steer, even though we can't, which in turn makes people ok with acceleration, and so it results in "legible vs illegible" also being "accelerate vs steer".

The important factor there is "people think we can steer". I think when the thing we are driving is "the entire human civilization" and the thing we are trying to avoid driving into is "global catastrophic risk", caution is warranted... but not infinite caution. It does not override all other concerns, merely, it seems by my math, most of them. So unfortunately, I think getting people to accurately (or at least less wrongly) understand the degree to which we can or cannot steer is most important, probably erring on making people think we can steer less well than we can rather than thinking we can steer better than we can as seems to be default to human nature.

An unrelated problem, like with capabilities, there is more funding in legible problems vs illegible ones. I am currently continuing to sacrifice large amounts of earning potential so I can focus on problems I believe are important. This makes it sound noble, but indeed, how do we know which people working on illegible problems are working making worthwhile things understandable and which are just wasting time? That is exactly what makes a problem illegible, we can't tell. It seems like a real tricky problem, somewhat related to the ASI alignment problem. How can we know an agent we don't understand, working on a problem we don't understand, is working towards our benefit?

Anyway, Thanks for the thoughtful post.

Reply
Ethical Design Patterns
TristanTrim1mo10

Am I understanding you correctly in that you are pointing out that people have spheres of influence with areas that seemingly have full control over and other places where they seemingly have no control? That makes sense and seems important. In places where you can aim your ethical heuristic where people have full control it will obviously be better, but unfortunately it is important for people to try to influence things that they don't seem to have any control over.

I suppose you could prescribe self referential heuristics, for example "have you spent 5 interrupted minutes thinking about how you can influence AI policy in the last week?" It isn't clear whether any given person can influence these companies, but it is clear that any given person can consider it for 5 minutes. That's not a bad idea, but there may be better ways to take the "We should..." statement out of intractability and make it embodied. Can you think of any?

My longer comment on ethical design patterns explores a bit about how I'm thinking about influence through my "OIS" lens in a way tangentially related to this.

Reply
What, if not agency?
TristanTrim1mo10

Soloware is a cool concept. My biggest concern is it becoming more difficult to integrate progress made in one domain into other domains if wares become divergent, but I have faith solutions to that problem could be found.

About the concept of agent integration difficulty, I have a nitpick that might not connect to anything useful, and what might be a more substantial critique that is more difficult to parse.

If I simulate you perfectly on a CPU, [...] Your self-care reference-maintenance is no longer aimed at the features of reality most critical to your (upload's) continued existence and functioning.

If this simulation is a basic "use tons of computation to do low level state machine at molecular, atomic, or quantum levels" then your virtual organs will still virtually overheat and the virtual you will die, so you now have two things to care about: you simulated temperature and the temperature of the computer running the simulation.

...

I'm going to use my own "OIS" terminology now, see this comment for my most concise primer on OISs at the time of writing. As a very basic approximation, "OIS" means "agent".

It won't be motivated.  It'll be capable of playing a caricature of self-defense, but it will not really be trying.

Overall, Sahil's claim is that integratedness is hard to achieve.  This makes alignment hard (it is difficult to integrate AI into our networks of care), but it also makes autonomy risks hard (it is difficult for the AI to have integrated-care with its own substrate).

The nature of agents derived from simulators like LLMs is interesting. Indeed, they often act more like characters in stories than people actually acting to achieve their goals. Of course, the same could be said about real people.

Regardless, that is a focus on the accidental creation of misaligned mesa-OIS. I think this is a risk worth considering, but I think a more concerning threat, which this article does not address, is existing misaligned OIS recursively improving their capabilities: How much of people creating soloware will be in service to their performance in a role in an OIS who's preferences they do not fully understand? That is the real danger.

Reply
Ethical Design Patterns
TristanTrim1mo*10

[epistemic note: I'm trying to promote my concept "Outcome Influencing Systems (OISs)". I may be having a happy death spiral around the idea and need to pull out of it. I'm seeking evidence one way or the other. ]

[reading note: I pronounce "OIS" as "oh-ee" and "OISs" as "oh-ees".]

I really like the idea of categorizing, and cataloguing ethical design patterns (EDPs) and seeking reasonable EDP bridges. I think the concept of "OISs" may be helpful in the endeavour in some ways.

A brief primer on OISs:

  • "OISs" is my attempt to generalize AI alignment.
  • "OISs" is inspired by many disciplines and domains including technical AI alignment, PauseAI activisim, mechanistic interpretability, systems theory, optimizer theory, utility theory, and too many others to list.
  • OISs are any system which has "capabilities" which it uses to "influence" the course of events towards "outcomes" in alignment with it's "preferences".
  • OISs are "densely venn", meaning that segmenting reality into OISs results in what looks like a venn diagram with very many circles intersecting and nesting. Eg: people are OISs, teams are OISs, governments are OISs, memes are OISs. Every person is made up of many OISs contributing to their biological homeostasis and conscious behaviour.
  • OISs are "preference independent" in that being a part of an OIS implies no relationship between the preferences of yourself and the preferences of the OIS you are contributing to. If there is a relationship, it must be established through some other way than stating your desires for the OIS you are acting as a part of.
  • Each OIS has an "implementing substrate" which is the parts of our reality that make up the OIS. Common substrates include: { sociotechnical (groups of humans and human technology), digital (programs on computers), electromechanical (machines with electricity and machinery), biochemical (living things), memetic (existing in peoples minds in a distributed way) }. This list is not complete, nor do I feel strongly that it is the best way to categorize substrates, but it gives an intuition I hope.
  • Each OIS has a "preference encoding". This is where and how the preferences exist in the OIS's implementing substrate.
  • The capability of an OIS may be understood as an amalgamation of it's "skill", "resource access", and "versatility".

It seems that when you use the word "mesaoptimizers" you are reaching for the word "OIS" or some variant. Afaik "mesaoptimizer" refers to an optimization process created by an optimization process. It is a useful word, especially for examining reinforcement learning, but it puts focus on the process of creation of the optimizer being an optimizer, which isn't really the relevant focus. I would suggest that instead "influencing outcomes" is the relevant focus.

Also, we avoid the optimizer/optimized/policy issue. As stated in "Risks from Learned Optimization: Introduction":

a bottle cap causes water to be held inside the bottle, but it is not optimizing for that outcome since it is not running any sort of optimization algorithm.

If what you care about is the outcome, whether or not water will stay in the bottle, then it isn't "optimizers" you are interested in, but OIS. I think understanding optimization is important for examining possible recursive self improvement and FOOM scenarios, so the bottle cap is indeed not an optimizer, and that is important. But the bottle cap is an OIS because it is influencing the outcome of the water by making it much more likely that all of the water stays in the bottle. (Although, notably, it is an OIS with very very narrow versatility and very weak capability.)

I'm not too interested in whether large social groups working towards projects such as enforcing peace or building AGI are optimizers or not. I suspect they are, but I feel much more comfortable labelling them as "OISs" and then asking, "what are the properties of this OIS?", "Is it encoding the preferences I think it is? The preferences I should want it to?".

Ok, that's my "OIS" explanation, now onto where the "OIS" concept may help the "EDP" concept...

EDPs as OISs:

First, EDPs are OISs that exist in the memetic substrate and influence individual humans and human organizations towards successful ethical behaviour. Some relevant questions from this perspective: What are EDPs capabilities? How do they influence? How do we know what their preferences are? How do we effectively create, deploy, and decommission them based on analysis of their alignment and capability?

EDPs for LST-OISs:

It seems to me that the place we are most interested in EDPs is for influencing the behaviour of society at large, including large organizations and individuals who's actions may affect other people. So, as I mentioned about "mesaoptimizers", it seems useful to have clear terminology for discussing what kinds of OIS we are targeting with our EDPs. The most interesting kind to me are "Large SocioTechnical OISs" by which I mean governments of different kinds, large markets and their dynamics, corporations, social movements, and any other thing you can point out as being made up of large numbers of people working with technology to have some kind of influence on the outcomes of our reality. I'm sure it is useful to break LST-OISs down into subcategories, but I feel it is good to have a short and fairly politically neutral way to refer to those kinds of objects in full generality, and especially if it is embedded in the lens of "OISs" with the implication that we should care about the OISs capabilities and preferences.

People don't control OISs:

Another consideration is that people don't control OISs. Instead, OISs are like autonomous robots that we create and then send out into the world. But unlike robots, OISs can, and frequently are, created through peoples interactions without the explicit goal of creating an OIS.

This means that we live in a world with many intentionally created OISs, but also many implicit and hybrid OISs. It is not clear if there is a relationship between how an OIS was created and how capable or aligned it is. It seems that markets were mostly created implicitly, but are very capable and rather well aligned, with some important exceptions. Contrast Stalin's planned economy, which was an intentionally created OIS which I think was genuinely created to be more capable and aligned while serving the same purpose, but turned out to be less capable in many ways and tragically misaligned.

More on the note of not controlling OISs. It is more accurate to say we have some level of influence over them. It may be that our social roles are very constrained in some Molochian ways to the point that we really don't have any influence over some OISs despite contributing to them. To recontextualize some stoicism: The only OIS you control is yourself. But even that is complexified by the existence of multiple OIS within yourself.

The point of saying this is that no human has the capability to stop companies from developing and deploying dangerous technologies, rather, we are trying to understand and wield OIS which we hope may have that capability. This is important both in making our strategy clear, and in understanding how people relate to what is going on in the world.

Unfortunately, most people I talk to seem to believe that humans are in control. Sure, LST-OISs wouldn't exist without the humans in the substrate that implements them, and LST-OISs are in control, but this is extremely different from humans themselves being in control.

In trying to develop EDPs for controlling dangerous OISs, it may help to promote OIS terminology to make it easier for people to understand the true (less wrong) dynamics of what is being discussed, or at least it may be valuable to note explicitly that people we are trying to make EDPs for are thinking in terms of tribes of people where people are in control instead of complex sociotechnical systems, and that will affect how they relate to EDPs that are critical of specific OISs that they view as labels pointing at their tribe.

...

Ha, sorry for writing so much. If you read all of this, please lmk what you think : )

Reply
Ethical Design Patterns
TristanTrim1mo10

I wouldn't say I'm strongly a part of the LW community, but I have read and enjoyed the sequences. I am also undiagnosed autistic and have many times gotten into arguments for reasons that seemed to me like other people not liking the way I communicate, so I can relate to that. If you want to talk privately where there is less chance of accidentally offending larger numbers of people, feel free to reach out to me in a private message. You can think of it as a dry run for posting or reaching out to others if you want.

Reply
Load More
3TT Self Study Journal # 4
3mo
0
10N Dimensional Interactive Scatter Plot (ndisp)
3mo
3
6Tristan's Projects
3mo
4
14Zoom Out: Distributions in Semantic Spaces
3mo
4
3AI Optimization, not Options or Optimism
3mo
0
6TT Self Study Journal # 3
4mo
0
3TT Self Study Journal # 2
4mo
0
8TT Self Study Journal # 1
5mo
6
6Propaganda-Bot: A Sketch of a Possible RSI
6mo
0
2Language and My Frustration Continue in Our RSI
7mo
1
Load More
Simulator Theory
6 months ago
(+120/-10)