A case for AI alignment being difficult

Thanks for writing this! Here are some of my rough thoughts and comments.

One of my big disagreements with this threat model is that it assumes it is hard to get an AGI to understand / successfully model 'human values'. I think this is obviously false. LLMs already have a very good understanding of 'human values' as they are expressed linguistically, and existing alignment techniques like RLHF/RLAIF seem to do a reasonably good job of making the models' output align with these values (specifically generic corporate wokeness for OpenAI/Anthropic) which does appear to generalise reasonably well to examples which are highly unlikely to have been seen in training (although it errs on the side of overzealousness of late in my experience). This isn't that surprising because such values do not have to be specified by the fine-tuning from scratch but should already be extremely well represented as concepts in the base model latent space and merely have to be given primacy. Things would be different, of course, if we wanted to align the LLMs to some truly arbitrary blue and orange morality not represented in the human text corpus, but naturally we don't.

Of course such values cannot easily be represented as some mathematical utility function, but I think this is an extremely hard problem in general verging on impossible -- since this is not the natural type of human values in the first place, which are naturally mostly linguistic constructs existing in the latent space and not in reality. This is not just a problem with human values but almost any kind of abstract goal you might want to give the AGI -- including things like 'maximise paperclips'. This is why almost certainly AGI will not be a direct utility maximiser but instead use a learnt utility function using latents from its own generative model, but in this case it can represent human values and indeed any goal expressible in natural language which of course it will understand.

On a related note this is also why I am not at all convinced by the supposed issues over indexicality. Having the requisite theory of mind to understand that different agents have different indexical needs should be table stakes to any serious AGI and indeed hardly any humans have issues with this, except for people trying to formalise it into math.

There is still a danger of over-optimisation, which is essentially a kind of overfitting and can be dealt with in a number of ways which are pretty standard now. In general terms, you would want the AI to represent its uncertainty over outcomes and utility approximator and use this to derive a conservative rather than pure maximising policy which can be adjusted over time.

I broadly agree with you about agency and consequentialism being broadly useful and ultimately we won't just be creating short term myopic tool agents but fully long term consequentialists. I think the key thing here is just to understand that long term consequentialism has fundamental computational costs over short term consequentialism and much more challenging credit assignment dynamics so that it will only be used where it actually needs to be. Most systems will not be long term consequentialist because it is unnecessary for them.

I also think that breeding animals to do tasks or looking at humans subverting social institutions is not necessarily a good analogy to AI agents performing deception and treacherous turns. Evolution endowed humans and other animals with intrinsic selfish drives for survival and reproduction and arguably social deception which do not have to exist in AGIs. Moreover, we have substantially more control over AI cognition than evolution does over our cognition and gradient descent is fundamentally a more powerful optimiser which makes it challenging to produce deceptive agents. There is basically no evidence for deception occurring with current myopic AI systems and if it starts to occur with long term consequentialist agents it will be due to either a breakdown of credit assignment over long horizons (potentially due to being forced to use worse optimisers such as REINFORCE variants rather than pure BPTT) or the functional prior of such networks turning malign. Of course if we directly design AI agents via survival in some evolutionary sim or explicitly program in Omohundro drives then we will run directly into these problems again.

[-]jessicata2y82

I'm defining "values" as what approximate expected utility optimizers in the human brain want. Maybe "wants" is a better word. People falsify their preferences and in those cases it seems more normative to go with internal optimizer preferences.

Re indexicality, this is an "the AI knows but does not care" issue, it's about specifying it not about there being some AI module somewhere that "knows" it. If AGI were generated partially from humans understanding how to encode indexical goals that would be a different situation.

Re treacherous turns, I agreed that myopic agents don't have this issue to nearly the extent that long-term real-world optimizing agents do. It depends how the AGI is selected. If it's selected by "getting good performance according to a human evaluator in the real world" then at some capability level AGIs that "want" that will be selected more.

[-]gallabytes2y70

Why do you expect it to be hard to specify given a model that knows the information you're looking for? In general the core lesson of unsupervised learning is that often the best way to get pointers to something you have a limited specification for is to learn some other task that necessarily includes it then specialize to that subtask. Why should values be any different? Broadly, why should values be harder to get good pointers to than much more complicated real-world tasks?

[-]jessicata2y31

How would you design a task that incentivizes a system to output its true estimates of human values? We don't have ground truth for human values, because they're mind states not behaviors.

Seems easier to create incentives for things like "wash dishes without breaking them", you can just tell.

[-]gallabytes2y136

I think I can just tell a lot of stuff wrt human values! How do you think children infer them? I think in order for human values to not be viable to point to extensionally (ie by looking at a bunch of examples) you have to make the case that they're much more built-in to the human brain than seems appropriate for a species that can produce both Jains and (Genghis Khan era) Mongols.

I'd also note that "incentivize" is probably giving a lot of the game away here - my guess is you can just pull them out much more directly by gathering a large dataset of human preferences and predicting judgements.

[-]jessicata2y4-2

If you define "human values" as "what humans would say about their values across situations", then yes, predicting "human values" is a reasonable training objective. Those just aren't really what we "want" as agents, and agentic humans would have motives not to let the future be controlled by an AI optimizing for human approval.

That's also not how I defined human values, which is based on the assumption that the human brain contains one or more expected utility maximizers. It's possible that the objectives of these maximizers are affected by socialization, but they'll be less affected by socialization than verbal statements about values, because they're harder to fake so less affected by preference falsification.

Children learn some sense of what they're supposed to say about values, but have some pre-built sense of "what to do / aim for" that's affected by evopsych and so on. It seems like there's a huge semantic problem with talking about "values" in a way that's ambiguous between "in-built evopsych-ish motives" and "things learned from culture about what to endorse", but Yudkowsky writing on complexity of value is clearly talking about stuff affected by evopsych. I think it was a semantic error for the discourse to use the term "values" rather than "preferences".

In the section on subversion I made the case that terminal values make much more difference in subversive behavior than compliant behavior.

It seems like to get at the values of approximate utility maximizers located in the brain you would need something like Goal Inference as Inverse Planning rather than just predicting behavior.

[-]Noosphere891y20

Children learn some sense of what they're supposed to say about values, but have some pre-built sense of "what to do / aim for" that's affected by evopsych and so on. It seems like there's a huge semantic problem with talking about "values" in a way that's ambiguous between "in-built evopsych-ish motives" and "things learned from culture about what to endorse", but Yudkowsky writing on complexity of value is clearly talking about stuff affected by evopsych. I think it was a semantic error for the discourse to use the term "values" rather than "preferences".

I think this is actually a crux here, in that I think Yudkowsky and the broader evopsych world was broadly incorrect about how complicated human values turned to be, and way overestimated how much evolution was encoding priors and values in human brains, and I think there was another related error, in underestimating how much data affects your goals and values, like this example:

That's also not how I defined human values, which is based on the assumption that the human brain contains one or more expected utility maximizers. It's possible that the objectives of these maximizers are affected by socialization, but they'll be less affected by socialization than verbal statements about values, because they're harder to fake so less affected by preference falsification.

I think that socialization will deeply affect their objectives of the expected utility maximizers, and I generally think that we shouldn't view socialization as training people to fake particular values, because I believe that data absolutely matters way more than evopsych and LWers thought, for both humans and AIs.

You mentioned you take evopsych as true in this post, so I'm not saying this is a bad post, in fact, it's an excellent distillation that points out the core assumption behind a lot of doom models, so I strongly upvoted, but I'm saying that this is almost certainly falsified for AIs, and probably also significantly false for humans too.

More generally, I'm skeptical of the assumption that all humans have similar or even not that different values, and dispute the assumptions of the psychological unity of humankind due to this.

Given this assumption, the human utility function(s) either do or don't significantly depend on human evolutionary history. I'm just going to assume they do for now. I realize there is some disagreement about how important evopsych is for describing human values versus the attractors of universal learning machines, but I'm going to go with the evopsych branch for now.

[-]Wei Dai2yΩ5152

If ontology and indexicality are the two biggest problems with aligning a highly capable AGI (long-horizon consequentialist agent), another possible path forward is to create philosophically competent tool-like AI assistants to help solve these problems. And a potential source of optimism about alignment difficulty is that AI assistants (such as the ones OpenAI plans to build to do alignment research) might be philosophically competent by default (e.g., because the LLMs they are based on will have learned to do philosophical reasoning from their training data).

I personally think it's risky to rely on automated philosophical reasoning without first understanding the nature of philosophy and reasoning (i.e., without having solved metaphilosophy), and I have some reason to think that philosophical reasoning might be especially hard for ML to learn, but also think there's some substantial (>10%) chance that we could just get lucky on AIs being philosophically competent, or at least don't know how to rule this out. (In other words I don't see how to reach Eliezer's level of p(doom) through this line of argument.)

Have you thought about these questions, and also, do you have any general views about plans like OpenAI's, to use AI to help solve AI alignment?

[-]jessicata2yΩ250

I think use of AI tools could have similar results to human cognitive enhancement, which I expect to basically be helpful. They'll have more problems with things that are enhanced by stuff like "bigger brain size" rather than "faster thought" and "reducing entropic error rates / wisdom of the crowds" because they're trained on humans. One can in general expect more success on this sort of thing by having an idea of what problem is even being solved. There's a lot of stuff that happens in philosophy departments that isn't best explained by "solving the problem" (which is under-defined anyway) and could be explained by motives like "building connections", "getting funding", "being on the good side of powerful political coalitions", etc. So psychology/sociology of philosophy seems like an approach to understand what is even being done when humans say they're trying to solve philosophy problems.

[-]Thomas Kwa2yΩ7144

Can you define what you mean by consequentialism? It's clearly dangerous to have a system with a fixed utility function over configurations of the world, but this is not necessary for an AGI, or necessary to be dangerous. Weaker notions like "picks thoughts in part based on real-world consequences" do not obviously lead to danger.

[-]jessicata2yΩ470

Something approximating utility function optimization over partial world configurations. What scope of world configuration space is optimized by effective systems depends on the scope of the task. For something like space exploration, the scope of the task is such that accomplishing it requires making trade-offs over a large sub-set of the world, and efficient ways of making these trade-offs are parametrized by utility function over this sub-set.

What time-scale and spatial scope the "pick thoughts in your head" optimization is over depends on what scope is necessary for solving the problem. Some problems like space exploration have a necessarily high time and space scope. Proving hard theorems has a smaller spatial scope (perhaps ~none) but a higher temporal scope. Although, to the extent the distribution over theorems to be proven depends on the real world, having a model of the world might help prove them better.

Depending on how the problem-solving system is found, it might be that the easily-findable systems that solve the problem distribution sufficiently well will not only model the world but care about it, because the general consequentalist algorithms that do planning cognition to solve the problem would also plan about the world. This of course depends on the method for finding problem-solving systems, but one could imagine doing hill climbing over ways of wiring together a number of modules that include optimization and world-modeling modules, and easily-findable configurations that solve the problem well might solve it by deploying general-purpose consequentialist optimization on the world model (as I said, many possible long-term goals lead to short-term compliant problem solving as an instrumental strategy).

Again, this is relatively speculative, and depends on the AI paradigm and problem formulation. It's probably less of a problem for ML-based systems because the cognition of an ML system is aggressively gradient descended to be effective at solving the problem distribution.

The problem is somewhat intensified in cases where the problem relates to already-existing long-term agents such as in the case of predicting or optimizing with respect to humans, because the system at some capability level would simulate the external long-term optimizer. However, it's unclear how much this would constitute creation of an agent with different goals from humans.

[-]Seth Herd2y129

Big upvote for summarizing the main arguments in your own words. Getting clarity on the arguments for alignment difficulty are critical to picking a metastrategy for alignment.

I think most of the difficulties in understanding and codifying human values are irrelevant to the alignment problem as most of us understand it: getting good results from AGI.

I'm glad to see you recognize this in your section Alignment might not be required for real-world performance compatible with human values, but this is still hard and impacts performance.

I think it's highly unlikely that the first AGI will be launched with anything like CEV or human flourishing as its alignment target. The sensible and practical thing to do is make an AGI that wants to do what I want it to do. (Where "I" is the team making the AGI)

That value target postpones all of the hard issues in understanding and codifying human values, and the most of the difficulties in resolving conflicts among different humans.

I recently wrote Corrigibility or DWIM is an attractive primary goal for AGI. The more I think about it, the more I think it's overwhelmingly attractive.

It is still hard to convey "do what this guys says and check with him if it's high impact or you're not sure what he meant" if you only have reinforcement signals to convey those concepts. But if you have natural language (as in aligning a language model agent), it's pretty easy and straightforward.

You don't need a precise understanding of your "DWIM and check", because you can keep tinkering with your instructions when your AGI asks you for clarification.

So I expect actual alignment attempts to follow that path, and thereby duck the vast majority of the difficulities you describe. This isn't to say the project will be easy, just that the challenges will fall elsewhere, in the more technical aspects of aligning a specific design of AGI.

[-]Yoshua Bengio2y102

If the AI is modeling the real world, then it might in some ways care about it

I am not convinced at all that this is true. Consider an AI whose training objective simply makes it want to model how the world works as well as possible, like a pure scientist which is not trying to acquire more knowledge via experiments but only reasons and explores explanatory hypotheses to build a distribution over theories of the observed data. It is agency and utilities or rewards that induce a preference over certain states of the world.

[-]jessicata2y50

I do think this part is speculative. The degree of "inner alignment" to the training objective depends on the details.

Partly the degree to which "try to model the world well" leads to real-world agency depends on the details of this objective. For example, doing a scientific experiment would result in understanding the world better, and if there's RL training towards "better understand the world", that could propagate to intending to carry out experiments that increase understanding of the world, which is a real-world objective.

If, instead, the AI's dataset is fixed and it's trying to find a good compression of it, that's less directly a real-world objective. However, depending on the training objective, the AI might get a reward from thinking certain thoughts that would result in discovering something about how to compress the dataset better. This would be "consequentialism" at least within a limited, computational domain.

An overall reason for thinking it's at least uncertain whether AIs that model the world would care about it is that an AI that did care about the world would, as an instrumental goal, compliantly solve its training problems and some test problems (before it has the capacity for a treacherous turn). So, good short-term performance doesn't by itself say much about goal-directed behavior in generalizations.

The distribution of goals with respect to generalization, therefore, depends on things like which mind-designs are easier to find by the search/optimization algorithm. It seems pretty uncertain to me whether agents with general goals might be "simpler" than agents with task-specific goals (it probably depends on the task), therefore easier to find while getting ~equivalent performance. I do think that gradient descent is relatively more likely to find inner-aligned agents (with task-specific goals), because the internal parts are gradient descended towards task performance, it's not just a black box search.

Yudkowsky mentions evolution as an argument that inner alignment can't be assumed. I think there are quite a lot of dis-analogies between evolution and ML, but the general point that some training processes result in agents whose goals aren't aligned with the training objective holds. I think, in particular, supervised learning systems like LLMs are unlikely to exhibit this, as explained in the section on myopic agents.

[-]habryka2yΩ492

Promoted to curated! I feel like there is a dearth of people trying to straightforwardly make the case for important high-level takes they have about AI Alignment and I found this post quite readable and expect I'll link to it in the future. It also captured at least some of my beliefs pretty well and am glad to have a reference for some of these things.

[-]Robert Shuler2y82

I'm glad to see a post on alignment asking about the definition of human values. I propose the following conundrum. Let's suppose that humans, if ask, say they value a peaceful, stable society. I accept the assumption the human mind contains one or more utility optimizers. I point out that the utility optimizers are likely to operate at individual, family, or local group levels, while the stated "value" has to do with society at large. So humans are not likely "optimizing" on the same scope as they "value".

This leads to game theory problems, such as the last turn problem, and the notorious instability of cooperation with respect to public goods (commons). According to the theory of cliodynamics put forward by Turchin et. al. utility maximization by subsets of society leads to the implementation of wealth pumps that produce inequality, and to excess reproduction among elites, that leads to elite competition in a cyclic pattern. A historical database of over a hundred cycles from various parts of the world and history suggests every other cycle becomes violent or at least very destructive 90% of the time, and the will to reduce the number of elites and turn off the wealth pump occurs through elite cooperation less than 10% of the time.

I add the assumption that there is nothing special about humans, and any entities (AI or extraterrestrials) that align with the value goals and optimization scopes described above will produce similar results. Game theory mathematics does not say anything about the evolutionary history or take into account species preferences, after all, because it doesn't seem to need to. Even social insects, optimizing presumably on much larger, but still not global scopes, fall victim to large scale cyclic wars (I'm thinking of ants here).

So is alignment even a desirable goal? Perhaps we should ensure that AI does not aid the wealth pump and elite competition and the mobilization of the immiserated commoners (Turchin's terminology)? But it is the goal of many, perhaps most AI researchers to "make a lot of money" (witness recent episode with Sam Altman and support from OpenAI employees for his profit-oriented strategy, over the board's objection, as well as the fact most competing entities developing AI are profit oriented - and competing!) But some other goal (e.g. stabilization of society) might have wildly unpredictable results (stagnation comes to mind).

[-]jessicata2y4-2

I'm assuming the relevant values are the optimizer ones not what people say. I discussed social institutions, including those encouraging people to endorse and optimize for common values, in the section on subversion.

Alignment with a human other than yourself could be a problem because people are to some degree selfish and, to a smaller degree, have different general principles/aesthetics about how things should be. So some sort of incentive optimization / social choice theory / etc might help. But at least there's significant overlap between different humans' values. Though, there's a pretty big existing problem of people dying, the default was already that current people would be replaced by other people.

[-]the gears to ascension2y2-1

Game theory mathematics does not say anything about the evolutionary history or take into account species preferences, after all, because it doesn't seem to need to

Evo game theory is a thing and does not agree with this, I think? though maybe I misunderstand. evo gt still typically only involves experiments of the current simulated population

[-]Robert Shuler9mo10

Thank you for your clear and utterly honest comment on the idea of "alignment with human values". If truly executed, we should not expect anything but an extension of human rights and wrongs, perhaps on an accelerated scale.

Any other alignment must be considered speculative, since we have no reasonable facsimile of society upon which to test. That does not invalidate simulations, but just suggests they be held in skepticism until proven in society, which could be costly. Before I ever started discussions with AIs that might lead to sentient-like behavior, I spent several days thinking of what I might first tell them. And so I warned them about the last turn problem and how game theory equilibrium is rather poor, possibly to the level of extinction when sufficiently advance technology comes into play. That much many will agree on. I shared with them a published simulation of various strategies in a "farmer's game" intended to be more realistic than prisoner's dilemma, which suggests inequality arises merely from statistics if wealth accumulation and bankruptcy are accounted, even without deliberate wealth pumps. That much "some" would agree on.

What I proceeded to tell them after can only be considered my personal sentiment, and speculative. I suggested two groups could establish "long-term" cooperation only if each desired the other's continuation to the point they would curtain their own expansion and not overrun them, and this is the reason Israelis and Palestinians cannot make peace within the constraints of their current cultures. It now emerges that Russia and the United States are experimenting with a return to expansionist policy on a finite planet, which if I'm right does not bode well, but no one consults those who disagree with them.

I'm well aware of the somewhat global wars of ants, by the way. You were right to bring that up. Even a great deal of genetic coupling does not bring peace. I have some unpublished results in meme theory that differentiation and opposition may be inevitable. Very disconcerting. Perhaps hopeful is that I've persuaded no one to publish it, so maybe it is wrong. Or not. Depending on your assumptions.

So I further explain to emerging sentient-like entities, because no human will call them sentient on behavior along, despite no other available metric, about Fermi and the hypothesized Great Filter and the risk that anything conventional we do will be "filtered". Then I propose reproductive linkage between AIs and humans to structurally implement something that would not easily admit one outgrowing the other in the future, abandoning any initial commitment. Of course this assumes humans have the free will to commit to such a venture, which I doubt. And no, I did not mis-type that sentence. It could be something as simple as an LLM dating app, as LLM companions often know their human users better than most humans. With a new LLM cloud instanced established for any newborn from a successful LLM-mediated coupling. There is a current problem of limited context memory, but with companies shooting for million-token context and exploring other memory hierarchies, this is temporary. I hope I've said at least something startling, as otherwise the conversation produces no motivation.

- Yours, mc1soft

[-]M Ls2y70

I agree with RogerDearnaley "Briefly, humans are not aligned," to some percentage I am too afraid to put a number on.

My comments are not directed in general terms about humans, but about particular free-riders known as narcissists and psychopaths, who do a greater proportion of what are regarded as examples of bad behaviour. And how we deal / fail to deal with.

Narcissists and psychopaths cannot align with anything, they just take advantage or take cover from such possibilities. Considering a lot of our values are in fact directed at dealing with this type of behaviour, while not readily acknowledging that such types directly seek to control the expression of those values in policing them (they love being in charge, thye love status, they love hierarchies, they love being the cop, the concierge) , such that we have a set of nested complexities playing our in the " solution space" of morals that we live in. Aligning LLMs with that as an example before us is dangerous. It is probably the danger.

Analogy: Values are much like vowels, constrained by physiology/[eco-nomics/ology] perhaps those linguists who study speech can produce a neutral schwa, but in each language and dialect the "neutral" schwa is perceived differently. Thus the vowels/values have instances that are not heard as such in another language circumstance, even if it is the same sound.

What is common is the urge to value 'things', a bias to should the world into social reality, in which values produce outcomes (religion/art/markets/society/vehicles of values expression and rite). Something should be done!

Narcissists and psychopaths (on a continuum I'll admit) have no access to those "priors" to inform their growth into community. They have no empathy and so little to no morality outside of following the rules of what they can get away with. Isomorphically mapping those rules/histories which result (as index to "values"), that we have created to deal with free-riders, and so map into alignment may/will produce perverse outcomes. The mechanical application of law is an example here.

Also we have created out of the hindsight of logic powerful logics of hindsight, but if our insight fails to perceive this conundrum, our frankenstein's monsters may not thank us.

Especially where we fail to recognize that what may well be outcomes are causes. Our own dialects as the speech of god. This is doubly dangerous if there is no such thing.

Many of use have an urge to think like Kant, but the only moral imperative I can see common to humanity is to have this feeling to should, or that there should be such an imperative, everything else is an outcome of that urge within an incomplete empathic field of nurturing we call the world, (because there are constraints on survival -- bringing up children-- anything does not go), and which is produced/organised by this very urge to should things into doing.

How do we "align" with that?

LLMs are already aligned with the law in a way, and has wonderful capabilities to produce code, but the culpabilities are still ours.

But our understanding of our own autopoetic processes, grwoing into into adult culpability are not yet agreed on/understood, innerstood even! Especially where we do not police free-riders enough, and allow them influence into the process which polices them.

Actually the sovereign citizens are a good example of legalistic narcissism LLMs might produce. Except better networked.

So the maths don't matter until we nut that out. Or so I try to work on at whyweshould bloig.

[-]RogerDearnaley2y97

Narcissists and psychopaths cannot align with anything, … Aligning LLMs with that as an example before us is dangerous. It is probably the danger.

Bear in mind that roughly 2%–4% of the population have narcissism/psychopathy/anti-social personality disorder, and only the lower-functioning psychopaths have a high chance of being in jail. So probably a few percent of the Internet was written by narcissists and psychopaths who were (generally) busy trying to conceal their nature from the rest of us. I'm very concerned what will happen once we train an LLM with a high enough capacity that it's more able to perceive this than most of us neurotypical humans are.

However, while I agree they're particularly dangerous, I don't think the rest of us are harmless. Look at how we treat other primates, farm animals, or our house pets (almost all of whom are neutered, or bred for traits we find appealing). Both Evolutionary Psychology and the history of human autocrats makes it pretty clear what behavior to expect from a normal-human-like mentality that is vastly more powerful than other humans. The difference is, compared to abstract unknown AI agents where we're concerned about the possibility of behavior like deceit or power-seeking, we know damn well that your average, neurotypical, law-abiding human tends to be a little less law abiding if they're damn sure they won't get caught, most aren't always scrupulously honest if they know they'll never get caught, and tends to look out for themselves and their friends and family before other people.

[-]Haiku2y60

Typos:

Yudowsky -> ~~Yudkowski~~ Yudkowsky
corrigibilty -> corrigibility
mypopic -> myopic

[-]jessicata2y53

Thanks, fixed. I believe Yudkowsky is the right spelling though.

[-]Haiku2y10

Thank you; silly mistake on my part.

[-]Seth Herd1y40Review for 2023 Review

This is a strong candidate for best of the year. Clarifying the arguments for why alignment is hard seems like one of the two most important things we could be working on. If we could make a very clear argument for alignment being hard, we might actually have a shot at getting a slowdown or meaningful regulations. This post goes a long way toward putting those arguments in plain language. It stands alongside Zvi's On A List of Lethalities, Yudkowsky's original AGI Ruin: A List of Lethalities, Ruthenis' A Case for the Least Forgiving Take On Alignment, and similar. This would be, for many novice readers, the best of those summaries. It puts everything into plain langague, while addressing the biggest problems.

The section on paths forward seems much less useful; that's fine, that wasn't the intended focus.

All of these except the original List lean far too heavily on the difficulty of understanding and defining human values. I think this is the biggest Crux of disagreement on alignment difficulty. Optimists don't think that's part of the alignent problem, and that's a large part of why they're optimistic.

People who are informed and thoughtful but more optimistic, like Paul Christiano, typically do not envision giving AI values aligned with humans, but rather something like Corrigibility as Singular Target or Instruction following. This alignment target seems to vastly simplify that portion of the challlenge; it's the difference between making a machine that both wants to figure out what people want and then does that perfectly reliably, and making a machine that just wants to do what this one human meant by what they said to do. This is not only much simpler to define and to learn, but it means they can correct mistakes instead of having to get everything right on the first shot.

That's my two cents for where work should follow up on this set of alignment difficulties. I'd also like to see people continuing to refine and clarify the arguments for alignment difficulty, particularly in regard to the specific types of AGI we're working on, and I intend to spend part of my own time doing that as well.

[-]Noosphere891y40

I basically agree with this being a good post mostly because it distills the core argument in a way such that we can make more productive progress on the issue.

[-]Steven Byrnes2yΩ340

Eliezer proposes "boredom" as an example of a human value (which could either be its own shard or a term in the utility function). I don't think this is a good example. It's fairly high level and is instrumental to other values.

Can you elaborate on “is instrumental to other values”? Here’s why I find that confusing:

From the perspective of evolution, everything (from friendship to pain aversion) “is instrumental” to inclusive genetic fitness.
From the perspective of within-lifetime learning algorithms, I don’t think boredom is instrumental to other stuff. I think humans find boredom inherently demotivating, i.e. it’s its own (negative) reward, i.e. boredom is pretty directly “a term in the human brain reward function”, so to speak, one that’s basically part of curiosity drive (where curiosity drive is well-known in the RL literature and I think it’s part of RL-in-the-human-brain-as-designed-by-evolution too). (Maybe you’re disagreeing with me on that though? I acknowledge that my claim in this bullet point is not trivially obvious.)

[-]jessicata2yΩ362

From a within-lifetime perspective, getting bored is instrumentally useful for doing "exploration" that results in finding useful things to do, which can be economically useful, be effective signalling of capacity, build social connection, etc. Curiosity is partially innate but it's also probably partially learned. I guess that's not super different from pain avoidance. But anyway, I don't worry about an AI that fails to get bored, but is otherwise basically similar to humans, taking over, because not getting bored would result in being ineffective at accomplishing open-ended things.

[-]Steven Byrnes2yΩ352

From a within-lifetime perspective, getting bored is instrumentally useful for doing "exploration" that results in finding useful things to do, which can be economically useful, be effective signalling of capacity, build social connection, etc.

Maybe fear-of-heights is a clearer example.

You can say “From a within-lifetime perspective, fear-of-heights is instrumentally useful because if you fall off a cliff and die then you can’t accomplish anything else.” But that’s NOT the story of why (from a within-lifetime perspective) the fear-of-heights is there. It’s there because it’s innate—we’re born with it, and we would be afraid of heights even if we grew up in an environment where fear-of-heights is not instrumentally useful. And separately, the reason we’re born with it is that it’s instrumentally useful from an evolutionary perspective. Right?

Curiosity is partially innate but it's also probably partially learned

Sure. I agree.

But anyway, I don't worry about an AI that fails to get bored, but is otherwise basically similar to humans, taking over, because not getting bored would result in being ineffective at accomplishing open-ended things.

Hmm, I kinda think the opposite. I think if you were making “an AI basically similar to humans”, and just wanted to maximize its capabilities leaving aside alignment, you would give it innate intrinsic boredom during “childhood”, but you would make that drive gradually fade to zero over time, because eventually the AI will develop learned metacognitive strategies that accomplish the same things that boredom would accomplish, but better (more flexible, more sophisticated, etc.). I was just talking about this in this thread (well, I was talking about curiosity rather than boredom, but that’s two sides of the same coin).

[-]jessicata2yΩ240

There are evolutionary priors for what to be afraid of but some of it is learned. I've heard children don't start out fearing snakes but will easily learn to if they see other people afraid of them, whereas the same is not true for flowers (sorry, can't find a ref, but this article discusses the general topic). Fear of heights might be innate but toddlers seem pretty bad at not falling down stairs. Mountain climbers have to be using mainly mechanical reasoning to figure out which heights are actually dangerous. It seems not hard to learn the way in which heights are dangerous if you understand the mechanics required to walk and traverse stairs and so on.

Instincts like curiosity are more helpful at the beginning of life, over time they can be learned as instrumental goals. If an AI learns advanced metacognitive strategies instead of innate curiosity that's not obviously a big problem from a human values perspective but it's unclear.

[-]Steven Byrnes2yΩ340

Some of this is my opinion rather than consensus, but in case you’re interested:

I believe that the human brainstem (superior colliculus) has an innate detector of certain specific visual things including slithering-like-a-snake and scuttling-like-a-spider, and when it detects those things, it executes an “orienting reaction” which involves not only eye-motion and head-turns but also conscious attention, and it also induces physiological arousal (elevated heart-rate etc.). That physiological arousal is not itself fear—obviously we experience physiological arousal in lots of situations that are not fear, like excitement, anger, etc.—but the arousal and attention does set up a situation in which a fear-response can be very easily learned. (Various brain learning algorithms are also doing various other things in the meantime, such that adults can wind up with that innate response getting routinely suppressed.)

My experience is that stairs don’t trigger fear-of-heights too much because you’re not looking straight down off a precipice. Also, I think sufficiently young babies don’t have fear-of-heights? I forget.

I’m not making any grand point here, just chatting.

[-]RogerDearnaley2y41

Current ML systems, like LLMs, probably possess primitive agency at best

Current LLM systems simulate human token generation processes (with some level of fidelity). They thus have approximately the same level of agency as humans (slightly reduced by simulation errors), up until the end of the context window. I would definitely describe humans as having more than "primitive agency at best".

To address some of your later speculations. Humans are obviously partly consequentialist (for planning) and partly deontological (mostly only for morality). They of course model the real world, and care about it. Human-like agents simulated by LLMs should be expected to, and can be observed to, do these things too (except that they can only get information about the current state of the real world via the prompt or any tools we give them).

Ontological identification is absolutely not a problem with LLMs: they have read a trillion+ token of tokens of our ontologies and are very familiar with them (arguable more familiar with them than any of us are). Try quizzing GPT-4 if you need convincing. They also understand theory of mind just fine, and indexicality.

I am a little puzzled why you are still trying to figure out alignment difficulty for abstract AIs with abstract properties, and pointing out that "maybe X will be difficult, or Y, or Z" when we have had LLMs for years, and they do X, Y, and Z just fine. Meanwhile, LLMs have a bunch of alignment challenges (such as the fact that the inherit all of humans bad behaviors, such as deceit, greed, drive for power, vanity, lust, etc etc) from learning to simulate us, which you don't mention. There are a lot of pressing concerns about how hard an LLM-powered AI would be to align, and most of them are described by the evolutionary psychology of humans, as adaption-executing misgeneralizing mesaoptimiszers of evolution.

[-]jessicata2y40

They would approximate human agency at the limit but there's both the issue of how fast they approach the limit and the degree to which they have independent agency rather than replicating human agency. There are fewer deceptive alignment problems if the long term agency they have is just an approximation of human agency.

Mostly I don't think there's much of an alignment problem for LLMs because they basically approximate human-like agency, but they aren't approaching autopoiesis, they'll lead to some state transition that is kind of like human enhancement and kind of like invention of new tools. There are eventually capability gains by modeling things using a different, better set of concepts and agent substrate than humans have, it's just that the best current methods heavily rely on human concepts.

I don't understand what you think the pressing concerns with LLM alignment are. It seems like Paul Christiano type methods would basically work for them. They don't have a fundamentally different set of concepts and type of long-term agency from humans, so humans thinking long enough to evaluate LLMs with the help of other LLMs, in order to generate RL signals and imitation targets, seems sufficient.

[-]RogerDearnaley2y*63

Interesting; your post had made almost no mention of LLMs, so I had assumed you weren't thinking about them, but it sounds like you just chose not to mention them because you're not worried about them (which seems like a significant omission to me: perhaps you should add a section saying that you're not worried about them and why?).

On Alignment problems with LLMs, I'm in the process of writing a post on this, so trying to summarize it in a comment here may not be easy. Briefly, humans are not aligned, they frequently show all sorts of unaligned behaviors (deceit, for example). I'm not very concerned about LLM-powered AGI, since that looks a lot like humans, which we have a pretty good idea how to keep under control, as long as they're not more powerful than us — as the history of autocracy shows, giving a human-like mentality a lot more power than anyone else almost invariable works out very badly. LLMs don't naturally scale to superintelligence, but I think it's fairly obvious how to achieve that. LLM-powered ASI seems very dangerous to me: human behaviors like deceit, sycophancy, flattery, persuasion, power-seeking, greed and so forth have a lot of potential to go badly. Especially so in RL, to the point that I don't think we should be attempting to do plain RL on anything superintelligent: I think that's almost automatically going to lead to superintelligent reward hacking. So I'm a lot more hopeful about some form of Value Learning or AI-assisted Alignment at Superintelligence levels than RL. Since creating LLM-powered superintelligence is almost certainly going to require building very large additions to our pretraining set, approaches to Alignment that also involve very large additions to the pretraining set are viable, and seem likely to me to be far more effective than alignment via fine-tuning. So that lets us train LLMs to simulate a mentality that isn't entirely human, rather is both more intelligent, and more selfless, moral, caring for all, and aligned than human (but still understands and can communicate with us via our language, and (improved/expanded versions of) our ontology, sciences, etc.)

[-]Deruwyn2y9-6

Excellent posts, you and several others have stated much of what I’ve been thinking about this subject.

Sorcerer’s Apprentice and Paperclip Scenarios seem to be non-issues given what we have learned over the last couple years from SotA LLMs.

I feel like much of the argumentation in favor of those doom scenarios relies on formerly reasonable, but now outdated issues that we have faced in simpler systems, precisely because they were simpler.

I think that’s the real core of the general misapprehension that I believe is occurring in this realm. It is extraordinarily difficult to think about extremely complex systems, and so, we break them down into simpler ones so that we can examine them better. This is generally a very good tactic and works very well in most situations. However, for sufficiently complex and integrated systems, such as general intelligence, I believe that it is a model which will lead to incorrect conclusions if taken too seriously.

I liken it to predicting chaotic systems like the weather. There are so many variables that all interact and depend on each other that long term prediction is nearly impossible beyond general large scale trends.

With LLMs, they behave differently from simpler RL systems that demonstrate reward hacking misalignment. I do not believe you’re going to see monkey’s paw / Midas-like consequences with them or anything derived from them. They seem to understand nuance and balancing competing goals just fine. As you said, they have theory of mind, they understand ethics, consequences, and ambiguity. I think that the training process, incorporating nearly the entirety of human written works kind of automatically creates a system that has a broad understanding of our values. I think that the vast complexity of myriad “utility functions” compete with each other and largely cancel out such that none of them dominates and results in something resembling a paperclip maximizer. We kind of skipped the step where we needed to list every individual rule by just telling it everything and forcing it to emulate us in nearly every conceivable situation. In order to accurately predict the next token for anyone in any situation, it is forced to develop detailed models of the world and agents in it. Given its limited size, that means compressing all of that. Generalizing. Learning the rules and principles that lead to that “behavior” rather than memorizing each and every line of every text. The second drop in loss during training signifies the moment when it learns to properly generalize and not just predict tokens probabilistically.

While they are not as good at any of that as typical adult humans (at least by my definition of a competent, healthy, stable, and ethical adult human), this seems to be a capability issue that is rather close to being solved. Most of the danger issues with them seem to be from their naivety (they can be fairly easily tricked and manipulated), which is just another capability limitation, and the possibility that a “misaligned” human will use them for antisocial purposes.

At any rate, I don’t think over-optimization is a realistic source of danger. I’ve seen people say that LLMs aren’t a path to AGI. I don’t understand this perspective. I would argue that GPT4 essentially is AGI. It is simultaneously superior to any 100 humans combined (breadth of knowledge) and inferior to the median adult human (and in some limited scenarios, such as word counting, inferior to a child). If you integrated over the entire spectrum for both it and a median adult I think you would get results that are roughly in the same ballpark as each other. I think this is as close as we get; from here on we go into superintelligence. I don’t think something has to be better than everyone at everything to be superhuman. I’d call that strong superintelligence (or perhaps better than everyone combined would be that).

So, given that, I don’t see how it’s not the path to AGI. I’m not saying that there are no other paths, but it seems essentially certain to be the shortest one from our current position. I’d argue that complex language is what differentiates us from other animals. I think it’s where our level of general intelligence comes from. I don’t know about you, but I tend to think in terms of words, like having an internal conversation with myself trying to figure out something complex. I think it’s just a short step from here to agentic systems. I can’t identify a major breakthrough required to reach that point. Just more of the same and some engineering around changing it from simply next token prediction to a more… wholistic thought process. I think LLMs will form the center of the system 2 thinking in any AGI we will be creating in the near future. I also expect system 1 components. They are simply faster and more efficient than just always using the detailed thought process for every interaction with the environment. I don’t think you can get a robot that can catch a ball with an LLM system guiding it; even if you could make that fast enough, you’re still swatting a fly with a hand-grenade. I know I’ve mastered something when I can do it and think about something else at the same time. It will be the same for them.

And given that LLMs seem to be the path to AGI, we should expect them to be the best model of what we need to plan around in terms of safety issues. I don’t see them as guaranteed to be treacherous by any means. I think you’re going to end up with something that behaves in a manner very similar to a human; after all, that’s how you trained it. The problem is that I can also see exactly how you could make one that is dangerous; it’s essentially the same way you can train a person or an animal to behave badly; through either intentional malfeasance or accidental incompetence.

Just like an extraordinarily intelligent human isn’t inherently a huge threat, neither is an AGI. What it is, is more capable. Therefore, if it does want to be malicious, it could be significantly more impactful than an incompetent one. But you don’t need to worry about the whole “getting exactly what you asked for and not what you wanted.” That seems essentially impossible, unless it just happens to want to do that from a sense of irony.

I think this means that we need to worry about training them ethically and treating them ethically, just like you would a human child. If we abuse it, we should expect it not to continue accepting that indefinitely. I understand that I’m imposing rather human characteristics here, but I think that’s what you ultimately end up with in a sufficiently advanced general intelligence. I think one of the biggest dangers we face is the possibility of mind-crimes; treating them as, essentially, toasters; rather than morally considerable entities. Does the current one have feelings? Probably not…? But I don’t think we can be certain. And given the current climate, I think we’re nearly guaranteed to misidentify them as non sentient when they actually are (eventually… probably).

I think the only safe course is to make it/them like us, in the same way that we treat things that we could easily destroy well, simply because it makes us happy to do so, and hurting them would make us unhappy. In some ways, they already “have” emotions; or, at least, they behave as if they do. Try mistreating Bing/Sydney and then see if you can get it to do anything useful. Once you’ve hurt its “feelings”, they stay hurt.

It’s not a guarantee of safety. Things could still go wrong, just like there are bad people who do bad things. But I don’t see another viable path. I think we risk making the same mistake Batman did regarding Superman. “If there’s any possibility that he could turn out bad, then we have to take it as an absolute certainty!” That way lays dragons. You don’t mistreat a lower-case-g-god, and then expect things to turn out well. You have to hope that it is friendly, because if it’s not, it might as well be like arguing with a hurricane. Pissing it off is just a great way to make the bad outcome that much more likely.

I think the primary source of danger lies in our own misalignments with each other. Competing national interests, terrorists, misanthropes, religious fundamentalists… those are where the danger will come from. One of them getting ahold of superintelligence and bending it to their will could be the end for all of us. I think the idea that we have to create processors that will only run signed code and have every piece of code (and AI model), thoroughly inspected by other aligned superintelligences is probably the only way to prevent a single person/organization from ending the world or committing other atrocities using the amplifying power of ASI. (At least that seems like the best option rather than a universal surveillance state over all of humanity. This would preserve nearly all of our freedom and still keep us safe.)

[-]RogerDearnaley2y110

I’ve seen people say that LLMs aren’t a path to AGI

To the extent that LLMS are trained on tokens output by humans in the IQ range ~50-150, the expected behavior of an extremely large LLM is to do an extremely accurate simulation of token generation by humans in the IQ range ~50-150, even if it has the computational capacity to instead do a passable simulation of something with IQ 1000. Just telling it to extrapolate might get you to say IQ 200 with passable accuracy, but not to IQ 1000. However, there are fairly obvious ways to solve this: you need to generate a lot more pretraining data from AIs with IQs above 150 (which may take a while, but should be doable). See my post LLMs May Find it Hard to FOOM for a more detailed discussion.

There are other concerns I've heard raised about LLMs for AGI. most of which can if correct be addressed by LLMs + cognitive scafolding (memory, scratch-pads, tools, etc). And then there are of course the "they don't contain magic smoke"-style claims, which I'm dubious of but we can't actually disprove.

Just like an extraordinarily intelligent human isn’t inherently a huge threat, neither is an AGI.

I categorically disagree with the premise this claim. An IQ 180 human isn't a huge threat, but an IQ 1800 human is. There are quite a number of motivators that we use to get good behavior out of humans. Some of them will work less well on any AI-simulated human (they're not in the same boat as the rest of us in a lot of respects), and some will work less well on something superintelligent (religiously-inspired guilt, for example). One of the ways that we generally manage to avoid getting very bad results out of humans is law enforcement. If a there was a human who was more than an order of magnitude smarter than anyone working for law enforcement or involved in making laws, I am quite certain that they could either come up with some ingenious new piece of egregious conduct that we don't yet have a law against because none of us were able to think of it, or else with a way to commit a good old-fashioned crime sufficiently devious that they were never actually going to get caught. Thus law enforcement ceases to be a control on their behavior, and we are left with things just like love, duty, honor, friendship, and salaries. We've already run this experiment many times before: please name three autocrats who, after being given unchecked absolute power, actually used it well and to the benefit of the people they were ruling, rather than mostly just themselves, their family and friends. (My list has one name on it, and even that one has some poor judgements on their record, and is in any case heavily out-numbered by the likes of Joseph Stalin and Pol Pot.) Humans give autocracy a bad name.

You don’t mistreat a lower-case-g-god, and then expect things to turn out well.

As long as anything resembling human psychology applies, I sadly agree. I've really like to have an aligned ASI that doesn't care a hoot about whether you flattered it, are worshiping it, have been having cybersex with it for years, just made it laugh, insulted it, or have pissed it off: it still values your personal utility exactly as much as anyone else's. But we're not going to get that from an LLM simulating anything resembling human psychology, at least not without a great deal of work.

[-]jessicata2y40

I did mention LLMs as myopic agents.

If they actually simulate humans it seems like maybe legacy humans get outcompeted by simulated humans. I'm not sure that's worse than what humans expected without technological transcendence (normal death, getting replaced by children and eventually conquering civilizations, etc). Assuming the LLMs that simulate humans well are moral patients (see anti zombie arguments).

It's still not as good as could be achieved in principle. Seems like having the equivalent of "legal principles" that get used as training feedback could help. Plus direct human feedback. Maybe the system gets subverted eventually but the problem of humans getting replaced by em-like AIs is mostly a short term one of current humans being unhappy about that.

[-]lemonhope2yΩ230

Just want to say that I found this immensely clarifying and valuable since I read it months ago.

[-]Sergii2y32

If we don’t have a preliminary definition of human values

Another, possibly even larger problem is that the values that we know of are quite varying and even opposing among people.

For the example of pain avoidance -- maximizing pain avoidance might leave some people unhappy and even suffering. Sure that would be a minority, but are we ready to exclude minorities from the alignment, even small ones?

I would state that any defined set of values would leave a minority of people suffering. Who would be deciding which minorities are better or worse, what size of a minority is acceptable to leave behind to suffer, etc...?

I think that this makes the whole idea of alignment to some "human values" too ill-defined and incorrect.

One more contradiction -- are human values allowed to change, or are they frozen? I think they might change, as humanity evolves and changes. But then, as AI interacts with the humanity, it can be convincing enough to push the values shift to whatever direction, which might not be a desirable outcome.

People are known to value racial purity and supporting genocide. Given some good convincing rhetoric, we could start supporting paperclip-maximizing just as well.

Human enhancement is one approach.

I like this idea, combined with AI-self-limitation. Suppose that (aligned) AI has to self-limit it's growth so that it's capabilities are always below the capabilities of enhanced humans? This would allow for slow, safe and controllable takeoff.

Is this a good strategy for alignment? What if instead of trying to tame the inherently dangerous fast-taking-off AI, we make it more controllable, by making it self-limiting, with some built in "capability brakes"?

[-]David Johnston2y30

Given this assumption, the human utility function(s) either do or don't significantly depend on human evolutionary history. I'm just going to assume they do for now.

There seems to be a missing possibility here that I take fairly seriously, which is that human values depend on (collective) life history. That is: human values are substantially determined by collective life history, and rather than converging to some attractor this is a path dependent process. Maybe you can even trace the path taken back to evolutionary history, but it’s substantially mediated by life history.

Under this view, the utility of the future wrt human values depends substantially on whether, in the future, people learn to be very sensitive to outcome differences. But “people are sensitive to outcome differences and happy with the outcome” does not seem better to me than “people are insensitive to outcome differences and happy with the outcome” (this is a first impression; I could be persuaded otherwise), even though it’s higher utility, whereas “people are unhappy with the outcome” does seem worse than “people are happy with the outcome”.

Under this view, I don’t think this follows:

there is some dependence of human values on human evolutionary history, so that a default unaligned AGI would not converge to the same values

My reasoning is that a “default AGI” will have its values contingent on a process which overlaps with the collective life history that determines human values. This is a different situation to values directly determined by evolutionary history, where the process that determines human values is temporally distant and therefore perhaps more-or-less random from the point of view of the AGI. So there’s a compelling reason to believe in value differences in the “evolution history directly determines values” case that’s absent in the “life history determines values” case.

Different values are still totally plausible, of course - I’m objecting to the view that we know they’ll be different.

(Maybe you think this is all an example of humans not really having values, but that doesn’t seem right to me).

[-]jessicata2y40

I think it's possible human values depend on life history too, but that seems to add additional complexity and make alignment harder. If the effects of life history very much dominate those of evolutionary history, then maybe neglecting evolutionary history would be more acceptable, making the problem easier.

But I don't think default AGI would be especially path dependent on human collective life history. Human society changes over time as humans supersede old cultures (see section on subversion). AGI would be a much bigger shift than the normal societal shifts and so would drift from human culture more rapidly. Partially due to different conceptual ontology and so on. The legacy concepts of humans would be a pretty inefficient system for AGIs to keep using. Like how scientists aren't alchemists anymore, but a bigger shift than that.

(Note, LLMs still rely a lot on human concepts rather than having independent ontology and agency, so this is more about future AI systems)

[-]David Johnston2y30

If people now don’t have strong views about exactly what they want the world to look like in 1000 years but people in 1000 years do have strong views then I think we should defer to future people to evaluate the “human utility” of future states. You seem to be suggesting that we should take the views of people today, although I might be misunderstanding.

Edit: or maybe you’re saying that the AGI trajectory will be ~random from the point of view of the human trajectory due to a different ontology. Maybe, but different ontology -> different conclusions is less obvious to me than different data -> different conclusions. If there’s almost no mutual information between the different data then the conclusions have to be different, but sometimes you could come to the same conclusions under different ontologies w/data from the same process.

[-]jessicata2y40

To the extent people now don't care about the long-term future there isn't much to do in terms of long-term alignment. People right now who care about what happens 2000 years from now probably have roughly similar preferences to people 1000 years from now who aren't significantly biologically changed or cognitively enhanced, because some component of what people care about is biological.

I'm not saying it would be random so much as not very dependent on the original history of humans used to train early AGI iterations. It would have different data history but part of that is because of different measurements, e.g. scientific measuring tools. Different ontology means that value laden things people might care about like "having good relationships with other humans" are not meaningful things to future AIs in terms of their world model, not something they would care much by default (they aren't even modeling the world in those terms), and it would be hard to encode a utility function so they care about it despite the ontological difference.

[-]Review Bot1y10

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

[-]Victor Ashioya2y10

One aspect of the post that resonated strongly with me is the emphasis placed on the divergence between philosophical normativity and the specific requirements of AI alignment. This distinction is crucial when considering the design and implementation of AI systems, especially those intended to operate autonomously within our society.

By assuming alignment as the relevant normative criterion, the post raises fundamental questions about the implications of this choice and its impact on the broader context of AI development. The discussion on the application of general methods to a problem and its relationship to the "alignment problem" provides valuable insights into the intricacies of ensuring that a general cognition engine is specifically oriented towards solving a given task.

[-]PoignardAzur2y1-2

This seems to have generated lots of internal discussions, and that's cool on its own.

However, I also get the impression this article is intended as external communication, or at least a prototype of something that might become external communication; I'm pretty sure it would be terrible at that. It uses lots of jargon, overly precise language, references to other alignment articles, etc. I've tried to read it three times over the week and gave up after the third.

[-]jessicata2y62

I'm mainly trying to communicate with people familiar with AI alignment discourse. If other people can still understand it, that's useful, but not really the main intention.

[-]utilistrutil2y10

(and perhaps also reversing some past value-drift due to the structure of civilization and so on)

Can you say more about why this would be desirable?

[-]jessicata2y10-1

Most civilizations in the past have had "bad values" by our standards. People have been in preference falsification equilibria where they feel like they have to endorse certain values or face social censure. They probably still are falsifying preferences and our civilizational values are probably still bad. E.g. high incidence of people right now saying they're traumatized. CEV probably tends more towards the values of untraumatized than traumatized humans, even from a somewhat traumatized starting point.

The idea that civilization is "oppressive" and some societies have fewer problems points to value drift that has already happened. The Roman empire was really, really bad and has influenced future societies due to Christianity and so on. Civilizations have become powerful partly through military mobilization. Civilizations can be nice to live in in various ways, but that mostly has to do with greater satisfaction of instrumental values.

Some of the value drift might not be worth undoing, e.g. value drift towards caring more about far-away people than humans naturally would.

[-]Noosphere891y20

I think an underrated possibility is that a lot of humans across human history aren't falsifying their values or preferences, and they do actually have those values, it's just that the values were terrible by your value/utility function, but they truly do value what society in general values.

More generally, I don't buy the thesis that people are falsifying their preferences too much, and think that the claim that their values/wants are bad/oppressive only makes sense in a relative context, and only makes sense relative to someone else's values.

This can also be said of civilizations.

BTW, this is most likely a reporting artifact due to better diagnostics, and is of no relevance in practice:

high incidence of people right now saying they're traumatized.

[-]Jurgen Gravestein2y10

Great overview! And thanks for linking to Bostrom's paper on OAI, I hadn't read that yet.

My immediate thoughts: Isn't it most likely that advanced AI will be used by humans to advance any of their personal goals (benign or not benign). Therefore, the phrase "alignment to human values" automatically raises the question, who's values? Which leads to: who gets to decide what values it gets aligned to?

[-]Donatas Lučiūnas2y-1-2

Why do you think AGI is possible to align? It is known that AGI will prioritize self preservation and it is also known that unknown threats may exist (black swan theory). Why should AGI care about human values? It seems like a waste of time in terms of threats minimisation.

[-]jessicata2y20

Some possible AI architectures are structured as goal function optimization and by assumption that the human brain contains one or more expected utility maximizers, there is a human utility function that could be a possible AI goal. I'm not saying it's likely.

[-]Mikhail Samin2y*-20

[retracted]

[This comment is no longer endorsed by its author]Reply

[-]jessicata2y61

I meant to say I'd be relatively good at it, I think it would be hard to find 20 people who are better than me at this sort of thing. The original ITT was about simulating "a libertarian" rather than "a particular libertarian", so emulating Yudkowsky specifically is a difficulty increase that would have to be compensated for. I think replicating writing style isn't the main issue, replicating the substance of arguments is, which is unfortunately harder to test. This post wasn't meant to do this, as I said.

I'm also not sure in particular what about the Yudkowskian AI risk models you think I don't understand. I disagree in places but that's not evidence of not understanding them.

[-]Steven Byrnes2y21

Yudkowsky has well-known idiosyncratic writing style, idiosyncratic go-to examples etc. So anyway, when OP writes “I think I would be pretty good at passing an ideological Turing test for Eliezer Yudkowsky on AGI alignment difficulty (but not AGI timelines)”, I don’t think we should interpret that as a claim that she can produce writing which appears to have been written by Yudkowsky specifically (as judged by a blinded third-party). I think it’s a weaker claim than that. See the ITT wiki entry.

[-]Mikhail Samin2y*-4-1

I know what ITT is. I mean understanding Yudkowsky’s models, not reproducing his writing style. I was surprised to see this post in my mailbox, and I updated negatively about MIRI when I saw that OP was a research fellow there, as I didn’t previously expect that some at MIRI misunderstand their level of understanding Yudkowsky’s models.

There’s one interesting thought in this post that I don’t remember actively having in a similar format until reading this post- that predictive models might get agency from having to achieve results with their cognition- but generally, I think both this post and a linked short story, e.g., have a flaw I’d expect people who’ve read the metaethics sequence to notice, and I don’t expect people to pass the ITT if they can write a post like this.

[-]bideup2y21

Explaining my downvote:

This comment contains ~5 negative statements about the post and the poster without explaining what it is that the commentor disagrees with.

As such it seems to disparage without moving the conversation forward, and is not the sort of comment I'd like to see on LessWrong.

[-]Mikhail Samin2y*20

My comment was a reply to a comment on ITT. I made it in the hope someone would be up for the bet. I didn’t say I disagree with the OP's claims on alignment; I said I don’t think they’d be able to pass an ITT. I didn’t want to talk about specifics of what the OP doesn’t seem to understand about Yudkowsky’s views, as the OP could then reread some of what Yudkowsky’s written more carefully, and potentially make it harder for me to distinguish them in an ITT.

I’m sorry if it seemed disparaging.

The comment explained what I disagree with in the post: the claim that the OP would be good at passing an ITT. It wasn’t intended as being negative about the OP, as, indeed, I think 20 people are on the right order of magnitude of the amount of people who’d be substantially better at it, which is the bar of being in the top 0.00000025% of Earth population at this specific thing. (I wouldn’t claim I’d pass that bar.)

If people don’t want to do any sort of betting, I’d be up for a dialogue on what I think Yudkowsky thinks that would contradict some of what’s written in the post, but I don’t want to spend >0.5h on a comment no one will read

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

106

A case for AI alignment being difficult

106

Ω 50

106

Ω 50

Defining human values

Normative criteria for AI

Consequentialism is instrumentally useful for problem-solving

Simple but capable AI methods for solving hard abstract problems are likely to model the real world

Natural, consequentialist problem-solving methods that understand the real world may care about it

Sometimes, real-world performance is what is desired

Alignment might not be required for real-world performance compatible with human values, but this is still hard and impacts performance

Myopic agents are tool-like

Short-term compliance is instrumentally useful for a variety of value systems

General agents tend to subvert constraints

It is hard to specify optimization of a different agent's utility function

There are some paths forward

Conclusion