Thoughts on Iason Gabriel’s Artificial Intelligence, Values, and Alignment

Humans aren't fit to run the world, and there's no reason to think humans can ever be fit to run the world.

I see this argument pop up every so often. I don't find it persuasive because it presents a false choice in my view.

Our choice is not between having humans run the world and having a benevolent god run the world. Our choice is between having humans run the world, and having humans delegate the running of the world to something else (which is kind of just an indirect way of running the world).

If you think the alignment problem is hard, you probably believe that humans can't be trusted to delegate to an AI, which means we are left with either having humans run the world (something humans can't be trusted to do) or having humans build an AI to run the world (also something humans can't be trusted to do).

The best path, in my view, is to pick and choose in order to make the overall task as easy as possible. If we're having a hard time thinking of how to align an AI for a particular situation, add more human control. If we think humans are incompetent or untrustworthy in some particular circumstance, delegate to the AI in that circumstance.

It's not obvious to me that becoming wiser is difficult -- your comment is light on supporting evidence, violence seems less frequent nowadays, and it seems possible to me that becoming wiser is merely unincentivized, not difficult. (BTW, this is related to the question of how effective rationality training is.)

However, again, I see a false choice. We don't have flawless computerized wisdom at the touch of a button. The alignment problem remains unsolved. What we do have are various exotic proposals for computerized wisdom (coherent extrapolated volition, indirect normativity) which are very difficult to test. Again, insofar as you believe the problem of aligning AIs with human values is hard, you should be pessimistic about these proposals working, and (relatively) eager to shift responsibility to systems we are more familiar with (biological humans).

Let's take coherent extrapolated volition. We could try & specify some kind of exotic virtual environment where the AI can simulate idealized humans and observe their values... or we could become idealized humans. Given the knowledge of how to create a superintelligent AI, the second approach seems more robust to me. Both approaches require us to nail down what we mean by an "idealized human", but the second approach does not include the added complication+difficulty of specifying a virtual environment, and has a flesh and blood "human in the loop" observing the process at every step, able to course correct if things seem to be going wrong.

The best overall approach might be a committee of ordinary humans, morally enhanced humans, and morally enhanced ems of some sort, where the AI only acts when all three parties agree on something (perhaps also preventing the parties from manipulating each other somehow). But anyway...

You talk about the influence of better material conditions and institutions. Fine, have the AI improve our material conditions and design better institutions. Again I see a false choice between outcomes achieved by institutions and outcomes achieved by a hypothetical aligned AI which doesn't exist. Insofar as you think alignment is hard, you should be eager to make an AI less load-bearing and institutions more load-bearing.

Maybe we can have an "institutional singularity" where we have our AI generate a bunch of proposals for institutions, then we have our most trusted institution choose from amongst those proposals, we build the institution as proposed, then have that institution choose from amongst a new batch of institution proposals until we reach a fixed point. A little exotic, but I think I've got one foot on terra firma.

The Great Karma Reckoning

We removed the historical 10x multiplier for posts that were promoted to main on LW 1.0

Are comments currently accumulating karma in the same way that toplevel posts do?

Approval Extraction Advertised as Production

When I read this essay in 2019, I remember getting the impression that approval-extracting vs production-oriented was supposed to be about the behavior of the founders, not the industry the company competes in.

Why GPT wants to mesa-optimize & how we might change this

I was using it to refer to "any inner optimizer". I think that's the standard usage but I'm not completely sure.

Why GPT wants to mesa-optimize & how we might change this

With regard to the editing text discussion, I was thinking of a really simple approach where we resample words in the text at random. Perhaps that wouldn't work great, but I do think editing has potential because it allows for more sophisticated thinking.

Let's say we want our language model to design us an aircraft. Perhaps its starts by describing the engine, and then it describes the wings. Standard autoregressive text generation (assuming no lookahead) will allow the engine design to influence the wing design (assuming the engine design is inside the context window when it's writing about the wings), but it won't allow the wing design to influence the engine design. However, if the model is allowed to edit its text, it can rethink the engine in light of the wings and rethink the wings in light of the engine until it's designed a really good aircraft.

In particular, it would be good to figure out some way of contriving a mesa-optimization setup, such that we could measure if these fixes would prevent it or not.

Agreed. Perhaps if we generated lots of travelling salesman problem instances where the greedy approach doesn't get you something that looks like the optimal route, then try & train a GPT architecture to predict the cities in the optimal route in order?

This is an interesting quote: our experience we find that lean stochastic local search techniques such as simulated annealing are often the most competitive for hard problems with little structure to exploit.


I suspect GPT will be biased towards avoiding mesa-optimization and making use of heuristics, so the best contrived mesa-optimization setup may be an optimization problem with little structure where heuristics aren't very helpful. Maybe we could focus on problems where non-heuristic methods such as branch and bound / backtracking are considered state of the art, and train the architecture to mesa-optimize by starting with easy instances and gradually moving to harder and harder ones.

The (Unofficial) Less Wrong Comment Challenge

I also felt frustrated by lack of feedback my posts got, my response was to write this: Maybe submitting LW posts to targeted subreddits could be high impact?

LessWrong used to have a lot of comments back in the day. I wonder if part of the issue is simply that the number of posts went up, which means a bigger surfaces for readers to be spread across. Why did the writer/reader ratio go up? Perhaps because writing posts falls into the "endorsed" category, whereas reading/writing comments feels like "time-wasting". And as CFAR et al helped rationalists be more productive, they let activities labeled as "time-wasting" fall by the wayside. (Note that there's something rather incoherent about this: If the subject matter of the post was important enough to be worth a post, surely it is also worth reading/commenting?)

Anyway, here are the reasons why commenting falls into the "endorsed" column for me:

  • It seems neglected. See above argument.
  • I suspect people actually read comments a fair amount. I know I do. Sometimes I will skip to the comments before reading the post itself.
  • Writing a comment doesn't trigger the same "officialness" anxiety that writing a post does. I don't feel obligated to do background research, think about how my ideas should be structured, or try to anticipate potential lines of counterargument.
  • Taking this further, commenting doesn't feel like work. So it takes fewer spoons. I'm writing this comment during a pre-designated goof off period, in fact. The ideal activity is one which is high-impact yet feels like play. Commenting and brainstorming are two of the few things that fall in that category for me.

I know there was an effort to move the community from Facebook to LW recently. Maybe if we pitched LW as "just as fun as Facebook, but discussing more valuable things and adding to a searchable/taggable knowledge archive" that could lure people over? IMO the concept of "work that feels like play" is underrated in the rationalist and EA communities.

Unfortunately, even though I find it fun to write comments, I tend to get demoralized a while later when my comments don't get comment replies themselves :P So that ends up being an "endorsed" reason to avoid commenting.

Where do (did?) stable, cooperative institutions come from?

Well, death spirals can happen, but turnaround / reform can also happen. It usually needs good leadership though.

Sure, they have competitors, but what are they competing on? In terms of what's going on in the US right now, one story is that newspapers used to be nice and profitable, which created room for journalists to pursue high-minded ideals related to objectivity, fairness, investigative reporting, etc. But since Google/Craigslist took most of their ad revenue, they've had to shrink a bunch, and the new business environment leaves less room for journalists to pursue those high-minded ideals. Instead they're forced to write clickbait and/or pander to a particular ideological group to get subscriptions. Less sophisticated reporting/analysis means less sophisticated voting means less sophisticated politicians who aren't as capable of reforming whatever government department is currently most in need of reform (or, less sophisticated accountability means they do a worse job).

Where do (did?) stable, cooperative institutions come from?

Another hypothesis: Great people aren't just motivated by money. They're also motivated by things like great coworkers, interesting work, and prestige.

In the private sector, you see companies like Yahoo go into death spirals: Once good people start to leave, the quality of the coworkers goes down, the prestige of being a Yahoo employee goes down, and you have to deal with more BS instead of bold, interesting initiatives... which means fewer great people join and more leave (partially, also, because mediocre people can't identify, or don't want to hire, great people.)

This death spiral is OK in the private sector because people can just switch their search engine from Yahoo to Google if the results become bad. But there's no analogous competitive process for provisioning public sector stuff.

Good Marines get out because of bad leadership, which means bad Marines stay in and eventually get promoted to leadership positions and the cycle repeats itself.


John_Maxwell's Shortform

That's possible, but I'm guessing that it's not hard for a superintelligent AI to suddenly swallow an entire system using something like gray goo.

Load More