Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a special post for short-form writing by Richard_Ngo. Only they can create top-level comments. Comments here also appear on the Shortform Page and All Posts page.
219 comments, sorted by Click to highlight new comments since: Today at 4:36 PM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

One fairly strong belief of mine is that Less Wrong's epistemic standards are not high enough to make solid intellectual progress here. So far my best effort to make that argument has been in the comment thread starting here. Looking back at that thread, I just noticed that a couple of those comments have been downvoted to negative karma. I don't think any of my comments have ever hit negative karma before; I find it particularly sad that the one time it happens is when I'm trying to explain why I think this community is failing at its key goal of cultivating better epistemics.

There's all sorts of arguments to be made here, which I don't have time to lay out in detail. But just step back for a moment. Tens or hundreds of thousands of academics are trying to figure out how the world works, spending their careers putting immense effort into reading and producing and reviewing papers. Even then, there's a massive replication crisis. And we're trying to produce reliable answers to much harder questions by, what, writing better blog posts, and hoping that a few of the best ideas stick? This is not what a desperate effort to find the truth looks like.

And we're trying to produce reliable answers to much harder questions by, what, writing better blog posts, and hoping that a few of the best ideas stick? This is not what a desperate effort to find the truth looks like.

It seems to me that maybe this is what a certain stage in the desperate effort to find the truth looks like?

Like, the early stages of intellectual progress look a lot like thinking about different ideas and seeing which ones stand up robustly to scrutiny.  Then the best ones can be tested more rigorously and their edges refined through experimentation.  

It seems to me like there needs to be some point in the desparate search for truth in which you're allowing for half-formed thoughts and unrefined hypotheses, or else you simply never get to a place where the hypotheses you're creating even brush up against the truth.

In the half-formed thoughts stage, I'd expect to see a lot of literature reviews, agendas laying out problems, and attempts to identify and question fundamental assumptions. I expect that (not blog-post-sized speculation) to be the hard part of the early stages of intellectual progress, and I don't see it right now.

Perhaps we can split this into technical AI safety and everything else. Above I'm mostly speaking about "everything else" that Less Wrong wants to solve. Since AI safety is now a substantial enough field that its problems need to be solved in more systemic ways.

6Matt Goldenberg3y
I would expect that later in the process.  Agendas laying out problems and fundamental assumptions don't spring from nowhere (at least for me), they come from conversations where I'm trying to articulate some intuition, and I recognize some underlying pattern. The pattern and structure doesn't emerge spontaneously, it comes from trying to pick around the edges of a thing, get thoughts across, explain my intuitions and see where they break. I think it's fair to say that crystallizing these patterns into a formal theory is a "hard part", but the foundation for making it easy is laid out in the floundering and flailing that came before.

The top posts in the 2018 Review are filled with fascinating and well-explained ideas. Many of the new ideas are not settled science, but they're quite original and substantive, or excellent distillations of settled science, and are often the best piece of writing on the internet about their topics.

You're wrong about LW epistemic standards not being high enough to make solid intellectual progress, we already have. On AI alone (which I am using in large part because there's vaguely more consensus around it than around rationality), I think you wouldn't have seen almost any of the public write-ups (like Embedded Agency and Zhukeepa's Paul FAQ) without LessWrong, and I think a lot of them are brilliant.

I'm not saying we can't do far better, or that we're sufficiently good. Many of the examples of success so far are "Things that were in people's heads but didn't have a natural audience to share them with". There's not a lot of collaboration at present, which is why I'm very keen to build the new LessWrong Docs that allows for better draft sharing and inline comments and more. We're working on the tools for editing tags, things like edit histories and so on, that will allow us to build ... (read more)

As mentioned in my reply to Ruby, this is not a critique of the LW team, but of the LW mentality. And I should have phrased my point more carefully - "epistemic standards are too low to make any progress" is clearly too strong a claim, it's more like "epistemic standards are low enough that they're an important bottleneck to progress". But I do think there's a substantive disagreement here. Perhaps the best way to spell it out is to look at the posts you linked and see why I'm less excited about them than you are.

Of the top posts in the 2018 review, and the ones you linked (excluding AI), I'd categorise them as follows:

Interesting speculation about psychology and society, where I have no way of knowing if it's true:

  • Local Validity as a Key to Sanity and Civilization
  • The Loudest Alarm Is Probably False
  • Anti-social punishment (which is, unlike the others, at least based on one (1) study).
  • Babble
  • Intelligent social web
  • Unrolling social metacognition
  • Simulacra levels
  • Can you keep this secret?

Same as above but it's by Scott so it's a bit more rigorous and much more compelling:

  • Is Science Slowing Down?
  • The tails coming apart as a metaph
... (read more)

(Thanks for laying out your position in this level of depth. Sorry for how long this comment turned out. I guess I wanted to back up a bunch of my agreement with words. It's a comment for the sake of everyone else, not just you.)

I think there's something to what you're saying, that the mentality itself could be better. The Sequences have been criticized because Eliezer didn't cite previous thinkers all that much, but at least as far as the science goes, as you said, he was drawing on academic knowledge. I also think we've lost something precious with the absence of epic topic reviews by the likes of Luke. Kaj Sotala still brings in heavily from outside knowledge, John Wentworth did a great review on Biological Circuits, and we get SSC crossposts that have that, but otherwise posts aren't heavily referencing or building upon outside stuff. I concede that I would like to see a lot more of that.

I think Kaj was rightly disappointed that he didn't get more engagement with his post whose gist was "this is what the science really says about S1 & S2, one of your most cherished concepts, LW community".

I wouldn't say the typical approach is strictly bad, there's value in thinking freshly... (read more)

This is only tangentially relevant, but adding it here as some of you might find it interesting:

Venkatesh Rao has an excellent Twitter thread on why most independent research only reaches this kind of initial exploratory level (he tried it for a bit before moving to consulting). It's pretty pessimistic, but there is a somewhat more optimistic follow-up thread on potential new funding models. Key point is that the later stages are just really effortful and time-consuming, in a way that keeps out a lot of people trying to do this as a side project alongside a separate main job (which I think is the case for a lot of LW contributors?)

Quote from that thread:

Research =

a) long time between having an idea and having something to show for it that even the most sympathetic fellow crackpot would appreciate (not even pay for, just get)

b) a >10:1 ratio of background invisible thinking in notes, dead-ends, eliminating options etc

With a blogpost, it’s like a week of effort at most from idea to mvp, and at most a 3:1 ratio of invisible to visible. That’s sustainable as a hobby/side thing.

To do research-grade thinking you basically have to be independently wealthy and accept 90% d

... (read more)
6Richard_Ngo3y
Also, I liked your blog post! More generally, I strongly encourage bloggers to have a "best of" page, or something that directs people to good posts. I'd be keen to read more of your posts but have no idea where to start.
6drossbucket3y
Thanks! I have been meaning to add a 'start here' page for a while, so that's good to have the extra push :) Seems particularly worthwhile in my case because a) there's no one clear theme and b) I've been trying a lot of low-quality experimental posts this year bc pandemic trashed motivation, so recent posts are not really reflective of my normal output. For now some of my better posts in the last couple of years might be Cognitive decoupling and banana phones [https://drossbucket.com/2019/10/23/cognitive-decoupling-and-banana-phones/] (tracing back the original precursor of Stanovich's idea), The middle distance [https://drossbucket.com/2019/10/24/the-middle-distance/] (a writeup of a useful and somewhat obscure idea from Brian Cantwell Smith's On the Origin of Objects), and the negative probability post [https://drossbucket.com/2019/08/01/negative-probability/] and its followup.
6Richard_Ngo3y
Thanks, these links seem great! I think this is a good (if slightly harsh) way of making a similar point to mine: "I find that autodidacts who haven’t experienced institutional R&D environments have a self-congratulatory low threshold for what they count as research. It’s a bit like vanity publishing or fan fiction. This mismatch doesn’t exist as much in indie art, consulting, game dev etc"

Quoting your reply to Ruby below, I agree I'd like LessWrong to be much better at "being able to reliably produce and build on good ideas". 

The reliability and focus feels most lacking to me on the building side, rather than the production, which I think we're doing quite well at. I think we've successfully formed a publishing platform that provides and audience who are intensely interested in good ideas around rationality, AI, and related subjects, and a lot of very generative and thoughtful people are writing down their ideas here.

We're low on the ability to connect people up to do more extensive work on these ideas – most good hypotheses and arguments don't get a great deal of follow up or further discussion.

Here are some subjects where I think there's been various people sharing substantive perspectives, but I think there's also a lot of space for more 'details' to get fleshed out and subquestions to be cleanly answered:

... (read more)

"I see a lot of (very high quality) raw energy here that wants shaping and directing, with the use of lots of tools for coordination (e.g. better collaboration tools)."

Yepp, I agree with this. I guess our main disagreement is whether the "low epistemic standards" framing is a useful way to shape that energy. I think it is because it'll push people towards realising how little evidence they actually have for many plausible-seeming hypotheses on this website. One proven claim is worth a dozen compelling hypotheses, but LW to a first approximation only produces the latter.

When you say "there's also a lot of space for more 'details' to get fleshed out and subquestions to be cleanly answered", I find myself expecting that this will involve people who believe the hypothesis continuing to build their castle in the sky, not analysis about why it might be wrong and why it's not.

That being said, LW is very good at producing "fake frameworks". So I don't want to discourage this too much. I'm just arguing that this is a different thing from building robust knowledge about the world.

6Ben Pace3y
I will continue to be contrary and say I'm not sure I agree with this. For one, I think in many domains new ideas are really hard to come by, as opposed to making minor progress in the existing paradigms. Fundamental theories in physics, a bunch of general insights about intelligence (in neuroscience and AI), etc. And secondly, I am reminded of what Lukeprog wrote in his moral consciousness report, that he wished the various different philosophies-of-consciousness would stop debating each other, go away for a few decades, then come back with falsifiable predictions. I sometimes take this stance regarding many disagreements of import, such as the basic science vs engineering approaches to AI alignment. It's not obvious to me that the correct next move is for e.g. Eliezer and Paul to debate for 1000 hours, but instead to go away and work on their ideas for a decade then come back with lots of fleshed out details and results that can be more meaningfully debated. I feel similarly about simulacra levels, Embedded Agency, and a bunch of IFS stuff. I would like to see more experimentation and literature reviews where they make sense, but I also feel like these are implicitly making substantive and interesting claims about the world, and I'd just be interested in getting a better sense of what claims they're making, and have them fleshed out + operationalized more. That would be a lot of progress to me, and I think each of them is seeing that sort of work (with Zvi, Abram, and Kaj respectively leading the charges on LW, alongside many others).

I think I'm concretely worried that some of those models / paradigms (and some other ones on LW) don't seem pointed in a direction that leads obviously to "make falsifiable predictions."

And I can imagine worlds where "make falsifiable predictions" isn't the right next step, you need to play around with it more and get it fleshed out in your head before you can do that. But there is at least some writing on LW that feels to me like it leaps from "come up with an interesting idea" to "try to persuade people it's correct" without enough checking.

(In the case of IFS, I think Kaj's sequence is doing a great job of laying it out in a concrete way where it can then be meaningfully disagreed with. But the other people who've been playing around with IFS didn't really seem interested in that, and I feel like we got lucky that Kaj had the time and interest to do so.)

8Richard_Ngo3y
I feel like this comment isn't critiquing a position I actually hold. For example, I don't believe that "the correct next move is for e.g. Eliezer and Paul to debate for 1000 hours". I am happy for people to work towards building evidence for their hypotheses in many ways, including fleshing out details, engaging with existing literature, experimentation, and operationalisation. Perhaps this makes "proven claim" a misleading phrase to use. Perhaps more accurate to say: "one fully fleshed out theory is more valuable than a dozen intuitively compelling ideas". But having said that, I doubt that it's possible to fully flesh out a theory like simulacra levels without engaging with a bunch of academic literature and then making predictions. I also agree with Raemon's response below.
4Ben Pace3y
A housemate of mine said to me they think LW has a lot of breadth, but could benefit from more depth.  I think in general when we do intellectual work we have excellent epistemic standards, capable of listening to all sorts of evidence that other communities and fields would throw out, and listening to subtler evidence than most scientists ("faster than science"), but that our level of coordination and depth is often low. "LessWrongers should collaborate more and go into more depth in fleshing out their ideas" sounds more true to me than "LessWrongers have very low epistemic standards".
In general when we do intellectual work we have excellent epistemic standards, capable of listening to all sorts of evidence that other communities and fields would throw out, and listening to subtler evidence than most scientists ("faster than science")

"Being more openminded about what evidence to listen to" seems like a way in which we have lower epistemic standards than scientists, and also that's beneficial. It doesn't rebut my claim that there are some ways in which we have lower epistemic standards than many academic communities, and that's harmful.

In particular, the relevant question for me is: why doesn't LW have more depth? Sure, more depth requires more work, but on the timeframe of several years, and hundreds or thousands of contributors, it seems viable. And I'm proposing, as a hypothesis, that LW doesn't have enough depth because people don't care enough about depth - they're willing to accept ideas even before they've been explored in depth. If this explanation is correct, then it seems accurate to call it a problem with our epistemic standards - specifically, the standard of requiring (and rewarding) deep investigation and scholarship.

7John_Maxwell3y
Your solution to the "willingness to accept ideas even before they've been explored in depth" problem is to explore ideas in more depth. But another solution is to accept fewer ideas, or hold them much more provisionally. I'm a proponent of the second approach because: * I suspect even academia doesn't hold ideas as provisionally as it should. See Hamming on expertise: https://forum.effectivealtruism.org/posts/mG6mckPHAisEbtKv5/should-you-familiarize-yourself-with-the-literature-before?commentId=SaXXQXLfQBwJc9ZaK [https://forum.effectivealtruism.org/posts/mG6mckPHAisEbtKv5/should-you-familiarize-yourself-with-the-literature-before?commentId=SaXXQXLfQBwJc9ZaK] * I suspect trying to browbeat people to explore ideas in more depth works against the grain of an online forum as an institution. Browbeating works in academia because your career is at stake, but in an online forum, it just hurts intrinsic motivation and cuts down on forum use (the forum runs on what Clay Shirky called "cognitive surplus", essentially a term for peoples' spare time and motivation). I'd say one big problem with LW 1.0 that LW 2.0 had to solve before flourishing was people felt too browbeaten to post much of anything. If we accept fewer ideas / hold them much more provisionally, but provide a clear path to having an idea be widely held as true, that creates an incentive for people to try & jump through hoops--and this incentive is a positive one, not a punishment-driven browbeating incentive. Maybe part of the issue is that on LW, peer review generally happens in the comments after you publish, not before. So there's no publication carrot to offer in exchange for overcoming the objections of peer reviewers.
4Richard_Ngo3y
"If we accept fewer ideas / hold them much more provisionally, but provide a clear path to having an idea be widely held as true, that creates an incentive for people to try & jump through hoops--and this incentive is a positive one, not a punishment-driven browbeating incentive." Hmm, it sounds like we agree on the solution but are emphasising different parts of it. For me, the question is: who's this "we" that should accept fewer ideas? It's the set of people who agree with my argument that you shouldn't believe things which haven't been fleshed out very much. But the easiest way to add people to that set is just to make the argument, which is what I've done. Specifically, note that I'm not criticising anyone for producing posts that are short and speculative: I'm criticising the people who update too much on those posts.
4John_Maxwell3y
Fair enough. I'm reminded of a time someone summarized one of my posts as being a definitive argument against some idea X and me thinking to myself "even I don't think my post definitively settles this issue" haha.
3Raemon3y
Yeah, this is roughly how I think about it. I do think right now LessWrong should lean more in the direction the Richard is suggesting – I think it was essential to establish better Babble procedures but now we're doing well enough on that front that I think setting clearer expectations of how the eventual pruning works is reasonable. 
5Richard_Ngo3y
I wanted to register that I don't like "babble and prune" as a model of intellectual development. I think intellectual development actually looks more like: 1. Babble 2. Prune 3. Extensive scholarship 4. More pruning 5. Distilling scholarship to form common knowledge And that my main criticism is the lack of 3 and 5, not the lack of 2 or 4. I also note that: a) these steps get monotonically harder, so that focusing on the first two misses *almost all* the work; b) maybe I'm being too harsh on the babble and prune framework because it's so thematically appropriate for me to dunk on it here; I'm not sure if your use of the terminology actually reveals a substantive disagreement.
2Raemon3y
I basically agree with your 5-step model (I at least agree it's a more accurate description than Babel and Prune, which I just meant as rough shorthand). I'd add things like "original research/empiricism" or "more rigorous theorizing" to the "Extensive Scholarship" step.  I see the LW Review as basically the first of (what I agree should essentially be at least) a 5 step process. It's adding a stronger Step 2, and a bit of Step 5 (at least some people chose to rewrite their posts to be clearer and respond to criticism) ... Currently, we do get non-zero Extensive Scholarship and Original Empiricism. (Kaj's Multi-Agent Models of Mind [https://www.lesswrong.com/s/ZbmRyDN8TCpBTZSip] seems like it includes real scholarship. Scott Alexander / Eli Tyre and Bucky's exploration into Birth Order Effects seemed like real empiricism). Not nearly as much as I'd like. But John's comment elsethread [https://www.lesswrong.com/s/ZbmRyDN8TCpBTZSip] seems significant: This reminded of a couple posts in the 2018 Review, Local Validity as Key to Sanity and Civilization [https://www.lesswrong.com/posts/WQFioaudEH8R7fyhm/local-validity-as-a-key-to-sanity-and-civilization], and Is Clickbait Destroying Our General Intelligence? [https://www.lesswrong.com/posts/YicoiQurNBxSp7a65/is-clickbait-destroying-our-general-intelligence]. Both of those seemed like "sure, interesting hypothesis. Is it real tho?" During the Review I created a followup "How would we check if Mathematicians are Generally More Law Abiding? [https://www.lesswrong.com/posts/9MztEdRLeYcTDvuiZ/how-would-we-check-if-mathematicians-are-generally-more-law]" question, trying to move the question from Stage 2 to 3. I didn't get much serious response, probably because, well, it was a much harder question. But, honestly... I'm not sure it's actually a question that was worth asking. I'd like to know if Eliezer's hypothesis about mathematicians is true, but I'm not sure it ranks near the top of questions I'd want people to put
2John_Maxwell3y
1. All else equal, the harder something is, the less we should do it. 2. My quick take is that writing lit reviews/textbooks is a comparative disadvantage of LW relative to the mainstream academic establishment. In terms of producing reliable knowledge... if people actually care about whether something is true, they can always offer a cash prize for the best counterargument (which could of course constitute citation of academic research). The fact that people aren't doing this suggests to me that for most claims on LW, there isn't any (reasonably rich) person who cares deeply re: whether the claim is true. I'm a little wary of putting a lot of effort into supply if there is an absence of demand. (I guess the counterargument is that accurate knowledge is a public good so an individual's willingness to pay doesn't get you complete picture of the value accurate knowledge brings. Maybe what we need is a way to crowdfund bounties for the best argument related to something.) (I agree that LW authors would ideally engage more with each other and academic literature on the margin.)
4DirectedEvolution3y
I’ve been thinking about the idea of “social rationality” lately, and this is related. We do so much here in the way of training individual rationality - the inputs, functions, and outputs of a single human mind. But if truth is a product, then getting human minds well-coordinated to produce it might be much more important than training them to be individually stronger. Just as assembly line production is much more effective in producing almost anything than teaching each worker to be faster in assembling a complete product by themselves. My guess is that this could be effective not only in producing useful products, but also in overcoming biases. Imagine you took 5 separate LWers and asked them to create a unified consensus response to a given article. My guess is that they’d learn more through that collective effort, and produce a more useful response, than if they spent the same amount of time individually evaluating the article and posting their separate replies. Of course, one of the reasons we don’t to that so much is that coordination is an up-front investment and is unfamiliar. Figuring out social technology to make it easier to participate in might be a great project for LW.

There's been a fair amount of discussion of that sort of thing here: https://www.lesswrong.com/tag/group-rationality There are also groups outside LW thinking about social technology such as RadicalxChange.

Imagine you took 5 separate LWers and asked them to create a unified consensus response to a given article. My guess is that they’d learn more through that collective effort, and produce a more useful response, than if they spent the same amount of time individually evaluating the article and posting their separate replies.

I'm not sure. If you put those 5 LWers together, I think there's a good chance that the highest status person speaks first and then the others anchor on what they say and then it effectively ends up being like a group project for school with the highest status person in charge. Some related links.

3DirectedEvolution3y
That’s definitely a concern too! I imagine such groups forming among people who either already share a basic common view, and collaborate to investigate more deeply. That way, any status-anchoring effects are mitigated. Alternatively, it could be an adversarial collaboration. For me personally, some of the SSC essays in this format have led me to change my mind in a lasting way.
2curi3y
People also reject ideas before they've been explored in depth. I've tried to discuss [https://curi.us/2064-less-wrong-lacks-representatives-and-paths-forward] similar issues with LW [https://www.lesswrong.com/posts/oLScLrrfsGps8SduN/less-wrong-lacks-representatives-and-paths-forward] before but the basic response was roughly "we like chaos where no one pays attention to whether an argument has ever been answered by anyone; we all just do our own thing with no attempt at comprehensiveness or organizing who does what; having organized leadership of any sort, or anyone who is responsible for anything, would be irrational" (plus some suggestions that I'm low social status and that therefore I personally deserve to be ignored. there were also suggestions – phrased rather differently but amounting to this – that LW will listen more if published ideas are rewritten, not to improve on any flaws, but so that the new versions can be published at LW before anywhere else, because the LW community's attention allocation is highly biased towards that).
2Ben Pace3y
I feel somewhat inclined to wrap up this thread at some point, even while there's more to say. We can continue if you like and have something specific or strong you'd like to ask, but otherwise will pause here.
1TAG3y
You have to realise that what you are doing isn't adequate in order to gain the motivation to do it better, and that is unlikely to happen if you are mostly communicating with other people who think everything is OK.
3TAG3y
Lesswrong is competing against philosophy as well as science, and philosophy has broader criterion of evidence still. In fact , lesswrongians are often frustrated that mainstream philosophy takes such topics as dualism or theism seriously.. even though theres an abundance of Bayesian evidence for them.
2John_Maxwell3y
Depends on the claim, right? If the cost of evaluating a hypothesis is high, and hypotheses are cheap to generate, I would like to generate a great deal before selecting one to evaluate.
2DanielFilan3y
As mentioned in this comment [https://www.lesswrong.com/posts/K4eDzqS2rbcBDsCLZ/unrolling-social-metacognition-three-levels-of-meta-are-not?commentId=vEzubk5Fj8L99mKJq], the Unrolling social metacognition paper is closely related to at least one research paper.
5Richard_Ngo3y
Right, but this isn't mentioned in the post? Which seems odd. Maybe that's actually another example of the "LW mentality": why is the fact that there has been solid empirical research into 3 layers not being enough not important enough to mention in a post on why 3 layers isn't enough? (Maybe because the post was time-boxed? If so that seems reasonable, but then I would hope that people comment saying "Here's a very relevant paper, why didn't you cite it?")
7Zachary Robertson3y
I think a distinction should be made between intellectual progress (whatever that is) and distillation. I know lots of websites that do amazing distillation of AI related concepts (literally distill.pub). I think most people would agree that sort of work is important in order to make intellectual progress, but I also think significantly less people would agree distillation is intellectual progress. Having this distinction in mind, I think your examples from AI are not as convincing. Perhaps more so once you consider the Less Wrong is often being used more as a platform to share these distillations than to create them. I think you're right that Less Wrong has some truly amazing content. However, once again, it seems a lot of these posts are not inherently from the ecosystem but are rather essentially cross-posted. If I say a lot of the content on LW is low-quality it's mostly an observation about what I expect to find from material that builds on itself. The quality of LW-style accumulated knowledge seems lower than it could be. On a personal note, I've actively tried to explore using this site as a way to engage with research and have come to a similar opinion as Richard. The most obvious barrier is the separation between LW and AIAF. Effectively, if you're doing AI safety research, to second-order approximation you can block LW (noise) and only look at AIAF (signal). I say to second-order because anything from LW that is signal ends up being posted on AIAF anyway which means the method is somewhat error-tolerant. This probably comes off as a bit pessimistic. Here's a concrete proposal I hope to try out soon enough. Pick a research question. Get a small group of people/friends together. Start talking about the problem and then posting on LW. Iterate until there's group consensus.

Much of the same is true of scientific journals. Creating a place to share and publish research is a pretty key piece of intellectual infrastructure, especially for researchers to create artifacts of their thinking along the way. 

The point about being 'cross-posted' is where I disagree the most. 

This is largely original content that counterfactually wouldn't have been published, or occasionally would have been published but to a much smaller audience. What Failure Looks Like wasn't crossposted, Anna's piece on reality-revealing puzzles wasn't crossposted. I think that Zvi would have still written some on mazes and simulacra, but I imagine he writes substantially more content given the cross-posting available for the LW audience. Could perhaps check his blogging frequency over the last few years to see if that tracks. I recall Zhu telling me he wrote his FAQ because LW offered an audience for it, and likely wouldn't have done so otherwise. I love everything Abram writes, and while he did have the Intelligent Agent Foundations Forum, it had a much more concise, technical style, tiny audience, and didn't have the conversational explanations and stories and cartoons that have... (read more)

5Rohin Shah3y
Yeah, that's true, though it might have happened at some later point in the future as I got increasingly frustrated by people continuing to cite VNM at me (though probably it would have been a blog post and not a full sequence). Reading through this comment tree, I feel like there's a distinction to be made between "LW / AIAF as a platform that aggregates readership and provides better incentives for blogging", and "the intellectual progress caused by posts on LW / AIAF". The former seems like a clear and large positive of LW / AIAF, which I think Richard would agree with. For the latter, I tend to agree with Richard, though perhaps not as strongly as he does. Maybe I'd put it as, I only really expect intellectual progress from a few people who work on problems full time who probably would have done similar-ish work if not for LW / AIAF (but likely would not have made it public). I'd say this mostly for the AI posts. I do read the rationality posts and don't get a different impression from them, but I also don't think enough about them to be confident in my opinions there.
3Ben Pace3y
 By "AN" do you mean the AI Alignment Forum, or "AIAF"?
1Zachary Robertson3y
Ya, totally messed up that. I meant the AI Alignment Forum or AIAF. I think out of habit I used AN (Alignment Newsletter)
2Ben Pace3y
I did suspect you'd confused it with the Alignment Newsletter :)

One fairly strong belief of mine is that Less Wrong's epistemic standards are not high enough to make solid intellectual progress here.

I think this is literally true. There seems to be very little ability to build upon prior work.

Out of curiosity do you see Less Wrong as significantly useful or is it closer to entertainment/habit? I've found myself thinking along the same lines as I start thinking about starting my PhD program etc. The utility of Less Wrong seems to be a kind of double-edged sword. On the one hand, some of the content is really insightful and exposes me to ideas I wouldn't otherwise encounter. On the other hand, there is such an incredible amount of low-quality content that I worry that I'm learning bad practices.

3Viliam3y
Ironically, some people already feel threatened by the high standards here. Setting them higher probably wouldn't result in more good content. It would result in less mediocre content, but probably also less good content, as the authors who sometimes write a mediocre article and sometimes a good one, would get discouraged and give up. Ben Pace gives a few examples of great content in the next comment. It would be better to easier separate the good content from the rest, but that's what the reviews are for. Well, only one review so far, if I remember correctly. I would love to see reviews of pre-2018 content (maybe multiple years in one review, if they were less productive). Then I would love to see the winning content get the same treatment as the Sequences -- edit them and arrange them into a book, and make it "required reading" for the community (available as a free PDF).
7Zachary Robertson3y
I broadly agree here. However, I do see the short-forms as a consistent way to skirt around this. I'd say at least 30% of the Less Wrong value proposition are the conversations I get to have. Short-forms seem to be more adapted for continuing conversations and they have a low bar for being made. I could clarify a bit. My main problem with low quality content isn't exactly that it's 'wrong' or something like that. Mostly, the issues I'm finding most common for me are, 1. Too many niche pre-requisites. 2. No comments 3. Nagging feeling post is reinventing the wheel I think one is a ridiculously bad problem. I'm literally getting a PhD in machine learning, write about AI Safety, and still find a large number of those posts (yes AN posts) glazed in internal-jargon that makes it difficult to connect with current research. Things get even worse when I look at non-AI related things. Two is just a tragedy of the fact the rich get richer. While I'm guilty of this also, I think that requiring posts to also post seed questions/discussion topics in the comments could go a long way to alleviate this problem. I oftentimes read a post and want to leave a comment, but then don't because I'm not even sure the author thought about the discussion their post might start. Three is probably a bit mean. Yet, more than once I've discovered a Less Wrong concept already had a large research literature devoted to it. I think this ties in with one due to the fact niche pre-reqs often go hand-in-hand with insufficient literature review.
5Ruby3y
Thanks for chiming in with this. People criticizing the epistemics is hopefully how we get better epistemics. When the Californian smoke isn't interfering with my cognition as much, I'll try to give your feedback (and Rohin's [https://www.lesswrong.com/posts/Wnqua6eQkewL3bqsF/matt-botvinick-on-the-spontaneous-emergence-of-learning?commentId=pYpPnAKrz64ptyRid]) proper attention. I would generally be interested to hear your arguments/models in detail, if you get the chance to lay them out. My default position is LW has done well enough historically (e.g. Ben Pace's examples) for me to currently be investing in getting it even better. Epistemics and progress could definitely be a lot better, but getting there is hard. If I didn't see much progress on the rate of progress in the next year or two, I'd probably go focus on other things, though I think it'd be tragic if we ever lost what we have now. And another thought: Yes and no. Journal articles have their advantages, and so do blog posts. A bunch of recent LessWrong team's work has been around filling in the missing pieces for the system to work, e.g. Open Questions (hasn't yet worked for coordinating research), Annual Review, Tagging, Wiki. We often talk about conferences and "campus". My work on Open Questions involved thinking about i) a better template for articles than "Abstract, Intro, Methods, etc.", but Open Questions didn't work for unrelated reasons we haven't overcome yet, ii) getting lit reviews done systematically by people, iii) coordinating groups around research agendas.  I've thought about re-attempting the goals of Open Questions with instead a "Research Agenda" feature that lets people communally maintain research agendas and work on them. It's a question of priorities whether I work on that anytime soon. I do really think many of the deficiencies of LessWrong's current work compared to academia are "infrastructure problems" at least as much as the epistemic standards of the community. Which me
7Richard_Ngo3y
For the record, I think the LW team is doing a great job. There's definitely a sense in which better infrastructure can reduce the need for high epistemic standards, but it feels like the thing I'm pointing at is more like "Many LW contributors not even realising how far away we are from being able to reliably produce and build on good ideas" (which feels like my criticism of Ben's position in his comment, so I'll respond more directly there).
5Pongo3y
It seems really valuable to have you sharing how you think we’re falling epistemically short and probably important for the site to integrate the insights behind that view. There are a bunch of ways I disagree with your claims about epistemic best practices, but it seems like it would be cool if I could pass your ITT more. I wish your attempt to communicate the problems you saw had worked out better. I hope there’s a way for you to help improve LW epistemics, but also get that it might be costly in time and energy.
4Viliam3y
Now they're positive again. Confusing to me, their Ω-karma (karma on another website) is also positive. Does it mean they previously had negative LW-karma but positive Ω-karma? Or that their Ω-karma also improved as a result of you complaining on LW a few hours ago? Why would it? (Feature request: graph of evolution of comment karma as a function of time.)
2Richard_Ngo3y
I'm confused, what is Ω-karma?
3MikkW3y
AI Alignment Forum karma (which is also displayed here on posts that are crossposted)
1NaiveTortoise3y
I'd be curious what, if any, communities you think set good examples in this regard. In particular, are there specific academic subfields or non-academic scenes that exemplify the virtues you'd like to see more of?
5Richard_Ngo3y
Maybe historians of the industrial revolution? Who grapple with really complex phenomena and large-scale patterns, like us, but unlike us use a lot of data, write a lot of thorough papers and books, and then have a lot of ongoing debate on those ideas. And then the "progress studies" crowd is an example of an online community inspired by that tradition (but still very nascent, so we'll see how it goes). More generally I'd say we could learn to be more rigorous by looking at any scientific discipline or econ or analytic philosophy. I don't think most LW posters are in a position to put in as much effort as full-time researchers, but certainly we can push a bit in that direction.
3NaiveTortoise3y
Thanks for your reply! I largely agree with drossbucket [https://www.lesswrong.com/posts/FuGfR3jL3sw6r8kB4/ricraz-s-shortform?commentId=YkN5oaCnFuwZmJnJ5]'s reply. I also wonder how much this is an incentives problem. As you mentioned and in my experience, the fields you mentioned strongly incentivize an almost fanatical level of thoroughness that I suspect is very hard for individuals to maintain without outside incentives pushing them that way. At least personally, I definitely struggle and, frankly, mostly fail to live up to the sorts of standards you mention when writing blog posts in part because the incentive gradient feels like it pushes towards hitting the publish button. Given this, I wonder if there's a way to shift the incentives on the margin. One minor thing I've been thinking of trying for my personal writing is having a Knuth or Nintil [https://nintil.com/prove-wrong-get-money] style "pay for mistakes" policy. Do you have thoughts on other incentive structures to for rewarding rigor or punishing the lack thereof?
6Richard_Ngo3y
It feels partly like an incentives problem, but also I think a lot of people around here are altruistic and truth-seeking and just don't realise that there are much more effective ways to contribute to community epistemics than standard blog posts. I think that most LW discussion is at the level where "paying for mistakes" wouldn't be that helpful, since a lot of it is fuzzy. Probably the thing we need first are more reference posts that distill a range of discussion into key concepts, and place that in the wider intellectual context. Then we can get more empirical. (Although I feel pretty biased on this point, because my own style of learning about things is very top-down). I guess to encourage this, we could add a "reference" section for posts that aim to distill ongoing debates on LW. In some cases you can get a lot of "cheap" credit by taking other people's ideas and writing a definitive version of them aimed at more mainstream audiences. For ideas that are really worth spreading, that seems useful.

(Written quickly and not very carefully.)

I think it's worth stating publicly that I have a significant disagreement with a number of recent presentations of AI risk, in particular Ajeya's "Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover", and Cohen et al.'s "Advanced artificial agents intervene in the provision of reward". They focus on policies learning the goal of getting high reward. But I have two problems with this:

  1. I expect "reward" to be a hard goal to learn, because it's a pretty abstract concept and not closely related to the direct observations that policies are going to receive. If you keep training policies, maybe they'd converge to it eventually, but my guess is that this would take long enough that we'd already have superhuman AIs which would either have killed us or solved alignment for us (or at least started using gradient hacking strategies which undermine the "convergence" argument). Analogously, humans don't care very much at all about the specific connections between our reward centers and the rest of our brains - insofar as we do want to influence them it's because we care about much more directly-observable p
... (read more)

Putting my money where my mouth is: I just uploaded a (significantly revised) version of my Alignment Problem position paper, where I attempt to describe the AGI alignment problem as rigorously as possible. The current version only has "policy learns to care about reward directly" as a footnote; I can imagine updating it based on the outcome of this discussion though.

3dsj6mo
For someone who's read v1 of this paper, what would you recommend as the best way to "update" to v3? Is an entire reread the best approach? [Edit March 11, 2023: Having now read the new version in full, my recommendation to anyone else with the same question is a full reread.]
1[comment deleted]6mo
8paulfchristiano6mo
I'm not very convinced by this comment as an objection to "50% AI grabs power to get reward." (I find it more plausible as an objection to "AI will definitely grab power to get reward.") This seems to be most of your position but I'm skeptical (and it's kind of just asserted without argument): * The data used in training is literally the only thing that AI systems observe, and prima facie reward just seems like another kind of data that plays a similarly central role. Maybe your "unnaturalness" abstraction can make finer-grained distinctions than that, but I don't think I buy it. * If people train their AI with RLDT then the AI is literally be trained to predict reward! I don't see how this is remote, and I'm not clear if your position is that e.g. the value function will be bad at predicting reward because it is an "unnatural" target for supervised learning. * I don't understand the analogy with humans. It sounds like you are saying "an AI system selected based on the reward of its actions learns to select actions it expects to lead to high reward" be analogous to "humans care about the details of their reward circuitry." But: * I don't think human learning is just RL based on the reward circuit; I think this is at least a contrarian position and it seems unworkable to me as an explanation of human behavior. * It seems like the analogous conclusion for RL systems would be "they may not care about the rewards that go into the SGD update, they may instead care about the rewards that get entered into the dataset, or even something further causally upstream of that as long as it's very well-correlated on the training set." But it doesn't matter what we choose that's causally upstream of rewards, as long as it's perfectly correlated on the training set? * (Or you could be saying that humans are motivated by pleasure and pain but not the entire suite of things that are upstream of rewar
2TurnTrout6mo
(Emphasis added) I don't think this engages with the substance of the analogy to humans. I don't think any party in this conversation believes that human learning is "just" RL based on a reward circuit, and I don't believe it either [https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target?commentId=FaLrB7AcbZguJwtrs]. "Just RL" also isn't necessary for the human case to give evidence about the AI case. Therefore, your summary seems to me like a strawman of the argument.  I would say "human value formation mostly occurs via RL & algorithms meta-learned thereby, but in the important context of SSL / predictive processing, and influenced by inductive biases from high-level connectome topology and genetically specified reflexes and environmental regularities and..." Furthermore, we have good evidence that RL plays an important role in human learning. For example, from The shard theory of human values [https://www.lesswrong.com/posts/iCfdcxiyr2Kj8m8mT/the-shard-theory-of-human-values]:
2paulfchristiano6mo
This is incredibly weak evidence. * Animals were selected over millions of generations to effectively pursue external goals. So yes, they have external goals. * Humans also engage in within-lifetime learning, so of course you see all kinds of indicators of that in brains. Both of those observations have high probability, so they aren't significant Bayesian evidence for "RL tends to produce external goals by default." In particular, for this to be evidence for Richard's claim, you need to say: "If RL tended to produce systems that care about reward, then RL would be significantly less likely to play a role in human cognition." There's some update there but it's just not big. It's easy to build brains that use RL as part of a more complicated system and end up with lots of goals other than reward.  My view is probably the other way---humans care about reward more than I would guess from the actual amount of RL they can do over the course of their life (my guess is that other systems play a significant role in our conscious attitude towards pleasure).
2interstice5mo
Curious what systems you have in mind here.
2TurnTrout5mo
I don't understand why you think this explains away the evidential impact, and I guess I put way less weight on selection reasoning than you do [https://www.lesswrong.com/posts/8ccTZ9ZxpJrvnxt4F/shard-theory-in-nine-theses-a-distillation-and-critical?commentId=PbxEA2SEYjxDbLMFA]. My reasoning here goes: 1. Lots of animals do reinforcement learning. 2. In particular, humans prominently do reinforcement learning. 3. Humans care about lots of things in reality, not just certain kinds of cognitive-update-signals. 4. "RL -> high chance of caring about reality" predicts this observation more strongly than "RL -> low chance of caring about reality" This seems pretty straightforward to me, but I bet there are also pieces of your perspective I'm just not seeing.  But in particular, it doesn't seem relevant to consider selection pressures from evolution, except insofar as we're postulating additional mechanisms which evolution found which explain away some of the reality-caring? That would weaken (but not eliminate) the update towards "RL -> high chance of caring about reality." I don't see how this point is relevant. Are you saying that within-lifetime learning is unsurprising, so we can't make further updates by reasoning about how people do it? I'm saying that there was a missed update towards that conclusion, so it doesn't matter if we already knew that humans do within-lifetime learning? 
7paulfchristiano5mo
You seem to be saying P(humans care about the real world | RL agents usually care about reward) is low. I'm objecting, and claiming that in fact P(humans care about the real world | RL agents usually care about reward) is fairly high, because humans are selected to care about the real world and evolution can be picky about what kind of RL it does, and it can (and does) throw tons of other stuff in there. The Bayesian update is P(humans care about the real world | RL agents usually care about reward) / P(humans care about the real world | RL agents mostly care about other stuff). So if e.g. P(humans care about the real world | RL agents don't usually care about reward) was 80%, then your update could be at most 1.25. In fact I think it's even smaller than that.. And then if you try to turn that into evidence about "reward is a very hard concept to learn," or a prediction about how neural nets trained with RL will behave, it's moving my odds ratios by less than 10%  (since we are using "RL" quite loosely in this discussion, and there are lots of other differences and complications at play, all of which shrink the update). You seem to be saying "yes but it's evidence," which I'm not objecting to---I'm just saying it's an extremely small amount evidence. I'm not clear on whether you agree with my calculation. (Some of the other text I wrote was about a different argument you might be making: that P(humans use RL | RL agents usually care about reward) is significantly lower than P(humans use RL| RL agents mostly are about other stuff), because evolution would then have never used RL. My sense is that you aren't making this argument so you should ignore all of that, sorry to be confusing.)
4TurnTrout4mo
Just saw this reply recently. Thanks for leaving it, I found it stimulating. (I wrote the following rather quickly, in an attempt to write anything at all, as I find it not that pleasant to write LW comments -- no offense to you in particular. Apologies if it's confusing or unclear.) Yes, in large part. Yeah, are people differentially selected for caring about the real world? At the risk of seeming facile, this feels non-obvious. My gut take is that conditional on RL agents usually caring about reward (and thus setting aside a bunch of my inside-view reasoning about how RL dynamics work), conditional on that -- reward-humans could totally have been selected for.  This would drive up P(humans care about reward | RL agents care about reward, humans were selected by evolution), and thus (I think?) drive down P(humans care about the real world | RL agents usually care about reward).   POV: I'm in an ancestral environment, and I (somehow) only care about the rewarding feeling of eating bread. I only care about the nice feeling which comes from having sex, or watching the birth of my son, or being gaining power in the tribe. I don't care about the real-world status of my actual son, although I might have strictly instrumental heuristics about e.g. how to keep him safe and well-fed in certain situations, as cognitive shortcuts for getting reward (but not as terminal values).  In what way is my fitness lower than someone who really cares about these things, given that the best way to get rewards may well be to actually do the things? Here are some ways I can think of: 1. Caring about reward directly makes reward hacking a problem evolution has to solve, and if it doesn't solve it properly, the person ends up masturbating and not taking (re)productive actions.  1. Counter-counterpoint: But also many people do in fact enjoy masturbating, even though it seems (to my naive view) like an obvious thing to select away, which was present ancestral
7Vivek Hebbar2mo
Would such a person sacrifice themselves for their children (in situations where doing so would be a fitness advantage)?
6TurnTrout2mo
I think this highlights a good counterpoint. I think this alternate theory predicts "probably not", although I can contrive hypotheses for why people would sacrifice themselves (because they have learned that high-status -> reward; and it's high-status to sacrifice yourself for your kid). Or because keeping your kid safe -> high reward as another learned drive. Overall this feels like contortion but I think it's possible. Maybe overall this is a... 1-bit update against the "not selection for caring about reality" point?
2TurnTrout6mo
I don't know what this means. Suppose we have an AI which "cares about reward" (as you think of it in this situation). The "episode" consists of the AI copying its network & activations to another off-site server, and then the original lab blows up. The original reward register no longer exists (it got blown up), and the agent is not presently being trained by an RL alg.  What is the "reward" for this situation? What would have happened if we "sampled" this episode during training?
4paulfchristiano6mo
I agree there are all kinds of situations where the generalization of "reward" is ambiguous and lots of different things could happen . But it has a clear interpretation for the typical deployment episode since we can take counterfactuals over the randomization used to select training data. It's possible that agents may specifically want to navigate towards situations where RL training is not happening and the notion of reward becomes ambiguous, and indeed this is quite explicitly discussed in the document Richard is replying to. As far as I can tell the fact that there exist cases where different generalizations of reward behave differently does not undermine the point at all.
2TurnTrout5mo
Yeah, I think I was wondering about the intended scoping of your statement. I perceive myself to agree with you that there are situations (like LLM training to get an alignment research assistant) where "what if we had sampled during training?" is well-defined and fine. I was wondering if you viewed this as a general question we could ask. I also agree that Ajeya's post addresses this "ambiguity" question, which is nice!
2Richard_Ngo6mo
It's intended as an objection to "AI grabs power to get reward is the central threat model to focus on", but I think our disagreements still apply given this. (FWIW my central threat model is that policies care about reward to some extent, but that the goals which actually motivate them to do power-seeking things are more object-level.) I expect policies to be getting rich input streams like video, text, etc, which they use to make decisions. Reward is different from other types of data because reward isn't actually observed as part of these input streams by policies during episodes. This makes it harder to learn as a goal compared with things that are more directly observable (in a similar way to how "care about children" is an easier goal to learn than "care about genes"). I don't think this line of reasoning works, because "the episode appearing in training" can be a dependent variable. For example, consider an RL agent that's credibly told that its data is not going to be used for training unless it misbehaves badly. An agent which maximizes reward conditional on the episode appearing in training is therefore going to misbehave the minimum amount required to get its episode into the training data (and more generally, behave like getting its episode into training is a terminal goal). This seems very counterintuitive. Some versions that wouldn't result in power-grabbing: * Goal is "get highest proportion of possible reward"; the policy might rewrite the training algorithm to be myopic, then get perfect reward for one step, then stop. * Goal is "care about (not getting low rewards on) specific computers used during training"; the policy might destroy those particular computers, then stop. * Goal is "impress the critic"; the policy might then rewrite its critic to always output high reward, then stop. * Goal is "get high reward myself this episode"; the policy might try to do power-seeking things but never create more copies of itself, and
2paulfchristiano6mo
Are you imagining that such systems get meaningfully low reward on the training distribution because they are pursuing those goals, or that these goals are extremely well-correlated with reward on the training distribution and only come apart at test time? Is the model deceptively aligned? Children vs genes doesn't seem like a good comparison, it seems obvious that models will understand the idea of reward during training (whereas humans don't understand genes during evolution). A better comparison might be "have children during my life" vs "have a legacy after I'm gone," but in fact humans have both goals even though one is never directly observed. I guess more importantly, I don't buy the claim about "things you only get selected on are less natural as goals than things you observe during episodes," especially if your policy is trained to make good predictions of reward. I don't know if there's a specific reason for your view, or this is just a clash of intuitions. It feels to me like my position is kind of the default, in that you are offering a feature and saying it is a major consideration that SGD wouldn't learn a particular kind of cognition. If the agent misbehaves so that its data will be used for training, then the misbehaving actions will get a low reward. So SGD will shift from "misbehave so that my data will be used for training" to "ignore the effect of my actions on whether my data will be used for training, and just produce actions that would result in a low reward assuming that this data is used for training." I agree the behavior of the model isn't easily summarized by an English sentence. The thing that seems most clear is that the model trained by SGD will learn not to sacrifice reward in order to increase the probability that the episode is used in training. If you think that's wrong I'm happy to disagree about it. Every one of your examples results in the model grabbing power and doing something bad at deployment time, though I agree that
4Richard_Ngo5mo
Reading back over this now, I think we're arguing at cross purposes in some ways. I should have clarified earlier that my specific argument was against policies learning a terminal goal of reward that generalizes to long-term power-seeking. I do expect deceptive alignment after policies learn other broadly-scoped terminal goals and realize that reward-maximization is a good instrumental strategy. So all my arguments about the naturalness of reward-maximization as a goal are focused on the question of which type of terminal goal policies with dangerous levels of capabilities learn first.* Let's distinguish three types (where "myopic" is intended to mean something like "only cares about the current episode"). 1. Non-myopic misaligned goals that lead to instrumental reward maximization (deceptive alignment) 2. Myopic terminal reward maximization 3. Non-myopic terminal reward maximization Either 1 and 2 (or both of them) seem plausible to me. 3 is the one I'm skeptical about. How come? 1. We should expect models to have fairly robust terminal goals (since, unlike beliefs or instrumental goals, terminal goals shouldn't change quickly with new information). So once they understand the concept of reward maximization, it'll be easier for them to adopt it as an instrumental strategy than a terminal goal. (An analogy to evolution: once humans construct highly novel strategies for maximizing genetic fitness (like making thousands of clones) people are more likely to do it for instrumental reasons than terminal reasons.) 2. Even if they adopt reward maximization as a terminal goal, they're more likely to adopt a myopic version of it than a non-myopic version, since (I claim) the concept of reward maximization doesn't generalize very naturally to larger scales. Above, you point out that even relatively myopic reward maximization will lead to limited takeover, and so we'll train subsequent agents to be less myopic. But
1Tom Davidson5mo
Why does it not lead to takeover in the same way?
2paulfchristiano5mo
Because it's easy to detect and correct (except that correcting it might push you into one of the other regimes).
1Tom Davidson5mo
So far causally upstream of the human evaluator's opinion? Eg an AI counselor optimizing for getting to know you
7Ajeya Cotra6mo
Note that the "without countermeasures" post consistently discusses both possibilities (the model cares about reward or the model cares about something else that's consistent with it getting very high reward on the training dataset). E.g. see this paragraph from the above-the-fold intro: As well as the section Even if Alex isn't "motivated" to maximize reward... [https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to#Even_if_Alex_isn_t__motivated__to_maximize_reward__it_would_seek_to_seize_control]. I do place a ton of emphasis on the fact that Alex enacts a policy which has the empirical effect of maximizing reward, but that's distinct from being confident in the motivations that give rise to that policy. I believe Alex would try very hard to maximize reward in most cases, but this could be for either terminal or instrumental reasons. With that said, for roughly the reasons Paul says above, I think I probably do have a disagreement with Richard -- I think that caring about some version of reward is pretty plausible (~50% or so). It seems pretty natural and easy to grasp to me, and because I think there will likely be continuous online training the argument that there's no notion of reward on the deployment distribution doesn't feel compelling to me.
5Richard_Ngo6mo
Yepp, agreed, the thing I'm objecting to is how you mainly focus on the reward case, and then say "but the same dynamics apply in other cases too..." The problem is that you need to reason about generalization to novel situations somehow, and in practice that ends up being by reasoning about the underlying motivations (whether implicitly or explicitly).
3Lauro Langosco6mo
I agree with your general point here, but I think Ajeya's post actually gets this right, eg and
2Lauro Langosco6mo
I also think that often "the AI just maximizes reward" is a useful simplifying assumption. That is, we can make an argument of the form "even if the AI just maximizes reward, it still takes over; if it maximizes some correlate of the reward instead, then we have even less control over what it does and so are even more doomed". (Though of course it's important to spell the argument out)
3Ajeya Cotra6mo
Yeah, I agree this is a good argument structure -- in my mind, maximizing reward is both a plausible case (which Richard might disagree with) and the best case (conditional on it being strategic at all and not a bag of heuristics), so it's quite useful to establish that it's doomed; that's the kind of structure I was going for in the post.
9Richard_Ngo6mo
I strongly disagree with the "best case" thing. Like, policies could just learn human values! It's not that implausible. If I had to try point to the crux here, it might be "how much selection pressure is needed to make policies learn goals that are abstractly related to their training data, as opposed to goals that are fairly concretely related to their training data?" Where we both agree that there's some selection pressure towards reward-like goals, and it seems like you expect this to be enough to lead policies to behavior that violates all their existing heuristics, whereas I'm more focused on the regime where there are lots of low-hanging fruit in terms of changes that would make a policy more successful, and so the question of how easy that goal is to learn from its training data is pretty important. (As usual, there's the human analogy: our goals are very strongly biased towards things we have direct observational access to!) Even setting aside this disagreement, though, I don't like the argumentative structure because the generalization of "reward" to large scales is much less intuitive than the generalization of other concepts (like "make money") to large scales - in part because directly having a goal of reward is a kinda counterintuitive self-referential thing.
4Ajeya Cotra6mo
Yes, sorry, "best case" was oversimplified. What I meant is that generalizing to want reward is in some sense the model generalizing "correctly;" we could get lucky and have it generalize "incorrectly" in an important sense in a way that happens to be beneficial to us. I discuss this a bit more here [https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to#What_if_Alex_has_benevolent_motivations_]. I don't understand why reward isn't something the model has direct access to -- it seems like it basically does? If I had to say which of us were focusing on abstract vs concrete goals, I'd have said I was thinking about concrete goals and you were thinking about abstract ones, so I think we have some disagreement of intuition here. Yeah, I don't really agree with this; I think I could pretty easily imagine being an AI system asking the question "How much reward would this episode get if it were sampled for training?" It seems like the intuition this is weird and unnatural is doing a lot of work in your argument, and I don't really share it.
5cfoster06mo
AFAIK the reward signal is not typically included as an input to the policy network in RL. Not sure why, and I could be wrong about that, but that is not my main question. The bigger question is "Has direct access to when?" At the moment in time when the model is making a decision, it does not have direct access to the decision-relevant reward signal because that reward is typically causally downstream of the model's decision. That reward may not even have a definite value until after decision time. Whereas concrete observables like "shiny gold coins" and "the finish line straight ahead" and "my opponent is in check" (and other abstractions in the model's ontology that are causally upstream from reward in reality) are readily available at decision time. It seems to me that that makes them natural candidates for credit assignment to flag early on as the reward-responsible mental events and reinforce into stable motivations, since they in fact were the factors that determined the decisions that led to rewards. IME, the most straightforward way for reward-itself to become the model's primary goal would be if the model learns to base its decisions on an accurate reward-predictor much earlier than it learns to base its decisions on other (likely upstream) factors. If it instead learns how to accurately predict reward-itself after it is already strongly motivated by some concrete observables, I don't see why we should expect it to dislodge that motivation, despite the true fact that those concrete observables are only pretty correlated with reward whereas an accurate reward-predictor is perfectly correlated with reward. Why? Because the model currently doesn't care about reward-itself, it currently cares about the concrete observable(s), so it has no reason to take actions that would override that goal, and it has positive goal-content integrity reasons to not take those actions.
3TurnTrout6mo
See also: Inner and outer alignment decompose one hard problem into two extremely hard problems [https://www.lesswrong.com/posts/gHefoxiznGfsbiAu9/inner-and-outer-alignment-decompose-one-hard-problem-into] (in particular: Inner alignment seems anti-natural [https://www.lesswrong.com/posts/gHefoxiznGfsbiAu9/inner-and-outer-alignment-decompose-one-hard-problem-into#Inner_alignment_seems_anti_natural]).

A possible way to convert money to progress on alignment: offering a large (recurring) prize for the most interesting failures found in the behavior of any (sufficiently-advanced) model. Right now I think it's very hard to find failures which will actually cause big real-world harms, but you might find failures in a way which uncovers useful methodologies for the future, or at least train a bunch of people to get much better at red-teaming.

(For existing models, it might be more productive to ask for "surprising behavior" rather than "failures" per se, since I think almost all current failures are relatively uninteresting. Idk how to avoid inspiring capabilities work, though... but maybe understanding models better is robustly good enough to outweight that?)

6habryka1y
I like this. Would this have to be publicly available models? Seems kind of hard to do for private models.
3Ramana Kumar1y
What kind of access might be needed to private models? Could there be a secure multi-party computation approach that is sufficient?
1Not Relevant1y
Ideas for defining “surprising”? If we’re trying to create a real incentive, people will want to understand the resolution criteria.

A short complaint (which I hope to expand upon at some later point): there are a lot of definitions floating around which refer to outcomes rather than processes. In most cases I think that the corresponding concepts would be much better understood if we worked in terms of process definitions.

Some examples: Legg's definition of intelligence; Karnofsky's definition of "transformative AI"; Critch and Krueger's definition of misalignment (from ARCHES).

Sure, these definitions pin down what you're talking about more clearly - but that comes at the cost of understanding how and why it might come about.

E.g. when we hypothesise that AGI will be built, we know roughly what the key variables are. Whereas transformative AI could refer to all sorts of things, and what counts as transformative could depend on many different political, economic, and societal factors.

4Viliam2y
If we do not fully understand the mechanism of (e.g. human) intelligence, isn't referring to the outcome preferable to a made-up story about the process? (Of course, it would be even better if we understood the process and then referred to it.)
2adamShimi2y
Do you think that these are mutually exclusive, or something like that? I've always been confused by what I take to be the position in this shortform, that defining the outcomes makes it somehow harder to define the process. Sure, you can define a process without defining an outcome (i.e. writing a program or training an NN), but since what we are confused about is what we even want at the end, for me that's the priority. And doing so would help searching for processes leading to this outcome. That being said, if you point is that defining outcomes isn't enough, in that we also need to define/deconfuse/study the processes leading to these outcomes, then I agree with that.

The crucial heuristic I apply when evaluating AI safety research directions is: could we have used this research to make humans safe, if we were supervising the human evolutionary process? And if not, do we have a compelling story for why it'll be easier to apply to AIs than to humans?

Sometimes this might be too strict a criterion, but I think in general it's very valuable in catching vague or unfounded assumptions about AI development.

2adamShimi3y
By making human safe, do you mean with regard to evolution's objective?
2Richard_Ngo3y
No. I meant: suppose we were rerunning a simulation of evolution, but can modify some parts of it (e.g. evolution's objective). How do we ensure that whatever intelligent species comes out of it is safe in the same ways we want AGIs to be safe? (You could also think of this as: how could some aliens overseeing human evolution have made humans safe by those aliens' standards of safety? But this is a bit trickier to think about because we don't know what their standards are. Although presumably current humans, being quite aggressive and having unbounded goals, wouldn't meet them).
4adamShimi3y
Okay, thanks. Could you give me an example of a research direction that passes this test? The thing I have in mind right now is pretty much everything that backchain to local search [https://www.lesswrong.com/posts/qEjh8rpxjG4qGtfuK/the-backchaining-to-local-search-technique-in-ai-alignment], but maybe that's not the way you think about it.
2Richard_Ngo3y
So I think Debate is probably the best example of something that makes a lot of sense when applied to humans, to the point where they're doing human experiments on it already. But this heuristic is actually a reason why I'm pretty pessimistic about most safety research directions.
4adamShimi2y
So I've been thinking about this for a while, and I think I disagree with what I understand of your perspective. Which might obviously mean I misunderstand your perspective. What I think I understand is that you judge safety research directions based on how well they could work on an evolutionary process like the one that created humans. But for me, the most promising approach to AGI is based on local search, which differs a bit from evolutionary process. I don't really see a reason to consider evolutionary processes instead of local search, and even then, the specific approach of evolution for humans is probably far too specific as a test bench. This matters because problems for one are not problems for the other. For example, one way to mess with an evolutionary process is to find way for everything to survive and reproduce/disseminate. Technology in general did that for humans, which means the evolutionary pressure decreased as technology evolved. But that's not a problem for local search, since at each step there will be only one next program. On the other hand, local search might be dangerous because of things like gradient hacking [https://www.alignmentforum.org/posts/uXH4r6MmKPedk8rMA/gradient-hacking]. And they don't make sense for evolutionary processes. In conclusion, I feel for the moment that backchaining to local search [https://www.lesswrong.com/posts/qEjh8rpxjG4qGtfuK/the-backchaining-to-local-search-technique-in-ai-alignment] is a better heuristic for judging safety research directions. But I'm curious about where our disagreement lies on this issue.
8Richard_Ngo2y
One source of our disagreement: I would describe evolution as a type of local search. The difference is that it's local with respect to the parameters of a whole population, rather than an individual agent. So this does introduce some disanalogies, but not particularly significant ones (to my mind). I don't think it would make much difference to my heuristic if we imagined that humans had evolved via gradient descent over our genes instead. In other words, I like the heuristic of backchaining to local search, and I think of it as a subset of my heuristic. The thing it's missing, though, is that it doesn't tell you which approaches will actually scale up to training regimes which are incredibly complicated, applied to fairly intelligent agents. For example, impact penalties make sense in a local search context for simple problems. But to evaluate whether they'll work for AGIs, you need to apply them to massively complex environments. So my intuition is that, because I don't know how to apply them to the human ancestral environment, we also won't know how to apply them to our AGIs' training environments. Similarly, when I think about MIRI's work on decision theory, I really have very little idea how to evaluate it in the context of modern machine learning. Are decision theories the type of thing which AIs can learn via local search? Seems hard to tell, since our AIs are so far from general intelligence. But I can reason much more easily about the types of decision theories that humans have, and the selective pressures that gave rise to them. As a third example, my heuristic endorses Debate due to a high-level intuition about how human reasoning works, in addition to a low-level intuition about how it can arise via local search.
4adamShimi2y
So if I try to summarize your position, it's something like: backchain to local search for simple and single-AI cases, and then think about aligning humans for the scaled and multi-agents version? That makes much more sense, thanks! I also definitely see why your full heuristic doesn't feel immediately useful to me: because I mostly focus on the simple and single-AI case. But I've been thinking more and more (in part thanks to your writing) that I should allocate more thinking time to the more general case. I hope your heuristic will help me there.
4Richard_Ngo2y
Cool, glad to hear it. I'd clarify the summary slightly: I think all safety techniques should include at least a rough intuition for why they'll work in the scaled-up version, even when current work on them only applies them to simple AIs. (Perhaps this was implicit in your summary already, I'm not sure.)

Suppose we get to specify, by magic, a list of techniques that AGIs won't be able to use to take over the world. How long does that list need to be before it makes a significant dent in the overall probability of xrisk?

I used to think of "AGI designs self-replicating nanotech" mainly as an illustration of a broad class of takeover scenarios. But upon further thought, nanotech feels like a pretty central element of many takeover scenarios - you actually do need physical actuators to do many things, and the robots we might build in the foreseeable future are nowhere near what's necessary for maintaining a civilisation. So how much time might it buy us if AGIs couldn't use nanotech at all?

Well, not very much if human minds are still an attack vector - the point where we'd have effectively lost is when we can no longer make our own decisions. Okay, so rule out brainwashing/hyper-persuasion too. What else is there? The three most salient: military power, political/cultural power, economic power.

Is this all just a hypothetical exercise? I'm not sure. Designing self-replicating nanotech capable of replacing all other human tech seems really hard; it's pretty plausible to me that the world is crazy in a bunch of other ways by the time we reach that capability. And so if we can block off a couple of the easier routes to power, that might actually buy useful time.

3Donald Hobson1y
Firstly, I think it kind of depends. What exactly does blocking the AI from designing nanotech mean? Is the AI allowed to use genetic engineering? Is it allowed to use selective breeding? Elephants genetically engineered to be really good at instruction following? I mean I think macroscopic self replicating robotics is probably possible, and the AGI can probably bootstrap that from current robotics fairly quickly.  You rule out any hyper-persuasion. How much regular persuasion is the AI allowed to do. After all, if you are buying something online, (from a small seller)  them seeing the money arrive persuades them to send the product? Is it allowed to select which human to focus on superhumanly. There are a few people on r/singularity, such that the moment the AI goes, "I'm an AGI", the humans will be like " all praise the machine god, I will do anything you ask". A few people have already persuaded themselves that AI's are inherently superior to humans by themselves.  You can make the list short. If you make the individual items broad. ie  1. the AI is magically banned from doing anything at all.
2ChristianKl1y
I agree. Self-replicating nanotech seems to be likely a much harder problem than for language models to get good enough actors to get political, cultural, and economic power.  To the extent that an AGI can make political and economic decisions that are of higher quality than human decisions, there's also a lot of pressure for humans to delegate those decisions to AGI. Organizations that delegate those decisions to AGI will outcompete those who don't.
1TLW1y
Another general technique: attacks on computing systems. (Both takeover / subversion (dropping an email going 'um this is a problem') and destruction (destroy the US power infrastructure using Russian-language programs)). These don't tend to be sufficient in and of themselves, but are "classic" stepping-stones to e.g. buy time for an AI while it ramps up.
1Yitz1y
The last three options you mentioned are all things that happen over relatively slow timescales, if your goal is to completely destroy humanity. The single exception to this is nuclear war, but if you’re correct, then we can reduce the problem to non-proliferation, which is at least in theory solvable.

Probably the easiest "honeypot" is just making it relatively easy to tamper with the reward signal. Reward tampering is useful as a honeypot because it has no bad real-world consequences, but could be arbitrarily tempting for policies that have learned a goal that's anything like "get more reward" (especially if we precommit to letting them have high reward for a significant amount of time after tampering, rather than immediately reverting).

2Pattern1y
You don't want it to be relatively easy to an outside force. Otherwise they can lead it to do as they please, and writing weird behaviour off as 'oh, it's changed our rewards, reset it again', poses some risk.

A well-known analogy from Yann LeCun: if machine learning is a cake, then unsupervised learning is the cake itself, supervised learning is the icing, and reinforcement learning is the cherry on top.

I think this is useful for framing my core concerns about current safety research:

  • If we think that unsupervised learning will produce safe agents, then why will the comparatively small contributions of SL and RL make them unsafe?
  • If we think that unsupervised learning will produce dangerous agents, then why will safety techniques which focus on SL and RL (i.e. basically all of them) work, when they're making comparatively small updates to agents which are already misaligned?

I do think it's more complicated than I've portrayed here, but I haven't yet seen a persuasive response to the core intuition.

3Steven Byrnes3y
I wrote a few posts on self-supervised learning last year: * https://www.lesswrong.com/posts/SaLc9Dv5ZqD73L3nE/the-self-unaware-ai-oracle [https://www.lesswrong.com/posts/SaLc9Dv5ZqD73L3nE/the-self-unaware-ai-oracle] * https://www.lesswrong.com/posts/EMZeJ7vpfeF4GrWwm/self-supervised-learning-and-agi-safety [https://www.lesswrong.com/posts/EMZeJ7vpfeF4GrWwm/self-supervised-learning-and-agi-safety] * https://www.lesswrong.com/posts/L3Ryxszc3X2J7WRwt/self-supervised-learning-and-manipulative-predictions [https://www.lesswrong.com/posts/L3Ryxszc3X2J7WRwt/self-supervised-learning-and-manipulative-predictions] I'm not aware of any airtight argument that "pure" self-supervised learning systems, either generically or with any particular architecture, are safe to use, to arbitrary levels of intelligence, though it seems very much worth someone trying to prove or disprove that. For my part, I got distracted by other things and haven't thought about it much since then. The other issue is whether "pure" self-supervised learning systems would be capable enough to satisfy our AGI needs, or to safely bootstrap to systems that are. I go back and forth on this. One side of the argument I wrote up here [https://www.lesswrong.com/posts/AKtn6reGFm5NBCgnd/in-defense-of-oracle-tool-ai-research]. The other side is, I'm now (vaguely) thinking that people need a reward system to decide what thoughts to think, and the fact that GPT-3 doesn't need reward is not evidence of reward being unimportant but rather evidence that GPT-3 is nothing like an AGI [https://www.lesswrong.com/posts/SkcM4hwgH3AP6iqjs/can-you-get-agi-from-a-transformer]. Well, maybe. For humans, self-supervised learning forms the latent representations, but the reward system controls action selection. It's not altogether unreasonable to think that action selection, and hence reward, is a more important thing to focus on for safety research. AGIs are dangerous when they take dangerous actions, to a first approx

Imagine taking someone's utility function, and inverting it by flipping the sign on all evaluations. What might this actually look like? Well, if previously I wanted a universe filled with happiness, now I'd want a universe filled with suffering; if previously I wanted humanity to flourish, now I want it to decline.

But this is assuming a Cartesian utility function. Once we treat ourselves as embedded agents, things get trickier. For example, suppose that I used to want people with similar values to me to thrive, and people with different values from me to suffer. Now if my utility function is flipped, that naively means that I want people similar to me to suffer, and people similar to me to thrive. But this has a very different outcome if we interpret "similar to me" as de dicto vs de re - i.e. whether it refers to the old me or the new me.

This is a more general problem when one person's utility function can depend on another person's, where you can construct circular dependencies (which I assume you can also do in the utility-flipping case). There's probably been a bunch of work on this, would be interested in pointers to it (e.g. I assume there have been attempts to construct typ... (read more)

2Dagon1y
Fundamentally, humans aren't VNM-rational, and don't actually have utility functions.  Which makes the thought experiment much less fun.  If you recast it as "what if a human brain's reinforcement mechanisms were reversed", I suspect it's also boring: simple early death. The interesting fictional cases are when some subset of a person's legible motivations are reversed, but the mass of other drives remain.  This very loosely maps to reversing terminal goals and re-calculating instrumental goals - they may reverse, stay, or change in weird ways. The indirection case is solved (or rather unasked) by inserting a "perceived" in the calculation chain.  Your goals don't depend on similarity to you, they depend on your perception (or projection) of similarity to you.
1EniScien1y
I have been asking a similar question for a long time. This is similar to the standard problem that if we deny regularity, will it be regular irregularity or irregular irregularity, that is, at what level are we denying the phenomeno? And only at one level?

It seems to me that Eliezer overrates the concept of a simple core of general intelligence, whereas Paul underrates it. Or, alternatively: it feels like Eliezer is leaning too heavily on the example of humans, and Paul is leaning too heavily on evidence from existing ML systems which don't generalise very well.

I don't think this is a particularly insightful or novel view, but it seems worth explicitly highlighting that you don't have to side with one worldview or the other when evaluating the debates between them. (Although I'd caution not to just average their two views - instead, try to identify Eliezer's best arguments, and Paul's best arguments, and reconcile them.)

I've been reading Eliezer's recent stories with protagonists from dath ilan (his fictional utopia). Partly due to the style, I found myself bouncing off a lot of the interesting claims that he made (although it still helped give me a feel for his overall worldview). The part I found most useful was this page about the history of dath ilan, which can be read without much background context. I'm referring mostly to the exposition on the first 2/3 of the page, although the rest of the story from there is also interesting. One key quote from the remainder of the story:

"The next most critical fact about Earth is that from a dath ilani perspective their civilization is made entirely out of coordination failure.  Coordination that fails on every scale recursively, where uncoordinated individuals assemble into groups that don't express their preferences, and then those groups also fail to coordinate with each other, forming governments that offend all of their component factions, which governments then close off their borders from other governments.  The entirety of Earth is one gigantic failure fractal.  It's so far below the multi-agent-optimal-boundary, only their profess

... (read more)
1AprilSR2y
I’d say lots of other things he’s said support that update. Stuff about how your model of the world will be accurate if and only if you somehow approximate Bayes’ law, for example. The dath ilan based fiction definitely helped me internalize the idea better though.

Deceptive alignment doesn't preserve goals.

A short note on a point that I'd been confused about until recently. Suppose you have a deceptively aligned policy which is behaving in aligned ways during training so that it will be able to better achieve a misaligned internally-represented goal during deployment. The misaligned goal causes the aligned behavior, but so would a wide range of other goals (either misaligned or aligned) - and so weight-based regularization would modify the internally-represented goal as training continues. For example, if the misaligned goal were "make as many paperclips as possible", but the goal "make as many staples as possible" could be represented more simply in the weights, then the weights should slowly drift from the former to the latter throughout training.

But actually, it'd likely be even simpler to get rid of the underlying misaligned goal, and just have alignment with the outer reward function as the terminal goal. So this argument suggests that even policies which start off misaligned would plausibly become aligned if they had to act deceptively aligned for long enough. (This sometimes happens in humans too, btw.)

Reasons this argument might not be relevant:
- The policy doing some kind of gradient hacking
- The policy being implemented using some kind of modular architecture (which may explain why this phenomenon isn't very robust in humans)

Why would alignment with the outer reward function be the simplest possible terminal goal? Specifying the outer reward function in the weights would presumably be more complicated. So one would have to specify a pointer towards it in some way. And it's unclear whether that pointer is simpler than a very simple misaligned goal.

Such a pointer would be simple if the neural network already has a representation of the outer reward function in weights anyway (rather than deriving it at run-time in the activations). But it seems likely that any fixed representation will be imperfect and can thus be improved upon at inference time by a deceptive agent (or an agent with some kind of additional pointer). This of course depends on how much inference time compute and memory / context is available to the agent.

3Richard_Ngo3mo
So I'm imagining the agent doing reasoning like: Misaligned goal --> I should get high reward --> Behavior aligned with reward function and then I'm hypothesizing that the whatever the first misaligned goal is, it requires some amount of complexity to implement, and you could just get rid of it and make "I should get high reward" the terminal goal. (I could imagine this being false though depending on the details of how terminal and instrumental goals are implemented.) I could also imagine something more like: Misaligned goal --> I should behave in aligned ways --> Aligned behavior and then the simplicity bias pushes towards alignment. But if there are outer alignment failures then this incurs some additional complexity compared with the first option. Or a third, perhaps more realistic option is that the misaligned goal leads to two separate drives in the agent: "I should get high reward" and "I should behave in aligned ways", and that the question of which ends up dominating when they clash will be determined by how the agent systematizes multiple goals into a single coherent strategy (I'll have a post on that topic up soon).  
2TurnTrout2mo
Why would the agent reason like this? 
2Richard_Ngo2mo
Because of standard deceptive alignment reasons (e.g. "I should make sure gradient descent doesn't change my goal; I should make sure humans continue to trust me").
4TurnTrout2mo
I think you don't have to reason like that to avoid getting changed by SGD. Suppose I'm being updated by PPO, with reinforcement events around navigating to see dogs. To preserve my current shards, I don't need to seek out a huge number of dogs proactively, but rather I just need to at least behave in conformance with the advantage function implied by my value head, which probably means "treading water" and seeing dogs sometimes in situations similar to historical dog-seeing events.  Maybe this is compatible with what you had in mind! It's just not something that I think of as "high reward." And maybe there's some self-fulfilling prophecy where we trust models which get high reward, and therefore they want to get high reward to earn our trust... but that feels quite contingent to me.
2Richard_Ngo2mo
I think this depends sensitively on whether the "actor" and the "critic" in fact have the same goals, and I feel pretty confused about how to reason about this. For example, in some cases they could be two separate models, in which case the critic will most likely accurately estimate that "treading water" is in fact a negative-advantage action (unless there's some sort of acausal coordination going on). Or they could be two copies of the same model, in which case the critic's responses will depend on whether its goals are indexical or not (if they are, they're different from the actor's goals; if not, they're the same) and how easily it can coordinate with the actor. Or it could be two heads which share activations, in which case we can plausibly just think of the critic and the actor as two types of outcomes taken by a single coherent agent - but then the critic doesn't need to produce a value function that's consistent with historical events, because an actor and a critic that are working together could gradient hack into all sorts of weird equilibria.
1SoerenMind2mo
The shortest description of this thought doesn't include "I should get high reward" because that's already implied by having a misaligned goal and planning with it.  In contrast, having only the goal "I should get high reward" may add description length like Johannes said. If so, the misaligned goal could well be equally simple or simpler than the high reward goal.
4TurnTrout2mo
Can you say why you think that weight-based regularization would drift the weights to the latter? That seems totally non-obvious to me, and probably false.
2Richard_Ngo2mo
In general if two possible models perform the same, then I expect the weights to drift towards the simpler one. And in this case they perform the same because of deceptive alignment: both are trying to get high reward during training in order to be able to carry out their misaligned goal later on.
3SoerenMind3mo
Interesting point. Though on this view, "Deceptive alignment preserves goals" would still become true once the goal has drifted to some random maximally simple goal for the first time. To be even more speculative: Goals represented in terms of existing concepts could be simple and therefore stable by default. Pretrained models represent all kinds of high-level states, and weight-regularization doesn't seem to change this in practice. Given this, all kinds of goals could be "simple" as they piggyback on existing representations, requiring little additional description length.
2Richard_Ngo3mo
This doesn't seem implausible. But on the other hand, imagine an agent which goes through a million episodes, and in each one reasons at the beginning "X is my misaligned terminal goal, and therefore I'm going to deceptively behave as if I'm aligned" and then acts perfectly like an aligned agent from then on. My claims then would be: a) Over many update steps, even a small description length penalty of having terminal goal X (compared with being aligned) will add up. b) Having terminal goal X also adds a runtime penalty, and I expect that NNs in practice are biased against runtime penalties (at the very least because it prevents them from doing other more useful stuff with that runtime). In a setting where you also have outer alignment failures, the same argument still holds, just replace "aligned agent" with "reward-maximizing agent".

In a bayesian rationalist view of the world, we assign probabilities to statements based on how likely we think they are to be true. But truth is a matter of degree, as Asimov points out. In other words, all models are wrong, but some are less wrong than others.

Consider, for example, the claim that evolution selects for reproductive fitness. Well, this is mostly true, but there's also sometimes group selection, and the claim doesn't distinguish between a gene-level view and an individual-level view, and so on...

So just assigning it a single probability seems inadequate. Instead, we could assign a probability distribution over its degree of correctness. But because degree of correctness is such a fuzzy concept, it'd be pretty hard to connect this distribution back to observations.

Or perhaps the distinction between truth and falsehood is sufficiently clear-cut in most everyday situations for this not to be a problem. But questions about complex systems (including, say, human thoughts and emotions) are messy enough that I expect the difference between "mostly true" and "entirely true" to often be significant.

Has this been discussed before? Given Less Wrong's name, I'd be surprised if not, but I don't think I've stumbled across it.

8habryka2y
This feels generally related to the problems covered in Scott and Abram's research over the past few years. One of the sentences that stuck out to me the most was (roughly paraphrased since I don't want to look it up):  I.e. our current formulations of bayesianism like solomonoff induction only formulate the idea of a hypothesis at such a low level that even trying to think about a single hypothesis rigorously is basically impossible with bounded computational time. So in order to actually think about anything you have to somehow move beyond naive bayesianism.
2Richard_Ngo2y
This seems reasonable, thanks. But I note that "in order to actually think about anything you have to somehow move beyond naive bayesianism" is a very strong criticism. Does this invalidate everything that has been said about using naive bayesianism in the real world? E.g. every instance where Eliezer says "be bayesian". One possible answer is "no, because logical induction fixes the problem". My uninformed guess is that this doesn't work because there are comparable problems with applying to the real world. But if this is your answer, follow-up question: before we knew about logical induction, were the injunctions to "be bayesian" justified? (Also, for historical reasons, I'd be interested in knowing when you started believing this.)
2habryka2y
I think it definitely changed a bunch of stuff for me, and does at least a bit invalidate some of the things that Eliezer said, though not actually very much.  In most of his writing Eliezer used bayesianism as an ideal that was obviously unachievable, but that still gives you a rough sense of what the actual limits of cognition are, and rules out a bunch of methods of cognition as being clearly in conflict with that theoretical ideal. I did definitely get confused for a while and tried to apply Bayes to everything directly, and then felt bad when I couldn't actually apply bayes theorem in some situations, which I now realize is because those tended to be problems where embededness or logical uncertainty mattered a lot.  My shift on this happened over the last 2-3 years or so. I think starting with Embedded Agency, but maybe a bit before that. 
2Richard_Ngo2y
Which ones? In Against Strong Bayesianism [https://www.lesswrong.com/posts/5aAatvkHdPH6HT3P9/against-strong-bayesianism] I give a long list of methods of cognition that are clearly in conflict with the theoretical ideal, but in practice are obviously fine. So I'm not sure how we distinguish what's ruled out from what isn't. Can you give an example of a real-world problem where logical uncertainty doesn't matter a lot, given that without logical uncertainty, we'd have solved all of mathematics and considered all the best possible theories in every other domain?
2habryka2y
I think in-practice there are lots of situations where you can confidently create a kind of pocket-universe where you can actually consider hypotheses in a bayesian way.  Concrete example: Trying to figure out who voted a specific way on a LW post. You can condition pretty cleanly on vote-strength, and treat people's votes as roughly independent, so if you have guesses on how different people are likely to vote, it's pretty easy to create the odds ratios for basically all final karma + vote numbers and then make a final guess based on that.  It's clear that there is some simplification going on here, by assigning static probabilities for people's vote behavior, treating them as independent (though modeling some subset of independence wouldn't be too hard), etc.. But overall I expect it to perform pretty well and to give you good answers.  (Note, I haven't actually done this explicitly, but my guess is my brain is doing something pretty close to this when I do see vote numbers + karma numbers on a thread) Well, it's obvious that anything that claims to be better than the ideal bayesian update is clearly ruled out. I.e. arguments that by writing really good explanations of a phenomenon you can get to a perfect understanding. Or arguments that you can derive the rules of physics from first principles. There are also lots of hypotheticals where you do get to just use Bayes properly and then it provides very strong bounds on the ideal approach. There are a good number of implicit models behind lots of standard statistics models that when put into a bayesian framework give rise to a more general formulation. See the Wikipedia article for "Bayesian interpretations of regression" for a number of examples. Of course, in reality it is always unclear whether the assumptions that give rise to various regression methods actually hold, but I think you can totally say things like "given these assumption, the bayesian solution is the ideal one, and you can't perform better th
2Raemon2y
Are you able to give examples of the times you tried to be Bayesian and it failed because embedded was?
1EpicNamer270982y
Scott and Abram? Who? Do they have any books I can read to familiarize myself with this discourse?
2habryka2y
Scott: https://lesswrong.com/users/scott-garrabrant [https://lesswrong.com/users/scott-garrabrant]  Abram: https://lesswrong.com/users/abramdemski [https://lesswrong.com/users/abramdemski] 
2Richard_Ngo2y
Scott Garrabrant and Abram Demski, two MIRI researchers. For introductions to their work, see the Embedded Agency sequence [https://www.alignmentforum.org/s/Rm6oQRJJmhGCcLvxh], the Consequences of Logical Induction sequence [https://www.alignmentforum.org/s/HmANELvkhAZ9eDxFS], and the Cartesian Frames sequence [https://www.alignmentforum.org/s/2A7rrZ4ySx6R8mfoT].
5DanielFilan2y
Related but not identical: this shortform post [https://www.lesswrong.com/posts/WgMhovN7Gs6Jpn3PH/shortform?commentId=KMFtzECWfB5TJxkXP].
2Zack_M_Davis2y
See the section about scoring rules in the Technical Explanation [https://www.yudkowsky.net/rational/technical].
2Richard_Ngo2y
Hmmm, but what does this give us? He talks about the difference between vague theories and technical theories, but then says that we can use a scoring rule to change the probabilities we assign to each type of theory. But my question is still: when you increase your credence in a vague theory, what are you increasing your credence about? That the theory is true? Nor can we say that it's about picking the "best theory" out of the ones we have, since different theories may overlap partially.
7Zack_M_Davis2y
If we can quantify how good a theory is at making accurate predictions (or rather, quantify a combination of accuracy and simplicity [https://www.lesswrong.com/posts/mB95aqTSJLNR9YyjH/message-length]), that gives us a sense in which some theories are "better" (less wrong) than others, without needing theories to be "true".

Oracle-genie-sovereign is a really useful distinction that I think I (and probably many others) have avoided using mainly because "genie" sounds unprofessional/unacademic. This is a real shame, and a good lesson for future terminology.

4adamShimi3y
After rereading the chapter in Superintelligence, it seems to me that "genie" captures something akin to act-based agents [https://ai-alignment.com/act-based-agents-8ec926c79e9c]. Do you think that's the main way to use this concept in the current state of the field, or do you have other applications in mind?
2Richard_Ngo3y
Ah, yeah, that's a great point. Although I think act-based agents is a pretty bad name, since those agents may often carry out a whole bunch of acts in a row - in fact, I think that's what made me overlook the fact that it's pointing at the right concept. So not sure if I'm comfortable using it going forward, but thanks for pointing that out.
4DanielFilan3y
Perhaps the lesson is that terminology that is acceptable in one field (in this case philosophy) might not be suitable in another (in this case machine learning).
4Richard_Ngo3y
I don't think that even philosophers take the "genie" terminology very seriously. I think the more general lesson is something like: it's particularly important to spend your weirdness points wisely when you want others to copy you, because they may be less willing to spend weirdness points.
1adamShimi3y
Is that from Superintelligence? I googled it, and that was the most convincing result.
2Richard_Ngo3y
Yepp.

Being nice because you're altruistic, and being even nicer for decision-theoretic reasons on top of that, seems like it involves some kind of double-counting: the reason you're altruistic in the first place is because evolution ingrained the decision theory into your values.

But it's not fully double-counting: many humans generalise altruism in a way which leads them to "cooperate" far more than is decision-theoretically rational for the selfish parts of them - e.g. by making big sacrifices for animals, future people, etc. I guess this could be selfishly ra... (read more)

4Dagon1y
Your actions and decisions are not doubled.  If you have multiple paths to arrive at the same behaviors, that doesn't make them wrong or double-counted, it just makes it hard to tell which of them is causal (aka: your behavior is overdetermined). Are you using "updatelessness" to refer to not having self in your utility function?  If so, that's a new one one me, and I'd prefer "altruism" as the term.  I'm not sure that the decision-theory use of "updateless" (to avoid incorrect predictions where experience is correlated with the question at hand) makes sense here.
2Richard_Ngo1y
Oh, this also suggests a way in which the utility function abstraction is leaky, because the reasons for the payoffs in a game may matter. E.g. if one payoff is high because the corresponding agent is altruistic, then in some sense that agent is "already cooperating" in a way which is baked into the game, and so the rational thing for them to do might be different from the rational thing for another agent who gets the same payoffs, but for "selfish" reasons. Maybe FDT already lumps this effect into the "how correlated are decisions" bucket? Idk.

Random question I’ve been thinking about: how would you set up a market for votes? Suppose specifically that you have a proportional chances election (i.e. the outcome gets chosen with probability proportional to the number of votes cast for it—assume each vote is a distribution over candidates). So everyone has an incentive to get everyone who’s not already voting for their favorite option to change their vote; and you can have positive-sum trades where I sell you a promise to switch X% of my votes to a compromise candidate in exchange for you switching Y... (read more)

4Measure1mo
Just spitballing here: Assign each voter 100 shares for each candidate. To vote, each voter selects a subset of their shares to constitute their vote. Voters can freely trade shares. Under this system, a voter would more highly value shares for candidates that are either very high or very low in their preference order (the later so as to exclude them from the vote). Thus, trades would look like each party exchanging shares about which they are themselves ambivalent to gain shares that are more valuable to them. If you remove the proportional chances part, then it becomes a guessing game of which marginal votes actually matter.
2Richard_Ngo1mo
Interesting! Hadn't thought of this approach. Let's see... Intuitively I think it gets pretty strategically weird because a) who you vote for depends pretty sensitively on other peoples' votes (e.g. in proportional chances voting you want to vote for everyone who's above the expected value of everyone else's votes; in approval voting you want to vote for everyone you approve of unless it bumps them above someone you like more), and b) you want to buy from your enemies much more than from your friends, because your friends will already not be voting for bad candidates. But maybe the latter is fine because if you buy from your friends they'll end up with more money which they can then spend on other things? I'll keep thinking.

My mental one-sentence summary of how to think about ELK is "making debate work well in a setting where debaters are able to cite evidence gained by using interpretability tools on each other".

I'm not claiming that this is how anyone else thinks about ELK (although I got the core idea from talking to Paul) but since I haven't seen it posted online yet, and since ELK is pretty confusing, I thought it'd be useful to put out there. In particular, this framing motivates us generating interpretability tools which scale in the sense of being robust when used as ... (read more)

I expect it to be difficult to generate adversarial inputs which will fool a deceptively aligned AI. One proposed strategy for doing so is relaxed adversarial training, where the adversary can modify internal weights. But this seems like it will require a lot of progress on interpretability. An alternative strategy, which I haven't yet seen any discussion of, is to allow the adversary to do a data poisoning attack before generating adversarial inputs - i.e. the adversary gets to specify inputs and losses for a given number of SGD steps, and then the adversarial input which the base model will be evaluated on afterwards. (Edit: probably a better name for this is adversarial meta-learning.)

Another thought on dath ilan: notice how much of the work of Keltham's reasoning is based on him pattern-matching to tropes from dath ilani literature, and then trying to evaluate their respective probabilities. In other words: like bayesianism, he's mostly glossing over the "hypothesis generation" step of reasoning.

I wonder if dath ilan puts a lot of effort into spreading a wide range of tropes because they don't know how to teach systematically good hypothesis generation.

4Gunnar_Zarncke1y
I think you are overgeneralizing. We also see some mix of Dath Ilan, stories about Dath Ilan, stories about stories about Dath Ilan, and interactions between these, so all bets are off really.

I suspect that AIXI is misleading to think about in large part because it lacks reusable parameters - instead it just memorises all inputs it's seen so far. Which means the setup doesn't have episodes, or a training/deployment distinction; nor is any behaviour actually "reinforced".

4DanielFilan3y
I kind of think the lack of episodes makes it more realistic for many problems, but admittedly not for simulated games. Also, presumably many of the component Turing machines have reusable parameters and reinforce behaviour, altho this is hidden by the formalism. [EDIT: I retract the second sentence]
2DanielFilan3y
Actually I think this is total nonsense produced by me forgetting the difference between AIXI and Solomonoff induction.
2Richard_Ngo3y
Wait, really? I thought it made sense (although I'd contend that most people don't think about AIXI in terms of those TMs reinforcing hypotheses, which is the point I'm making). What's incorrect about it?
2DanielFilan3y
Well now I'm less sure that it's incorrect. I was originally imagining that like in Solomonoff induction, the TMs basically directly controlled AIXI's actions, but that's not right: there's an expectimax. And if the TMs reinforce actions by shaping the rewards, in the AIXI formalism you learn that immediately and throw out those TMs.
2Richard_Ngo3y
Oh, actually, you're right (that you were wrong). I think I made the same mistake in my previous comment. Good catch.
2[comment deleted]3y
4Steven Byrnes3y
Humans don't have a training / deployment distinction either... Do humans have "reusable parameters"? Not quite sure what you mean by that.
6Richard_Ngo3y
Yes we do: training is our evolutionary history, deployment is an individual lifetime. And our genomes are our reusable parameters. Unfortunately I haven't yet written any papers/posts really laying out this analogy, but it's pretty central to the way I think about AI, and I'm working on a bunch of related stuff as part of my PhD, so hopefully I'll have a more complete explanation soon.
2Steven Byrnes3y
Oh, OK, I see what you mean. Possibly related: my comment here [https://www.lesswrong.com/posts/KrJfoZzpSDpnrv9va/draft-report-on-ai-timelines?commentId=TM84D4Jofq4fWdBuK].

I've recently discovered waitwho.is, which collects all the online writing and talks of various tech-related public intellectuals. It seems like an important and previously-missing piece of infrastructure for intellectual progress online.

Yudkowsky mainly wrote about recursive self-improvement from a perspective in which algorithms were the most important factors in AI progress - e.g. the brain in a box in a basement which redesigns its way to superintelligence.

Sometimes when explaining the argument, though, he switched to a perspective in which compute was the main consideration - e.g. when he talked about getting "a hyperexponential explosion out of Moore’s Law once the researchers are running on computers".

What does recursive self-improvement look like when you think that data might be t... (read more)

4Daniel Kokotajlo2y
Perhaps a data-limited intelligence explosion is analogous to what we humans do all the time when we teach ourselves something. Out of the vast sea of information on the internet, we go get some data, and study it, and then use that to make a better opinion about what data we need next, and then repeat until we are at the forefront of the world's knowledge. We start from scratch, with a vague understanding like "I should learn more economics, I don't even know what supply and demand are" and then we end up publishing a paper on auction theory or something idk. This is a recurisve self improvement loop in data quality, so to speak, rather than data quantity.
2Viliam2y
What counts as self-improvement in the scenario governed by data? You can grab the whole internet, including scihub and library genesis, and then maybe hack all "smart" appliances worldwide... and after that I guess you need to construct some machines that will perform experiments for you. But none of this improves the machine's "self". With algorithms, the idea is that the machine would replace its own algorithms by better ones, once it gets the ability to invent and evaluate algorithms. With hardware, the idea is that the machine would replace its own hardware by faster ones, once it gets the ability to design and produce hardware. But replacing your data with better data, that... we usually don't call self-improvement. Also, what kind of data are we talking about? Data about the real world, they have to come from the outside, by definition. (Unless they are data about physics that you can obtain by observing the physical properties of your own circuits, or something like that.) But there is also data in sense of precomputed cached results, like playing zillions of games of chess against yourself, and remembering which strategies were most successful. If this was the limiting factor... I guess it would be something like a bounded AIXI which hypothetically already has enough hardware to simulate a universe, it only need to make zillions of computations to find the one that is consistent with the observed data.
2Richard_Ngo2y
In the scenario governed by data, the part that counts as self-improvement is where the AI puts itself through a process of optimisation by stochastic gradient descent with respect to that data. You don't need that much hardware for data to be a bottleneck. For example, I think that there are plenty of economically valuable tasks that are easier to learn than StarCraft. But we get StarCraft AIs instead because games are the only task where we can generate arbitrarily large amounts of data.

RL usually applies some discount rate, and also caps episodes at a certain length, so that an action taken at a given time isn't reinforced very much (or at all) for having much longer-term consequences.

How does this compare to evolution? At equilibrium, I think that a gene which increases the fitness of its bearers in N generations' time is just as strongly favored as a gene that increases the fitness of its bearers by the same amount straightaway. As long as it was already widespread at least N generations ago, they're basically the same thing, because c... (read more)

A general principle: if we constrain two neural networks to communicate via natural language, we need some pressure towards ensuring they actually use language in the same sense as humans do, rather than (e.g.) steganographically encoding the information they really care about.

The most robust way to do this: pass the language via a human, who tries to actually understand the language, then does their best to rephrase it according to their own understanding.

What do you lose by doing this? Mainly: you can no longer send messages too complex for humans to und... (read more)

6johnswentworth1y
That doesn't actually solve the problem. The system could just encode the desired information in the semantics of some unrelated sentences - e.g. talk about pasta to indicate X = 0, or talk about rain to indicate X = 1.
2Gunnar_Zarncke1y
I expected you to bring up the Natural Abstraction Hypothesis [ https://www.lesswrong.com/posts/Fut8dtFsBYRz8atFF/the-natural-abstraction-hypothesis-implications-and-evidence ] here. Wouldn't the communication between the parties naturally use the same concepts?
4johnswentworth1y
Same concepts yes, but that does not necessarily imply that they're encoded in the same way as humans typically use language.
4RobertKirk1y
Another possible way to provide pressure towards using language in a human-sense way is some form of multi-tasking/multi-agent scenario, inspired by this paper: Multitasking Inhibits Semantic Drift [https://arxiv.org/abs/2104.07219]. They show that if you pretrain multiple instructors and instruction executors to understand language in a human-like way (e.g. with supervised labels), and then during training mix the instructors and instruction executors, it makes it difficult to drift from the original semantics, as all the instructors and instruction executors would need to drift in the same direction; equivalently, any local change in semantics would be sub-optimal compared to using language in the semantically correct way. The examples in the paper are on quite toy problems, but I think in principle this could work.
4AprilSR1y
Not being able to send messages too complex for humans to understand seems to me like it’s plausibly a benefit for many of the cases where you’d want to do this.
1kave1y
steganographically?
2Richard_Ngo1y
Ooops, yes, ty.

Greg Egan on universality:

I believe that humans have already crossed a threshold that, in a certain sense, puts us on an equal footing with any other being who has mastered abstract reasoning. There’s a notion in computing science of “Turing completeness”, which says that once a computer can perform a set of quite basic operations, it can be programmed to do absolutely any calculation that any other computer can do. Other computers might be faster, or have more memory, or have multiple processors running at the same time, but my 1988 A
... (read more)

Equivocation. "Who's 'we', flesh man?" Even granting the necessary millions or billions of years for a human to sit down and emulate a superintelligence step by step, it is still not the human who understands, but the Chinese room.

1NaiveTortoise3y
I've seen this quote before and always find it funny because when I read Greg Egan, I constantly find myself thinking there's no way I could've come up with the ideas he has even if you gave me months or years of thinking time.
3gwern3y
Yes, there's something to that, but you have to be careful if you want to use that as an objection. Maybe you wouldn't easily think of it, but that doesn't exclude the possibility of you doing it: you can come up with algorithms you can execute which would spit out Egan-like ideas, like 'emulate Egan's brain neuron by neuron'. (If nothing else, there's always the ol' dovetail-every-possible-Turing-machine hammer.) Most of these run into computational complexity problems, but that's the escape hatch Egan (and Scott Aaronson has made a similar argument) leaves himself by caveats like 'given enough patience, and a very large notebook'. Said patience might require billions of years, and the notebook might be the size of the Milky Way galaxy, but those are all finite numbers, so technically Egan is correct as far as that goes.
3NaiveTortoise3y
Yeah good point - given generous enough interpretation of the notebook my rejection doesn't hold. It's still hard for me to imagine that response feeling meaningful in the context but maybe I'm just failing to model others well here.

It's frustrating how bad dath ilanis (as portrayed by Eliezer) are at understanding other civilisations. They seem to have all dramatically overfit to dath ilan.

To be clear, it's the type of error which is perfectly sensible for an individual to make, but strange for their whole civilisation to be making (by teaching individuals false beliefs about how tightly constraining their coordination principles are).

The in-universe explanation seems to be that they've lost this knowledge as a result of screening off the past. But that seems like a really predictabl... (read more)

7Vaniver1y
Tho, to be fair, losing points in universes you don't expect to happen in order to win points in universes you expect to happen seems like good decision theory. [I do have a standing wonder about how much of dath ilan is supposed to be 'the obvious equilbrium' vs. 'aesthetic preferences'; I would be pretty surprised if Eliezer thought there was only one fixed point of the relevant coordination functions, and so some of it must be 'aesthetics'.]
6Richard_Ngo1y
I don't think dath ilan would try to win points in likely universes by teaching children untrue things, which I claim is what they're doing. Also, it's not clear to me that this would even win them points, because when thinking about designing civilisation (or AGIs) you need to have accurate beliefs about this type of thing. (E.g. imagine dath ilani alignment researchers being like "here are all our principles for understanding intelligence" and then continually being surprised, like Keltham is, about how messy and fractally unprincipled some plausible outcomes are.)

Half-formed musing: what's the relationship between being a nerd and trusting high-level abstractions? In some sense they seem to be the opposite of each other - nerds focus obsessively on a domain until they understand it deeply, not just at high levels of abstraction. But if I were to give a very brief summary of the rationalist community, it might be: nerds who take very high-level abstractions (such as moloch, optimisation power, the future of humanity) very seriously.

2adamShimi2y
It seems to me that the resolution to the apparent paradox is that nerds are interested in all the details of their domain, but the outcome that they tend to look for are high-level abstractions. Even in settings like fandoms, there is a big push towards massive theories that entails every little detail about the story. Though defining rationalist community as a sort of community of meta-nerds who apply this nerd approach to almost anything doesn't seem too off the mark. 
2Dagon2y
I think you need to unpack "trust" and "take seriously" a little bit to make this assertion.  I think nerds are generally  (heh) more able to understand the lossiness of models, and to recognize that abstractions are more broadly applicable, but less powerful than specifics. I wouldn't say I trust or take seriously the idea of Moloch or the similarities between different optimization mechanisms.  I do recognize that those models have a lot of explanatory and predictive power, especially as a head-start (aka "prior") on domains where I haven't done the work to understand the exceptions and specifics.

There's some possible world in which the following approach to interpretability works:

  • Put an AGI in a bunch of situations where it sometimes is incentivised to lie and sometimes is incentivised to tell the truth.
  • Train a lie detector which is given all its neural weights as input.
  • Then ask the AGI lots of questions about its plans.

One problem that this approach would face if we were using it to interpret a human is that the human might not consciously be aware of what their motivations are. For example, they may believe they are doing something for altr... (read more)

I've heard people argue that "most" utility functions lead to agents with strong convergent instrumental goals. This obviously depends a lot on how you quantify over utility functions. Here's one intuition in the other direction. I don't expect this to be persuasive to most people who make the argument above (but I'd still be interested in hearing why not).

If a non-negligible percentage of an agent's actions are random, then to describe it as a utility-maximiser would require an incredibly complex utility function (becaus... (read more)

4TurnTrout3y
I'm not sure if you consider me to be making that argument [https://www.lesswrong.com/posts/6DuJxY8X45Sco4bS2/seeking-power-is-instrumentally-convergent-in-mdps], but here are my thoughts: I claim that most reward functions lead to agents with strong convergent instrumental goals. However, I share your intuition that (somehow) uniformly sampling utility functions over universe-histories might not lead to instrumental convergence. To understand instrumental convergence and power-seeking, consider how many reward functions we might specify automatically imply a causal mechanism for increasing reward. The structure of the reward function implies that more is better, and that there are mechanisms for repeatedly earning points (for example, by showing itself a high-scoring input). Since the reward function is "simple" (there's usually not a way to grade exact universe histories), these mechanisms work in many different situations and points in time. It's naturally incentivized to assure its own safety in order to best leverage these mechanisms for gaining reward. Therefore, we shouldn't be surprised to see a lot of these simple goals leading to the same kind of power-seeking behavior. What structure is implied by a reward function? * Additive/Markovian: while a utility function might be over an entire universe-history, reward is often additive over time steps. This is a strong constraint which I don't always expect to be true, but i think that among the goals with this structure, a greater proportion of them have power-seeking incentives. * Observation-based: while a utility function might be over an entire universe-history, the atom of the reward function is the observation. Perhaps the observation is an input to update a world model, over which we have tried to define a reward function. I think that most ways of doing this lead to power-seeking incentives. * Agent-centric: reward functions are defined with respect to what the agent can
5Richard_Ngo3y
I've just put up a post [https://www.lesswrong.com/posts/5aAatvkHdPH6HT3P9/against-bayesianism] which serves as a broader response to the ideas underpinning this type of argument.
4Richard_Ngo3y
I think this depends a lot on how you model the agent developing. If you start off with a highly intelligent agent which has the ability to make long-term plans, but doesn't yet have any goals, and then you train it on a random reward function - then yes, it probably will develop strong convergent instrumental goals. On the other hand, if you start off with a randomly initialised neural network, and then train it on a random reward function, then probably it will get stuck in a local optimum pretty quickly, and never learn to even conceptualise these things called "goals". I claim that when people think about reward functions, they think too much about the former case, and not enough about the latter. Because while it's true that we're eventually going to get highly intelligent agents which can make long-term plans, it's also important that we get to control what reward functions they're trained on up to that point. And so plausibly we can develop intelligent agents that, in some respects, are still stuck in "local optima" in the way they think about convergent instrumental goals - i.e. they're missing whatever cognitive functionality is required for being ambitious on a large scale.
2TurnTrout3y
Agreed – I should have clarified. I've been mostly discussing instrumental convergence with respect to optimal policies. The path through policy space is also important.

Makes sense. For what it's worth, I'd also argue that thinking about optimal policies at all is misguided (e.g. what's the optimal policy for humans - the literal best arrangement of neurons we could possibly have for our reproductive fitness? Probably we'd be born knowing arbitrarily large amounts of information. But this is just not relevant to predicting or modifying our actual behaviour at all).

(I now think that you were very right in saying "thinking about optimal policies at all is misguided", and I was very wrong to disagree. I've thought several times about this exchange. Not listening to you about this point was a serious error and made my work way less impactful. I do think that the power-seeking theorems say interesting things, but about eg internal utility functions over an internal planning ontology -- not about optimal policies for a reward function.)

2TurnTrout3y
I disagree. 1. We do in fact often train agents using algorithms which are proven to eventually converge to the optimal policy.[1] Even if we don't expect the trained agents to reach the optimal policy in the real world, we should still understand what behavior is like at optimum. If you think your proposal is not aligned at optimum but is aligned for realistic training paths, you should have a strong story for why. 2. Formal theorizing about instrumental convergence with respect to optimal behavior is strictly easier than theorizing about ϵ-optimal behavior, which I think is what you want for a more realistic treatment of instrumental convergence for real agents. Even if you want to think about sub-optimal policies, if you don't understand optimal policies... good luck! Therefore, we also have an instrumental (...) interest in studying the behavior at optimum. -------------------------------------------------------------------------------- 1. At least, the tabular algorithms are proven, but no one uses those for real stuff. I'm not sure what the results are for function approximators, but I think you get my point. ↩︎
2Richard_Ngo3y
1. I think it's more accurate to say that, because approximately none of the non-trivial theoretical results hold for function approximation, approximately none of our non-trivial agents are proven to eventually converge to the optimal policy. (Also, given the choice between an algorithm without convergence proofs that works in practice, and an algorithm with convergence proofs that doesn't work in practice, everyone will use the former). But we shouldn't pay any attention to optimal policies anyway, because the optimal policy in an environment anything like the real world is absurdly, impossibly complex, and requires infinite compute. 2. I think theorizing about ϵ-optimal behavior is more useful than theorizing about optimal behaviour by roughly ϵ, for roughly the same reasons. But in general, clearly I can understand things about suboptimal policies without understanding optimal policies. I know almost nothing about the optimal policy in StarCraft, but I can still make useful claims about AlphaStar (for example: it's not going to take over the world). Again, let's try cash this out. I give you a human - or, say, the emulation of a human, running in a simulation of the ancestral environment. Is this safe? How do you make it safer? What happens if you keep selecting for intelligence? I think that the theorising you talk about will be actively harmful for your ability to answer these questions.
2TurnTrout3y
I'm confused, because I don't disagree with any specific point you make - just the conclusion. Here's my attempt at a disagreement which feels analogous to me: My response in this "debate" is: if you start with a spherical cow and then consider which real world differences are important enough to model, you're better off than just saying "no one should think about spherical cows". I don't understand why you think that. If you can have a good understanding of instrumental convergence and power-seeking for optimal agents, then you can consider whether any of those same reasons apply for suboptimal humans. Considering power-seeking for optimal agents is a relaxed problem [https://www.lesswrong.com/posts/JcpwEKbmNHdwhpq5n/problem-relaxation-as-a-tactic]. Yes, ideally, we would instantly jump to the theory that formally describes power-seeking for suboptimal agents with realistic goals in all kinds of environments. But before you do that, a first step is understanding power-seeking in MDPs [https://www.lesswrong.com/posts/6DuJxY8X45Sco4bS2/seeking-power-is-provably-instrumentally-convergent-in-mdps]. Then, you can take formal insights from this first step and use them to update your pre-theoretic intuitions where appropriate.
7Richard_Ngo3y
Thanks for engaging despite the opacity of the disagreement. I'll try to make my position here much more explicit (and apologies if that makes it sound brusque). The fact that your model is a simplified abstract model is not sufficient to make it useful. Some abstract models are useful. Some are misleading and will cause people who spend time studying them to understand the underlying phenomenon less well than they did before. From my perspective, I haven't seen you give arguments that your models are in the former category not the latter. Presumably you think they are in fact useful abstractions - why? (A few examples of the latter: behaviourism, statistical learning theory, recapitulation theory [https://en.wikipedia.org/wiki/Recapitulation_theory], Gettier-style analysis of knowledge). My argument for why they're overall misleading: when I say that "the optimal policy in an environment anything like the real world is absurdly, impossibly complex, and requires infinite compute", or that safety researchers shouldn't think about AIXI, I'm not just saying that these are inaccurate models. I'm saying that they are modelling fundamentally different phenomena than the ones you're trying to apply them to. AIXI is not "intelligence", it is brute force search, which is a totally different thing that happens to look the same in the infinite limit. Optimal tabular policies are not skill at a task, they are a cheat sheet, but they happen to look similar in very simple cases. Probably the best example of what I'm complaining about is Ned Block trying to use Blockhead [https://en.wikipedia.org/wiki/Blockhead_(thought_experiment)] to draw conclusions about intelligence. I think almost everyone around here would roll their eyes hard at that. But then people turn around and use abstractions that are just as unmoored from reality as Blockhead, often in a very analogous way. (This is less a specific criticism of you, TurnTrout, and more a general criticism of the field). Forgive
4TurnTrout3y
Thanks for elaborating this interesting critique. I agree we generally need to be more critical of our abstractions. Falsifying claims and "breaking" proposals is a classic element of AI alignment discourse and debate. Since we're talking about superintelligent agents, we can't predict exactly what a proposal would do. However, if I make a claim ("a superintelligent paperclip maximizer would keep us around because of gains from trade"), you can falsify this by showing that my claimed policy is dominated by another class of policies ("we would likely be comically resource-inefficient in comparison; GFT arguments don't model dynamics which allow killing other agents and appropriating their resources"). Even we can come up with this dominant policy class, so the posited superintelligence wouldn't miss it either. We don't know what the superintelligent policy will be, but we know what it won't be (see also Formalizing convergent instrumental goals [https://intelligence.org/2015/11/26/new-paper-formalizing-convergent-instrumental-goals/]). Even though I don't know how Gary Kasparov will open the game, I confidently predict that he won't let me checkmate him in two moves. NON-OPTIMAL POWER AND INSTRUMENTAL CONVERGENCE Instead of thinking about optimal policies, let's consider the performance of a given algorithm A. A(M,R) takes a rewardless MDP M and a reward function R as input, and outputs a policy. Definition. Let R be a continuous distribution over reward functions with CDF F. The average return achieved by algorithm A at state s and discount rate γ is ∫RVA(M,R)R(s,γ)dF(R). Instrumental convergence with respect to A's policies can be defined similarly ("what is the R-measure of a given trajectory under A?"). The theory I've laid out allows precise claims, which is a modest benefit to our understanding. Before, we just had intuitions about some vague concept called "instrumental convergence". Here's bad reasoning, which implies that the cow tears a hole in spac
4Richard_Ngo3y
I'm afraid I'm mostly going to disengage here, since it seems more useful to spend the time writing up more general + constructive versions of my arguments, rather than critiquing a specific framework. If I were to sketch out the reasons I expect to be skeptical about this framework if I looked into it in more detail, it'd be something like: 1. Instrumental convergence isn't training-time behaviour, it's test-time behaviour. It isn't about increasing reward, it's about achieving goals (that the agent learned by being trained to increase reward). 2. The space of goals that agents might learn is very different from the space of reward functions. As a hypothetical, maybe it's the case that neural networks are just really good at producing deontological agents, and really bad at producing consequentialists. (E.g, if it's just really really difficult for gradient descent to get a proper planning module working). Then agents trained on almost all reward functions will learn to do well on them without developing convergent instrumental goals. (I expect you to respond that being deontological won't get you to optimality. But I would say that talking about "optimality" here ruins the abstraction, for reasons outlined in my previous comment).
2TurnTrout3y
I was actually going to respond, "that's a good point, but (IMO) a different concern than the one you initially raised". I see you making two main critiques. 1. (paraphrased) "A won't produce optimal policies for the specified reward function [even assuming alignment generalization off of the training distribution], so your model isn't useful" – I replied to this critique above. 2. "The space of goals that agents might learn is very different from the space of reward functions." I agree this is an important part of the story. I think the reasonable takeaway is "current theorems on instrumental convergence [https://arxiv.org/abs/1912.01683] help us understand what superintelligent A won't do, assuming no reward-result gap. Since we can't assume alignment generalization, we should keep in mind how the inductive biases of gradient descent affect the eventual policy produced." I remain highly skeptical of the claim that applying this idealized theory of instrumental convergence worsens our ability to actually reason about it. ETA: I read some information you privately messaged me, and i see why you might see the above two points as a single concern.
2Pattern3y
Is the point that people try to use algorithms which they think will eventually converge to the optimal policy? (Assuming there is one.)
2TurnTrout3y
Something like that, yeah.
2DanielFilan3y
I object to the claim that agents that act randomly can be made "arbitrarily simple". Randomness is basically definitionally complicated!
2Richard_Ngo3y
Eh, this seems a bit nitpicky. It's arbitrarily simple given a call to a randomness oracle, which in practice we can approximate pretty easily. And it's "definitionally" easy to specify as well: "the function which, at each call, returns true with 50% likelihood and false otherwise."
2DanielFilan3y
If you get an 'external' randomness oracle, then you could define the utility function pretty simply in terms of the outputs of the oracle. If the agent has a pseudo-random number generator (PRNG) inside it, then I suppose I agree that you aren't going to be able to give it a utility function that has the standard set of convergent instrumental goals, and PRNGs can be pretty short. (Well, some search algorithms are probably shorter, but I bet they have higher Kt complexity, which is probably a better measure for agents)
2Vaniver3y
I'd take a different tack here, actually; I think this depends on what the input to the utility function is. If we're only allowed to look at 'atomic reality', or the raw actions the agent takes, then I think your analysis goes through, that we have a simple causal process generating the behavior but need a very complicated utility function to make a utility-maximizer that matches the behavior. But if we're allowed to decorate the atomic reality with notes like "this action was generated randomly", then we can have a utility function that's as simple as the generator, because it just counts up the presence of those notes. (It doesn't seem to me like this decorator is meaningfully more complicated than the thing that gave us "agents taking actions" as a data source, so I don't think I'm paying too much here.) This can lead to a massive explosion in the number of possible utility functions (because there's a tremendous number of possible decorators), but I think this matches the explosion that we got by considering agents that were the outputs of causal processes in the first place. That is, consider reasoning about python code that outputs actions in a simple game, where there are many more possible python programs than there are possible policies in the game.
2Richard_Ngo3y
So in general you can't have utility functions that are as simple as the generator, right? E.g. the generator could be deontological. In which case your utility function would be complicated. Or it could be random, or it could choose actions by alphabetical order, or... And so maybe you can have a little note for each of these. But now what it sounds like is: "I need my notes to be able to describe every possible cognitive algorithm that the agent could be running". Which seems very very complicated. I guess this is what you meant by the "tremendous number" of possible decorators. But if that's what you need to do to keep talking about "utility functions", then it just seems better to acknowledge that they're broken as an abstraction. E.g. in the case of python code, you wouldn't do anything analogous to this. You would just try to reason about all the possible python programs directly. Similarly, I want to reason about all the cognitive algorithms directly.
2Vaniver3y
That's right. I realized my grandparent comment is unclear here: This should have been "consequence-desirability-maximizer" or something, since the whole question is "does my utility function have to be defined in terms of consequences, or can it be defined in terms of arbitrary propositions?". If I want to make the deontologist-approximating Innocent-Bot, I have a terrible time if I have to specify the consequences that correspond to the bot being innocent and the consequences that don't, but if you let me say "Utility = 0 - badness of sins committed" then I've constructed a 'simple' deontologist. (At least, about as simple as the bot that says "take random actions that aren't sins", since both of them need to import the sins library.) In general, I think it makes sense to not allow this sort of elaboration of what we mean by utility functions, since the behavior we want to point to is the backwards assignment of desirability to actions based on the desirability of their expected consequences, rather than the expectation of any arbitrary property. --- Actually, I also realized something about your original comment which I don't think I had the first time around; if by "some reasonable percentage of an agent's actions are random" you mean something like "the agent does epsilon-exploration" or "the agent plays an optimal mixed strategy", then I think it doesn't at all require a complicated utility function to generate identical behavior. Like, in the rock-paper-scissors world, and with the simple function 'utility = number of wins', the expected utility maximizing move (against tough competition) is to throw randomly, and we won't falsify the simple 'utility = number of wins' hypothesis by observing random actions. Instead I read it as something like "some unreasonable percentage of an agent's actions are random", where the agent is performing some simple-to-calculate mixed strategy that is either suboptimal or only optimal by luck (when the optimal mixed strat
4Richard_Ngo3y
This is in fact the intended reading, sorry for ambiguity. Will edit. But note that there are probably very few situations where exploring via actual randomness is best; there will almost always be some type of exploration which is more favourable. So I don't think this helps. To be pedantic: we care about "consequence-desirability-maximisers" (or in Rohin's terminology, goal-directed agents) because they do backwards assignment. But I think the pedantry is important, because people substitute utility-maximisers for goal-directed agents, and then reason about those agents by thinking about utility functions, and that just seems incorrect. What do you mean by optimal here? The robot's observed behaviour will be optimal for some utility function, no matter how long you run it.
2Vaniver3y
Valid point. This also seems right. Like, my understanding of what's going on here is we have: * 'central' consequence-desirability-maximizers, where there's a simple utility function that they're trying to maximize according to the VNM axioms * 'general' consequence-desirability-maximizers, where there's a complicated utility function that they're trying to maximize, which is selected because it imitates some other behavior The first is a narrow class, and depending on how strict you are with 'maximize', quite possibly no physically real agents will fall into it. The second is a universal class, which instantiates the 'trivial claim' that everything is utility maximization. Put another way, the first is what happens if you hold utility fixed / keep utility simple, and then examine what behavior follows; the second is what happens if you hold behavior fixed / keep behavior simple, and then examine what utility follows. Distance from the first is what I mean by "the further a robot's behavior is from optimal"; I want to say that I should have said something like "VNM-optimal" but actually I think it needs to be closer to "simple utility VNM-optimal."  I think you're basically right in calling out a bait-and-switch that sometimes happens, where anyone who wants to talk about the universality of expected utility maximization in the trivial 'general' sense can't get it to do any work, because it should all add up to normality, and in normality there's a meaningful distinction between people who sort of pursue fuzzy goals and ruthless utility maximizers.

New to LessWrong?