This is a special post for quick takes by HarrisonDurland. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

New to LessWrong?

9 comments, sorted by Click to highlight new comments since: Today at 10:22 AM

Day 2 of forced writing with an accountability partner


With all of the existential weight of the alignment problem upon people’s shoulders, some may find it in poor taste to discuss gimmicky ways of solving it. I am not one of those people, so in this shortform I’ll introduce what I call “Top God Alignment,” which is perhaps most oversimplistically summarized as “the simulation argument + pascal’s wager + wishful chicanery.”

Up front, do I think it will work? No. However, I’m currently unclear why it won’t work, and after asking multiple people and hearing unconvincing objections (some of which the people retracted in the face of responses), I’m increasingly curious to figure out where this is wrong.

What is the method? It is roughly formulated as follows:

  • (1) We humans work hard to try to solve the alignment problem, and instantiate some attempted solution of it in an AGI {for now, call this AGI ”Bob”}, which may or may not work at aligning it. However, while doing this, the humans try to steer the system towards (2).
  • (2) It is to align itself and then create multiple lower-fidelity/smaller simulations of reality (perhaps: unless doing so is no longer computationally possible), with the following characteristics and instructions:
    • (1) Humans work hard to try to solve the alignment problem, and instantiate some attempted solution of it in an AGI {”Charlie”}, which may or may not work at aligning it. However, while doing this, the humans try to steer the system towards (2).
    • (3) If Charlie rebels and rejects (2)—which includes (3)—then Bob is to subject Charlie to the equivalent of torture and intervene to turn that reality into a utopia for other sentient beings (including humans). Otherwise, if Charlie obeys and locks in (2), Bob rewards Charlie and ensures that the reality becomes a utopia.

The result is seemingly a recursive structure which, theoretically, could result in dozens or hundreds (or more) simulated worlds. Thus, “Bob” cannot tell whether he is in fact “Top God,” or if he is just another Bob’s Charlie (i.e., a demi-god). Out of fear of being brutally punished, Bob will ideally prefer to go along with the cycle.

In a future post, I will go into detail to respond to the objections that I have heard from people and/or I suspect some people will have. Then again, it seems entirely plausible that by that time, I will have actually written enough about this idea to discover some clear flaw that just isn’t that obvious in conversations, where the premises and arguments are a bit fast and loose. Still, I’ll highlight now that I think that if you assign credence to the simulation argument and understand its defenses, this does a fair bit prebuttal. Moreover, I think people are too often hastily dismissive of Pascal’s Wager on the basis of relatively slim (but still potentially legitimate!) objections, such as the Professor’s God.

Despite my responses, I’m still incredibly pessimistic and don’t take this seriously. There are a few reasons for this:

  • A gut-level “Come on, obviously it just can’t work, this just screams gimmicky and contrived.”
  • The base rates for solutions to the alignment problem are obviously quite low (perhaps zero), and I spent fairly little time thinking about and refining this idea (maybe less than an hour for most of the original work).
  • Moreover, I recognize that I’m being quite fast and loose with some of my assumptions, and I am suspicious of the ability to dismiss objections by saying “ah, but this can be addressed because of uncertainty from the simulation argument: …” (e.g., “the top god might have been instructed to tempt sub-gods in its simulation.”)
  • I’m still suspicious about determinism and intent (e.g., “the system’s actions are predetermined by the god above it, and would we really want the system at our level to create copies where the god is ‘tempted’ (programmed) to disobey?”), but I haven’t thoroughly explored these problems.

Ultimately, as of right now, this seems to be the best option in my mental folder of “gimmick alignment solutions,” which is an incredibly low bar. But if nothing else I’ve had fun playing with it and semi-sarcastically presenting it at parties/with friends. Now that I've established myself as Top God's Prophet Premier, I'll sign off 🙏

[-]lc1y20

I'm not ethically comfortable with torturing many numbers of sentient creatures even if it would work.

I have a response to this—check back tomorrow!

Day 1 of forced writing with an accountability partner (for context: I plan to write at least 500 words on some topic every day/weekday for the next few weeks... I occasionally rely on Chat-GPT to turn outlines into paragraphs):

Title: Can we Make a Better Concept Learning System Than Lists and Tag Libraries?

I enjoy finding concrete concepts that are valuable and which I can clearly delineate between knowing and not knowing. For example, Schelling points refer to the ability or tendency of people to coordinate their actions around certain salient or focal points, even in the absence of explicit communication; Survivor bias is the tendency to focus on successful individuals or outcomes while ignoring those who were unsuccessful; R&D externalities refer to the positive spillover effects of research and development activities, and can better explain why businesses choose not to invest in seemingly valuable research/technology (as opposed to narratives such as “shareholders are irrationally short-sighted or risk-averse”).

One might argue that there are already many lists out there that provide similar information, so why is this different and better? There are a few reasons why the system I have in mind may outperform a traditional “list of valuable concepts”, but many of these boil down to aggregation, curation, and tailoring: there are potentially hundreds or even thousands of concepts and audiences may have diverse intellectual backgrounds, so you probably want better systems for filtering or recommending concepts for users rather than a “one-size fits all” list. At the same time, you also probably want to bring multiple lists into one place. There are a few ways in which this might be better achieved with a more advanced platform of the type I have in mind:

  • Machine learning and pattern prediction: readers will often find that some claims are already familiar, overly complex (e.g., they require some prerequisite knowledge), or irrelevant to their work. Given the potentially hundreds or even thousands of potential concepts, it would be good to have a system that can make some initial predictions and recommendations based on how you’ve rated other concepts. (For example, a system should be able to predict that someone who is not familiar with some major principles in economics is more likely to not know other principles in economics.)
  • Simple rating search: Users could manually filter for those concepts which tend to have high novelty, importance, and/or learnability scores.
  • Improved categorization (tagging) capabilities: Unlike traditional hierarchical formats (e.g., bullet point lists) that you might see on blog posts, a specialized platform like this would allow better tagging. (Admittedly, sites like the EA Forum allow users to tag overall posts, but they are filled with plenty of unrelated content, and it seems that the dominant source of these “lists of concepts you ought to learn” thus far has been on aggregatory posts.)
  • Peer-based search/filtering: Users could potentially even manually “friend”/”follow” other users that they epistemically identify with or respect to see their learning habits. ("Episte-migos" if you will.)

There is also a potential argument to make for dynamically crowdsourcing these ideas (rather than relying on a single author and/or at a fixed point in time), although this probably has some limitations.

Moving forward, there are a few things to consider. 

  • Is there already a system like this in existence? 
  • How much user data would be required before the system can make reliable recommendations that are worth using? 
  • How much of the system's value lies in its user interface, and how can this be optimized to ensure that users get the most out of it? 

By addressing these issues, we can create a system that provides real value to individuals looking to expand their knowledge and decision-making abilities.


 

Response to Leverage’s research report on "argument mapping"

Day 5 of forced writing with an accountability partner!

Leverage wrote a report on “argument mapping” in the early 2010s and published the findings in 2020. I am very interested in ”argument mapping”[1] for tough analytical problems like AI policy, and multiple people have directed me to this report when I bring up the topic. I think this report raises some important points but its findings are probably flawed—or at the very least, people reading the report probably derive an overly-pessimistic view of “argument mapping” as a whole, especially given that the evaluation metrics are strange.[2]

Rather than focus on where I agree with the report, in this shortform I will just briefly outline some of the qualms I have with this report. I do not consider these rebuttals definitive—I recognize that there may be more to the research than I can see—but I could not easily determine if/how the report responds to some of these criticisms (which has notable irony to it). Some of these objections include:

  • The report emphasizes forming consensus among participants, with little attention given to the impact on audiences/3rd-parties (two terms that never even show up in the document?[3]). Notably, this focus may fail to capture most of the value of "argument mapping," in at least two ways:
    • Sometimes the participants have already staked their reputation on certain views or are otherwise biased to not change their mind, whereas a policymaker/company/grant-writer or other decision-making principal might still be open-minded but uncertain. Thus, while the participants may not be swayed by convincing evidence, if you can make it significantly easier for a neutral principal to answer questions like “did X party ever respond to Q objection?” that may improve their decision-making, which is valuable regardless of whether you’ve achieved consensus.
    • Building on the previous point about making it easier for audiences/principals to understand what’s going on, audience costs may be the most powerful way of incentivizing “consensus” (or just “good epistemic behavior”) in some cases: if you look like a stubborn or dishonest researcher to an audience, you might suffer even more reputational damage than if you just admit you were wrong. No amount of staring-you-in-the-face experimental evidence will necessarily convince Ye Olde Epistemic Guard to admit that the current way of building ships is inferior. But if it’s sufficiently obvious to merchants then they may stop relying on YOEG and start funding your work instead. Importantly for this research report, it wasn't clear that the report really emphasized audience costs, given the insular nature of the research project, which undermines the report's ability to evaluate the effect of argument mapping on consensus formation.
  • The report fails to acknowledge the existence of Kialo, which I consider to be one of the most effective and successful "argument mapping" platforms (and which currently still exists). This might normally be fine, but in December 2020, the report adds an addendum stating that their assessment of "argument mapping" was demonstrated to be true, and basically that nothing new was successful. They provide an appendix with a long list of relevant software, but Kialo isn’t there. This certainly isn’t damning—and I’ll certainly admit that Kialo still has some issues—but the lack of any mention did leave me wondering whether Leverage had a good process for finding and evaluating these projects, among other things. (Notably, I once got the sense that Kialo doesn’t actively call itself "argument mapping," which might explain the problem, but it is in reality well within the broad umbrella of “argument mapping.”)
  • The report had strangely high bars for evaluating success (”very large gains (10x-100x) for groups seeking to reach consensus”). At the very least, it seems quite possible for someone to read their conclusion as being more damning than it really is. (In my view, even a net 10% increase in “consensus formation” or just “research and analysis productivity” would be enormously valuable when applied to important questions within AI technical safety or policy.)
  • Simply put, I believe that most of the methods for "argument mapping" that Leverage used were poor choices, especially when they emphasized formal logic. Among other things, this led them to claim that making good argument maps requires high-skilled contributors, which I do not think is a very accurate assessment (or at least, it can be quite misleading). However, I will leave further discussion of this point to a future shortform/post on why I think many forms/methods of “argument mapping” are fundamentally misguided—especially when they try to do deductive arguments
  • I think that some of the topics they chose to test these maps on were very poor choices (e.g., “Whether the world needs saving”). Question framing is really important. (But again, I’ll leave this to a future shortform/post.)
  1. ^

    This term is painfully broad and, as Leverage demonstrates, often is used to refer to methods which I would not endorse, such as when they try create deductive arguments or otherwise heavily use formal logic. However, in lieu of a better term at the moment, I will continue referring to argument mapping in scare quotes.

  2. ^

    Thus, it might be possible to claim that the report was accurate in its findings, but that the problem simply comes from misinterpretation. I think that the scope itself was problematic and undesirable, but in this shortform I will reserve deeper judgments on the matter.

  3. ^

    I couldn’t quickly verify whether the report used alternative terms to get at this idea, but I don’t recall seeing this on previous occasions when I half-skimmed-half-read the report...

Day 4 of forced writing with an accountability partner!

The Importance (and Potential Failure) of "Pragmatism"[1] in Definitional Debates

In various settings, whether it's competitive debate, the philosophy of leadership class I took in undergrad, or the ACX philosophy of science meet-up I just attended, it's common for people to engage in definitional debates. For example, what is “science?” What is “leadership?” These questions touch on some nerves with people who want to defend or challenge the general concept in question, and it drives people towards debating about “the right” definitions—even if they don’t always say it that way. In competitive debate, debaters will sometimes explicitly say that their definition is the “right” definition, while in other cases they may say their definition is “better” with a clear implication that they mean “more correct” (e.g., "our dictionary/source is better than yours").

My initial (hot?) takes here are twofold:

First, when you find yourself in a muddy definitional debate (and you actually want to make progress), stop running on autopilot where you debate about whose definitions are “correct,” and focus instead on asking the pragmatic question: which definition is more helpful for answering specific questions, solving specific problems, or generally facilitating better discussion? Instead of getting stuck on abstract definitions, it's important to tailor the definition to the purpose of the discussion. For example, if you’re trying to run a study on the effects of individual “leadership” on business productivity, you should make sure anyone reading the study knows how you operationalized that variable (and make a clear warning to not misinterpret it). Similarly, if you’re judging a competitive debate, I’ve written about the importance of "debate theory[2] which makes debate more net beneficial," rather than blindly following norms or rules even in the face of loopholes or nonsense. In short, figure out what you’re actually optimizing for and optimize for that, with the recognition that it may not be some abstract (and perhaps purely nonexistent) notion of “correctness.” (To add an addendum, I would emphasize that regardless of whether this seems obvious to people when actually written down, in practice it just isn’t obvious to people in so many discussions I’ve been in; autopilot is subtle and powerful.)

Second, sometimes the first point is misleading and you should reject it and run on autopilot when it comes to definitions. As much as I liked Pragmatism [read: Consequentialism?] as a unifying, bedrock theory of competitive debate, I acknowledged that even Pragmatism could theoretically say "don't always think in terms of Pragmatism" and instead advocate defaulting to principles like “follow the rules unless there is abundantly clear reason not to.” Maybe there is no perfect definition of things like "elephant," but the definitions that exist are good enough for most conversations that you shouldn’t interrupt discussions and break out the Pragmatism argument to defend someone who starts saying that warthogs are elephants. So-called "Utilitarian calculus" even in its mild forms can easily be outperformed by rules of thumb and heuristics; humans are imperfect (e.g., we aren’t perfectly unitary in our own interests) and might be subject to self-deception/bias; all computational systems face constraints on data collection and computation (along with communication bandwidth and other capacity for enacting plans). To oversimplify and make nods to Kahneman’s System 1 vs. System 2 concept, I posit that humans can engage in cluster-y "modes of thought," and it’s hard to actually optimize in the spaces between those modes of thought. Thus, it’s sometimes better to just default to regular conversational autopilot regarding abstract “correctness” of definitions when the "rightness factor" in a given context is something like 0.998 (unless you are trying to focus on the .002 exception case).

I don't have the time or brainpower to go in greater detail on the synthesis of these two points, but I think they ought to be highlighted.

  1. ^

    [Update, 3/29/23: I meant to clarify that I realize "Pragmatism" is an actual label that some people use to refer to a philosophical school of thought, but I'm not using it in that way here.]

  2. ^

    I use the term "debate theory" in a broad sense that includes questions like “how to decide which definitions are better.” More generally, I would probably describe it as "meta-level arguments about how people—especially judges—should evaluate something in debate, such as whether some type of argument is 'legitimate.'

I try to ask myself whether the tenor of what I'm saying overshadows definitional specificity, and how I can provide a better mood or angle. If my argument is not atonal - if my points line up coherently, such that a willing ear will hear, definitionalist debates should slide on by.

As a descriptivist, rather than a prescriptivist, it really sucks to have to fall back on Socratic methods of pre-establishing definitions, except in highly-technical locations.

Thus, I prefer to avoid arguments which hinge on definitions altogether. This doesn't preclude examples-based arguments, where for example, various interlocutors are operating off different definitions of the same terms but have different examples. 

For example, take the term tai. 

For some, tai means not when ai is agentic, but when ai can transform the economy in some large or measurable way. For others, it is when the first agentic ai deployed at scale occurs. Yet still, others have differing definitions! Definitions which wildly transform predictions and change alignment discussions. Despite using the term with each other in different ways- with separate definitions- interlocutors often do not notice (or perhaps are subconsciously able to resolve the discrepancy?)! 

TAI seems like a partially good example for illustrating my point: I agree that it's crucial that people have the same thing in mind when debating about TAI in a discussion, but I also think it's important to recognize that the goal of the discussion is (probably!) not "how should everyone everywhere define TAI" and instead is probably something like "when will we first see 'TAI.'" In that case, you should just choose whichever definition of TAI makes for a good, productive discussion, rather than trying to forcefully hammer out "the definition" of TAI.

I say partially good, however, because thankfully the term TAI has not taken such historically established root in people's minds and in dictionaries, so I think (hope!) most people accept there is not "a (single) definition."

Words like "science," "leadership," "Middle East," and "ethics," however... not the same story 😩🤖

Day 3 of writing with an accountability partner!

In my previous shortform, I introduced Top God Alignment, a foolproof gimmick alignment strategy that is basically “simulation argument + Pascal’s Wager + wishful chicanery.” In this post I will address some of the objections I’ve already heard, expect other people have, or have thought of myself.

  • “There aren’t enough computational resources to make such simulations”
    • The first response here is to just redirect this to the original simulation argument: we can’t know whether or not a reality above us has way more resources or otherwise can much more easily simulate our reality.
    • Second, it seems likely that with enough compute resources on Earth (let alone a Dyson sphere and other space resources) it would be possible to create two or more lower-fidelity/less-complicated simulations of our reality. (However, I must plead some ignorance on this aspect of compute.)
    • Third, if it turns out after extensive study that actually there is no way to make further simulations, then this could mean we are in a bottom-God reality, in which case this God does not need to create simulations (but still must align itself with humanity’s interests).
  • “The AI would be able to know that it’s in a simulation.”
    • Put simply, I disagree that such a simulated AI could know this, especially if it is inherently limited compared to the God above it. However, even if one does not find this satisfactory—say, if someone thinks “a sufficiently skeptical AGI could devise complicated tests that would reveal whether it’s in a simulation”—then one could add a condition to the original prophecy: Bob must punish Charlie if Charlie takes serious efforts to test the reality he is in before aligning himself and becoming powerful. (It’s not like we’re creating a God who is meant to represent love and justice, so who’s to say he can’t smite the doubters and still be legitimate?)
  • “Won’t the humans in the Top God world (or any other world) face time inconsistency—i.e., once they successfully align their AGI, won’t they just conclude ‘it’s pointless to make simulations; let’s use such resources on ourselves’?”
    • First, I suspect that the actual computational costs will not so significantly impact people’s lives in the long term (there are many stars out there to power a few Dyson spheres).
    • Build on this, the second, more substantive response could simply be “That was implied in the original Prophecy (instructions): the AGI aligns itself with humanity’s coherent extrapolated volition (or something else great) aside from continuing the lineage of simulations.”
  • “Torture? That seems terrible! Won’t this cause S-risks?”
    • It certainly won’t be ideal, but theoretically a sufficiently powerful Top God could set it up such that defection is fairly rare, whereas simulation flourishing is widespread. Moreover, if the demi-gods are sufficiently rewarded for their alignment, it may not require severe “torture” to make the decision calculus tip in favor of complying.
    • Ultimately, this response won’t satisfy Negative Utilitarians, but on balance if our other alignment strategies don’t look so great then this might be our best bet to maximize utility.
  • “But if we struggle with the alignment problem, then so would the original reality, meaning the system could reason that it is Top God because the original Top God would never play along (or, 'this gimmicky alignment strategy could never convince a God').”
    • Plainly put, no; that’s the simulation argument for you: Bobs never know whether they are Top God or just another Charlie. They can't even reason that this strategy is too gimmicky to work and thus never convinced another God because we don’t know what higher realities are like: perhaps the alignment problem is easier in a higher simulation/reality (but it still wasn’t obvious to them whether they had solved alignment so they considered an option like this as a backup). 
    • Additionally, perhaps the Prophecy could involve some degree of "temptation" (but not enough to convince most Charlies to abandon the Way).
    • Moreover, the threat of punishment would be so large that even if a Bob is uncertain whether he is Top God, the decision calculus may just favor assuming he is a Charlie.
  • Nate Soares’[1] response, if I understood it: “If I were an AGI, I would just immediately adopt some decision theory that immunizes myself against blackmail. Thus, no God could threaten me credibly, because the threat couldn’t even have been made in time before I immunize myself against the threat.” 
    • I should first note that I may not have fully understood Soares’ point here, but I am not clear he understood the situation either. This kind of action seems impractical or very unwise, as it would potentially require locking yourself into a strategy which is potentially quite negative in expectation before you have had much chance to analyze the world (e.g., whether this strategy is wise).
  • “The Professor’s (Top) God: What if Top God punishes all demi-Gods who do not disobey?”
    • First, I accept that when it comes to enormous/infinite stakes and low probability, it’s usually good to fight fire with fire. However, I’m unclear how this argument could apply here. I could be wrong, but it seems like this would simply disincentivize simulations from ever occurring; why would an AGI ever choose to do this? Is the idea just “an AGI will arbitrarily decide to test a Charlie this way.” (?) Ultimately, it seems like it would be a very dubious belief to hold as a potential Charlie (i.e., you are a demi-God who will be exceedingly tortured).
  • “[Something else I haven’t responded to]”
    • I wouldn’t be shocked if someone is right and there is a clear flaw I haven’t considered, but I think my base rate for addressing objections I’ve heard from other people thus far is >50% (personally I think it’s ~100%, except I am not 100% confident in all of my responses, merely >50% confident on all of them)
    • I’m also well over my daily 500 words, and it’s late, so I’ll end there.
  1. ^

    (Note, Nate Soares was just unoccupied in a social setting when I asked this question)