metachirality

Wiki Contributions

Comments

I have a strong inside view of the alignment problem and what a solution would look like. The main reason why I don't have an as concrete inside view AI timeline is because I don't know enough about ML and I have to defer to get a specific decade. The biggest gap in my model of the alignment problem is what a solution to inner misalignment would look like, although I think it would be something like trying to find a way to avoid wireheading.

I've checked out John Wentworth's study guide before, mostly doing CS50.

Part of the reason I'm considering getting a degree is so I can get a job if I want and not have to bet on living rent-free with other rationalists or something.

The people I've talked to the most have timelines centering around 2030. However, I don't have a detailed picture of why because their reasons are capabilities exfohazards. From what I can tell, their reasons are tricks you can implement to get RSI even on hardware that exists right now, but I think most good-sounding tricks don't actually work (no one expected transformer models to be the closest to AGI in comparison with other architectures) and I think superintelligence is more contingent on compute and training data than they think. It also seems like other people in AI alignment disagree in a more optimistic direction. Now that I think about it though, I probably overestimated how long the timelines of optimistic alignment researchers were so it's probably more like 2040.

The difference between an expected utility maximizer using updateless decision theory and an entity who likes the number 1 more than the number 2, or who cannot count past 1, or who has a completely wrong model of the world which nonetheless makes it one-box is that the expected utility maximizer using updateless decision theory wins in scenarios outside of Newcomb's problem where you may have to choose to $2 instead of $1, or have to count amounts of objects larger than 1, or have to believe true things. Similarly, an entity that "acts like they have a choice" generalizes well to other scenarios whereas these other possible entities don't.

  1. I think getting an extra person to do alignment research can give massive amounts of marginal utility considering how few people are doing it and how it will determine the fate of humanity. We're still in the stage where adding an extra person removes a scarily large amount from p(doom), like up to 10% for an especially good individual person, which probably averages to something much smaller but still scarily large when looking at your average new alignment researcher. This is especially true for agent foundations.
  2. I think it's very possible to solve the alignment problem. Stuff like QACI, while not a full solution yet, make me think that this is conceivable and you could probably find a solution if you threw enough people at the problem.
  3. I think we'll get a superintelligence at around 2050.

One-boxers win because they reasoned in their head that one-boxers win because of updateless decision theory or something so they "should" be a one-boxer. The decision is predetermined but the reasoning acts like it has a choice in the matter (and people who act like they have a choice in the matter win.) What carado is saying is that people who act like they can move around the realityfluid tend to win more, just like how people who act like they have a choice in Newcomb's problem and one-box in Newcomb's problem win even though they don't have a choice in the matter.

I don't think this matters all that much. In Newcomb's problem, even though your decision is predetermined, you should still want to act as if you can affect the past, specifically Omega's prediction.

I don't believe something can persuade generals to go to war in a short period of time, just because it's very intelligent.

A few things I've seen give pretty worrying lower bounds for how persuasive a superintelligence would be:

Remember that a superintelligence will be at least several orders of magnitude more persuasive than character.ai or Stuart Armstrong.

Formal alignment proposals avoid this problem by doing metaethics, mostly something like determining what a person would want if they were perfectly rational (so no cognitive biases or logical errors), otherwise basically omniscient, and had an unlimited amount of time to think about it. This is called reflective equilibrium. I think this approach would work for most people, even pretty terrible people. If you extrapolated a terrorist who commits acts of violence for some supposed greater good, for example, they'd realize that the reasoning they used to determine that said acts of violence were good was wrong. 

Corrigibility, on the other hand, is more susceptible to this problem and you'd want to get the AI to do a pivotal act, for example, destroying every GPU to prevent other people from deploying harmful AI, or unaligned AI for that matter. 

Realistically, I think that most entities who'd want to use a superintelligent AI like a nuke would probably be too short-sighted to care about alignment, but don't quote me on that.

To the first one, they aren't actually suffering that much or experiencing anything they'd rather not experience because they're continuous with you and you aren't suffering.

I don't actually think a simulated human would be continuous in spacetime with the AI because the computation wouldn't be happening inside of the qualia-having parts of the AI.

I think what defines a thing as a specific qualia-haver is not what information it actually holds but how continuous it is with other qualia-having instances in different positions of spacetime. I think that mental models are mostly continuous with the modeler so you can't actually kill them or anything. In general, I think you're discounting the importance that the substrate of a mental model/identity/whatever has. To make an analogy, you're saying the prompt is where the potential qualia-stuff is happening, and isn't merely a filter on the underlying language model.

Load More