Vladimir_Nesov

Wiki Contributions

Comments

On Ego, Reincarnation, Consciousness and The Universe

The blank-slateness still makes sense as referring to the dimensions determined by nurture. But that doesn't yield an interesting point about content of civilization. If everyone starts out as blank canvas, that doesn't mean paintings (and art schools) are less real/important/legitimate.

Rant on Problem Factorization for Alignment

What if 90% or 99% of the work was not object level, but about mechanism/incentive design, surveillance/interpretability, and rationality training/tuning, including specialized to particular projects being implemented, including the projects that set this up, iterating as relevant wisdom/tuning and reference texts accumulate? This isn't feasible for most human projects, as it increases costs by orders of magnitude in money (salaries), talent (number of capable people), and serial time. But in HCH you can copy people, it runs faster, and distillation should get rid of redundant steering if it converges to a legible thing in the limit of redundancy.

How would two superintelligent AIs interact, if they are unaligned with each other?

If both agents are FDT, and have common knowledge of each others source code

Any common knowledge they can draw up can go into a coordinating agent (adjudicator), all it needs is to be shared among the coalition, it doesn't need to have any particular data. The problem is verifying that all members of the coalition will follow the policy chosen by the coordinating agent, and common knowledge of source code is useful for that. But it could just be the source code of the trivial rule of always following the policy given by the coordinating agent.

One possible policy chosen by the adjudicator should be falling back to unshared/private BATNA, aborting the bargain, and of course doing other things not in scope of this particular bargain. These things are not parts of the obey-the-adjudicator algorithm, but consequences of following it. So common knowledge of everything is not needed, only common knowledge of the adjudicator and its authority over the coalition. (This is also a possible way of looking at UDT, where a single agent in many possible states acting through many possible worlds coordinates among its variants.)

How would two superintelligent AIs interact, if they are unaligned with each other?

FDT works on an assumption that other actors use a similar utility function as itself

FDT is not about interaction with other actors, it's about accounting for influence of the agent through all of its instances (including predictions-of) in all possible worlds.

Coordination with other agents is itself an action, that a decision theory could consider. This action involves creation of a new coordinating agent that decides a coordinating policy that all members of a coalition carry out, and this coordinating agent also needs a decision theory. The coordinating agent acts through all agents of the coalition, so it's sensible for it to be some flavor of FDT, though a custom decision theory specifically for such situations seems appropriate, especially since it's doing bargaining.

The decision theory that chooses whether to coordinate by running a coordinating agent or not has no direct reason to be FDT, could just be trivial. And preparing the coordinating agent is not obviously a question of decision theory, it even seems to fit deontology a bit better.

Are ya winning, son?

In acausal PD, (C, C) stands for bargaining, and bargaining in a different game could be something more complicated than carrying out (C, C). Even in PD itself, bargaining could select a different point on the Pareto frontier, a mixed outcome with some probability between (C, C) and (C, D), or between (C, C) and (D, C). So with acausal coordination, PD should play in three stages: (1) players establish willingness to bargain, which is represented by playing (C, C) in acausal PD (but not yet making moves in actual PD), (2) players run the bargaining algorithm, which let's say selects the point 0.8*(C, C) + 0.2*(D, C), and (3) a shared random number samples let's say (D, C), so the first player plays D and the second plays C.

Are ya winning, son?

one-shot, once-in-a-lifetime scenario

It's less than that, you don't know that you are real and not the hypothetical. If you are the hypothetical, paying up is useful for the real one.

This means that even if you are the real one (which you don't know), you should pay up, or else the hypothetical you wouldn't. Winning behavior/policy is the best map from what you observe/know to decisions, and some (or all) of those observations/knowledge never occur, or even never could occur.

chinchilla's wild implications

It might be even better to just augment the data with quality judgements instead of only keeping the high-quality samples. This way, quality can have the form of a natural language description instead of a one-dimensional in-built thing, and you can later prime the model for an appropriate sense/dimension/direction of quality, as a kind of objective, without retraining.

How would Logical Decision Theories address the Psychopath Button?

a fixed-goal AGI is bad... Which is indeed correct, but... irrelevant? It has the best EV by its own metric?

Nobody knows how to formulate it like that! EV maximization is so entrenched as obviously the thing to do that the "obviously, it's just EV maximization for something else" response is instinctual, but that doesn't seem to be the case.

And if maximization is always cursed (goals are always proxy goals, even as they become increasingly more accurate, particularly around the actual environment), it's not maximization that decision theory should be concerned with.

How would Logical Decision Theories address the Psychopath Button?

(The second paragraph was irrelevant to the comment I was replying to, I thought the "incidentally", and the inverted-in-context "it's obviously relevant" (it's maximization of EV that's obviously relevant, unlike the objections to it I'm voicing; maybe this was misleading) made that framing clear?)

I was commenting on how "having the best EV", the classical dream of decision theory, is recently in question because of the Goodhart's Curse issue. That it might be good to look for decision theories that do something else. The wrapper-minds post is pointing at the same problem from a very different framing. Mild optimization is a sketch of the kind of thing that might make it better, and includes more specific suggestions like quantilization. (I currently like "moral updatelessness" for this role, a variant of UDT that bargains from a position of moral ignorance, not just epistemic ignorance, among its more morally competent successors, with mutually counterfactual, that is discordant, but more developed moralities/values/goals.) The "coherent decisions" post is just a handy reference for why EV maximization is the standard go-to thing, and might still remain as such in the limit of reflection (time), but possibly not even then.

The relevant part (to the "saner CDT" point) is the first paragraph, which is mostly about Troll Bridge and logical decision theory. Last post of the sequence has a summary/retrospective. Personally, I mostly like CDT for introducing surgery, fictional laws-of-physics-defying counterfactuals seem inescapable in some framings that are not just being dumb like vanilla CDT. In particular, when considering interventions through approximate predictions of the agent. (How do you set all of these to some possible decision, when all you know is the real world, which might have the actual decision you didn't make yet in its approximate models of you? You might need to "lie" in the counterfactual with fictional details to make models of your behavior created by others predict what you are considering doing, instead of what you actually do and can't predict or infer from actual models they've already made of you. Similarly to how you know a Chess AI will win, without knowing how, you know that models of your behavior will predict it, without knowing how. So you are not inferring their predictions from their details, you are just editing them in into a counterfactual.) This might even be relevant to CEV in that moral updatelessness setting I've mentioned, though that's pure speculation at this point.

Edit: Last edited about an hour after posting, mostly the last paragraph.

How would Logical Decision Theories address the Psychopath Button?

The context seems to be

There, it's related to Smoking Lesion, which has a tradition of interpreting it that suggests how to go about interpreting "only a psychopath would press such a button" as well. But that tradition is also convoluted (see "tickle defense"; it might be possible to contort this into an argument that EDT recommends pressing the button in Psychopath Button, not sure).

Load More