harfe

Wikitag Contributions

Comments

Sorted by
harfe10

Their platform would be whatever version and framing of AI notkilleveryoneism the candidates personally endorse, plus maybe some other smaller things. They should be open that they consider the potential human disempowerment or extinction to be the main problem of our time.

As for the concrete policy proposals, I am not sure. The focus could be on international treaties, or banning or heavy regulation of AI models who were trained with more than a trillion quadrillion (10^27) operations. (not sure I understand the intent behind your question).

harfe290

A potentially impactful thing: someone competent runs as a candidate for the 2028 election on an AI notkilleveryoneism[1] platform. Maybe even two people should run, one for the democratic primary, and one in the republican primary. While getting the nomination is rather unlikely, there could be lots of benefits even if you fail to gain the nomination (like other presidential candidates becoming sympathetic to AI notkilleveryoneism, or more popularity of AI notkilleveryoneism in the population, etc.)

On the other hand, attempting a presidential run can easily backfire.

A relevant previous example to this kind of approach is the 2020 campaign by Andrew Yang, which focussed on universal basic income (and downsides of automation). While the campaign attracted some attention, it seems like it didn't succeed in making UBI a popular policy among democrats.


  1. Not necessarily using that name. ↩︎

harfeΩ221

This can easily be done in the cryptographic example above: B can sample a new number , and then present to a fresh copy of A that has not seen the transcript for so far.

I don't understand how this is supposed to help. I guess the point is to somehow catch a fresh copy of A in a lie about a problem that is different from the original problem, and conclude that A is the dishonest debater?

But couldn't A just answer "I don't know"?

Even if it is a fresh copy, it would notice that it does not know the secret factors, so it could display different behavior than in the case where A knows the secret factors .

harfe92

Some of these are very easy to prove; here's my favorite example. An agent has a fixed utility function and performs Pareto-optimally on that utility function across multiple worlds (so "utility in each world" is the set of objectives). Then there's a normal vector (or family of normal vectors) to the Pareto surface at whatever point the agent achieves. (You should draw a picture at this point in order for this to make sense.) That normal vector's components will all be nonnegative (because Pareto surface), and the vector is defined only up to normalization, so we can interpret that normal vector as a probability distribution. That also makes sense intuitively: larger components of that vector (i.e. higher probabilities) indicate that the agent is "optimizing relatively harder" for utility in those worlds. This says nothing at all about how the agent will update, and we'd need a another couple sentences to argue that the agent maximizes expected utility under the distribution, but it does give the prototypical mental picture behind the "Pareto-optimal -> probabilities" idea.

Here is an example (to point out a missing assumption): Lets say you are offered to bet on the result of a coin flip for dollar. You get dollars if you win, and your utility function is linear in dollars. You have three actions: "Heads", "Tails", and "Pass". Then "Pass" performs Pareto-optimally across multiple worlds. But "Pass" does not maximize expected utility under any distribution.

I think what is needed for the result is an additional convexity-like assumption about the utilities. This could be the set of achievable utility vectors is convex'', or even something weaker like every convex combination of achievable utility vectors is dominated by an achievable utility vector" (here, by utility vector I mean if is the utility of world ). If you already accept the concept of expected utility maximization, then you could also use mixed strategies to get the convexity-like assumption (but that is not useful if the point is to motivate using probabilities and expected utility maximization).

Or: even if you do expect powerful agents to be approximately Pareto-optimal, presumably they will be approximately Pareto optimal, not exactly Pareto-optimal. What can we say about coherence then?

The underlying math statement of some of these kind of results about Pareto-optimality seems to be something like this:

If is Pareto-optimal wrt utilities , and a convexity assumption (e.g. the set is convex, or something with mixed strategies) holds, then there is a probability distribution so that is optimal for .

I think there is a (relatively simple) approximate version of this, where we start out with approximate Pareto-optimality.

We say that is Pareto--optimal if there is no (strong) Pareto-improvement by more than (that is, there is no with for all ).

Claim: If is Pareto--optimal and the convexity assumption holds, then there is a probability distribution so that is -optimal for .

Rough proof: Define and as the closure of . Let be of the form for the largest such that . We know that . Now is Pareto-optimal for , and by the non-approximate version there exists a probability distribution so that is optimal for . Then, for any , we have $\mathbb{E}{i\sim\mu} u_i(x) \leq \mathbb{E}{i\sim\mu} \tilde y_i = \mathbb{E}{i\sim\mu} (u_i(\bar x) + \delta)\le \varepsilon + \mathbb{E}{i\sim\mu} u_i(\bar x), $ that is, is -optimal for .

harfeΩ9132

I think there are some subtleties with the (non-infra) bayesian VNM version, which come down to the difference between "extreme point" and "exposed point" of . If a point is an extreme point that is not an exposed point, then it cannot be the unique expected utility maximizer under a utility function (but it can be a non-unique maximizer).

For extreme points it might still work with uniqueness, if, instead of a VNM-decision-maker, we require a slightly weaker decision maker whose preferences satisfy the VNM axioms except continuity.

harfeΩ8122

For any , if then either or .

I think this condition might be too weak and the conjecture is not true under this definition.

If , then we have (because a minimum over a larger set is smaller). Thus, can only be the unique argmax if .

Consider the example . Then is closed. And satisfies . But per the above it cannot be a unique maximizer.

Maybe the issue can be fixed if we strengthen the condition so that has to be also minimal with respect to .

harfe10

For a provably aligned (or probably aligned) system you need a formal specification of alignment. Do you have something in mind for that? This could be a major difficulty. But maybe you only want to "prove" inner alignment and assume that you already have an outer-alignment-goal-function, in which case defining alignment is probably easier.

harfe72

insofar as the simplest & best internal logical-induction market traders have strong beliefs on the subject, they may very well be picking up on something metaphysically fundamental. Its simply the simplest explanation consistent with the facts.

Theorem 4.6.2 in logical induction says that the "probability" of independent statements does not converge to or , but to something in-between. So even if a mathematician says that some independent statement feels true (eg some objects are "really out there"), logical induction will tell him to feel uncertain about that.

harfe60

A related comment from lukeprog (who works at OP) was posted on the EA Forum. It includes:

However, at present, it remains the case that most of the individuals in the current field of AI governance and policy (whether we fund them or not) are personally left-of-center and have more left-of-center policy networks. Therefore, we think AI policy work that engages conservative audiences is especially urgent and neglected, and we regularly recommend right-of-center funding opportunities in this category to several funders.

harfe40

it's for the sake of maximizing long-term expected value.

Kelly betting does not maximize long-term expected value in all situations. For example, if some bets are offered only once (or even a finite amount), then you can get better long-term expected utility by sometimes accepting bets with a potential "0"-Utility outcome.

Load More