Dalmert - LessWrong

> The low-trust attractor starts to bend other people into reciprocal low-trust shapes, just like a prion twisting nearby proteins.
Convincing people using your actions sounds disgusting!

Could you expand on what you mean here? I'm not sure I or others followed you. Perhaps you mean what you say sarcastically?

(Formatting wise: not sure how to quote a quote here, perhaps someone knows?)

Eliezer and I wrote a book: If Anyone Builds It, Everyone Dies

Dalmert4mo140

Is an audiobook version also planned per chance? Could preordering that one also help?

Judging from Stephen Fry's endorsement and, as I've seen, his interest in the topic for some time in general, perhaps a delightful and maybe even eager deal could be made where he narrates? Unless some other choice might be better for either party of course. And I also understand if negotiations or existing agreements prevent anyone from confirming anything on this aspect, I'd be happy to hear whether the audio version is planned/intended to begin with and when if that can be known.

a confusion about preference orderings

Dalmert4mo60

I might be missing something that's written on this page, including comments, but if not, here is a vague understanding of mine of what people might fear regarding money pumps. I'm going to diverge from your model a bit, and use the concept of sub-world-state, denoted A', B', and C', which includes everything related to the world that can be preferred except for how much money you have, which I handle separately in this comment.

A' -> B' -> C' -> A' preferences hold in a cycle.

-> $M_{m o r e}$ preference also holds for money had.

I think, agents, either intrinsically, or instrumentally, (have to) simplify their decisions by factoring them at each timestep.

So they ask themselves:

Do I prefer going from A' -> B' more than having $M_{l e s s}$ -> $M_{m o r e}$ , more concretely $M_{Δ - 1}$ -> $M_{Δ 0}$ ?

In this example, the non-money preference is strong, so the answer is clearly yes.

Even if the agent plans ahead a bit, and considers:

Do I prefer A' -> B' -> C' more than having $M_{Δ - 2}$ -> $M_{Δ 0}$ ?

The answer will still be a clear yes.

The interesting question is, what might someone who fears money pumps say an agent would do if it occurs to them to plan ahead enough and consider:

Do I prefer A' -> B' -> C' -> A' more than having $M_{Δ - 3}$ -> $M_{Δ 0}$ ?

According to both this and your formalisms, this agent should clearly realize that they much prefer $M_{Δ - 3}$ -> $M_{Δ 0}$ and stay put at A'. And I think you are correct to ask whether planning is allowed by these different formalisms, and how it fits in.

I think concerns come in two flavors:

One is how you put it: if the agent is stupid, (or more charitably, computationally bounded, as we all are) they might not realize that they are going in circles and trading away value in the process for no (comparable(?)) benefit to themselves. Maybe agents are more prone to notice repetition and stop after a few cycles, since prediction and planning are famously hard.

The other concern is what we seem to notice in other humans and might notice in ourselves as well (and therefore might in practice diverge from idealized formalisms): sometimes we know or strongly suspect that something is likely not a good choice, and yet we do it anyway. How come? One simple answer can be how preference evaluations work in humans, if A' -> B' is strongly enough preferred in itself, knowing or suspecting A' -> B' -> C' -> A' + $M_{Δ - 3}$ -> $M_{Δ 0}$ might not be strong enough to override it.

It might be important that, if we can, we construct agents that do not exhibit this 'flaw'. Although one might need to be careful with such wishes since such an agent might monomaniacally pursue a peak to which it then statically sticks if reached. Which humans might dis-prefer. Which might be incoherent. (This has interested me for a while and I am not yet convinced that human values do not contain some (fundamental(?)) incoherence, e.g. in the form of such loops. For better or for worse I expanded a bit on this below, though not at all very formally and I fear less than clearly.)

So in summary, I think that if an agent

has static preferences over complete world states
is computationally boundless (enough), plans, and
does not 'suffer' from the kind of near-term bias that humans seem to

then it cannot be made worse by money pumps around things it cares about.

I think your questions are very important to be clear on as much as we can, and I only responded to a little bit of what you wrote. I might respond to more, hopefully in a more targeted and clear way, if I have more time later. And I also really hope that others also provide answers to your questions.

Some bonus pondering is below, much less connected to your post just felt nice to think through this a little and perhaps invite others' thoughts on as well.

Let's imagine the terminus of human preference satisfaction. Let's assume that all preferences are fulfilled, importantly enough: in a non-wire-headed fashion. What would that look like, at least abstractly?

a) Rejecting the premise: All (human) preferences can never be fulfilled. If one has n dyson spheres, one can always wish for n+1. There will always be an orderable list of world states that we can inch ever higher on. And even if it's hard to imagine what might someone desire if they could have everything we currently can imagine--by definition--new desires will always spring up. In a sense, dissatisfaction might be a constant companion.

b) We find a static peak of human preferences. Hard to imagine what this might be, especially if we ruled out wireheading. Hard to imagine not dis-preferring it at least a little.

c) A (small or large) cycle is found and occupied at the top. This might also fulfill the (meta-)preference against boringness. But it's hard to escape that this might have to be a cycle. And if nothing else we are spending negentropy anyway, so maybe this is a helical money-pump spiral to the bottom still?

d) Something more chaotic is happening at the top of the preferences, with no true cycles, but maybe dynamically changing fads and fashions, never much deviating from the peak. This is hard to see how or why states would be transitioned to and from if one believes in cardinal utility. This still spends negentropy, but if we never truly return to a prior world-state even apart from that, maybe it's not a cycle in the formal sense?

I welcome thoughts and votes on the above possibilities.

Any-Benefit Mindset and Any-Reason Reasoning

Dalmert6mo*30

I think this is a nice write-up, let me add some nuance in two directions:

Indeed these are quick-and-dirty heuristics that can be subpar, but you may or may not be surprised just how often decisions don't reach even this bar. In my work, when we are about to make a decision, I sometimes explicitly have to ask: do we have even a single reason to pick the option that we were about to pick over one or more others? And I find myself saying that (one of) those other options actually have reason(s) for us to pick them--I didn't bring up the question for nothing after all.

In these cases I could argue that we upgraded from no-reason deciding to at least any-reason deciding. (If we even did, in some contexts it's not unheard of that the answer to the above is something along the lines of "I cannot name any reasons but I still want to pick the first option.")

This is how we can cross from lower sophistication to the middle. However, there are perils of going ever higher: once we have identified at least one set of opposing reasons, we cross into a regime that can be immensely costly: how to weigh reasons against each other, especially when people disagree. And I'd argue that people in general are quite bad at doing this, hence why this can take up a lot of resources and have results with questionable arbitrariness.

Of course this has to all balance in how important the decision even is and how much effort, if any, should be extended towards it. And I think humans are quite bad at judging this too but we do approximate it somewhat at least with large variance.

Thank you for naming these patterns!

Open Thread Winter 2024/2025

Dalmert8mo190

I have found it! This was the one:

https://www.lesswrong.com/posts/qvNrmTqywWqYY8rsP/solutions-to-problems-with-bayesianism

Seems to have seen better reception at: https://forum.effectivealtruism.org/posts/3z9acGc5sspAdKenr/solutions-to-problems-with-bayesianism

The winning search strategy was quite interesting as well I think:

I took the history of all LW articles I have roughly ever read, I had easy access to all such titles and URLs, but not article contents. I fed them one by one into a 7B LLM asking it to rate how likely based on the title alone the unseen article content could match what I described above, as vague as that memory may be. Then I looked at the highest ranking candidates, and they were a dud. Did the same thing with a 70B model, et voila, the solution was near the top indeed.

Now I just need to re-read it if it was worth dredging up, I guess when a problem starts to itch it's hard to resist solving it.

Open Thread Winter 2024/2025

Dalmert8mo50

Hey, can anyone help me find this LW (likely but could be diaspora) article, especially if you might have read it too?

My vague memory: It was talking about (among other things?) some potential ways of extending point estimate probability predictions and calibration curves. I.e. in a situation where making a prediction in one way affects what the outcome will be, i.e. if there is a mind-reader/accurate-simulator involved that bases its actions on your prediction. And in this case, a two dimensional probability estimate might be more appropriate: If 40% is predicted for event A, event B will have a probability of 60%. If 70% for event A, then 80% for event B, and so on, a mapping potentially continuously defined for the whole range. (event A and event B might be the same.) IIRC the article contained 2D charts where curves and rectangles were drawn for illustration.

IIRC it didn't have too many upvotes, more like around low-dozen, or at most low-hundred.

Searches I've tried so far: Google, Exa, Gemini 1.5 with Deep Research, Perplexity, OpenAI GPT-4o with Search.

p.s. if you are also unable to put enough time into finding it, do you have any ideas how it could be found?

Hire (or Become) a Thinking Assistant

Dalmert9mo40

I'm interested in variants of this from both sides. Feel free to shoot me a DM and let's see if we can set something up.

I haven't had a good label to put on things like this but I've gravitated towards similar ways of work over the last 10-20 years, and I've very often found very good performance boosting effects, especially where compatibility and trust could be achieved.

Reflections on Less Online

Dalmert1y170

If anyone reading this feel like they missed out, or this sparked their curiosity, or they are bummed that they might have to wait 11 months for a chance at something similar, or they feel like that so many cool things happen in North America and so few things in Europe, (all preceding "or"s are inclusive) then I can heartily recommend you to come to LessWrong Community Weekend 2024 [Applications Open] in Berlin in about 2 months over the weekend of 13 September. Applications are open as of now.

I've attended it a couple of times so far, and I quite liked it. Reading this article, it seemed very similar and I begun to wonder if LWCW was a big inspiration for LessOnline, or if they had a common source of inspiration. So I do mean to emphasize what I wrote in the first paragraph: if you think you might like something as described here then I strongly encourage you to come!

(If someone attended both then maybe they can weigh in even more authoritatively whether my impression is accurate or if more nuance would be beneficial.)

Sum-threshold attacks

Dalmert2y32

In a not-too-fast and therefore requisitely stealthy ASI takeover scenario, if the intelligence explosion is not too steep, this could be a main meta-method by which the system gains increasing influence and power while fully remaining under the radar and avoiding detection until it is reasonably sure that it can no longer be opposed. This could be happening without anyone knowing or maybe even being able to know. Frightening.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments