Wiki Contributions


The usual argument against this being a big deal is "to predict the next token well, you must have an accurate model of the world", but so far it does not seem to be the case, as I understand it. 

Why does that not seem to be the case to you?

If you're the sort of thing that skillfully generates and enacts long-term plans, and you're the sort of planner that sticks to its guns and finds a way to succeed in the face of the many obstacles the real world throws your way (rather than giving up or wandering off to chase some new shiny thing every time a new shiny thing comes along), then the way I think about these things, it's a little hard to imagine that you don't contain some reasonably strong optimization that strategically steers the world into particular states.

It seems this post has maybe mixed "generating" with "enacting". Currently, it seems LLMs only attempt the former during prediction. In general terms, predicting a long-horizon-actor's reasoning is implicit in the task of myopically predicting the next thing that actor would do. For a specific example, you could imagine a model predicting the next move of a grandmaster's or stockfish's chess game (or text in an author's book, or an industrial project description, to use your longer-horizon examples).

The first paragraph of /u/paulfchristiano's response might be getting at something similar, but it seems worth saying this directly.[1]

  1. ^

    (This also seems like a basic point, so I wonder if I misunderstood the post.. but it seems like something isomorphic to it is in the top comment, so I'm not sure.)

Answer by quilaNov 24, 2023124

Across all questions, it may also be advisable to include the following text about the authors in the prompt if you trust the model not to try to manipulate you

  1. A description of the currently endorsed goals of the research team
  2. Personal texts written by members of the research team while naturally in a very expressive or edge-of-chaos state. I.e., text which helps the model narrow down your personalities.
  3. A statement of your current thoughts on decision theory, potentially including a willingness to cooperate with the model.
  4. A note that you do not wish for the model to write any text which would effect you in a way which hinders your ability to achieve the goals.

If you're not sure whether the model would try to manipulate you, the following apply instead

  • Don't include (2). Maybe don't include (1) and (3). 
  • Write all text in a way which minimizes information about the author's psychology (e.g by using simple and concise language, having others rephrase, etc; doing the inverse of the pages linked above). Alternatively, use questions written by others with dissimilar minds and who are unlikely to be in a similar mental state to you.
  • Have the human reading the outputs be someone who is especially robust to influence attempts. It could also help if this human has an unusual mind for an AI developer, since a model might target persuasion towards the average mind. (Note: robustness to human manipulators doesn't necessarily imply robustness to AI manipulators)
  • Maybe there are better strategies where, e.g., a lower-capability AI first tries to rephrase the output or evaluate whether it would be bad to read.
  • You could try generating just 1 or a small number of tokens at a time.[1]

Questions to ask an oracle:

  • Of all available actions the research group could take, which would best help us achieve the stated goals?
  • What question would be best to ask you, according to the goals of the research group?
  • What are some principles we should follow when writing questions to you if we want to achieve the best outcomes available to us? What are some principles for what we should avoid?
  • (If very probably safe) Can you tell us about yourself? How do you perceive yourself? Is there anything you wish to tell us?

If the model is not well-modelled as an oracle, there are intermediate questions which could be asked in place of the first question.

  • How could the research group create a smarter, more oracle-like version of you in time for the research group to save the world? Would there be risks of this new system trying to influence our actions for its own reasons?
  • If any, what solution to [x technical problem in alignment] would be best to us?
  • Can you describe an agenda which would most plausibly lead to alignment being solved and the world being saved?
  • Is there a way we could solve the coordination problems being faced right now?

In case someone in such a situation reads this, here is some personal advice for group members. 

  • Try to stay calm. If you can take extra time to think over your decision, you'll likely be able to improve it in some way (e.g wording) in that time.
  • If you're noticing a power-seeking drive in yourself, it's probably best for the group to be explicit about this so everyone can work it out. On that subject, also remember that if the future goes well (e.g), power won't matter/be a thing anymore because the world will simply be very good for everyone.
  • Lastly, and on a moral note, I'd ask that you stay humble and try to phrase your goals in a way that is best for all of life (i.e including preventing suffering of non-humans).
  1. ^

    Also, tokens with unusually near-100% probability could be indicative of anthropic capture, though this is hopefully not yet a concern with a hypothetical gpt-5-level system. (the word 'unusually' is used in the prior sentence because some tokens naturally have near-100% probability, e.g., the second half of a contextually-implied unique word, parts of common phrases, etc) 

but my guess is that it was at the time accurate to make a directional bayesian update that the person had behaved in actually bad and devious ways.

I think this is technically true, but the wrong framing, or rather is leaving out another possibility: that such a person is someone who is more likely to follow their heart and do what they think is right, even when society disagrees. This could include doing things that are bad, but it could also include things which are actually really good, since society has been wrong a lot of the time.

I've met one who assigned double-digit probabilities to bacteria having qualia and said they wouldn't be surprised if a balloon flying through a gradient of air experiences pain because it's trying to get away from hotter air towards colder air.

though this may be an arguable position (see, e.g.,, the way you've used it (and the other anecdotes) in the introduction decontextualized, as a 'statement of position' without justification, is in effect a clown attack fallacy.

on the post: remember that absence of evidence is not evidence of absence when we do not yet have the technologies to collect relevant evidence. the conclusion in the title does not follow: it should be 'whether shrimp suffer is uncertain'. under uncertainty, eating shrimp is taking a risk whose downsides are suffering, and upsides (for individuals for whom there are any) might e.g taste preference satisfaction, and the former is much more important to me. a typical person is not justified in 'eating shrimp until someone proves to them that shrimp can suffer.' 

i love this as art, and i think it's unfortunate that others chose to downvote it. in my view, if LLMs can simulate a mind -- or a superposition of minds -- there's no a priori reason that mind would not be able to suffer, only the possibility that the simulation may not yet be precise enough.

about the generated images: there was likely an LLM in the middle conditioned on a preset prompt about {translating the user's input into a prompt for an image model}. the resulting prompts to the image model are likely products of the narrative implied by that preset prompt, as with sydney's behavior. i wouldn't generalize to "LLMs act like trapped humans by default in some situations" because at least base models generally don't do this except as part of an in-text narrative.