1 min read24th May 20234 comments
This is a special post for short-form writing by porby. Only they can create top-level comments. Comments here also appear on the Shortform Page and All Posts page.
4 comments, sorted by Click to highlight new comments since: Today at 7:50 AM

3 more posts I feel like I need to write at some point:

In defense of dumb value alignment

Solving all of ethics and morality and getting an AI to implement it seems hard. There are possible worlds where we would need to work with half measures. Some of these paths rely on lower auto-doom densities, but there seem to be enough of those potential worlds to consider it.

Example of 'good enough to not x/s-risk' dumb value alignment. Required assumptions for stability. Shape of questions implied that may differ from more complete solutions.

What I currently believe, in pictures

Make a bunch of diagrams of things I believe relevant to alignmentstuff and how they interact, plus the implications of those things.

The real point of the post is to encourage people to try to make more explicit and extremely legible models so people can actually figure out where they disagree instead of running around in loops for several years.

Preparation for unknown adversaries is regularization

Generalizing the principle from policy regularization.

  1. Adversaries need not be actual agents working against you.
  2. "Sharp" models that aggressively exploit specific features have a fragile dependence on those features. Such models are themselves exploitable.
  3. Uncertainty and chaos are strong regularizers. The amount of capability required to overcome even relatively small quantities of chaos can be extreme.
  4.  Applications in prediction.

Another item for the todo list:
Autoregressive transformer gradient flow shapes earlier token computation to serve future predictions, but that early computation cannot condition on future tokens. This should serve as a regularizing influence on the internal structure of token predictions: in order to be useful to the largest possible set of future predictions, the local computation would need to factor itself into maximally reusable modules.

The greater the local uncertainty about the future, the less the local computation can be specialized to serve future tokens. Could consider it something like: the internal representation is a probability-weighted blend of representations useful to possible futures. If the local computation is highly confident in a narrow space, it can specialize more. 

Simplicity biases would incentivize sharing modules more strongly. Even if the local computation suspects a narrower future distribution, it would be penalized for implementing specialized machinery that is too rarely useful.

One implication: many forms of token-parallelized search get blocked, because they require too much foresight-driven specialization.

Quarter-baked ideas for potential future baking:

  1. A procedure for '~shardifying'[1] an incoherent utility function into a coherent utility function by pushing preferences into conditionals. Example of an extreme case of this would be an ideal predictor (i.e. one which has successfully learned values fit to the predictive loss, not other goals, and does not exhibit internally motivated instrumental behavior) trained to perfectly predict the outputs of an incoherent agent.

    The ideal predictor model, being perfectly conditional, would share the same outputs but would retain coherence: inconsistencies in the original utility function are remapped to be conditional. Apparent preference cycles over world states are fine if the utility function isn't primarily concerned with world states. The ideal predictor is coherent by default- it doesn't need to work out any kinks to avoid stepping on its own toes.

    Upon entering a hypothetical capability-induced coherence death spiral, what does the original inconsistent agent do? Does it try to stick to object level preferences, forcing it to violate its previous preferences in some presumably minimized way?[2] Or does it punt things into conditionality to maintain behaviors implied by the original inconsistencies? Is that kind of shardification convergent?
  2. Is there a path to piggybacking on greed/noncompetitive inclinations for restricting compute access in governance? One example: NVIDIA already requires that data center customers purchase its vastly more expensive data center products. The driver licenses for the much cheaper gaming class hardware already do not permit use cases like "build a giant supercomputer for training big LLMs."

    Extending this to, say, having a dead man's switch built into the driver if the GPU installation hasn't received an appropriate signal recently (implying that the relevant regulatory entity has not been able to continue its audits of the installation and its use), the cluster simply dies.

    Modified drivers could bypass some of the restrictions, but some hardware involvement would make it more difficult. NVIDIA may already be doing this kind of hardware-level signing to ensure that only approved drivers can be used (I haven't checked). It's still possible in principle to bypass- the hardware and software are both in the hands of the enemy- but it would be annoying.

    Even if they don't currently do that sort of check, it would be relatively simple to add some form of it with a bit of lead time.

    By creating more regulatory hurdles that NVIDIA (or other future dominant ML hardware providers) can swallow without stumbling too badly, they get a bit of extra moat against up-and-comers. It'd be in their interest to get the government to add those regulations, and then they could extract a bit more profit from hyperscalers.
  1. ^

    I'm using the word "shard" here to just mean "a blob of conditionally activated preferences." It's probably importing some other nuances that might be confusing because I haven't read enough of shard theory things to catch where it doesn't work.

  2. ^

    This idea popped into my head during a conversation with someone working on how inconsistent utilities might be pushed towards coherence. It was at the Newspeak House the evening of the day after EAG London 2023. Unfortunately, I promptly forgot their name! (If you see this, hi, nice talking to you, and sorry!)

Another item for the todo list:

  1. Compile neural networks from fountains of autogenerated programs.
  2. Generate additional permutations by variously scrambling compiled neural networks.
  3. Generate more "natural" neural representations by training networks to predict the mapping implied by the original code.
  4. Train an interpreter to predict the original program from the neural network.

Naive implementation likely requires a fairly big CodeLlama-34b-Instruct-tier interpreter and can only operate on pretty limited programs, but it may produce something interesting. Trying to apply the resulting interpreter on circuits embedded in larger networks probably won't work, but... worth trying just to see what it does?

Might also be something interesting to be learned in spanning the gap between 'compiled' networks and trained networks. How close do they come to being affine equivalents? If not linear, what kind of transform is required (and how complicated is it)?

New to LessWrong?