3 more posts I feel like I need to write at some point:
Solving all of ethics and morality and getting an AI to implement it seems hard. There are possible worlds where we would need to work with half measures. Some of these paths rely on lower auto-doom densities, but there seem to be enough of those potential worlds to consider it.
Example of 'good enough to not x/s-risk' dumb value alignment. Required assumptions for stability. Shape of questions implied that may differ from more complete solutions.
Make a bunch of diagrams of things I believe relevant to alignmentstuff and how they interact, plus the implications of those things.
The real point of the post is to encourage people to try to make more explicit and extremely legible models so people can actually figure out where they disagree instead of running around in loops for several years.
Generalizing the principle from policy regularization.
Another item for the todo list:Autoregressive transformer gradient flow shapes earlier token computation to serve future predictions, but that early computation cannot condition on future tokens. This should serve as a regularizing influence on the internal structure of token predictions: in order to be useful to the largest possible set of future predictions, the local computation would need to factor itself into maximally reusable modules.
The greater the local uncertainty about the future, the less the local computation can be specialized to serve future tokens. Could consider it something like: the internal representation is a probability-weighted blend of representations useful to possible futures. If the local computation is highly confident in a narrow space, it can specialize more.
Simplicity biases would incentivize sharing modules more strongly. Even if the local computation suspects a narrower future distribution, it would be penalized for implementing specialized machinery that is too rarely useful.
One implication: many forms of token-parallelized search get blocked, because they require too much foresight-driven specialization.
Quarter-baked ideas for potential future baking:
I'm using the word "shard" here to just mean "a blob of conditionally activated preferences." It's probably importing some other nuances that might be confusing because I haven't read enough of shard theory things to catch where it doesn't work.
This idea popped into my head during a conversation with someone working on how inconsistent utilities might be pushed towards coherence. It was at the Newspeak House the evening of the day after EAG London 2023. Unfortunately, I promptly forgot their name! (If you see this, hi, nice talking to you, and sorry!)
Another item for the todo list:
Naive implementation likely requires a fairly big CodeLlama-34b-Instruct-tier interpreter and can only operate on pretty limited programs, but it may produce something interesting. Trying to apply the resulting interpreter on circuits embedded in larger networks probably won't work, but... worth trying just to see what it does?
Might also be something interesting to be learned in spanning the gap between 'compiled' networks and trained networks. How close do they come to being affine equivalents? If not linear, what kind of transform is required (and how complicated is it)?