Wiki Contributions



Would you say that models designed from the ground up to be collaborative and capabilitarian would be a net win for alignment, even if they're not explicitly weakened in terms of helping people develop capabilities? I'd be worried that they could multiply human efforts equally, but with humans spending more effort on capabilities, that's still a net negative.

I really appreciate the call-out where modern RL for AI does not equal reward-seeking (though I also appreciate @tailcalled 's reminder that historical RL did involve reward during deployment); this point has been made before, but not so thoroughly or clearly.

A framing that feels alive for me is that AlphaGo didn't significantly innovate in the goal-directed search (applying MCTS was clever, but not new) but did innovate in both data generation (use search to generate training data, which improves the search) and offline-RL.



Here the difference seems only to be spacing, but I've also seen bulleted lists appear. I think but I can't recall for sure that I've seen something similar happen to top-level posts.

@Habryka @Raemon I'm experiencing weird rendering behavior on Firefox on Android. Before voting, comments are sometimes rendered incorrectly in a way that gets fixed after I vote on them.

Is this a known issue?

This is also mitigated by automatic images like gravatar or the ssh key visualization. I wonder if they can be made small enough to just add to usernames everywhere while maintaining enough distinguishable representations.

Note that every issue you mentioned here can be dealt with by trading off capabilities.

Yes. The trend I see is "pursue capabilities, worry about safety as an afterthought if at all". Pushing the boundaries of what is possible on the capabilities front subject to severe safety constraints is a valid safety strategy to consider (IIRC, this is one way to describe davidad's proposal), but most orgs don't want to bite the bullet of a heavy alignment tax.

I also think you're underestimating how restrictive your mitigations are. For example, your mitigation for sycophancy rules out RLHF, since the "HF" part lets the model know what responses are desired. Also, for deception, I wasn't specifically thinking of strategic deception; for general deception, limiting situational awareness doesn't prevent it arising (though it lessens its danger), and if you want to avoid the capability, you'd need to avoid any mention of e.g. honesty in the training.

The "sleeper agent" paper I think needs to be reframed. The model isn't plotting to install a software backdoor, the training data instructed it to do so. Or simply put, there was sabotaged information used for model training.

This framing is explicitly discussed in the paper. The point was to argue that RLHF without targeting the hidden behavior wouldn't eliminate it. One threat to validity is the possibility that artificially induced hidden behavior is different than naturally occurring hidden behavior, but that's not a given.

If you want a 'clean' LLM that never outputs dirty predicted next tokens, you need to ensure that 100% of it's training examples are clean.

First of all, it's impossible to get "100% clean data", but there is a question of whether 5 9s of cleanliness is enough; it shouldn't be, if you want a training pipeline that's capable of learning from rare examples. Separate from that, some behavior is either subtle or emergent; examples include "power seeking", "sycophancy", and "deception". You can't reliably eliminate them from the training data because they're not purely properties of data.

Since the sleeper agents paper, I've been thinking about a special case of this, namely trigger detection in sleeper agents. The simple triggers in the paper seem like they should be easy to detect by the "attention monitoring" approach you allude to, but it also seems straightforward to design more subtle, or even steganographic, triggers.

A question I'm curious about it whether hard-to-detect triggers also induce more transfer from the non-triggered case to the triggered case. I suspect not, but I could see it happening, and it would be nice if it does.

Let's consider the trolley problem. One consequentialist solution is "whichever choice leads to the best utility over the lifetime of the universe", which is intractable. This meta-principle rules it out as follows: if, for example, you learned that one of the 5 was on the brink of starting a nuclear war and the lone one was on the brink of curing aging, that would say switch, but if the two identities were flipped, it would say stay, and generally, there are too many unobservables to consider. By contrast, a simple utilitarian approach of "always switch" is allowed by the principle, as are approaches that take into account demographics or personal importance.

The principle also suggests that killing a random person on the street is bad, even if the person turns out to be plotting a mass murder, and conversely, a doctor saving said person's life is good.

Two additional cases where the principle may be useful and doesn't completely correspond to common sense:

  • I once read an article by a former vegan arguing against veganism and vegetarianism; one example was the fact that the act of harvesting grain involves many painful deaths of field mice, and that's not particularly better than killing one cow. Applying the principle, this suggests that suffering or indirect death cannot straightforwardly be the basis for these dietary choices, and that consent is on shaky ground.
  • When thinking about building a tool (like the LW infrastructure) that could be either hugely positive (because it leads to aligned AI) or hugely negative (because it leads to unaligned AI by increasing AI discussions), and there isn't really a way to know which, you are morally free to build it or not; any steps you take to increase the likelihood of a positive outcome are good, but you are not required to stop building the tool due to a huge unknowable risk. Of course, if there's compelling reason to believe that the tool is net-negative, that reduces the variance and suggests that you shouldn't build it (e.g. most AI capabilities advancements).

Framed a different way, the principle is, "Don't tie yourself up in knots overthinking." It's slightly reminiscent of quiescence search in that it's solving a similar "horizon effect" problem, but it's doing so by discarding evaluation heuristics that are not locally stable.

This makes me think that a useful meta-principle for application of moral principles in the absence omniscience is "robustness to auxillary information." Phrased another way, if the variance of the outcomes of your choices is high according to a moral principle, in all but the most extreme cases, either find more information or pick a different moral principle.

Load More