Nico Hillbrand

1: My understanding is the classic arguments go something like: Assume interpretability won't work (illegible CoT, probes don't catch most problematic things). Assume we're training our AI on diverse tasks and human feedback. It'll sometimes get reinforced for deception. Assume useful proxy goals for solving tasks become drives that the AI comes up with instrumental strategies to achieve. Deception is often a useful instrumental strategy. Assume that alien or task focused drives win out over potential honesty etc drives because they're favoured by inductive biases. You get convergent deception.

I'm guessing you have interpretability working as th main crux and together with inductive biases for nice behaviours potentially winning it drives this story... (read 424 more words →)

Replying toCounting arguments provide no evidence for AI doom

Nico Hillbrand2y

Counting arguments provide no evidence for AI doom

We can also construct an analogous simplicity argument for overfitting:
Overfitting networks are free to implement a very simple function— like the identity function or a constant function— outside the training set, whereas generalizing networks have to exhibit complex behaviors on unseen inputs. Therefore overfitting is simpler than generalizing, and it will be preferred by SGD.
Prima facie, this parody argument is about as plausible as the simplicity argument for scheming. Since its conclusion is false, we should reject the argumentative form on which it is based.

As far as I understood people usually talk about simplicity biases based on the volume of basins in parameter space. So the response would be that overfitting takes... (read more)

LESSWRONG
LW

LESSWRONG
LW

Nico Hillbrand

Nico Hillbrand

Nico Hillbrand

Nico Hillbrand

Nico Hillbrand

Nico Hillbrand