keshavs

This seems wrong to me. We have results showing that reward hacking generalizes to broader misalignment, plus that changing the prompt distribution via inoculation prompting significantly reduces reward hacking in deployment.

It seems like the models do generally follow the model spec, but specifically learn not to apply that to reward hacking on coding tasks because we reward that during training.

-1

Replying toThe future of alignment if LLMs are a bubble

keshavs2mo

The future of alignment if LLMs are a bubble

according to Google, they're recently reached "50% of code by character count was generated by LLMs". Since Google haven't massively cut their headcount, that suggests they're now producing code at roughly twice the rate as a few years ago (at least by character count).

This doesn't seem true. Time spent interfacing with LLMs trades off against time spent coding. Significantly more than 50% of my code is LLM written, but I haven't seen a 2x code increase.
Not a great source and a few months old, but I would expect a code-output increase in open source repos if Google was seeing 2x code output and that doesn't seem to have materialized yet.

Replying toStop Applying And Get To Work

keshavs2mo

Stop Applying And Get To Work

This seems unlikely to me. It seems like I should be way more willing to fund someone else to work on safety now than myself multiple years in the future.

The discount rate on future work and funding is pretty high compared to work now. Also, the other person was able to get funding from a grant-maker, which is a positive signal compared to myself, who wasn't.

LESSWRONG
LW

LESSWRONG
LW

keshavs

keshavs

keshavs

keshavs

keshavs

keshavs