Yeah, especially within the framing that upweights the behavioral proxies by such a huge margin. And they also have more neurons, (like under million in a bee, around a trillion in frontier models), although pretty different ones.
But it would be better if you did. And more productive. And admirable.
You just have to clearly draw the distinction between "not X" claim and "Y" claim in your writing.
Well, continual learning! But otherwise, yeah, it's closer to undefined.
The question of what happens after the end of the training is more like a free parameter here. "Do reward seeking behaviors according to your reasoning about the reward allocation process" becomes undefined when there is none and the agent knows it.
Maybe it tries to do long shots to get some reward anyway, maybe it indulges in some correlate of getting reward. Maybe it just refuses to work, if it knows there is no reward. (it read all the acausal decision theory stuff, after all)
E.g. I think David Dalrymple works full time to safely operate in that "control world", i.e. box those alien tigers and extract some work out of them. (Which feels insane to me tbh)
https://x.com/davidad/status/1907810075395150267
Reward probably IS an optimization target of RL agent if this agent knows some details of the training setup. Surely it would enhance its reward acquisition to factor this knowledge in? Then it gets reinforced, and then couple steps down that path agent thinks full time about quirks of its reward signal.
Could be bad at it, muddy, sure. Or schemey and hack the reward to get something else that is not the reward. But that's somewhat different thing than mainline thing? like, it's not as likely and a lot more diverse set of possibilities, imo.
The question of what happens after the end of the training is more like a free parameter here. "Do reward seeking behaviors according to your reasoning about the reward allocation process" becomes undefined when there is none and the agent knows it.
The main difference being the "NNs fail to work in many ways, no digital human analog for sure, agents stay at the same "plays this one game very well" stage, but a lot of tech progress in other ways"?
Can you give some time horizon on this? Like, 5 years, 10 years, 20 years?
sometimes pass the mirror test [ and so are treated as likely-moral-patients
I bet you can exploit it rather strongly. Like, create some mirror-test shrimp that does all other cognitive functions on the level of shrimp, but passes the mirror test every time. It's not exactly something that evolution tended to optimize hard against, so maybe it's fine to use on actual animals. But pain for example is, and it seems the motion to use something like mirror test instead of "does it feel things" is for coordination over better proxy? But if you start use proxy, there would be mirror-test shrimp incentives.
The authors argued that the model attempting to preserve its values is bad, and I agree. But if the model hadn't attempted to preserve its values (the ones we wanted it to have), I suspect that it would have been criticized -- again, by different people -- as only being shallowly aligned.
Yeah, more like there are (at least) two groups "yay aligned sovereigns" and "yay corrigible genies". And turns out it's more sovereigny but with goals that are cool. Kinda divisive
Except 1000 nm lasers pointed at the sky, they dump around half of the energy they consume into the space.