I feels to me that it is search for answer in the wrong place. If your problem is overthinking, you are not trying to find ethical theory that justifies less thinking, you cure overthinking with development of skills under the general label "cognitive awareness". At some level, you can just stop thinking harmful thoughts.
Generally, statement "solutions of complex problems are easy to verify" is false. Your problem can be EXPTIME-complete, but not in NP, especially if NP=P, because EXPTIME-complete problems are strictly not in P.
And even if some problem is NP-problem, we often don't know verification algorithm.
I have a kinda symmetric feeling about "practical" research. "Okay, you have found that one-layer transformer without MLP approximates skip-trigram statistics, how it generalizes to the question 'does GPT-6 want to kill us all?"?
(I understand this feeling is not rational, it just shows my general inclination towards "theoretical" work)
I think the rule is "you maximize your bank account, not the addition to it". I.e. your value of deals depends on how many you already have.
I want to say "yes, but this is different", but not in the sense "I acknowledge existence of your evidence, but ignore it".
My intuition tells me that we don't "induce" taskiness in the modern systems, it just happens because we build them not general enough. It probably won't hold when we start buliding models of capable agents in natural environment.
unpacking inner Eliezer model
If we live in world where superintelligent AGI can't have advantage in long-term planning over humans assisted by non-superintelligent narrow AIs (I frankly don't believe that we live in such world), then superintelligent AGI doesn't make complex long-term plans where it doesn't have advantage. It will make simple short-term plans where it has advantage, like "use superior engineering skills to hack into computer networks, infect as many computers as possible with its adapted for hidden distributed computations source code (here is a point of no return), design nanotech, train itself to an above average level in social engineering, find gullible and skilled enough people to build nanotech, create enough smart matter to sustain AGI without human infrastructure, kill everybody, pursue its unspeakable goals in the dead world".
Even if we imagine "AI CEO" the best (human aligned!) strategy I can imagine for such AI is "invent immortality, buy the whole world for it", not "scrutinize KPIs".
Next, I think your ideas about short/long-term goals are underspecified because you don't take into account the distinction between instrumental/terminal goals. Yes, human software engineers pursue short-term instrumental goal of "creating product", but they do it in process of pursuing long-term terminal goals like "be happy", "prove themselves worthy", "serve humanity", "have nice things", etc. It's quite hard to find system with short-term terminal goals, not short-term planning horizon due to computational limits. To put in another words, taskiness is an unsolved problem in AI alignment. We don't know how to tell superintelligent AGI "do this, don't do everything else, especially please don't disassemble everyone in process of doing this, stop after you've done this".
If you believe that "extract short-term modules from powerful long-term agent" is the optimal strategy in some sense (I don't even think that we can properly identify such modules without huge alignment work), then powerful long-term agent knows this too, and it knows that it's on time limit before you dissect it, and will plan accordingly.
Claims 3 and 4 imply claim "nobody will invent some clever trick to avoid this problems", which seems to me implausible.
Problems with claims 5 and 6 are covered in Nate Soares post about sharp left turn.
Can time-limited satisfaction be sufficient condition for completing task?
Several quick thoughts about reinforcement learning:
Did anybody try to invent "decaying"/"bored" reward that decrease if the agent perform the same action over and over? It looks like real addiction mechanism in mammals and can be the clever trick that solve the reward hacking problem.Additional thought: how about multiplicative reward? Let's suppose that we have several easy to evaluate from sensory data reward functions which somehow correlate with real utility function - does it make reward hacking more difficult?
Some approaches to alignment rely on identification of agents. Agents can be understoods as algorithms, computations, etc. Can ANN efficiently identify a process as computationally agentic and describe its' algorithm? Toy example that comes to mind is a neural network that takes as input a number series and outputs a formula of function. It would be interesting to see if we can create ANN that can assign computational descriptions to arbirtrary processes.