Right. Trying to design and train a consistent VNM-style utility function that's distinct from the actual world-state definitions you want to obtain is very difficult, perhaps impossible.
"this state is high-reward, but you don't want to attain it" is self-contradictory.
You might be able to make it locally high reward, with the surrounding states (trigger discovered and not yet shut down, and trigger obtained but not discovered) having negative values, but the whole cluster much lower than "triggers nowhere near truth". It gets more and more complicated the further down the recursion hole you go.
Even this leaves the fundamental problem: you only need the tripwire/shutdown if the utility function is already wrong - you've reached a world state where the AI is doing harm in some way which you didn't anticipate when you built it. You CANNOT fix this inside the system. You can either fix the system so that this state is unobtainable (the AI is always helping, never hurting), or have an outside-control mechanism that overrides the utility function.
I don't know who you're arguing against, but you're right that future lives have some value to me. I think you're wrong in any model that excludes "to me" in any valuation mechanism. Values are not objective and independent of the evaluator.
The very serious debate is about the RATIO of value (to a given decision-maker) of a given current individual vs other current and possible future individuals. And the uncertainty of measurable welfare - it's impossible to know what "welfare level 10" even means. It's very rare that you can press buttons with only positive or neutral effects.
It may be unknown, or even unknowable by any real-world agent. It's still not necessarily undetermined by the universe - I find it pretty likely that the universe is, in fact, deterministic.
Your underlying point is correct, though. Because human behavior is anti-inductive (people change their behavior based on their predictions of others' predictions), a lot of these kinds of questions are chaotic (in the fractal / James Gleik sense).
Like all language, and especially technical or domain-specific language, you need to know your audience/correspondent well enough to guess which terms are understood in the way you expect, and which ones need clarification. I fully support you, if A is worth the effort to educate. Many As are not.
For this topic, Wikipedia is a reasonable authority and makes it clear that ability to modify/distribute is a core part of open source as commonly used. Sadly, the Open Source Initiative never got a trademark on it, but has been fighting this fight since the previous milleneum.
upvoted, but I do not accept the framing that "value" is somehow aggregated across fred, tom, and the drivers in the town. Value is individual, and it matters (to the evaluators, the actors in any scenario) which of them get what value out of the transactions.
further, it does seem a bit obvious that if the increment of improvement is small (slightly more convenient, but overall the same service at the same cost of provision), the value-add is small. Note that in this example, there are other dimensions that haven't been mentioned, like the throughput of the shop and how long one has to wait for a repair, or any variance in specialization/quality between the shops.
Nuclear war/winter was the expected form of the destructor in my youth (I'm now in my 50s). Then Malthusian resource exhaustion, then resource failure through climate change, then supply chain fragility causing/resulting from all of the above. There really have been good reasons to expect species failure on a few decades timeframe. I watched the world go from paper ledgers and snail mail to fax machines and then electronic shared spreadsheets and actual apps/databases for most important things, and human society seemed incapable of coping with those changes at the time.
And none of it compares to the current and near-future rate of change, with all the above risks being amplified by human foibles related to the uncertainty, IN ADDITION to the direct risk of AI takeover.
All of them are lovely and extremely useful in my world-models. And all of them have failed me at one time or another. Here are some modes where I've thought I understood and it was obvious, but I was missing important assumptions:
The embedded and underlying falsehood is that "the poverty line" or "the cost of living" is a useful tool for policy or personal decision-making. Two major issues that confound the desire for simplicity:
1) one size does not fit all. Both across and within geographies, across and within families, and across time for the same individuals, the variance in expectation, community/family support, and behaviors can change the requirements to experience poverty, comfort, or wealth, by more than an order of magnitude.
2) Poverty multidimensional, and is a continuum, not a line. It's quite possible to be impoverished in some elements (education or entertainment) at a VERY different level than others (nutrition or leisure time).
Bot farms have been around for awhile. Use of AI for this purpose (along with all other, more useful purposes) has been massively increasing over the last few years, and a LOT in the last 6 months.
Personally, I'd rather have someone point out the errors or misleading statements in the post, rather than worrying about whether it's AI or just a content farm of low-paid humans or someone with too much time and a bad agenda. But a lot of folks think "AI generated" is bad, and react as such (some by stopping following such accounts, some by blocking the complainers).
In this case, the regime change is external to the current regime, right? But the the regime (current utility function) has to have a valuation for the world-states around and at the regime change, because they're reachable and detectable. Which means the regime-change CANNOT be fully external, it's known to and included in the current regime.
The solutions are around breaking the (super)intelligence, by making sure it has false beliefs about some parts of causality - it can't know that it could be hijacked or terminated, or it will seek or avoid it more than you want it to.