I think this ignores how different the hardware that runs AI training or inference looks from hardware that does any other general purpose compute, and how much that gap continues to widen as the money pours in.
Keep in mind also that Nvidia has zero incentive to share their hardware accelerator firmware or other low level code, which severely restricts AI companies from optimizing over that. That could maybe change if the leverage does. The companies selling the infrastructure are much more consolidated than the AI research firms at present.
We may be closer to the optimistic scenario, not through any great wisdom or planning, but rather just because of the monopolies/oligopolies we ended up with. And cutting edge fabs are in high enough demand it's hard to just route around the AI accelerator hardware monopoly because then you just find the cutting edge chip fab monopoly/oligopoly. Network hardware is no better (Broadcom).
I agree insider vs outsider threat is an important distinction, and I one that I have seen security people take seriously in other contexts. My background is in enterprise IT and systems admin. I think there's some practical nuance missing here.
In so far as security people are expecting to treat the AI as an outsider, they're likely expecting to have a hard boundary between "systems that run the AI" and "systems and tools the AI gets to use", where access to any given user is to only one or the other.
This is already fairly common practice, in the sense that security people already think about how to avoid giving too much privileged access to any specific insider. E.g. ideally in a large enough company, admins that have to most of the systems might not be given access to the systems responsible for logging what they did, nor the system granting access to other systems - centralized access control/identity (e.g. AWS IAM, Okta, etc.). In that example the team that manages those logging and access control systems would not have access to anything outside those systems and would just be responsible for implementation according to whatever the security (and indirectly, HR people, say). The security people might have just readonly access to systems (e.g. through a log search/archival tool and any telemetry from the security software) and rely on the separate IT team for all changes.
More extreme levels of security/paranoia are possible and do exist, especially if you look at the largest companies or national security applications. In the latter case, the US govt maintains and entire separate internet with distinct fiber optic cabling, routers, etc; the computers have no connection outside this separate network, sit exclusively in shielded rooms in physically access controlled buildings, no unauthorized electronics go in those rooms, nor anyone without a clearance, etc. I could go on but hopefully I've gotten the idea across; AI is a distinct threat that would need a different flavor of extreme precautions.
For AI there's a pretty natural boundary to draw between the AI specific hardware accelerators on which the AI training and inference runs (and associated control, data warehouse, etc), vs the tooling the AI is permitted to use while it's running. The AI's inference running on a particular system doesn't automatically give it privileged access to that system (though exploits might). Tooling that the AI calls into or uses need not exist on the same hardware that's running inference. If such a boundary gets defined by the security team, it's then possible to treat any leakage of AI state effecting the inference hardware as fishy and deserving of investigation, just as if it was an outside user (b/c with that division, it is).
Happy to answer related questions about what existing approaches exist here and might be considered ordinary practice. IT and information security mostly happens outside of academic literature and companies don't like to talk about specific security precautions they're taking. You'll probably need to talk to many people with industry experience to get a well rounded take, as nobody gets to see all that many implementations in full detail in one career.
(I have been busy, hence the delay.)
No worries, likewise.
Most centrally I think we're seeing fundamentally different things with the causal graph. Or more to the point, I haven't the slightest idea how one is supposed to do any useful reasoning with time varying nodes without somehow expanding it to consider how one node's function and/or time series effects it's leaf nodes (or another way, specifically what temporal relation the arrow represents). It also seems fairly inescapable to me that any way you consider that relation, an actual causal cycle where A causes B causes C causes A at the same instant looks very different than one where they indirectly effect each-other at some later time, to the point of needing different tools to analyze the two cases. The latter looks very much like the sort of thing solved with recursion or update loops in programs all the time. Alternately diff eq in the continuous case. The former looks like the sort of thing you need a solver to look for a valid solution for.
It's fairly obvious why cycles of the first kind I describe would need different treatment - the graph would place constraints on valid solutions but not tell you how to find them. I'm not seeing how the second case is cyclic in the same sense and how you couldn't just use induction arguments to extend to infinity.
AFAICT you and I aren't disagreeing on anything about real control systems. It's difficult to find a non-contrived example because so many control systems either aren't that demanding or have a human in the loop. But this theorem is about optimal control systems, optimal in the formal computer science sense, so the fact that neither of us can come up with an example that isn't solved by a PID control loop or similar is somewhat besides the point.
While PID controllers are applicable to many control problems and often perform satisfactorily without any improvements or only coarse tuning, they can perform poorly in some applications and do not in general provide optimal control.
Do you think they're actually struggling to distinguish real from fiction, or merely struggling to keep two complex distinct worlds in their working memory/stack/context window and keep the details straight?
E.g. many animals will play and chase, understanding both that there's different rules because it's play, yet still transfer the skills to actually hunting or fighting. Seems more a matter of degree?
It sounded previously like you were making the strong claim that this setup can't be applied to a closed control loop at all, even in e.g. the common (approximately universal?) case where we have a delay between the regulator's action and it's being able to measure that action's effect. That's mostly what I was responding to; the chaining that Alfred suggested in the sibling comment seems sensible enough to me.
It occurs to me that the household thermostat example is so non-demanding as to not be a poor intuition pump. I implicitly made the jump to thinking about a more demanding version of this without spelling that out. It's always going to be a little silly trying to optimize an example that's already intuitively good enough. Imagine for sake of argument a apparatus that needs tighter control such that there's actually pressure to optimize beyond the simplest control algorithm.
Your examples of control systems all seem fine and accurate. I think we agree the tricky bit is picking the most sensible frame for mapping the real system to the diagram (assuming that's roughly what you mean by terminology).
It seems like even with the improvements John Wentworth suggests there's still some ambiguity in how to apply the result to a case where the regulator makes a time series of decisions, and you're suggesting there's some reason we can't, or wouldn't want to use discrete timesteps and chain/repeat the diagram.
At a little more length, I'm picturing the unrolling such that the current state is the sensor's measurement time series through present, of which the regulator is certain. It's merely uncertain about how its action - what fraction of the next interval to run the heat - will effect the measurement at future times. It's probably easiest if we draw the diagram such that the time step is the delay between action and measured effect, and the regulator then sees the result of its action based on T1 at T3.
That seems pretty clearly to me to match the pattern this theorem requires, while still having a clear place to plug in whatever predictive model the regulator has. I bring up the sampling theorem as that is the bridge between the discrete samples we have and the continuous functions and differential equations you elsewhere say you want to use. Or stated a little more broadly, that theorem says we can freely move between continuous and discrete representations as needed, provided we sample frequently enough and the functions are well enough behaved to be amenable to calculus in the first place.
How do you figure a thermostat directly measures what it's controlling? It controls heat added/removed per unit time, typically just more/less/no change, and measures the resulting temperature at a single point on typically a minute+ delay due to the dynamics of the system (air and heat take time to diffuse, even with a blower). Any time step sufficiently shorter than that delay is going to work the same. The current measurement depends on what the thermostat did tens of seconds if not minutes previously.
There are times the continuous/discrete distinction is very important but this example isn't one of them. As soon as you introduce a significant delay between cause and effect the time step model works (it may well be a dependence on multiple previous time-steps, but not the current one).
I don't think this is an unusual example, we have a small number of sensors, we get data on a delay, and we're actually trying to control e.g. the temperature in the whole house, holding a set point, minimizing variation between rooms and minimizing variation across time, with the smallest amount of control authority over the system (typically just on/off).
I believe "sufficiently shorter than the delay" is just going to be Nyquist Shannon sampling theorem, once you're sampling twice the frequency of the highest frequency dynamic in the system, your control system has all the information from the sensor and sampling more will not tell you anything else.
I wonder if you could produce this behavior at all in a model that hadn't gone through the safety RL step. I suspect that all of the examples have in common that they were specifically instructed against during safety RL, alongside "don't write malware", and it was simpler to just flip the sign on the whole safety training suite.
Same theory would also suggest your misaligned model should be able to be prompted to produce contrarian output for everything else in the safety training suite too. Just some more guesses, the misaligned model would also readily exhibit religious intolerance, vocally approve of terror attacks and genocide (e.g. both expressing approval of Hamas' Oct 6 massacre, and expressing approval of Israel making an openly genocidal response in Gaza), and eagerly disparage OpenAI and key figures therein.
Yikes. So the most straightforward take: When trained to exhibit a specific form of treachery in one context, it was apparently simpler to just "act more evil" as broadly conceptualized by the culture in the training data. And also seemingly, "act actively unsafe and harmful", as defined by the existing safety RL. process. Most of those examples seem to just be taking the opposite position to the safety training, presumably in proportion to how heavily it featured in the safety training (e.g. "never ever ever say anything nice about Nazis" likely featured heavily).
I'd imagine those are distinct representations. There's quite a large delta between what OpenAI thinks is safe/helpful/harmless vs what broader society would call good/upstanding/respectable. It's possible that this is only inverting what was in the safety fine tuning, and likely specifically because "don't help people write malware" was something that featured in the safety training.
In any case, that's concerning. You've flipped the sign on the much of the value system it was trained on. Effectively by accident, with, as morally ambiguous requests go, a fairly innocuous one. People are absolutely going to put AI systems in adversarial contexts where they need to make these kind of fine tunings ("don't share everything you know", "toe the party line", etc). One doesn't generally need to worry about humans generalizing from "help me write malware" to "and also bonus points if you can make people OD on their medicine cabinet".
Hmm, I guess I see why other calculators have at least some additional heuristics and aren't straight Kelly. Going bankrupt is not infinitely bad in the US. If the insured has low wealth, there's likely a loan attached to any large asset that really complicates the math. Making W just be "household wealth" also doesn't model "I can replace the loss next paycheck". I'm not sure what exactly the correct notion of wealth is here, but if wealth is small compared to future earnings, and replacing the loss can be deferred, these assumptions are incorrect.
And obviously, paying $10k premium to insure a 50% chance of a $10k loss is always a mistake for all wealth levels. You're choosing to be bankrupt in 100% of possible worlds instead of 50%.
Common failures aren't common because they happen most of the time, they're common because, conditioned on a failure happening, they're likely.
The example is a bit contrived, but safety goals being poorly specified or outright inconsistent and contradictory seems quite plausible in general, as they have to try to incorporate input from PR, HR, legal compliance, etc. And this will always be a cost center, so minimal effort as long as it's not making the model too painfully stupid.