Relevant problem: how should one handle higher-order hyphenation? E. g., imagine if one is talking about cost-effective measures, but has the measures' effectiveness specifically relative to marginal costs in mind. Building it up, we have "marginal-cost effectiveness", and then we want to turn that whole phrase into a compound modifier. But "marginal-cost-effective measures" looks very awkward! We've effectively hyphenated "marginal cost effectiveness", no hyphen: within the hyphenated expression, we have no way to avoid the ambiguities between a hyphen and a space!
It becomes especially relevant in the case of longer composite modifiers, like your "responsive-but-not-manipulative" example.
Can we fix that somehow?
One solution I've seen in the wild is to increase the length of the hyphen depending on its "degree", i. e. use an en dash in place of a hyphen. Example: "marginal-cost–effective measures". (On Windows, can be inserted by typing 0150 on the keypad while holding ALT. See methods for other platforms here.)
In practice you basically never go beyond the second-degree expressions, but there's space to expand to third-degree expressions by the use of an even-longer em dash (—, 0151 while holding ALT).
Though I expect it's not "official" rules at all.
That seems to generalize to "no-one is allowed to make any claim whatsoever without consuming all of the information in the world".
Just because someone generated a vast amount of content analysing the topic, does not mean you're obliged to consume it before forming your opinions. Nay, I think consuming all object-level evidence should be considered entirely sufficient (which I assume was done in this case). Other people's analyses based on the same data are basically superfluous, then.
Even less than that, it seems reasonable to stop gathering evidence the moment you don't expect any additional information to overturn the conclusions you've formed (as long as you're justified in that expectation, i. e. if you have a model of the domain strong enough to have an idea regarding what sort of additional (counter)evidence may turn up and how you'd update on it).
In addition to Roko's point that this sort of opinion-falsification is often habitual rather than a strategic choice that a person could opt not to make, it also makes strategic sense to lie in such surveys.
First, the promised "anonymity" may not actually be real, or real in the relevant sense. The methodology mentions "a secure online survey system which allowed for recording the identities of participants, but did not append their survey responses to their names or any other personally identifiable information", but if your reputation is on the line, would you really trust that? Maybe there's some fine print that'd allow the survey-takers to look at the data. Maybe there'd be a data leak. Maybe there's some other unknown-unknown you're overlooking. Point is, if you give the wrong response, that information can get out somehow; and if you don't, it can't. So why risk it?
Second, they may care about what the final anonymized conclusion says. Either because the lab leak hypothesis becoming mainstream would hurt them personally (either directly, or by e. g. hurting the people they rely on for funding), or because the final conclusion ending up in favour of the lab leak would still reflect poorly on them collectively. Like, if it'd end up saying that 90% of epidemiologists believe the lab leak, and you're an epidemiologist... Well, anyone you talk to professionally will then assign 90% probability that that's what you believe. You'd be subtly probed regarding having this wrong opinion, your past and future opinions would be scrutinized for being consistent with those of someone believing the lab leak, and if the status ecosystem notices something amiss...?
But, again, none of these calculations would be strategic. They'd be habitual; these factors are just the reasons why these habits are formed.
Answering truthfully in contexts-like-this is how you lose the status games. Thus, people who navigate such games don't.
I think, like a lot of things in agent foundations, this is just another consequence of natural abstractions.
The universe naturally decomposes into a hierarchy of subsystems; molecules to cells to organisms to countries. Changes in one subsystem only sparsely interact with the other subsystems, and their impact may vanish entirely at the next level up. A single cell becoming cancerous may yet be contained by the immune system, never impacting the human. A new engineering technique pioneered for a specific project may generalize to similar projects, and even change all such projects' efficiency in ways that have a macro-economic impact; but it will likely not. A different person getting elected the mayor doesn't much impact city politics in neighbouring cities, and may literally not matter at the geopolitical scale.
This applies from the planning direction too. If you have a good map of the environment, it'll decompose into the subsystems reflecting the territory-level subsystems as well. When optimizing over a specific subsystem, the interventions you're considering will naturally limit their impact to that subsystem: that's what subsystemization does, and counteracting this tendency requires deliberately staging sum-threshold attacks on the wider system, which you won't be doing.
In the Rubik's Cube example, this dynamic is a bit more abstract, but basically still applies. In a way similar to how the "maze" here kind-of decomposes into a top side and a bottom side.
A complication is that any one agent can only have so much bandwidth, which would sometimes incentivize more blunt control. I've been thinking bandwidth is probably going to become a huge area of agent foundations
I agree. I currently think "bandwidth" in terms like "what's the longest message I can 'inject' into the environment per time-step?" is what "resources" are in information-theoretic terms. See the output-side bottleneck in this formulation: resources are the action bandwidth, which is the size of the "plan" into which you have to "compress" your desired world-state if you want to "communicate" it to the environment.
really the instrumental incentive is often to search for "precise" methods of influencing the world, where one can push in a lot of information to effect narrow change
I disagree. I've given it a lot of thoughts (none published yet), but this sort of "precise influence" is something I call "inferential control". It allows you to maximize your impact given your action bottleneck, but this sort of optimization is "brittle". If something unknown unknown happens, the plan you've injected breaks instantly and gracelessly, because the fundamental assumptions on which its functionality relied – the pathways by which it meant to implement its objective – turn out to be invalid.
It sort of naturally favours arithmetic utility maximization over geometric utility maximization. By taking actions that'd only work if your predictions and models are true, you're basically sacrificing your selves living in the timelines that you're predicting to be impossible, and distributing their resources to the timelines you expect to find yourself in.
And this applies more and more the more "optimization capacity" you're trying to push through a narrow bottleneck. E. g., if you want to change the entire state of a giant environment through a tiny action-pinhole, you'd need to do it by exploiting some sort of "snowball effect"/"butterfly effect". Your tiny initial intervention would need to exploit some environmental structures to increase its size, and do so iteratively. That takes time (for whatever notion of "time" applies). You'd need to optimize over a longer stretch of environment-state changes, and your initial predictions need to be accurate for that entire stretch, because you'd have little ability to "steer" a plan that snowballed far beyond your pinhole's ability to control.
By contrast, increasing the size of your action bottleneck is pretty much the definition of "robust" optimization, i. e. geometric utility maximization. It improves your ability to control the states of all possible worlds you may find yourself in, minimizing the need for "brittle" inferential control. It increases your adaptability, basically, letting you craft a "message" comprehensively addressing any unpredicted crisis the environment throws at you, right in the middle of it happening.
Nah, I think this post is about a third component of the problem: ensuring that the solution to "what to steer at" that's actually deployed is pro-humanity. A totalitarian government successfully figuring out how to load its regime's values into the AGI has by no means failed at figuring out "what to steer at". They know what they want and how to get it. It's just that we don't like the end result.
"Being able to steer at all" is a technical problem of designing AIs, "what to steer at" is a technical problem of precisely translating intuitive human goals into a formal language, and "where is the AI actually steered" is a realpolitiks problem that this post is about.
I think the bigger problem here is what happens when the agent ends up with an idea of "what we mean/intend" which is different from what we mean/intend
Agreed; I did gesture at that in the footnote.
I think the main difficulty here is that humans store their values in a decompiled/incomplete format, and so merely pointing at what a human "means" actually still has to route through defining how we want to handle moral philosophy/value extrapolation.
E. g., suppose the AGI's operator, in a moment of excitement after they activate their AGI for the first time, tells it to distribute a cure for aging. What should the AGI do?
There's quite a lot of different ways by which you can slice the idea. There's probably a way that corresponds to the intuitive meaning of "do what I mean", but maybe there isn't, and in any case we don't yet know what it is. (And the problem is recursive: telling it to DWIM when interpreting what "DWIM" means doesn't solve anything.)
And then, because of the general "unknown-unknown environmental structures" plus "compounding errors" problems, picking the wrong definition probably kills everyone.
I think maybe I sound naive phrasing it as "the AGI should just do what we say", as though I've wandered in off the street and am proposing a "why not just..." alignment solution
Nah, I recall your takes tend to be considerably more reasonable than that.
I agree that DWIM is probably a good target if we can specify it in a mathematically precise manner. But I don't agree that "rough knowledge of what humans tend to mean" is sufficient.
The concern is that the real world has a lot of structures that are unknown to us – fundamental physics, anthropics-like confusions regarding our place in everything-that-exists, timeless decision-theory weirdness, or highly abstract philosophical or social principles that we haven't figured out yet.
These structures might end up immediately relevant to whatever command we give, on the AI's better model of reality, in a way entirely unpredictable to us. For it to then actually do what we mean, in those conditions, is a much taller order.
For example, maybe it starts perceiving itself to be under an acausal attack by aliens, and then decide that the most faithful way to represent our request is to blow up the planet to spite the aliens. Almost certainly not literally that, but you get the idea. it may perceive something completely unexpected-to-us in the environment, and then its perception of that thing would interfere with its understanding of what we meant, even on requests that seem completely tame to us. The errors would then compound, resulting in a catastrophe.
The correct definition of DWIM would of course handle that. But a flawed, only-roughly-correct one? Each command we give would be rolling the dice on dying, with IMO pretty bad odds, and scaling exponentially with the command's complexity.
Checking, or clarifying when it's uncertain about meaning, is implied in a competent agent pursuing an imperfectly known utility function
That doesn't work, though, if taken literally? I think what you're envisioning here is a solution to the hard problem of corrigibility, which – well, sure, that'd work.
My money's on our understanding of what we mean by "what we mean" being hopelessly confused, and that causing problems. Unless, again, we've figured out how to specify it in a mathematically precise manner – unless we know we're not confused.
The issue is that, by default, an AGI is going to make galaxy-brained extrapolations in response to simple requests, whether you like that or not. It's simply part of figuring out what to do – translating its goals all around its world-model, propagating them up the abstraction levels, etc. Like a human's decision where to send job applications and how to word them is rooted in what career they'd like to pursue is rooted in their life goals is rooted in their understanding of where the world is heading.
To our minds, there's a natural cut-off point where that process goes from just understanding the request to engaging in alien moral philosophy. But that cut-off point isn't objective: it's based on a very complicated human prior of what counts as normal/sane and what's excessive. Mechanistically, every step from parsing the wording to solving philosophy is just a continuous extension of the previous ones.
"An AGI that just does what you tell it to" is a very specific design specification where we ensure that this galaxy-brained extrapolation process, which an AGI is definitely and convergently going to want to do, results in it concluding that it wants to faithfully execute that request.
Whether that happens because we've attained so much mastery of moral philosophy that we could predict this process' outcome from the inputs to it, or because we figured out how to cut the process short at the human-subjective point of sanity, or because we implemented some galaxy-brained scheme of our own like John's post is outlining, shouldn't matter, I think. Whatever has the best chance of working.
And I think somewhat-hacky hard-coded solutions have a better chance of working on the first try, than the sort of elegant solutions you're likely envisioning. Elegant solutions require a well-developed theory of value. Hacky stopgap measures only require to know which pieces of your software product you need to hobble. (Which isn't to say they require no theory. Certainly the current AI theory is so lacking we can't even hack any halfway-workable stopgaps. But they provide an avenue of reducing how much theory you need, and how confident in it you need to be.)
The main thing which convinced me to start paying attention to corrigibility was: by that same argument, corrigibility is itself a part of human values. Which means that, insofar as some class of utility maximizers has trouble expressing corrigibility... that class will also have trouble expressing human values.
The way you phrase this is making me a bit skeptical. Just because something is part of human values doesn't necessarily imply that if we can't precisely specify that thing, it means we can't point the AI at the human values at all. The intuition here would be that "human values" are themselves a specifically-formatted pointer to object-level goals, and that pointing an agent at this agent-specific "value"-type data structure (even one external to the AI) would be easier than pointing it at object-level goals directly. (DWIM being easier than hand-coding all moral philosophy.)
Which isn't to say I buy that. My current standpoint is that "human values" are too much of a mess for the aforementioned argument to go through, and that manually coding-in something like corrigibility may be indeed easier.
Still, I'm nitpicking the exact form of the argument you're presenting.
Although I am currently skeptical even of corrigibility's tractability. I think we'll stand a better chance of just figuring out how to "sandbox" the AGI's cognition such that it's genuinely not trying to optimize over the channels by which it's connected to the real world, then set it down the task of imagining the solution to alignment or to human brain uploading or whatever.
With this setup, if we screw up the task's exact specification, it shouldn't even risk exploding the world. And "doesn't try to optimize over real-world output channels" sounds like a property for which we'll actually be able to derive hard mathematical proofs, proofs that don't route through tons of opaque-to-us environmental ambiguities. (Specifically, that'd probably require a mathematical specification of something like a Cartesian boundary.)
(This of course assumes us having white-box access to the AI's world-model and cognition. Which we'll also need here for understanding the solutions it derives without the AI translating them into humanese – since "translate into humanese" would by itself involve optimizing over the output channel.)
And it seems more doable than solving even the simplified corrigibility setup. At least, when I imagine hitting "run" on a supposedly-corrigible AI vs. a supposedly-sandboxed AI, the imaginary me in the latter scenario is somewhat less nervous.
Haven't read everything yet, but that seems like excellent work. In particular, I think this general research avenue is extremely well-motivated.
Figuring out how to efficiently implement computations on the substrate of NNs had always seemed like a neglected interpretability approach to me. Intuitively, there are likely some methods of encoding programs into matrix multiplication which are strictly ground-truth better than any other encoding methods. Hence, inasmuch as what the SGD is doing is writing efficient programs on the NN substrate, it is likely doing so by making use of those better methods. And so nailing down the "principles of good programming" on the NN substrate should yield major insights regarding how the naturally-grown NN circuits are shaped as well.
This post seems to be a solid step in that direction!