I wouldn't call Shard Theory mainstream
Fair. What would you call a "mainstream ML theory of cognition", though? Last I checked, they were doing purely empirical tinkering with no overarching theory to speak of (beyond the scaling hypothesis).
judging by how bad humans are at [consistent decision-making], and how much they struggle to do it, they probably weren't optimized too strongly biologically to do it. But memetically, developing ideas for consistent decision-making was probably useful, so we have software that makes use of our processing power to be better at this
Roughly agree, yeah.
But all of this is still just one piece on the Jenga tower
I kinda want to push back against this repeat characterization – I think quite a lot of my model's features are "one storey tall", actually – but it probably won't be a very productive use of the time of either of us. I'll get around to the "find papers empirically demonstrating various features of my model in humans" project at some point; that should be a more decent starting point for discussion.
What I want is to build non-Jenga-ish towers
Agreed. Working on it.
Which, yeah, I think is false: scaling LLMs won't get you to AGI. But it's also kinda unfalsifiable using empirical methods, since you can always claim that another 10x scale-up will get you there.
the model chose slightly wrong numbers
The engraving on humanity's tombstone be like.
The sort of thing that would change my mind: there's some widespread phenomenon in machine learning that perplexes most, but is expected according to your model
My position is that there are many widespread phenomena in human cognition that are expected according to my model, and which can only be explained by the more mainstream ML models either if said models are contorted into weird shapes, or if they engage in denialism of said phenomena.
Again, the drive for consistent decision-making is a good example. Common-sensically, I don't think we'd disagree that humans want their decisions to be consistent. They don't want to engage in wild mood swings, they don't want to oscillate wildly between which career they want to pursue or whom they want to marry: they want to figure out what they want and who they want to be with, and then act consistently with these goals in the long term. Even when they make allowances for changing their mind, they try to consistently optimize for making said allowances: for giving their future selves freedom/optionality/resources.
Yet it's not something e. g. the Shard Theory would naturally predict out-of-the-box, last I checked. You'd need to add structures on top of it until it basically replicates my model (which is essentially how I arrived at my model, in fact – see this historical artefact).
I find the idea of morality being downstream from the free energy principle very interesting
I agree that there are some theoretical curiosities in the neighbourhood of the idea. Like:
Thus, in theory, if we plug all of these dynamics one into another, and then simplify the resultant expression, we should actually get a (probability distribution over) the utility function that is "most natural" for this universe to generate! And the expression may indeed be relatively simple and have something to do with thermodynamics, especially if some additional simplifying assumptions are made.
That actually does seem pretty exciting to me! In an insight-porn sort of way.
Not in any sort of practical way, though. All of this is screened off by the actual values actual humans actually have, and if the noise introduced at every stage of this process caused us to be aimed at goals wildly diverging from the "most natural" utility function of this universe... Well, sucks to be that utility function, I guess, but the universe screwed up installing corrigibility into us and the orthogonality thesis is unforgiving.
At least, not with regards to AI Alignment or human morality. It may be useful for e. g. acausal trade/acausal normalcy: figuring out the prior for what kinds of values aliens are most likely to have, etc.
Or maybe for roughly figuring out what values the AGI that kills us all is likely going to have, if you've completely despaired of preventing that, and founding an apocalypse cult worshiping it. Wait a minute...
I'm very sympathetic to this view, but I disagree. It is based on a wealth of empirical evidence that we have: on data regarding human cognition and behavior.
I think my main problem with this is that it isn't based on anything
Hm. I wonder if I can get past this common reaction by including a bunch of references to respectable psychology/neurology/game-theory experiments, which "provide scientific evidence" that various common-sensical properties of humans are actually real? Things like fluid vs. general intelligence, g-factor, the global workplace theory, situations in which humans do try to behave approximately like rational agents... There probably also are some psychology-survey results demonstrating stuff like "yes, humans do commonly report wanting to be consistent in their decision-making rather than undergoing wild mood swings and acting at odds with their own past selves", which would "provide evidence" for the hypothesis that complex minds want their utilities to be coherent.
That's actually an interesting idea! This is basically what my model is based on, after a fashion, and it makes arguments-from-introspection "legible" instead of seeming to be arbitrary philosophical navel-gazing.
Unfortunately, I didn't have this idea until a few minutes ago, so I haven't been compiling a list of "primary sources". Most of them are lost to time, so I can't compose a decent object-level response to you here. (The Wikipedia links are probably a decent starting point, but I don't expect you to trawl through all that.)
Still, that seems like a valuable project. I'll put a pin in it, maybe post a bounty for relevant papers later.
Do you think a car engine is in the same reference class as a car? Do you think "a car engine cannot move under its own power, so it cannot possibly hurt people outside the garage!" is a valid or a meaningful statement to make? Do you think that figuring out how to manufacture amazing car engines is entirely irrelevant to building a full car, such that you can't go from an engine to a car with relatively little additional engineering effort (putting it in a "wrapper", as it happens)?
As all analogies, this one is necessarily flawed, but I hope it gets the point across.
(Except in this case, it's not even that we've figured out how to build engines. It's more like, we have these wild teams of engineers we can capture, and we've figured out which project specifications we need to feed them in order to cause them to design and build us car engines. And we're wondering how far we are from figuring out which project specifications would cause them to build a car.)
Relevant problem: how should one handle higher-order hyphenation? E. g., imagine if one is talking about cost-effective measures, but has the measures' effectiveness specifically relative to marginal costs in mind. Building it up, we have "marginal-cost effectiveness", and then we want to turn that whole phrase into a compound modifier. But "marginal-cost-effective measures" looks very awkward! We've effectively hyphenated "marginal cost effectiveness", no hyphen: within the hyphenated expression, we have no way to avoid the ambiguities between a hyphen and a space!
It becomes especially relevant in the case of longer composite modifiers, like your "responsive-but-not-manipulative" example.
Can we fix that somehow?
One solution I've seen in the wild is to increase the length of the hyphen depending on its "degree", i. e. use an en dash in place of a hyphen. Example: "marginal-cost–effective measures". (On Windows, can be inserted by typing 0150 on the keypad while holding ALT. See methods for other platforms here.)
In practice you basically never go beyond the second-degree expressions, but there's space to expand to third-degree expressions by the use of an even-longer em dash (—, 0151 while holding ALT).
Though I expect it's not "official" rules at all.
That seems to generalize to "no-one is allowed to make any claim whatsoever without consuming all of the information in the world".
Just because someone generated a vast amount of content analysing the topic, does not mean you're obliged to consume it before forming your opinions. Nay, I think consuming all object-level evidence should be considered entirely sufficient (which I assume was done in this case). Other people's analyses based on the same data are basically superfluous, then.
Even less than that, it seems reasonable to stop gathering evidence the moment you don't expect any additional information to overturn the conclusions you've formed (as long as you're justified in that expectation, i. e. if you have a model of the domain strong enough to have an idea regarding what sort of additional (counter)evidence may turn up and how you'd update on it).
In addition to Roko's point that this sort of opinion-falsification is often habitual rather than a strategic choice that a person could opt not to make, it also makes strategic sense to lie in such surveys.
First, the promised "anonymity" may not actually be real, or real in the relevant sense. The methodology mentions "a secure online survey system which allowed for recording the identities of participants, but did not append their survey responses to their names or any other personally identifiable information", but if your reputation is on the line, would you really trust that? Maybe there's some fine print that'd allow the survey-takers to look at the data. Maybe there'd be a data leak. Maybe there's some other unknown-unknown you're overlooking. Point is, if you give the wrong response, that information can get out somehow; and if you don't, it can't. So why risk it?
Second, they may care about what the final anonymized conclusion says. Either because the lab leak hypothesis becoming mainstream would hurt them personally (either directly, or by e. g. hurting the people they rely on for funding), or because the final conclusion ending up in favour of the lab leak would still reflect poorly on them collectively. Like, if it'd end up saying that 90% of epidemiologists believe the lab leak, and you're an epidemiologist... Well, anyone you talk to professionally will then assign 90% probability that that's what you believe. You'd be subtly probed regarding having this wrong opinion, your past and future opinions would be scrutinized for being consistent with those of someone believing the lab leak, and if the status ecosystem notices something amiss...?
But, again, none of these calculations would be strategic. They'd be habitual; these factors are just the reasons why these habits are formed.
Answering truthfully in contexts-like-this is how you lose the status games. Thus, people who navigate such games don't.
I think, like a lot of things in agent foundations, this is just another consequence of natural abstractions.
The universe naturally decomposes into a hierarchy of subsystems; molecules to cells to organisms to countries. Changes in one subsystem only sparsely interact with the other subsystems, and their impact may vanish entirely at the next level up. A single cell becoming cancerous may yet be contained by the immune system, never impacting the human. A new engineering technique pioneered for a specific project may generalize to similar projects, and even change all such projects' efficiency in ways that have a macro-economic impact; but it will likely not. A different person getting elected the mayor doesn't much impact city politics in neighbouring cities, and may literally not matter at the geopolitical scale.
This applies from the planning direction too. If you have a good map of the environment, it'll decompose into the subsystems reflecting the territory-level subsystems as well. When optimizing over a specific subsystem, the interventions you're considering will naturally limit their impact to that subsystem: that's what subsystemization does, and counteracting this tendency requires deliberately staging sum-threshold attacks on the wider system, which you won't be doing.
In the Rubik's Cube example, this dynamic is a bit more abstract, but basically still applies. In a way similar to how the "maze" here kind-of decomposes into a top side and a bottom side.
A complication is that any one agent can only have so much bandwidth, which would sometimes incentivize more blunt control. I've been thinking bandwidth is probably going to become a huge area of agent foundations
I agree. I currently think "bandwidth" in terms like "what's the longest message I can 'inject' into the environment per time-step?" is what "resources" are in information-theoretic terms. See the output-side bottleneck in this formulation: resources are the action bandwidth, which is the size of the "plan" into which you have to "compress" your desired world-state if you want to "communicate" it to the environment.
really the instrumental incentive is often to search for "precise" methods of influencing the world, where one can push in a lot of information to effect narrow change
I disagree. I've given it a lot of thoughts (none published yet), but this sort of "precise influence" is something I call "inferential control". It allows you to maximize your impact given your action bottleneck, but this sort of optimization is "brittle". If something unknown unknown happens, the plan you've injected breaks instantly and gracelessly, because the fundamental assumptions on which its functionality relied – the pathways by which it meant to implement its objective – turn out to be invalid.
It sort of naturally favours arithmetic utility maximization over geometric utility maximization. By taking actions that'd only work if your predictions and models are true, you're basically sacrificing your selves living in the timelines that you're predicting to be impossible, and distributing their resources to the timelines you expect to find yourself in.
And this applies more and more the more "optimization capacity" you're trying to push through a narrow bottleneck. E. g., if you want to change the entire state of a giant environment through a tiny action-pinhole, you'd need to do it by exploiting some sort of "snowball effect"/"butterfly effect". Your tiny initial intervention would need to exploit some environmental structures to increase its size, and do so iteratively. That takes time (for whatever notion of "time" applies). You'd need to optimize over a longer stretch of environment-state changes, and your initial predictions need to be accurate for that entire stretch, because you'd have little ability to "steer" a plan that snowballed far beyond your pinhole's ability to control.
By contrast, increasing the size of your action bottleneck is pretty much the definition of "robust" optimization, i. e. geometric utility maximization. It improves your ability to control the states of all possible worlds you may find yourself in, minimizing the need for "brittle" inferential control. It increases your adaptability, basically, letting you craft a "message" comprehensively addressing any unpredicted crisis the environment throws at you, right in the middle of it happening.