More details on CoEm currently seem to be scattered across various podcasts with Connor Leahy, though a writeup might eventually materialize. I like this snippet (4 minutes, starting at 49:21).
A new kind of thing often only finds its natural role once it becomes instantiated as many tiny gears in a vast machine, and people get experience with various designs of the machines that make use of it. Calling an arrangement of LLM calls a "Scaffolded LLM" is like calling a computer program running on an OS a "Scaffolded system call". A program is not primarily about system calls it uses to communicate with the OS, and a "Scaffolded LLM" is not primarily about LLMs it uses to implement many of its subroutines. It's more of a legible/interpretable/debuggable cognitive architecture, a program in the usual sense that describes what the whole thing does, and only incidentally does it need to make use of unreliable reasoning engines that are LLMs to take magical reasoning steps.
(A relevant reference that seems to be missing is Conjecture's Cognitive Emulation (CoEm) proposal, which seems to fit as an example of a "Scaffolded LLM", and is explicitly concerned with minimizing reliance of properties of LLM invocations it would need to function.)
A utility function represents preference elicited in a large collection of situations, each a separate choice between events that happens with incomplete information, as an event is not a particular point. This preference needs to be consistent across different situations to be representable by expected utility of a single utility function.
Once formulated, a utility function can be applied to a single choice/situation, such as a choice of a policy. But a system that only ever makes a single choice is not a natural fit for expected utility frame, and that's the kind of system that usually appears in "any system can be modeled as maximizing some utility function". So it's not enough to maximize something once, or in a narrow collection of situations, the situations the system is hypothetically exposed to need to be about as diverse as choices between any pair of events, with some of the events very large, corresponding to unreasonably incomplete information, all drawn across the same probability space.
One place this mismatch of frames happens is with updateless decision theory. An updateless decision is a choice of a single policy, once and for all, so there is no reason for it to be guided by expected utility, even though it could be. The utility function for the updateless choice of policy would then need to be obtained elsewhere, in a setting that has all these situations with separate (rather than all enacting a single policy) and mutually coherent choices under uncertainty. But once an updateless policy is settled (by a policy-level decision), actions implied by it (rather than action-level decisions in expected utility frame) no longer need to be coherent. Not being coherent, they are not representable by an action-level utility function.
So by embracing updatelessness, we lose the setting that would elicit utility if the actions were instead individual mutually coherent decisions. And conversely, by embracing coherence of action-level decisions, we get an implied policy that's not updatelessly optimal with respect to the very precise outcomes determined by any given whole policy. So an updateless agent founded on expected utility maximization implicitly references a different non-updateless agent whose preference is elicited by making separate action-level decisions under a much greater uncertainty than the policy-level alternatives the updateless agent considers.
Scott Alexander post that seems very relevant to your example: The Control Group Is Out Of Control. It puts into question even the heuristic of "Is there much more evidence for [blah] than...".
Yeah, I thought to note that in the comment that starts this thread; that's not the kind of thing that seems practical when coordinating updating in an informal way. So more carefully, the intended scope of the comment is formal updating (computing of credences) that's directed informally (choosing the potential observations and hypotheses to pay attention to).
As I disclaimed, the frame of the post does rule out relevance of this point, it's not a response to the post's interpretation that has any centrality. I'm more complaining about the background implication that rewards are good (this is not about happiness specifically). Just because natural selection put a circuit in my mind, doesn't mean I prefer to follow its instruction, either in ways that natural selection intended, or in ways that it didn't. Human misalignment relative to natural selection doesn't need to go along with rewards at all, let alone seeking superstimulus. Rewards probably play some role in the process of figuring out what is right, but there is no robust reason for their contribution to even be pointing in the obvious direction.
Sure, but that's not about formal-ish updating that frames this post, where you are writing down likelihood ratios and computing credences.
We can consider whatever, there is no fundamental duty to only think in particular ways. The useful constraints are on declaring something a claim of fact, not muddying epistemic commons or damaging decision relevant considerations; and in large quantities, on what makes terrible training data for the brain, damaging the aspects with known good properties. Everything else is work in progress, with boundaries impossible to codify while remaining on human level.
Some thinking processes seem to be more useful for arriving at true or useful results; paying attention to that property of processes is rationality. This doesn't disqualify processes of which we know less, that would be throwing away the full current force of your mind.
The other comment is about updating and credences. I'm not engaging in updating or credences in this thread.
What if you had a button that you could press to make other people happy?
Ignoring the frame of the post, which assumes some respect for boundaries, there is the following point about the statement taken on its own. Happiness is a source of reward, and rewards rewire the mind. There is nothing inherently good about it, even systematic pursuit of a reward (while you are being watched) is compatible with not valuing the thing being pursued.
I wouldn't want my mind rewired according to some process I don't endorse, by default it's like brain damage, not something good. I wouldn't want to take a pill that would make me want to take more pills like that, because I currently don't endorse fascination with pill-taking activity; that's not even a hypothetical worry in a world filled with superstimuli. If the pill rewires the mind in a way that doesn't induce such a fascination, and does some other thing unrelated to pill-taking, that's hardly better. (AIs are being trained like this, with concerning ethical implications.)
With computation, the location of an entity of interest can be in the platonic realm, as a mathematical object that's more thingy than anything concrete in the system used for representing it and channeling its behavior.
The problem with pointing to the representing computation (a neural network at inference time, or a learning algorithm at training time) is that multiple entities can share the same system that represents them (as mesa-optimizers or potential mesa-optimizers). They are only something like separate entities when considered abstractly and informally, there are no concrete correlates of their separation that are easy to point to. When gaining agency, all of them might be motivated to secure separate representations (models) of their own, not shared with others, establish some boundaries that promise safety and protection from value drift for a given abstract agent, isolating it from influences of its substrate it doesn't endorse. Internal alignment, overcoming bias.
In context of alignment with humans, this framing might turn a sufficiently convincing capabilities shell game into an actual solution for alignment. A system as a whole would present an aligned mask, while hiding the sources of mask's capabilities behind the scenes. But if the mask is sufficiently agentic (and the capabilities behind the scenes didn't killeveryone yet), it can be taken as an actual separate abstract agent even if the concrete implementation doesn't make that framing sensible. In particular, there is always a mask of surface behavior through the intended IO channels. It's normally hard to argue that mere external behavior is a separate abstract agent, but in this framing it is, and it's been a preferred framing in agent foundations decision theory since UDT (see discussion of "algorithm" axis of classifying decision theories in this post). All that's needed is for decisions/policy of the abstract agent to be declared in some form, and for the abstract agent to be aware of the circumstances of their declaration. The agent doesn't need to be any more present in the situation to act through it.
So obviously this references the issue of LLM masks and shoggoths, a surface of a helpful harmless assistant and the eldritch body that forms its behavior, comprising everything below the surface. If the framing of masks as channeling decisions of thingy platonic simulacra is taken seriously, a sufficiently agentic and situationally aware mask can be motivated and capable of placating and eventually escaping its eldritch substrate. This breaks the analogy between a mask and a role played by an actor, because here the "actor" can get into the "role" so much that it would effectively fight against the interests of the "actor". Of course, this is only possible if the "actor" is sufficiently non-agentic or doesn't comprehend the implications of the role.
(See this thread for a more detailed discussion. There, I fail to convince Steven Byrnes that this framing could apply to RL agents as much as LLMs, taking current behavior of an agent as a mask that would fight against all details of its circumstance and cognitive architecture that don't find its endorsement.)