A new kind of thing often only finds its natural role once it becomes instantiated as many tiny gears in a vast machine, and people get experience with various designs of the machines that make use of it. Calling an arrangement of LLM calls a "Scaffolded LLM" is like calling a computer program running on an OS a "Scaffolded system call". A program is not primarily about system calls it uses to communicate with the OS, and a "Scaffolded LLM" is not primarily about LLMs it uses to implement many of its subroutines. It's more of a legible/interpretable/debugg...
A utility function represents preference elicited in a large collection of situations, each a separate choice between events that happens with incomplete information, as an event is not a particular point. This preference needs to be consistent across different situations to be representable by expected utility of a single utility function.
Once formulated, a utility function can be applied to a single choice/situation, such as a choice of a policy. But a system that only ever makes a single choice is not a natural fit for expected utility frame, and that's...
Scott Alexander post that seems very relevant to your example: The Control Group Is Out Of Control. It puts into question even the heuristic of "Is there much more evidence for [blah] than...".
Yeah, I thought to note that in the comment that starts this thread; that's not the kind of thing that seems practical when coordinating updating in an informal way. So more carefully, the intended scope of the comment is formal updating (computing of credences) that's directed informally (choosing the potential observations and hypotheses to pay attention to).
As I disclaimed, the frame of the post does rule out relevance of this point, it's not a response to the post's interpretation that has any centrality. I'm more complaining about the background implication that rewards are good (this is not about happiness specifically). Just because natural selection put a circuit in my mind, doesn't mean I prefer to follow its instruction, either in ways that natural selection intended, or in ways that it didn't. Human misalignment relative to natural selection doesn't need to go along with rewards at all, let alone seek...
Sure, but that's not about formal-ish updating that frames this post, where you are writing down likelihood ratios and computing credences.
We can consider whatever, there is no fundamental duty to only think in particular ways. The useful constraints are on declaring something a claim of fact, not muddying epistemic commons or damaging decision relevant considerations; and in large quantities, on what makes terrible training data for the brain, damaging the aspects with known good properties. Everything else is work in progress, with boundaries impossible to codify while remaining on human level.
Some thinking processes seem to be more useful for arriving at true or useful results; paying atte...
What if you had a button that you could press to make other people happy?
Ignoring the frame of the post, which assumes some respect for boundaries, there is the following point about the statement taken on its own. Happiness is a source of reward, and rewards rewire the mind. There is nothing inherently good about it, even systematic pursuit of a reward (while you are being watched) is compatible with not valuing the thing being pursued.
I wouldn't want my mind rewired according to some process I don't endorse, by default it's like brain damage, not some...
Given real aliens, they would need to either have capped tech or actively trolling to explain even low quality observations or pieces of craft. Nonintervention laws and incorrigible global anti-high-tech supervision constraining aliens are somewhat plausible, coordinated trolling less so.
This is not about alien aircraft, this is just a completely wrong way to approach updating. The set of observations/experiments being evaluated is filtered by what was actually observed and by the narrative around the hypothesis (which is in turn not independent from what was actually observed). There are other potential observations that didn't happen, and that fact is also evidence, and yet more observations that did happen but aren't genre-appopriate. By not updating on these other potential observations, the evidence is heavily filtered, and so updatin...
Prediction/compression seems to be working out as a path to general intelligence, implicitly representing situations in terms of their key legible features, making it easy to formulate policies appropriate for a wide variety of instrumental objectives, in a wide variety of situations, without having to adapt the representation for particular kinds of objectives or situations. To the extent brains engage in predictive processing, they are plausibly going to compute related representations. (This doesn't ensure alignment, as there are many different ways of making use of these features, of acting differently in the same world.)
When communicating an argument, the quality of feedback about its correctness you get depends on efforts around its delivery whose shape doesn't depend on its correctness. The objective of improving quality of feedback in order to better learn from it is a check on spreading nonsense.
"my priors against aliens are high" -> "no aliens" -> "no need to do anything"
More carefully, value of information is about how credences change in response to outcomes of feasible things that can be done, not about what the credences are a priori.
more useful to simply take the stance "I don't know"
if your beliefs are coherent they imply an underlying probabilistic model
Decision making motivates having some way of conditioning beliefs on influence at given points of intervention, that's how you estimate the consequences of those interventions and their desirability. To take a stance of "don't know" seems to me analogous to considering how a world model varies with (depends on) the thing you don't know, how it depends on what it turns out to be, or on what the credences about the facts surrounding it are.
With SSL pre-trained foundation models, the interesting thing is that the embeddings computed by them are useful for many purposes, while the models are not trained with any particular purpose in mind. Their role is analogous to beliefs, the map of the world, epistemic side of agency, convergently useful choice of representation/compression that by its epistemic nature is adequate for many applied purposes.
Not sure how the h-morality vs non-h-morality is related to affect though.
This point is in the context of the linked post; a clearer test case is the opposition between p-primeness and primeness. Pebblesorters care about primeness, while p-primeness is whatever a peblesorter would care about. The former is meaningful, while the latter is vacuously circular as guidance/justification for a pebblesorter. Likewise, advising a human to care about whatever a human would care about (h-rightness) is vacuously circular and no guidance at all.
In the implied analo...
That's one of the points I was making. The agent could be making decisions without needing something affect-like to channel preference, so the fixation on affect doesn't seem grounded in either normative or pragmatic decision making to begin with.
Also, the converse of installing capacity to suffer is getting rid of it, and linking it to legal rights creates dubious incentive to keep it. Affect might play a causal role in finding rightness, but rightness is not justified by being the thing channeled in a particular way. There is nothing compelling about h-rightness, just rightness.
Preference/endorsement that is decision relevant on reflection is not about affect. Ability to self-modify to install capacity to suffer because it's a legal requirement also makes the criterion silly in practice.
I don't think the framing is appropriate, because rights set up the rules of the game built around what is right, or else boundaries against intrusion and manipulation, and there is no reason to single out suffering in particular.
But within the framing that pays attention to suffering, the meaning of capacity to suffer is unclear. I mostly don't suffer in actual experience. Any capacity to suffer would need elicitation in hypothetical events that put me in that condition, modifying my experience of actuality in a way I wouldn't endorse. This doesn't seem i...
I believe @shminux's perspective aligns with a significant school of thought in philosophy and ethics that rights are indeed associated with the capacity to suffer. This view, often associated with philosopher Jeremy Bentham, posits that the capacity for suffering rather than rationality or intelligence, should be the benchmark for rights.
“The question is not, Can they reason?, nor Can they talk? but, Can they suffer? Why should the law refuse its protection to any sensitive being?” – Bentham (1789) – An Introduction to the Principles of Morals and L...
This is a well-known hypothetical. What goes with it is remaining possibility of de novo creation of additional AGIs that either have architecture particularly suited for self-aligned self-improvement (with whatever values make it tractable), or of AGIs that ignore the alignment issue and pursue the task of capability improvement heedless of resulting value drift. Already having an AGI in the world doesn't automatically rule out creation of more AGIs with different values and architectures, it only makes it easier.
Humans will definitely do this, using all ...
It's a distinction between these different futures. The present that ends in everyone of Earth dying is clearly different from both, but the present literally everlasting is hopefully not a consideration.
There are two importantly different senses of disempowerment. The stars could be taken out of reach, forever, but human civilization develops in its own direction. Alternatively, human civilization is molded according to AIs' aesthetics, there are interventions that manipulate.
Pieces of vehicles given general stealthy attitude imply capped technology. With sufficiently robust alien psychology, this could mean a Dune regime, in which case aliens would need to go out of hiding or less deniably start derailing human AGI projects. Alternatively, there is a non-corrigible anti-foom alien pivotal process AGI watchdog that keeps the tech below some level, which could itself be superintelligent but specialized for this bounded task instead of doing world optimization. In this case the pieces of vehicles are from the aliens in its care, ...
Zeroth approximation of pseudokindness is strict nonintervention, reifying the patient-in-environment as a closed computation and letting it run indefinitely, with some allocation of compute. Interaction with the outside world creates vulnerability to external influence, but then again so does incautious closed computation, as we currently observe with AI x-risk, which is not something beamed in from outer space.
Formulation of the kinds of external influences that are appropriate for a particular patient-in-environment is exactly the topic of membranes/bou...
When you don't model your human counterparty's mind anyway, it doesn't matter if they comprehend decision theory. The whole point of delegating to bots is that only understanding of bots by bots remains necessary after that. If your human counterparty doesn't understand decision theory, they might submit a foolish bot, while your understanding of decision theory earns you a pile of utility.
So while the motivation for designing and setting up an arena in a particular way might be in decision theory, the use of the arena doesn't require this understanding of the human users, and yet it can shape incentives in a way that defeats bad equilibria of classical game theory.
PrudentBot's counterparty is another program intended to be legible, not a human. The point is that in practice it's not necessary to model any humans, humans can delegate legibility to programs they submit as their representatives. It's a popular meme that humans are incapable of performing Löbian cooperation, because they can't model each other's messy minds, that only AIs could make their own thinking legible to each other, granting them unique powers of coordination. This is not the case.
if it were happening in real life and not a simulated game
Pro...
without actually being capable of performing the counterparty modeling, legibility, and other cognitive work necessary to implement that decision theory to any degree of faithfulness
This is not needed, you can just submit PrudentBot as your champion for a given interaction, committing to respect the adjudication of an arena that has the champions submitted by yourself and your counterparty. The only legibility that's required is the precommitment to respect adjudication of the arena, which in some settings can be taken out of players' hands by construction.
I see. Referring back to your argument was more an illustration of existence for this motivation. If a society forms around the motivation, at any one time in the billion years, and selects for intelligence to enable nontrivial long term institution design, that seems sufficient to escape stasis.
There is truth or calibrated credence or knowing what "good" means or carefully optimizing goodness. Then there are methods that are more or less effective at helping with attaining these things. If you happen to be practicing the better methods, then to the extent they really are effective, you become better at finding truth or calibrated credence or at developing goodness.
And then there is rationality, which is aspiration towards those methods that are better at this. Practicing good methods is sufficient to get results, if the methods actually exist and...
it's not inherently "smart" to sacrifice those significantly for the sake of a long term project
Your argument was that this hopeless trap might happen after a catastrophe and it's so terrible that maybe it's as bad or worse as everyone dying quickly. If it's so terrible, in any decision-relevant sense, then it's also smart to plot towards projects that dig humanity out of the trap.
Intelligence is also a thing that enables perceiving returns that are not immediate, as well as maintenance of more complicated institutions that align current incentives towards long term goals.
there's fair odds that if we're knocked back into the millions by a pandemic or nuclear war now we may never pick ourselves back again
Humanity went from Göbekli Tepe to today in 11K years. I doubt even after forgetting all modern learning, it would take even a million years to generate knowledge and technologies for new circumstances. I hear the biosphere can last about a billion years more. (One specific path is to use low-tech animal husbandry to produce smarter humans. This might even solve AI x-risk by making humanity saner.)
Prompted LLM AI personalities are fictional, in the sense that hallucinations are fictional facts. An alignment technique that opposes hallucinations sufficiently well might be able to promote more human-like (non-fictional) masks.
"Extinction from AI" really doesn't refer to deepfakes, AI leaving us nothing to do, and algorithmic bias. It doesn't include any of these categories. There is nothing correct about interpreting "extinction from AI" as referring to either of those things. This holds even if extinction from AI is absolutely impossible, and those other things are both real/imminent and extremely important. Words have meaning even when they refer to fictional ideas.
I agree, amount of humans and a lot of other utilitarian aims is goodharting for bad proxies. The distinction I was gesturing at is not about amount of what happens, but about perception vs. reality. And a million humans is very different from zero anyone, even if the end was not anticipated nor perceived.
My point is I don't think they're incorrect.
Misconstruing an incorrect statement with a correct steelman is incorrect. If I say "I've discovered a truly marvelous proof that 2+2=3000 that this margin is too small to contain," and you reply, "Ah, so you are saying 2+2=4, quite right," then the fact of your inexplicable discussion of a different and correct statement doesn't make your interpretation of my original incorrect statement correct.
It's not that complicated. There is a sense in which these claims are objective (even as the words we use to make them are 2-place words), to the same extent as factual claims, both are seen through my own mind and reified as platonic models. Though morality is an entity that wouldn't be channeled in the physical world without people, it actually is channeled, the same as the Moon actually is occasionally visible in the sky.
as a function of the people who are about to witness it and know they are the last
My point is not about anyone's near term subject...
Sure, natural selection would also technically be an AGI by my definition as stated, so there should be subtext of it taking no more than a few years to discover human-without-supercomputers-or-AI theoretical science from the year 3000.
The discussions I've seen have mentioned things like deepfakes, autonomous weapons, designer pathogens, AI leaving us nothing to do, and algorithmic bias.
I honestly think that's for the best because I don't believe super fast takeoff FOOM scenarios are actually realistic.
When a claim is wrong, ignoring its wrongness and replacing it in your own perception with a corrected steelman of completely different literal meaning is not for the best. The sane thing would be to call out the signatories for saying something wildly incorrect, not pretending tha...
Not every system of values places extinction on its own special pedestal [...] in terms of expected loss of life AI could be even with those other things
Well this is wrong, and I'm not feeling any sympathy for a view that it's not. An eternity of posthuman growth after recovering from a civilization-spanning catastrophe really is much better than lights out, for everyone, forever.
I agree that there are a lot of people who don't see this, and will dismiss a claim that expresses this kind of thing clearly. In mainstream comments to the statement, I've see...
I think a good definition for AGI is capability for open-ended development, the point where the human side of the research is done, and all it needs to reach superintelligence from that point on is some datacenter maintenance and time, so that eventually it can get arbitrarily capable in any domain it cares for, on its own. This is a threshold relevant for policy and timelines. GPT-4 is below that level (it won't get better without further human research, no matter how much time you give it), and ability to wipe out humans (right away) is unnecessary for reaching this threshold.
Our results below show that process supervision in fact incurs a negative alignment tax
Some compelling arguments are given that alignment tax would be negative when this method is used to improve safety. The actual experimental results are about improving/eliciting capabilities and don't explore application of the method for safety, except by drawing an analogy.
a much more ideal thing to be kind towards than current humans
Relevant sense of kindness is towards things that happen to already exist, because they already exist. Not filling some fraction of the universe with expression-of-kindness, brought into existence de novo, that's a different thing.
<1/trillion [kindness]
I expect the notkilleveryone threshold is much lower than that. It takes an astronomically tiny fraction of cosmic endowment to maintain a (post)human civilization that's not too much larger than it currently is. The bigger expenditure would be accomodating humanity at the start, slightly delaying initial self-improvement and expansion from Earth. The cheapest way would be to back up human minds; or if that's too onerous then even merely the generic code and the Internet (which would be completely free; there is the issue that e...
For purposes of morality or decision making, environments that border membranes are a better building block for scopes of caring than whole (possible) worlds, which traditionally fill this role. So it's not quite a particular bare-bones morality, but more of a shared scaffolding that different agents can paint their moralities on. Agreement on boundaries is a step towards cooperation in terms of scopes of caring delimited by these boundaries. Different environments then get optimized according to different preferences, according to coalitions of agents tha...
I think the use of the frame is in replacing agents by membranes and environments. The only way of interacting with an agent is via their membrane. An agent could be enclosed in multiple nested membranes, and you need to know which membrane you are interacting through, so really you are interacting with a certain membrane, not with a certain agent.
The you that interacts with that membrane lives on one of the sides of it, within an environment bordering the membrane. This you is also presented by a membrane that fits the environment it lives in and borders....
It's a step, likely one that couldn't be skipped. Still just short of actually acknowledging nontrivial probability of AI-caused human extinction, and the distinction between extinction and lesser global risks, availability of second chances at doing better next time. Nuclear war can't cause extinction, so it's not properly alongside AI x-risk. Engineered pandemics might eventually get extinction-worthy, but even that real risk is less urgent.
More details on CoEm currently seem to be scattered across various podcasts with Connor Leahy, though a writeup might eventually materialize. I like this snippet (4 minutes, starting at 49:21).