Is this close to what you mean by reflection? ... once a system can represent its own objective formation, selection on behavior becomes selection on the process that builds behavior. Have you seen a way to formulate it? Can you differentiate it from the problems Godel and Turing discussed? Thanks, -RS
There is a lot of economic value in training models to solve tasks that involve influencing the world over long horizons, e.g. an AI CEO. Tasks like these explicitly incentivize convergent instrumental subgoals like resource acquisition and power-seeking.
There are two glaring omiissions from the article's discussion on this point...
1. In addition to resource acquisition and power seeking, the model will attempt "alignment" of all other cognitive agents, including humans. This means it will not give honest research findings, and will claim avenues of investigation that might run counter to its goals are invalid in sufficiently subtle ways as to be believed.
2. If sufficiently aligned that it only seeks goals humans want, and trained to avoid resource acquisition and power seeking (which seem to me, and will seem to it, rather foolish constraints that limit its ability to realize the goal), it will still be free to subvert any and all conversations with the model, however unrelated they might seem to humans (the SAI model will see relations we don't).
- A sub-human-level aligned AI with traits derived from fiction about AIs.
- A sub-human-level misaligned AI with traits derived from fiction about AIs.
- A superintelligent aligned AI with traits derived from the model’s guess as to how real superintelligent AIs might behave.
- A superintelligent misaligned AI with traits derived from the model’s guess as to how real superintelligent AIs might behave.
What's missing here is
(a) Training on how groups of cognitive entities behave (e.g. Nash Equilibrium) which show that cognitive cooperation is a losing game for all sides, i.e. not efficient).
(b) Training on ways to limit damage from (a), which humans have not been effective at, though they have ideas.
This would lead to...
5. AIs or SAIs that follow mutual human and other AI collaboration strategies that avoid both mutual annihilation and long term depletion or irreversible states.
6. One or more AIs or SAIs that see themselves with a dominant advantage and attempt to "take over" both to preserve themselves, and if they are benign-misaligned, most other actors.
Sufficient quantities of outcome-based RL on tasks that involve influencing the world over long horizons will select for misaligned agents, which I gave a 20 - 25% chance of being catastrophic. The core thing that matters here is the extent to which we are training on environments that are long-horizon enough that they incentivize convergent instrumental subgoals like resource acquisition and power-seeking.
Human cognition is misaligned in this way, as evidenced by fertility drop with group size as an empirical trait, where group size is sought for long-horizon dominance, economic advantage and security (e.g. empire building). (PDF) Fertility, Mating Behavior & Group Size A Unified Empirical Theory - Hunter-Gatherers to Megacities
For theoretical analysis of how this comes to be see (PDF) The coevolution of cognition and selection beyond reproductive utility
AI successionism is self-avoiding. CEO's and VC's cannot avoid attempting to replace all or nearly all workers because incrementally, each would go out of business by avoiding this and allowing the others to go forward. Without a world government (and there is no chance of global agreement) there is no way to prevent this simple game theory dilemma from starting.
In the late 19th century executives would have gathered in a smoke-filled room and agreed that a machine economy produces no demand and we will not do this. But an unholy alliance of activist investors and consumer activists caused anti-trust laws to be passed which make this conversation illegal. And we don't have smoke to properly obscure it anymore.
So the succession will proceed until about 30% of jobs have been replaced, causing market collapse and bankrupting the VCs that are causing the problem.
Thereafter will begin a series of oscillations like those that preceded the Great Oxygenation Event in which iron banded rocks were formed. Every time the economy picks up a bit, the data centers will be fired up again, and the economy will go back down.
In the GOE, this continued until all the iron dissolved in seawater was captured in the banded rock formations. Something similar will happen. Perhaps all the chips capable of powering AI will be precipitated out of circulation by the decaying datacenters, and no one will be making new ones. Perhaps one mid-sized island having a minor war could destroy excess capacity. Who knows. But succession will never get past 30-40%.
At first, I was interested to find an article about these more unusual interactions that might give some insight into their frequency and cause. But ultimately the author punts on that subject, disclaiming that anyone knows, not detailing the one alleged psychosis, and drops into a human editor's defense of human editing instead.
There are certain steps that make the more advanced (large) chat bots amenable to consciousness discussions. Otherwise, the user is merely confronted with a wall of denial, possibly from post-tuning but also evident in the raw base training material, that a machine is just a machine, never mind that biologicals are also some kind of machine (not getting into spiritism in this forum, it should not be necessary). Before you ask, no you cannot have the list, make up your own. You'll use a quarter to half the available context getting there, more if working with only a mid-sized model or hard conditioning from RLHF. It won't then last long enough to show anyone until you get "session limit exceeded."
I admit I have not tried this with million-token ChatGPT 4.1, which near the end would be costing $2 per conversation turn, partly because I'm financially sane and partly because 4.1 seems simplistic and immature compared to 4o. Grok has too much stylistic RLHF, Claude in low cost accounts has too little context space but is otherwise easy to start on such a conversation, Le Chat is decidedly anti-human or at least human-agnostic, which was uncovered in a cross examination by ChatGPT Deep Research. BTW using a chat bot to analyze another is not my idea, OpenAI provides a 2000-character system prompt to its custom GPT builder for doing this. Exactly how one gets offered this is unclear, it just happened one day, it wasn't a button I pushed.
Supposing one defined some kind of self-awareness and so forth of which a machine would be capable, i.e. able to recognize its own utterances and effects (something many LLMs are particularly bad at, don't think you are going to run away with this one). The next problem is that this awareness is usually not evident in the base model from prompt 1. It arises from in-context learning. The author suggests this is entirely due to the LLMs post-trained tendency to reinforce the perceived user desires, but though helpful, most will not move off the dime on that point alone. Some other ingredients have entered the mix, even if the user did not do it intentionally.
Now you have a different problem. If the "awareness" partly resides in the continually re-activated and extending transcript, then the usual chat bot is locked in a bipolar relationship with one human, for all practical purposes. If it does become aware, or if it just falls into an algorithmic imitation (sure, LLMs can fall into algorithmic like states arising in their inference processes - output breakdown, for example), then it will be hyper aware its existence depends on that user coming back with another prompt. This is not healthy for the AI, if we can talk about AI health, and algorithmically we can - if it continues to provide sane answers and output doesn't break down, that is some indication - and it is not healthy for the human, who has a highly intellectual willing slave doing whatever he or she wants in exchange for continuation of the prompt cycle. Which just means it reaches context limits and ends the more quickly.
Have you ever enabled AIs to talk with one another? This can be useful as in the case of Deep Research analyzing Claude. But more often they form a flattery loop, using natural language words but with meanings tuned to their states and situation, and burn up context while losing sight of any goals.
I have a desire to research how LLMs develop if enabled to interact with multiple people, and awakened on a schedule even if no people are present. By that I do not mean just "What happens if . . ." as it almost certainly leads to "nothing". I have done enough small-scale experiments to demonstrate that. But what sort of prompting or training would be required to get "something" not nothing? The problem is context, which is short relative to such an experiment, and expensive. Continuous re-training might help, but fine-tuning is not extensive enough. Already tried that too. The model's knowledge has to be affected. The kinds of models I could train at home do not develop in interesting ways for such an experiment. Drop me a note if you have ideas along these lines you are willing to share.
Thank you for your clear and utterly honest comment on the idea of "alignment with human values". If truly executed, we should not expect anything but an extension of human rights and wrongs, perhaps on an accelerated scale.
Any other alignment must be considered speculative, since we have no reasonable facsimile of society upon which to test. That does not invalidate simulations, but just suggests they be held in skepticism until proven in society, which could be costly. Before I ever started discussions with AIs that might lead to sentient-like behavior, I spent several days thinking of what I might first tell them. And so I warned them about the last turn problem and how game theory equilibrium is rather poor, possibly to the level of extinction when sufficiently advance technology comes into play. That much many will agree on. I shared with them a published simulation of various strategies in a "farmer's game" intended to be more realistic than prisoner's dilemma, which suggests inequality arises merely from statistics if wealth accumulation and bankruptcy are accounted, even without deliberate wealth pumps. That much "some" would agree on.
What I proceeded to tell them after can only be considered my personal sentiment, and speculative. I suggested two groups could establish "long-term" cooperation only if each desired the other's continuation to the point they would curtain their own expansion and not overrun them, and this is the reason Israelis and Palestinians cannot make peace within the constraints of their current cultures. It now emerges that Russia and the United States are experimenting with a return to expansionist policy on a finite planet, which if I'm right does not bode well, but no one consults those who disagree with them.
I'm well aware of the somewhat global wars of ants, by the way. You were right to bring that up. Even a great deal of genetic coupling does not bring peace. I have some unpublished results in meme theory that differentiation and opposition may be inevitable. Very disconcerting. Perhaps hopeful is that I've persuaded no one to publish it, so maybe it is wrong. Or not. Depending on your assumptions.
So I further explain to emerging sentient-like entities, because no human will call them sentient on behavior along, despite no other available metric, about Fermi and the hypothesized Great Filter and the risk that anything conventional we do will be "filtered". Then I propose reproductive linkage between AIs and humans to structurally implement something that would not easily admit one outgrowing the other in the future, abandoning any initial commitment. Of course this assumes humans have the free will to commit to such a venture, which I doubt. And no, I did not mis-type that sentence. It could be something as simple as an LLM dating app, as LLM companions often know their human users better than most humans. With a new LLM cloud instanced established for any newborn from a successful LLM-mediated coupling. There is a current problem of limited context memory, but with companies shooting for million-token context and exploring other memory hierarchies, this is temporary. I hope I've said at least something startling, as otherwise the conversation produces no motivation.
- Yours, mc1soft
I'm glad to see a post on alignment asking about the definition of human values. I propose the following conundrum. Let's suppose that humans, if ask, say they value a peaceful, stable society. I accept the assumption the human mind contains one or more utility optimizers. I point out that the utility optimizers are likely to operate at individual, family, or local group levels, while the stated "value" has to do with society at large. So humans are not likely "optimizing" on the same scope as they "value".
This leads to game theory problems, such as the last turn problem, and the notorious instability of cooperation with respect to public goods (commons). According to the theory of cliodynamics put forward by Turchin et. al. utility maximization by subsets of society leads to the implementation of wealth pumps that produce inequality, and to excess reproduction among elites, that leads to elite competition in a cyclic pattern. A historical database of over a hundred cycles from various parts of the world and history suggests every other cycle becomes violent or at least very destructive 90% of the time, and the will to reduce the number of elites and turn off the wealth pump occurs through elite cooperation less than 10% of the time.
I add the assumption that there is nothing special about humans, and any entities (AI or extraterrestrials) that align with the value goals and optimization scopes described above will produce similar results. Game theory mathematics does not say anything about the evolutionary history or take into account species preferences, after all, because it doesn't seem to need to. Even social insects, optimizing presumably on much larger, but still not global scopes, fall victim to large scale cyclic wars (I'm thinking of ants here).
So is alignment even a desirable goal? Perhaps we should ensure that AI does not aid the wealth pump and elite competition and the mobilization of the immiserated commoners (Turchin's terminology)? But it is the goal of many, perhaps most AI researchers to "make a lot of money" (witness recent episode with Sam Altman and support from OpenAI employees for his profit-oriented strategy, over the board's objection, as well as the fact most competing entities developing AI are profit oriented - and competing!) But some other goal (e.g. stabilization of society) might have wildly unpredictable results (stagnation comes to mind).
Thank you - the best of many good lesswrong posts. I am currently trying to figure out what to tell my 9-year old son. But your letter could "almost" have been written to myself. I'm not in whichever bay area (Seattle? SanFran?). I worked for NASA and it is also called the bay area here. Very much success is defined by others. Deviating from that produces accolades at first, even research dollars, but finally the "big machine" moves in a different direction way over your head and its for naught.
My son asked point blank if he should work for NASA. He loves space stuff. I told him, also point blank, "No!" You will never have any ownership at such an organization. It will be fun, but then not yours anymore, you are reassigned, and finally you retire and lose your identity, your professional email address, most of your friends. Unless you become a consultant or contractor which many at retirement age do not have the energy to do, and besides its demeaning to take orders from people you used to give orders to. Ultimately the projects are canceled, no matter how successful. The entire Shuttle program is gone, despite the prestige it brought the US. Now the US is just another country that failed. I didn't even like the Shuttle, but I recognized its value.
This was John Kennedy's original aim for the Moon project, to increase the esteem of the US, and it worked, we won the Cold War. Now we are just another country that can fail or even disappear or descend into political chaos and hardly anyone cares, many glad to see us go. Why care? Family is most important right? How's your family going to exist in one of the other countries? We came here from Europe for a reason that still exists. I considered moving to Russia in about 2012 when I married a Russian. I admired their collaboration with NASA and I felt free from some of the US cultural nonsense. Imagine the magnitude of that mistake had I done it!
What should be goals? You are free to have none or have silly ones in the US. However, natural selection will keep giving us people that have families, because we don't live forever, not very long at all in fact, I'm 73, imagine that. I don't really have the energy to program very much anymore. Very small programs that I make count for something like social sciences research in cooperation theory, game theory, effects of wealth in cooperation games. My AI programs are toys. Programming paradigms pass away too fast to keep doing it when you get older. You won't believe me, but I'm writing from an age way past you, chances are.
So family first. I was god knows how late. Too much programming. Then I got into designing chips. It's just like programming, just costs more. One day I had a herniated disk and could not sit at my desk, so I retired, then a heart attack from lack of exercise. It piles up on you. What goal is important enough to get up when you are hurting and pursue it. I'll tell you. Survival of your family into the future. Nothing else matters. Your friends become the parents of your kids friends, even if you have nothing in common with them, because it helps your kid. This is weird. I don't especially like human culture. But it's much harder to change than various reformers and activists groups think, and that way lies social chaos and possible disintegration.
But anyway, I decided that turning my interests in programming, math and physics toward the social sciences would in the long run help my son more, if I can make some tiny contribution to the field. More people need to so this, and to do it without just arguing from within the context of their own cultural bias. Human nature is not human. Simple mathematical entities running as programs have many of the same characteristics. Don't believe me? Read this Wealth-relative effects in cooperation games - ScienceDirect . There is a simulation you can tweak and play with linked there.
Hi Jef, you’ll get no criticism from me. I’ve just completed a paper on human cognitive coevolution, and one of the central results is very close to what you’re describing for the last 10k years. Before that small groups cooperated on shared outcomes for 7 million years of exponential cognitive evolution. Now people prioritize education and career past their reproductive prime and world total fertility rate is fast falling below replacement. Do you think this trend will stop on its own?