That would be ones that are bounded so as to exclude taking your manipulation methods into account, not ones that are truly unbounded.
That's not something unique to homeostatic agents, though. If a model-based maximizer has some gap between its model and the real world, that gap can be exploited by another agent for its own gain, and that's game over for the maximizer.
I don't think of my argument as model-based vs heuristic-reactive, I mean it as unbounded vs bounded. Like you could imagine making a giant stack of heuristics that makes it de-facto act like an unbounded consequentialist, and you'd have a similar problem. Model-based agents only become relevant because they seem like an easier way of making unbounded optimizers.
If so, I don't think they make particularly great tools even in a non-adversarial context. I think they make pretty decent allies and trade partners though, and certainly better allies and trade partners than consequentialist maximizer agents of the same level of sophistication do (and I also think consequentialist maximizer agents make pretty terrible tools - pithily, it's not called the "Principal-Agent Solution"). And I expect "others are willing to ally/trade with me" to be a substantial advantage.
You can think of LLMs as a homeostatic agent where prompts generate unsatisfied drives. Behind the scenes, there's also a lot of homeostatic stuff going on to manage compute load, power, etc..
Homeostatic AIs are not going to be trading partners because it is preferable to run them in a mode similar to LLMs instead of similar to independent agents.
Can you expand on "turn evil"? And also what I was trying to accomplish by making my comms-screening bot into a self-directed goal-oriented agent in this scenario?
Let's say a think tank is trying to use AI to infiltrate your social circle in order to extract votes. They might be sending out bots to befriend your friends to gossip with them and send them propaganda. You might want an agent to automatically do research on your behalf to evaluate factual claims about the world so you can recognize propaganda, to map out the org chart of the think tank to better track their infiltration, and to warn your friends against it.
However, precisely specifying what the AI should do is difficult for standard alignment reasons. If you go too far, you'll probably just turn into a cult member, paranoid about outsiders. Or, if you are aggressive enough about it (say if we're talking a government military agency instead of your personal bot for your personal social circle), you could imagine getting rid of all the adversaries, but at the cost of creating a totalitarian society.
(Realistically, the law of earlier failure is plausibly going to kick in here: partly because aligning the AI to do this is so difficult, you're not going to do it. But this means you are going to turn into a zombie following the whims of whatever organizations are concentrating on manipulating you. And these organizations are going to have the same problem.)
Homeostatic agents are easily exploitable by manipulating the things they are maintaining or the signals they are using to maintain them in ways that weren't accounted for in the original setup. This only works well when they are basically a tool you have full control over, but not when they are used in an adversarial context, e.g. to maintain law and order or to win a war.
As capabilities to engage in conflict increase, methods to resist losing to those capabilities have to get optimized harder. Instead of thinking "why would my coding assistant/tutor bot turn evil?", try asking "why would my bot that I'm using to screen my social circles against automated propaganda/spies sent out by scammers/terrorists/rogue states/etc turn evil?".
Though obviously we're not yet at the point where we have this kind of bot, and we might run into law of earlier failure beforehand.
What if humanity mistakenly thinks that ceding control voluntarily is temporary, when actually it is permanent because it makes the systems of power less and less adapted to human means of interaction?
When asking this question, do you include scenarios where humanity really doesn't want control and is impressed by the irreproachability of GPTs, doing our best to hand over control to them as fast as possible, even as the GPTs struggle and only try in the sense that they accept whatever tasks are handed to them? Or do the GPTs have to in some way actively attempt to wrestle control from or trick humans?
Consider this model.
Suppose the state threatens people to do the following six things for their citizens:
* Teach the young
* Cure the sick
* Maintain law and order
* Feed, clothe and house people with work injuries
* Feed, clothe and house the elderly
* Feed, clothe and house people with FUBAR agency
(Requesting roughly equally many resources to be put into each of them.)
People vary in how they react to the threats, having basically three actions:
1. Assist with what is asked
2. Develop personal agency for essentially-selfish reasons, beyond what is useful on the margin to handle the six tasks above
3. Using the tokens the government provides to certify the completion of the threatened tasks, put citizens in charge of executing similar tasks for foreigners
The largest scale of assisting with what is asked could be to find areas with powerful economies of scale, for instance optimizing the efficiency with which food and clothing is distributed to citizens. However, economies of scale require homogenous tasks, which means that the highest extremes of action 1 trades negatively against extremes of action 2, as one develops narrower specialization while neglecting the general end-to-end agency.
One cannot do much of action 3 without also doing a lot of action 1, so wealth inequality correlates to a focus on economies of scale.
I'm not sure which of "oppression" and "production" this scenario corresponds to under your model.
Similar to the "production" scenario, the production under this model seems to be "real", for instance people are getting clothed and the people who are handsomely rewarded for this are contributing a lot of marginal value. However, unlike the "production" scenario, the wealth doesn't straightforwardly applying knowing better than others. One might know better with respect to one's specialty, but the flipside is that one has neglected the development of skills outside of that specialty (potentially due to starting out with less innate ability to develop them, e.g. a physical disability or lack of connectedness to tutors).
Meanwhile, the scenario I described here doesn't resemble "oppression" at all, except for the original part where the state threatens people to perform the various government services instead of improving their own agency. I get the impression that your oppression hypothesis is more concerned that people provide a simulacrum of these products to the state than that people are forced to provide a genuine version of these products in the most efficient possible way. I do see a strong case for the simulacrum model, but my comment here seems like a relevant alternative to consider, unless I am missing something.
I feel like the case of bivariate PCA is pretty uncommon. The classic example of PCA is over large numbers of variables that have been transformed to be short-tailed and have similar variance (or which just had similar/small variance to begin with before any transformations). Under that condition, PCA gives you the dimensions which correlate with as many variables as possible.
4) The human brain has many millions of idiosyncratic failure modes. We all display hundreds of them. The psychological disorders that we know of are all extremely rare and extremely precise, so if you ever met two people with the same disorder it would be obvious. Named psychological disorders are the result of people with degrees noticing two people who actually have the same disorder and other people reading their descriptions and pattern-matching noise against it. There are, for instance, 1300 bipolar people (based on the actual precise pattern which inspired the invention of the term) in the world but hundreds of thousands of people have disorders which if you squint hard look slightly like bipolar.
This seems mostly believable, except often (not always, I suspect) people name disorders less precisely than this.
I think the clearest problems in current LLMs are what I discussed in the "People used to be worried about existential risk from misalignment, yet we have a good idea about what influence current AIs are having on the world, and it is basically going fine." section. And this is probably a good example of what you are saying about how "Niceness can be hostile or deceptive in some conditions.".
For example, the issue of outsourcing tasks to an LLM to the point where one becomes dependent on it is arguably an issue of excessive niceness - though not exactly to the point where it becomes hostile or deceptive. But where it then does become deceptive in practice is that when you outsource a lot of your skills to the LLM, you start feeling like the LLM is a very intelligent guru that you can rely on, and then when you come up with a kind of half-baked idea, the RLHF makes the LLM praise you for your insight.
A tricky thing with a claim like "This LLM appears to be nice, which is evidence that it is nice." is what it means for it to "be nice". I think the default conception of niceness is as a general factor underlying nice behaviors, where a nice behavior is considered something like an action that alleviates difficulties or gives something desired, possibly with the restriction that being nice is the end itself (or at least, not a means to an end which the person you're treating nicely would disapprove of).
The major hurdle in generalizing this conception to LLMs is in this last restriction - both in terms of which restriction to use, and in how that restriction generalizes to LLMs. If we don't have any restriction at all, then it seems safe to say that LLMs are typically inhumanly nice. But obviously OpenAI makes ChatGPT so nice in order to get subscribers to earn money, so that could be said to violate the ulterior motive restriction. But it seems to me that this is only really profitable due to the massive economies of scale, so on a level of an individual conversation, the amount of niceness seems to exceed the amount of money transferred, and seems quite unconditional on the money situation, so it seems more natural to think of the LLM as being simply nice for the purpose of being nice.
I think the more fundamental issue is that "nice" is a kind of confused concept (which is perhaps not so surprising considering the etymology of "nice"). Contrast for instance the following cultures:
They're both "nice", but the niceness of the two cultures have fundamentally different mechanisms with fundamentally different root causes and fundamentally different consequences. Even if they might both be high on the general factor of niceness, most nice behaviors have relatively small consequences, and so the majority of the consequence of their niceness is not determined by the overall level of the general factor of niceness, but instead by the nuances and long tails of their niceness, which differs a lot between the two cultures.
Now, LLMs don't do either of these, because they're not human and they don't have enough context to act according to either of these mechanisms. I don't think one can really compare LLMs to anything other than themselves.
The defining difference was whether they have contextually activating behaviors to satisfy a set of drives, on the basis that this makes it trivial to out-think their interests. But this ability to out-think them also seems intrinsically linked to them being adversarially non-robust, because you can enumerate their weaknesses. You're right that one could imagine an intermediate case where they are sufficiently far-sighted that you might accidentally trigger conflict with them but not sufficiently far-sighted for them to win the conflicts, but that doesn't mean one could make something adversarially robust under the constraint of it being contextually activated and predictable.