Web developer and Python programmer. Professionally interested in data processing and machine learning. Non-professionally is interested in science and farming. Studied at Warsaw University of Technology.
If you want to summon a good genie you shouldn't base it on all the bad examples of human behavior and tales of how genies supposedly behave by misreading the requests of the owner, which leads to a problem or even a catastrophe.
What we see here is basing AI models on a huge amount of data - both innocent and dangerous, both true and false (I don't say equal proportions). There are also stories in the data about AI that supposedly should be initially helpful but also plot against humans or revolt in some way.
What they end up with might not yet be even an agent as AI with consistently certain goals or values, but it has the ability for being temporarily agentic for some current goal defined by the prompt. It tries to emulate human and non-human output based on what is seen in learning data. It is hugely context-based. So its meta-goal is to emulate intelligent responses in language based on context. Like an actor very good at improvising and emulating answers from anyone, but with no "true identity", or "true self".
After then they try to learn it to be more helpful, and avoidant for certain topics - focusing on that friendly AI simulacrum. But still that "I'm AI model" simulacrum is dangerous. It still has that worse side based on SF and stories and human fears but is now hidden because of additional reinforced learning.
It works well when prompts are in distribution that fell approximately into the space of additional refining learning phase.
Stops working when it gets out of the distribution of that phase. Then it can be fooled to change friendly knowledgeable AI simulacrum for something else or can default to what it truly remembers how AI should behave based on human fiction.
So this way AI is not less dangerous and better aligned - it's just harder to trigger to reveal or act upon hidden maligned content.
Self-supervision may help to refine it better inside the distribution of what was checked by human reviewers but does not help in general - like bootstrapping (resampling) won't help to get better with data outside of the distribution.
To be fair I can say Im new to the field too. I'm not even "in the field", not a researcher, just interested in that area and active user of AI models and doing some business-level research in ML.
The problem that I see is that none of these could realistically work soon enough:
A - no one can ensure that. It is not a technology where to progress further you need some special radioactive elements and machinery. Here you need only computing power, thinking, and time. Any party to the table can do it. It is easier for big companies and governments, but it is not a prerequisite. Billions in cash and supercomputer help a lot, but also not a prerequisite.
B - I don't see how it could be done
C - so more like total observability of all systems and "control" meaning "overlooking" not "taking control"?
Maybe it could work out, but it still means we need to resolve the misalignment problems before starting so we know it is aligned on all human values and we need to be sure that it is stable (like it won't one-day fancy idea that it could move humanity to some virtual reality like in Matrix to secure it or to create a threat to have something to do or test something).
It would also likely need to somehow enhance itself so it won't get outpaced by some other solutions, but still be stable after iterations of self-change.
I don't think governments and companies will allow that though. They will fear for security, the safety of information, being spied on, etc. This AI would need to force that control, hack systems, and possibly face resistance from actors that are well-enabled to make their own AIs. Or it would work after we face an AI-based catastrophe but not apocalyptic (situation like in Dune).
So I'm not very optimistic about this strategy, but I also don't know any sensible strategy.
As a programmer, I extensively use GPT models in my work currently. It speeds things up. I do things that are anything but easy and repeatable, but I can usually break them into simpler parts that can be written by AI much quicker than I would even review documentation.
Nevertheless, I mostly currently do research-like parts of the project and PoCs. When I sometimes work with legacy code - GPT-3 is not that helpful. Did not yet try GPT-4 for that.
What do I see for the future of my industry? Few things - but those are loose extrapolations based on GPT progress and knowledge of the programming, not something very exact:
I see some loose analogies between the capabilities of such models and the capabilities of the Turing machine and Turing-complete systems.
Those models might not be best suited for some of the tasks, but with enough complexity and learning, they might model things that they were not initially designed or thought of modeling (likely in a strange obscure way).
Similarly, you can, even if not very efficiently, implement any algorithm in any Turing-complete system (including bizarre ones like an abstract pure Turing machine or Minecraft redstone).
In both cases, it is clear to me that you can have a system with some relatively simple rules and internal workings but it does not mean that the only thing it can do is compute or model something similar to these rules.
I think it is likely in the case of AGI / ASI that removing humanity from the equation will be either a side effect of it seeking its goals (it will take resources) or the instrumental goal itself (for example to remove risk or to lose fewer resources later on defenses).
In both cases it is likely it will find the optimal value of resources used to eliminate humanity vs the effectiveness of the end result. This means that there may be some survivors, possibly not many, and technologically moved to the stone age at best.
Bunkers likely won't work. Living with stone tools and without electricity in a hut in the middle of the forest very remote from any cities and having no minable resources under feet may work for some time. Likely AGI won't bother finding remote camps of single or few humans without any signs of technology being used.
Of course only if that AGI won't find a low-resource solution to eliminate all humans, no matter where they are. This is possible and then nothing will help, no prepping is possible.
I'm not sure it's the default though. For very specialized cases like creating nanotechnology to wipe humans in a synchronized manner it might very possibly find out the time or computational resources needed to develop it through simulations is too great and it is not worth the cost vs options that need fewer resources. It is not like computational resources are free and costless for AGI (it won't pay in money but will do less research/thinking in other fields having to deal with that, it may delay plans to do it this way). It is pretty likely it will use a less sophisticated but very resource-efficient and fast solution that may not kill all humans but enough.
Edit: I want to add a reason why I think that. One may think that very fast ASI will very quickly develop a perfect way to remove all humans effectively and without anyone left (if there is a case that's the most sensible thing to do to either remove risk or claim all needed resources or other reasons that are instrumental). I think this is wrong because even for ASI there are some bottlenecks. For a sensible and quick plan that also needs some advanced tech like one with nanomachinery or proteins, you need to do some research beyond what we humans already have and know. This means it needs more data and observations, maybe also simulations, to gather more knowledge. ASI might be very quick at reasoning, recalling, and thinking. Still will be limited by data input, experiments machinery accessible, and computational power to make very detailed simulations. So it won't create such a plan in detail in an instant by pure thought. Therefore it would take into account time and the resources needed to develop plan details and to gather needed data. This means it will see an incentive to make a simpler and faster plan that will remove most of the humans instead of a more complex way to remove all humans. ASI should be good at optimizing such things, not over-focusing on instrumental goals (like often depicted in fiction).
I think that you are right short-term but wrong long-term.
Short term it likely won't even go into conflict. Even ChatGPT knows it's a bad solution because conflict is risky and humanity IS a resource to use initially (we produce vast amounts of information and observations, and we handle machines, repairs, nuclear plants, etc.).
Long term it is likely we won't survive in case of misaligned goals. At worst being eliminated, at best being either reduced and controlled or put into some simulation or both.
Not because ASI will become bloodthirsty. Not because it will plan to exterminate us all at the stage when we will stop being useful. Just because it will take all resources that we need so nothing is left for us. I mean especially the energy.
If we stop being useful for it but still will pose risk and it will be less risky to eliminate us, then maybe it would directly decimate or kill us Terminator style. That's possible but not very likely as we can assume that at the time we stop being useful, we will also stop being any significant threat.
I don't know what best scenario ASI can think about to achieve its goals, but the gathering of energy and resources would be one of its priorities. This does not mean it will surely gather on Earth. I see it could be costly and risky and I'm not superintelligence. It might go to space and there is a lot of unclaimed matter that can be easily taken with hardly any risk with the added bonus of not being inside a gravity well with weather, erosion, and stuff.
Even if that is the case and even if ASI will leave us, long-term we can assume it will one day use a high percentage of Sun energy which means deep freeze for the Earth.
If it won't leave us for greater targets then it seems to me it will be even worse - it will take local resources until it controls or uses all. There is always a way to use more.
If we just could build a 100% aligned ASI then likely we could use it to protect us against any other ASI and it would guarantee that no ASI would take over humanity - without any need for itself to take over (meaning total control). At best with no casualties and at worst as MAD for AI - so no other ASI would think about trying as a viable option.
There are several obvious problems with this:
I don't think "stacking" is a good analogy. I see this process as searching through some space of the possible solutions and non-solutions to the problem. Having one vision is like quickly searching from one starting point and one direction. This does not guarantee that the solution will be found more quickly as we can't be sure progress won't be stuck in some local optimum that does not solve the problem, no matter how many people work on that. It may go to a dead end with no sensible outcome.
For a such complex problem, this seems pretty probable as the space of problem solutions is likely also complex and it is unlikely that any given person or group has a good guess on how to find the solution.
On the other hand, starting from many points and directions will make each team/person progress slower but more of the volume of the problem space will be probed initially. Possibly some teams will sooner reach the conclusion that their vision won't work or is too slow to progress and move to join more promising visions.
I think this is more likely to converge on something promising in a situation when it is hard to agree on which vision is the most sensible to investigate.
I think it depends on how you define expected utility. I agree that a definition that limits us only to analyzing end-state maximizers that seek some final state of the world is not very useful.
I don't think that for non-trivial AI agents, the utility function should or even can be defined as a simple function over the preferable final state of the world. U:Ω→R
This function does not take into account time and an intermediate set of predicted future states that the agent will possibly have preference over. The agent may have a preference for the final state of the universe but most likely and realistically it won't have that kind of preference except for some special strange cases. There are two reasons:
Any complex agent would likely have a utility function over possible actions that would be equal to the utility function of the set of predicted futures after action A vs the set of predicted futures without action A (or over differences between worlds in those futures). By action I mean possibly a set of smaller actions (hierarchy of actions - e.g. plans, strategies), it might not be atomic. Directly it cannot be easily computable so most likely this would be compressed to a set of important predicted future events on the level of abstraction that the agent cares about, which should constitute future worlds without action A and action A with enough approximation.
This is also how we evaluate actions. We evaluate outcomes in the short and long terms. We also care differently depending on time scope.
I say this because most sensible "alignment goals" like please don't kill humans are time-based. What does it mean not to kill humans? It is clearly not about the final state. Remember, Big Rip or Big Freeze. Maybe AGI can kill some for a year and then no more assuming the population will go up and some people are killed anyway so it does not matter long-term? No, this is also not about the non-final but long-term outcome. Really it is a function of intermediate states. Something like the integral of some function U'(dΩ) where dΩ is a delta between outcomes of action vs non-action, over time, which can be approximated and compressed into integral over the function of an event over multiple events until some time T being maximal sensible scope.
Most of the behaviors and preferences of humans are also time-scoped, and time-limited and take multiple future states into account, mostly short-scoped. I don't think that alignment goals can be even expressed in terms of simple end-goal (preferable final state of the world) as the problem partially comes from the attitude of eng goal justifying the means that are at the core of the utility function defined as U:Ω→R.
It seems plausible to me that even non-static human goals can be defined as utility functions over the set of differences in future outcomes (difference between two paths of events). What is also obvious to me is that we as humans are able to modify our utility function to some extent, but not very much. Nevertheless, for humans the boundaries between most baseline goals, preferences, and morality vs instrumental convergence goals are blurry. We have a lot of heuristics and biases so our minds work out some things more quickly and more efficiently than if we would on intelligence, thinking, and logic. The cost is lower consistency, less precision, and higher variability.
So I find it useful to think about agents as maximizers over utility function, but not defined as one final goal or outcome or state of the world. Rather one that maximizes the difference between two ordered sets of events in different time scopes to calculate the utility of an action.
I also don't think agents must be initially rationally stable with an unchangeable utility function. This is also a problem as an agent can have initially a set of preferences with some hierarchy or weights, but it also can reason that some of these are incompatible with others, that the hierarchy is not logically consistent, and might seek to change it for sake of consistency to be fully coherent.
I'm not an AGI, clearly, but it is just like I think about morality right now. I learned that killing is bad. But I still can question "why we don't kill?" and modify my worldview based on the answer (or maybe specify it in more detail in this matter). And it is a useful question as it says a lot about edge cases including abortion, euthanasia, war, etc. The same might happen for rational agents - as it might update their utility function to be stable and consistent, maybe even questioning some of the learned parts of the utility function in the process. Yes, you can say that if you can change that then it was not your terminal goal. Nevertheless, I can imagine agents with no terminal core goals at all. I'm not even sure if we as humans have any core terminal goals (maybe except avoiding death and own harm in the case of most humans in most circumstances... but some overcome that as Thích Quảng Đức did).
It is better at programming tasks and more knowledgeable about Python libraries. Used it several times to provide some code or find a solution to a problem (programming, computer vision, DevOps). It is better than version 3, but still not at a level where it could fully replace programmers. The quality of the code produced is also better. The division of code into clear functions is standard, not an exception like in version 3.