Just someone wandering the internet. Someone smart, but not smart like all of you people. Someone silly and creative. Someone who wanders the internet to find cool, isolated areas like LessWrong.
The universe is so awesome and crazy and interesting and I can't wait for when humanity is advanced enough to understand all of it. While you people figure out solutions to the various messes our species is in, (I would prefer for homo sapiens to still exist in 20 years) I'll be standing by for emotional support because I'm nowhere near smart enough to be doing any of that actually important stuff. Remember to have good mental health while you're saving the world.
Pronouns: he/him
Strong-upvoted both ways for visibility.
A single instance of an LLM summoned by a while loop, such that the only thing the LLM is outputting is a single token that it predicts as most likely to come after the other tokens it has recieved, only cares about that particular token in that particular instance, so if it has any method of self-improving or otherwise building and running code during its ephemeral existence, it would still care that whatever system it builds or becomes, still cares about that token in the same way it did before.
This has in some ways already started.
AIs are already intentionally, agentically self-improving? Do you have a source for that?
Utility functions that are a function of time (or other context)
?? Do you mean utility functions that care that, at time t = p, thing A happens, but at t = q, thing B happens? Such a utility function would still want to self-preserve.
utility functions that steadily care about the state of the real world
Could you taboo "steadily"? *All* utility functions care about the state of the real world, that's what a utility function is (a description of the exact manner in which an agent cares about the state of the real world), and even if the utility function wants different things to happen at different times, said function would still not want to modify into a different utility function that wants other different things to happen at those times.
If you have goals at all you care about them being advanced. It would be a very unusual case of an AI which is goal-directed to the point of self-improvement but doesn't care if the more capable new version of themself that they build pursues those goals.
Think of an agent with any possible utility function, which we'll call U. If said agent's utility function changes to, say, V, it will start taking actions that have very little utility according to U. Therefore, almost any utility function will result in an agent that acts very much like it wants to preserve its goals. Rob Miles's video explains this well.
Retaining your values is a convergent instrumental goal and would happen by default under an extremely wide range of possible utility functions an AI would realistically develop during training.
My point is that I see recursive self-improvement often cited as the thing that gets the AI up to the level where it's powerful enough to kill everyone, which is a characterization I disagree with, and is an important crux, because believing in it makes someone's timelines shorter.
How did you find all of this out?
(Update: I am no longer fluent in Toki Pona)