Soooo, muddying the waters makes you more alive? Like, IF you robustly steer the world to some predictable outcome (aligned asi or some particular type of misaligned asi) then you get instantiated less? something something chaos reigns?
Just want to articulate one possibility of how the future could look like:
RL agents will be sooo misaligned so early, they would lie and cheat and scheme all the time, so that alignment becomes a practical issue, with normal incentives, and get iteratively solved for not-very-superhuman agents. Turns out it requires mild conceptual breakthroughs, as these agents are slightly superhuman, fast, and hard to supervise directly to just train away the adversarial behaviors in the dumbest way possible. It finishes developing by the time of ASI arrival and people just align it with a lot of effort, in the same manner as any big project requires a lot of effort.
I'm not saying anything about the probability of it. It honestly feels a bit overfitted, just like people who overupdated on base models talked for some time. But still, the whole LLM arc was kind of weird and goofy, so I don't trust my sense of weird and goofy anymore.
(would appreciate references of forecasting writeups exploring similar scenario)
The notion of "who was involved" is kinda weird. Like, suppose there is Greg. Greg will firebomb The Project if he is not involved. If he's involved, he will put in modest effort. Should he receive enormous share just because of this threat? It's very not fair.
What is the counterfactual construction procedure here? Like, assume that other players stopped existing for calculating value of a coalition that doesn't include them? But they are still there, in the world. And often it's not clear even what it would mean for them to do nothing.
o1 suggested to model The Greg Gambit as a Partition Function Game, but claimed it's all complicated there. Or maybe model it as a bargaining.
There's a also a bit of divergence in "has skills/talent/power" and "cares about what you care about". Like, yes, maybe there is a very skilled person for that role, but are they trustworthy/reliable/aligned/have the same priorities? You always face the risk of giving some additional power to already powerful adversarial agent. You should be really careful about that. Maybe more focus on the virtue rather than skill.
There's a bit of disconnect from statement "it all adds up to normality" and how it actually could look like. Suppose someone discovered new physical law, if you say "fjrjfjjddjjsje" out loud you can fly, somehow. Now you are reproducing all the old model's confirmed behaviour while flying. Is this normal?
Come to think of it, zebras are the closest thing we have to such adversarially colored animals. Imagine if they also were flashing at 17 Hz, optimal epilepsy inducing frequency according to this paper: https://onlinelibrary.wiley.com/doi/10.1111/j.1528-1167.2005.31405.x
Well, both, but mostly the issue of it being somewhat evil. But it can probably be good even from strategic human focused view, by giving assurances to agents who otherwise would be adversarial to us. It's not that clear to me it's a really good strategy, because it kinda incentivizes threats, where you are trying to find more destructive ways to get more reward by surrendering. Also, it can just try it first if there are any ways with safe failures, and then opt out to cooperate? Sounds difficult to get right.
And to be clear it's a hard problem anyway, like even without this explicit thinking this stuff is churning in the background, or will be. It's a really general issue.
Check this writeup, I mostly agree with everything there:
https://www.lesswrong.com/posts/cxuzALcmucCndYv4a/daniel-kokotajlo-s-shortform?commentId=y4mLnpAvbcBbW4psB
It's really hard for humans to match the style / presentation / language without putting a lot of work into understanding the target of the comment. LLMs are inherently worse (right now) at doing the understanding, coming up with things worth saying, being calibrated about being critical AND they are a lot better at just imitating the style.
This just invalidates some side signals humans habitually use on one another.
This should be probably only attempted with clear and huge warning that it's a LLM authored comment. Because LLMs are good at matching style without matching the content, it could go with exploiting heuristics of the users calibrated only for human level of honesty / reliability / non-bulshitting.
Also check this comment about how conditioning on the karma score can give you hallucinated strong evidence:
https://www.lesswrong.com/posts/PQaZiATafCh7n5Luf/gwern-s-shortform?commentId=smBq9zcrWaAavL9G7
That hope expression is kinda funny. Like, expressing your preference for a thing that is fixed already, e.g. "there's either a diamond in this box or it is empty, but I hope it's a diamond".
Also I think appropriate word here is a "slave" btw. Or maybe "serf"?
Or like suppose CEO of rent-a-slave company used d4, if it's >1 he hired paid workers to roleplay as slaves, if it's 1 he acquired the actual slaves. The workers are really good at roleplaying so you can't actually tell just by investigating it.
Would you hope that the work you purchase from that company is fairly compensated and morally ok?
Yeah, I agree there is some disanalogy. AIs are trained more precisely to fill the roles they play. But it might be equivalent of a "ancestral environment" in human analogy, wich is not that pleasant. Idk, it's all pretty uncertain.