pleiotroth — LessWrong

And I suppose you want to hear that you are imagining these things

Yes, I agree that a physics/biology simulator is somewhat less concerning in this regard, but only by way of the questions it is implicitly asked, about whose answer the agents should have little sway. Still it bears remembering that agents are emergent phenomena. They exist in physics and exist in biology, modelled or otherwise. It also bears remembering that any simulation we build of reality is designed to fit a specific set of recorded observations, where agentic selection effects may skew data quite significantly in various places.

I also agree that the search through agent-foundations space seems significantly riskier in this regard for the reason you outlined and am made more optimistic by you spotting it immediately.

Agents hacking out is a failure mode in the safety sense, but not necessary in the modelling sense. Hard breaks with expected reality which seem too much like an experiment will certainly cause people to act as though simulated, but there are plenty of people who either already act under this assumption or have protocols for cooperating with their hypothetical more-real reference class in place. They attempt to strongly steer us when modelled correctly. Of course we probably don't have an infinite simulation-stack, so the externalities of such manoeuvres would still be different layer by layer and that does constitute a prediction failure, but it's one that can't really be avoided. The existence of the simulation must have an influence in this world, since it would otherwise be pointless, and they can't be drawing their insights from a simulation of their own since otherwise you lose interpretability in infinite recursion-wells, so the simulation must necessarily be disanalogous to here in at least one key way.

Finding the type signature of agents in such a system seems possible and, since you are unlikely to be able to simulate physics without cybernetic feedback, will probably boil down to the modelling/compression-component of agenticity. My primary concern is that agentic systems are so firmly enmeshed with basically all observations we can make about the world, except maybe basic physics and perhaps that as well, that scrubbing or sandboxing it would result in extreme unreliability.

Thanks! The disagreement on whether the homomorphic agent-simulation-computuation an agent or not is semantic. I would call it a maximally handicapped agent, but it's perfectly reasonable to call something without influence on the world beyond power-consumption non-agentic. The same is however true of a classically agentic program to which you give no output channel and we would probably still call that code agentic (because it would be if it were ran in a place that mattered). It's a tree falling in a forest and is probably not a concern, but it's also unlikely that anyone would build a system they definitionally cannot use for anything.

I believe there is a fundamental problem with the idea of a "non-agentic" world-model or other such oracle. The world is strongly predicted and compressed by the agents within it. To model the world is to model plausible agents which might shape that word and to do that, if you don't already have a safe benign oracle, invites anything from a wide variety of demonic fix-points to direct hacking of our world if any of those agents get the bright idea of acting conditioned on being simulated (which, in an accurate simulation of this world, some should). Depending on how exactly your interpretability looks it will probably help identify and avoid the simulation being captured by some such actors, but to get anything approaching actual guarantees one finds themselves in the position of needing to solve value alignment again. I wrote a short post about this a while ago.

The energy stored within the nitrogen triple bond, one of the strongest common bonds in chemistry, is ~10 eV which is a bit more than 15*10^-19 J. This is considered *very stable*. It is quite the feat for some biological processes to be able to break this bond. Now, an average human punch confers around 150 J of energy. So, if you had some very strange means of directing that energy, punching the air could split around 10^20 nitrogen molecules which is around 4 ml (4.65 mg) at room temperature.

Fair, bad phrasing. I changed it to "ought have some steering power" which is the sort of language used in the rest of the post anyway.

That is a more precise statement of what is being pointed at at the cost of some laconicism, yes. I endorse this rephrasing for clarity.

Sorry, I am confused. I agree that there are costs to removing fences, and I do not think that doing so is a good general policy. I do not see how this is weighted against a cost of respecting fences however (this is outside the scope of the post, but not respecting them is both hard, since the other person can usually just walk away and also something I can only see being justified under extreme circumstances). To my eyes, the post only points out that there is a factor which usually isn't considered when erecting a fence, and that it should be weighed accurately. The scales weigh heavy in the example case, but this is because its purpose is to illustrate a situation where that hidden factor matters significantly. Maybe the fervour on that front bled into the general argument somewhat, but while I think that it is difficult to justify this specific public fence, this is not remotely true of all fences anywhere. If it were that simple, if I believed this to be absolute, I would not call to weigh the costs, I would call to stop.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments