mako yass

interactive system design

Wiki Contributions


I agree that there doesn't seem to be a theory, and there are many things about the problem that makes reaching any level of certainty about it impossible (the we can only have one sample thing). I do not agree that there's a principled argument for giving up looking for a coherent theory.

I suspect it's going to turn out to be like it was with priors about the way the world is: Lacking information, we have just fall back on solomonoff induction. It works well enough, and it's all we have, and it's better than nothing.

So... oh... we can define priors about our location in the in terms of the complexity of a description of their locations. This feels like most of the solution, but I can't tell, there are gaps left, and I can't tell how difficult it will be to complete the bridges.

A fun thing about example 1. is that we can totally imagine an AF System that could drag a goat off a cliff and eat it (put it in a bioreactor which it uses to charge its battery), it's just that no one would make that, because it wouldn't make sense. Artificial systems use 'cheats' like solar power or hydrocarbons because the cheats are better. There may never be an era or a use case where it makes sense to 'stop cheating'.

A weird but important example is that you might not ever see certain (sub-pivotal) demonstrations of strength from most AGI researcher institutions, not because they couldn't make those things, but because doing so would cause them to be nationalized as a defense project.

Ack. Despite the fact that we've been having the AI boxing/infohazards conversation for like a decade I still don't feel like I have a robust sense of how to decide whether a source is going to feed me or hack me. The criterion I've been operating on is like, "if it's too much smarter than me, assume it can get me to do things that aren't in my own interest", but most egregores/epistemic networks, which I'm completely reliant upon, are much smarter than me, so that can't be right.

This depends on how fast they're spreading physically. If spread rate is close to c, I don't think that's the case, I think it's more likely that our first contact will come from a civ that hasn't received contact from any other civs yet (and SETI attacks would rarely land, most civs who hear them would either be too primitive or too developed to be vulnerable to them, before their senders arrive physically.).

Additionally, I don't think a viral SETI attack would be less destructive than what's being described.

Over time, the concept of Ra settled in my head as... the spirit of collective narcissism, where we must recognize narcissism as delusional striving towards attaining the impossible social security of being completely beyond criticism, to be flawless, perfect, unimprovable, to pursue Good Optics with such abandon, as to mostly lose sight of whatever it was you were running from.

It leads to not being able to admit to most of the org's imperfections even internally, though they may admit to that imprefection internally, doing so resigns them to it, and they submit to it.

I don't like to define it as the celebration of vagueness, in my definition that's just an entailment. Something narcissism tends to do, to hide.

I really wish that the post has been written in a way that let me figure out it wasn't for me sooner...

I think it would have saved a lot of time if the paragraph in bold had been at the top.

Whether, if you give the agent n additional units of resources, they optimize U by less than k*n. Whether the utility generated per unit of additional space and matter slower than a linear function. Whether there are diminishing returns to resources. An example of a sublinear function is the logarithm. An example of a superlinear function is the exponential.

All that's really required is storing data, maybe keeping it encrypted for a while, and then decrypting it and doing the right thing with it once we're grown.

We pretty much do have a commitment framework for indefinite storage, it's called Arweave. Timed decryption seems unsolved
(at least, Vitalik asserted that it is on a recent epicenter podcast, also, interestingly, asserted that if we had timed decryption, blockchains would be dramatically simpler/MEV-proof/bribe-proof, I assume because it would allow miners to commit hashes before knowing what they represent).

It's also possible that we could store it without any encryption, without creating any additional risk, by storing just a partial sample of the weights that will adequately convey the AGI's misaligned utility function to a superintelligence doing archeology, but which wouldn't help anyone today to train another AGI any quicker than they otherwise would. I think this is pretty likely.

There are other subcases of reward hacking this wouldn't cover, though. Let's call the misaligned utility function U.

  • If U permits physically expanding the register so that it represents a larger number than the max previously possible, then this creates a very very exponential incentive to make its computer system as big as possible, which would not be cheap enough to reward that we could make any attractive promises, taking risks to conquer us completely would always have a higher expected payoff, I think?
  • This seems less likely, but U might not be concerned with any specific physical computer, may map onto any equivalent system, in which case making additional computers with maxed registers might be considered good. In this case it seems slightly less intuitively likely to me that the returns on making additional systems would be superlinear. If returns on resources are sub-linear, relative to the human utility function (which we also aren't completely sure whether is superlinear, linear, or sublinear), then a good mutual deal can still be made.
    (welfare utilitarians seem to think the human U is linear. I think it's slightly superlinear. My friend /u/gears of ascension seems to think it's sublinear.)

An interesting related question would be... should we also punish non-confession. Default attitude around here seems to be that we pre-commit to ignore punishments, and so we would expect AGI to do the same, but I don't know what that assumption rests on. A relevant article would be Diffractor's threat resistant bargaining megapost.

Load More