Martín Soto

Mathematical Logic grad student, doing AI Safety research for ethical reasons.

Doing SERI MATS with Vivek Hebbar (MIRI).

He/him

Wiki Contributions

Comments

Off-topic, but I just noticed I'm reading your book for my Computational Complexity master's course. And you're posting here on alignment! Such a positive shock! :)

In the past I had the thought: "probably there is no way to simulate reality that is more efficient than reality itself". That is, no procedure implementable in physical reality is faster than reality at the task of, given a physical state, computing the state after t physical ticks. This was motivated by intuitions about the efficiency of computational implementation in reality, but it seems like we can prove it by diagonalization (similarly to how we can prove two systems cannot perfectly predict each other), because the machine could in particular predict itself.

Indeed, suppose you have a machine M that calculates physical states faster than reality. Modify into M', which first uses M to calculate physical states, and then takes some bits from that physical state, does some non-identity operation to them (for example, negates them) and outputs them. Then, feed the physical description of M', its environment and this input itself to M', and suppose those privileged bits of the physical state are so that they perfectly correspond to the outputs of M' in-simulation. This is a contradiction, because M' will simulate everything up until simulated-M' finishing its computation, and then output something different from simulated-M'.

It seems like the relevant notion of "faster" here is causality, not time.

Wait, the input needs to contain the whole information in the input, plus some more (M' and the environment), which should be straightforwardly impossible information-theoretically? Unless somehow the input is a hash which generates both a copy of itself and the description of M' and the environment. But then would something already contradictory happen when M decodes the hash? I think not necessarily. But maybe getting the hash (having fixed the operation performed by M in advance) is already impossible, because we need to calculate what the hash would produce when being run that operation on. But this seems possible through some fix-point design, or just a very big brute-force trial and error (given reality has finite complexity). Wait, but whatever M generates from the hash won't contain more information than the system hash+M contained (at time 0), and the generated thing contains hash+M+E information. So it's not possible unless the environment is nothing (that is, the whole isolated environment initial state is the machine which is performing operations on the hash? but that's trivially always the case right?...). I'm not clear on this.

In any event it seems like the paradox could truly reside here, in the assumption that something could carry semantically all the information about its physical instantiation (and that does resonate with the efficiency intuition above), and we don't even need to worry about calculating the laws of physics, just encoding information of static physical states.

Other things to think about:

  • What do we mean by "given a physical state, compute the state after t physical ticks?". Do I give you a whole universe, or a part of the universe completely isolated from the rest so that the rest doesn't enter the calculations? (that seems impossible) What do t physical ticks mean? Allegedly they should be fixed by our theory. What if the ticks are continuous and so infinitely expensive to calculate any non-zero length of time? What about relativity messing up simultaneity? (probably in all of these there are already contradictions without even needing to the calculation, similarly to the thing above)
  • If the complexity of the universe never bottoms out, that is after atoms there's particles, then quarks, then fields, then etc. ad infinitum (this had a philosophical name I don't remember now), then it's immediately true.
  • How does this interact with that "infinite computation" thing?

More generally, if we did have such a delicate goal, it would be the case that if we learned that a particular thing had happened at any point in the past in our universe, that entire universe would be partially ruined for us forever. That just doesn't sound realistic.

It does sound realistic given how much we disvalue extreme suffering, and how much we regret events like the Holocaust (even still acknowledging that we need to look forward, it is still better for the future to be improved, we still have the potential to do so, etc.).

Maybe I'm misunderstanding you. If you meant "learning this past thing is qualitatively different to any positive thing we could implement going forward", then I agree this doesn't seem to be the case. But if you just meant "our utility would be heavily negative, because part of the universe has been devoted to that thing", then I do think it's actually the case for most humans' revealed preferences. Like, everything just continues to add up into the same quantitative utility basket (instead of being qualitatively different), but maybe that past negative sum was so large that it's very difficult in our universe to overpower it.

we have only said that P2B is the convergent instrumental goal. Whenever there are obvious actions that directly lead towards the goal, a planner should take them instead.

Hmm, given your general definition of planning, shouldn't it include realizations (and their corresponding guided actions) of the form "further thinking about this plan is worse than already acquiring some value now", so that P2B itself already includes acquiring the terminal goal (and optimizing solely for P2B is thus optimal)?

I guess your idea is "plan to P2B better" means "plan with the sole goal of improving P2B", so that it's a "non-value-laden" instrumental goal.

I don't think I completely grok the distinction you're trying to point at with "Shape of problem" vs "How capabilities decompose".

I guess "Shape of problem" is about systematic incentives that will be present, like inductive biases in our training procedures, while "How capabilities decompose" is about how easy/natural it is for a mind to solve the task without solving other tasks. The latter is about "minds in general" and the former about "minds trained by us"?

But then I don't understand some of your classifications. For example, how is "it stumbles into human-friendliness before x-risk capability” a claim about shape of the problem (instead of also depending on how hard are the tasks of making humans extinct, understanding/imitating humans, etc.), while things like “IDA does/doesn’t converge to deception (because of obfuscated arguments etc.)” (which would be a part of Scalable Oversight) are not shape of the problem, but capabilities decomposition?

I feel like this is a pretty blurry line to classify evidence (and thus maybe not the most useful, but I'm not sure).

  • Moral realism[2], and for some reason the agent cares.
  • Not moral realism, but the practical equivalent - orthogonality may be technically true, but the default thing that comes out of making something sufficiently smart is something human-friendly, and you'd have to have a very good mechanistic understanding of intelligence to create something unfriendly.

How would the first be different from the second? What are you understanding by "moral realism makes the AI human-friendly" that is not just "practical convergence"?

Are you picturing something like "the AI reasons enough / reads this text (which would also convince humans if presented correctly) and becomes completely convinced that a certain moral theory is true, and that it must follow it (a la Descartes with God)"? Because that's just a particular way of getting "practical convergence". You probably agree with that (as you mentioned, they are not mutually exclusive), but I'm interested in whether you understood anything else by moral realism here.

Since this hypothesis makes distinct predictions, it is possible for the confidence to rise above 50% after finitely many observations.

I was confused about why this is the case. I now think I've got an answer (please anyone confirm):
The description length of the Turing Machine enumerating theorems of PA is constant. The description length of any Turing Machine that enumerates theorems of PA up until time-step n and the does something else grows with n (for big enough n). Since any probability prior over Turing Machines has an implicit simplicity bias, no matter what prior we have, for big enough n the latter Turing Machines will (jointly) get arbitrarily low probability relative to the first one. Thus, after enough time-steps, given all observations are PA theorems, our listener will assign arbitrarily higher probability to the first one than all the rest, and thus the first one will be over 50%.

Edit: Okay, I now saw you mention the "getting over 50%" problem further down:

I don't know if the argument works out exactly as I sketched; it's possible that the rich hypothesis assumption needs to be "and also positive weight on a particular enumeration". Given that, we can argue: take one such enumeration; as we continue getting observations consistent with that observation, the hypothesis which predicts it loses no weight, and hypotheses which (eventually) predict other things must (eventually) lose weight; so, the updated probability eventually believes that particular enumeration will continue with probability > 1/2.

But I think the argument goes through already with the rich hypothesis assumption as initially stated. If the listener has non-zero prior probability on the speaker enumerating theorems of PA, it must have non-zero probability on it doing so in a particular enumeration. (unless our specification of the listener structure doesn't even consider different enumerations? but I was just thinking of their hypothesis space as different Turing Machines the whole time) And then my argument above goes through, which I think is just your argument + explicitly mentioning the additional required detail about the simplicity prior.

I think coherence over time is a very difficult problem, and one humans still struggle at, even though (I assume) evolution optimized us hard for this.

Not sure about this. Did humans in the ancestral environment need more than 1-day coherence?

Quite surprisingly, that hasn't been my (still recent) experience at all... I've found s-riskers I've met to be cheerful and open-minded. Most concretely, I've found in them a lot of that animal rights oomph, and haven't felt them mentally trapped anywhere?

Load More