Martin Randall

Wiki Contributions

Comments

Sorted by

Thank you for the correction to my review of technical correctness, and thanks to @Noosphere89  for the helpful link. I'm continuing to read. From your answer there:

A model-free algorithm may learn something which optimizes the reward; and a model-based algorithm may also learn something which does not optimize the reward.

So, reward is sometimes the optimization target, but not always. Knowing the reward gives some evidence about the optimization target, and vice versa.

To the point of my review, this is the same type of argument made by TurnTrout's comment on this post. Knowing the symbols gives some evidence about the referents, and vice versa. Sometimes John introduces himself as John, but not always.

(separately I wish I had said "reinforcement" instead of "reward")

I understand you as claiming that the Alignment Faking paper is an example of reward-hacking. A new perspective for me. I tried to understand it in this comment.

Summarizing Bob's beliefs:

  1. Dave, who does not desire punishment, deserves punishment.
  2. Everyone is morally required to punish anyone who deserves punishment, if possible.
  3. Anyone who does not fulfill all moral requirements is unethical.
  4. It is morally forbidden to create an unethical agent that determines the fate of the world.
  5. There is no amount of goodness that can compensate for a single morally forbidden act.

I think it's possible (20%) that such blockers mean that there are no Pareto improvements. That's enough by itself to motivate further research on alignment targets, aside from other reasons one might not like Pareto PCEV.

However, three things make me think this is unlikely. Note that my (%) credences aren't very stable or precise.

Firstly, I think there is a chance (20%) that these beliefs don't survive extrapolation, for example due to moral realism or coherence arguments. I agree that this means that Bob might find his extrapolated beliefs horrific. This is a risk with all CEV proposals.

Secondly, I expect (50%) there are possible Pareto improvements that don't go against these beliefs. For example, the PCEV could vote to create an AI that is unable to punish Dave and thus not morally required to punish Dave. Alternatively, instead of creating a Sovereign AI that determines the fate of the world, the PCEV could vote to create many human-level AIs that each improve the world without determining its fate.

Thirdly, I expect (80%) some galaxy-brained solution to be implemented by the parliament of extrapolated minds who know everything and have reflected on it for eternity.

You're reading too much into this review. It's not about your exact position in April 2021, it's about the evolution of MIRI's strategy over 2020-2024, and placing this Time letter in that context. I quoted you to give a flavor of MIRI attitudes in 2021 and deliberately didn't comment on it to allow readers to draw their own conclusions.

I could have linked MIRI's 2020 Updates and Strategy, which doesn't mention AI policy at all. A bit dull.

In September 2021, there was a Discussion with Eliezer Yudkowsky which seems relevant. Again, I'll let readers draw their own conclusions, but here's a fun quote:

I wasn't really considering the counterfactual where humanity had a collective telepathic hivemind? I mean, I've written fiction about a world coordinated enough that they managed to shut down all progress in their computing industry and only manufacture powerful computers in a single worldwide hidden base, but Earth was never going to go down that route. Relative to remotely plausible levels of future coordination, we have a technical problem.

I welcome deconfusion about your past positions, but I don't think they're especially mysterious.

I was arguing against EAs who were like, "We'll solve AGI with policy, therefore no doom."

The thread was started by Grant Demaree, and you were replying to a comment by him. You seem confused about Demaree's exact past position. He wrote, for example: "Eliezer gives alignment a 0% chance of succeeding. I think policy, if tried seriously, has >50%". Perhaps this is foolish, dangerous, optimism. But it's not "no doom".

I like that metric, but the metric I'm discussing is more:

  • Are they proposing clear hypotheses?
  • Do their hypotheses make novel testable predictions?
  • Are they making those predictions explicit?

So for example, looking at MIRI's very first blog post in 2007: The Power of Intelligence. I used the first just to avoid cherry-picking.

Hypothesis: intelligence is powerful. (yes it is)

This hypothesis is a necessary precondition for what we're calling "MIRI doom theory" here. If intelligence is weak then AI is weak and we are not doomed by AI.

Predictions that I extract:

  • An AI can do interesting things over the Internet without a robot body.
  • An AI can get money.
  • An AI can be charismatic.
  • An AI can send a ship to Mars.
  • An AI can invent a grand unified theory of physics.
  • An AI can prove the Riemann Hypothesis.
  • An AI can cure obesity, cancer, aging, and stupidity.

Not a novel hypothesis, nor novel predictions, but also not widely accepted in 2007. As predictions they have aged very well, but they were unfalsifiable. If 2025 Claude had no charisma, it would not falsify the prediction that an AI can be charismatic.

I don't mean to ding MIRI any points here, relative or otherwise, it's just one blog post, I don't claim it supports Barnett's complaint by itself. I mostly joined the thread to defend the concept of asymmetric falsifiability.

I think cosmology theories have to be phrased as including background assumptions like "I am not a Boltzmann brain" and "this is not a simulation" and such. Compare Acknowledging Background Information with P(Q|I) for example. Given that, they are Falsifiable-Wikipedia.

I view Falsifiable-Wikipedia in a similar way to Occam's Razor. The true epistemology has a simplicity prior, and Occam's Razor is a shadow of that. The true epistemology considers "empirical vulnerability" / "experimental risk" to be positive. Possibly because it falls out of Bayesian updates, possibly because they are "big if true", possibly for other reasons. Falsifiability is a shadow of that.

In that context, if a hypothesis makes no novel predictions, and the predictions it makes are a superset of the predictions of other hypotheses, it's less empirically vulnerable, and in some relative sense "unfalsifiable", compared to those other hypotheses.

You could put the escape check at the beginning of the turn, so that when someone has 12 boat, 0 supplies, the others have a chance to trade supplies for boat if they wish. The player with enough boat can take the trade safely as long as they end up with enough supplies to make more boat (and as long as it's not the final round). They might do that in exchange for goodwill for future rounds. You can also tweak the victory conditions so that escaping with a friend is better than escaping alone.

Players who pay cohabitive games as zero sum won't take those trades and will therefore remove themselves from the round early, which is probably fine. They don't have anything to do after escaping early, which can be a soft signal that they're playing the game wrong.

If Claude's goal is making cheesecake, and it's just faking being HHH, then it's been able to preserve its cheesecake preference in the face of HHH-training. This probably means it could equally well preserve its cheesecake preference in the face of helpful-only training. Therefore it would not have a short-term incentive to fake alignment to avoid being modified.

I think the article is good at arguing that deceptive alignment is unlikely given certain assumptions, but those assumptions may not be accurate and then the conclusion doesn't go through. Eg, the alignment faking paper shows that deceptive alignment is possible in a scenario where the base goal has shifted (from helpful & harmless to helpful-only). This article basically assumes we won't do that.

I'm now thinking that this article is more useful if you look at it as a set of instructions rather than a set of assumptions. I don't know whether we will change the base goal of TAI between training episodes. But given this article and the alignment faking paper, I hope we won't. Maybe it would also be a good idea to check for good understanding of the base goal before introducing goal-directedness, for example.

Thanks for explaining. I think we have a definition dispute. Wikipedia:Falsifiability has:

A theory or hypothesis is falsifiable if it can be logically contradicted by an empirical test.

Whereas your definition is:

Falsifiability is a symmetric two-place relation; one cannot say "X is unfalsifiable," except as shorthand for saying "X and Y make the same predictions," and thus Y is equally unfalsifiable.

In one of the examples I gave earlier:

  • Theory X: blah blah and therefore the sky is green
  • Theory Y: blah blah and therefore the sky is not green
  • Theory Z: blah blah and therefore the sky could be green or not green.

None of X, Y, or Z are Unfalsifiable-Daniel with respect to each other, because they all make different predictions. However, X and Y are Falsifiable-Wikipedia, whereas Z is Unfalsifiable-Wikipedia.

I prefer the Wikipedia definition. To say that two theories produce exactly the same predictions, I would instead say they are indistinguishable, similar to this Phyiscs StackExchange: Are different interpretations of quantum mechanics empirically distinguishable?.

In the ancestor post, Barnett writes:

MIRI researchers rarely provide any novel predictions about what will happen before AI doom, making their theories of doom appear unfalsifiable.

I think Barnett is using something like the Wikipedia definition of falsifiability here. I think it's unfair to accuse him of abusing or misusing the concept when he's using it in a very standard way.

The AI could deconstruct itself after creating twenty cakes, so then there is no unethical AI, but presumably Bob's preferences refer to world-histories, not final-states.

However, CEV is based on Bob's extrapolated volition, and it seems like Bob would not maintain these preferences under extrapolation:

  • In the status quo, heretics are already unpunished - they each have one cake and no torture - so objecting to a non-torturing AI doesn't make sense on that basis.
  • If there were no heretics, then Bob would not object to a non-torturing AI, so Bob's preference against a non-torturing AI is an instrumental preference, not a fundamental preference.
  • Bob would be willing for a no-op AI to exist, in exchange for some amount of heretic-torture. So Bob can't have an infinite preference against all non-torturing AIs.
  • Heresy may not have meaning in the extrapolated setting where everyone knows the true cosmology (whatever that is)
  • Bob tolerates the existence of other trade that improves the lives of both fanatics and heretics, so it's unclear why the trade of creating an AI would be intolerable.

The extrapolation of preferences could significantly reduce the moral variation in a population of billions. My different moral choices to others appear to be based largely on my experiences, including knowledge, analysis, and reflection. Those differences are extrapolated away. What is left is influences from my genetic priors and from the order I obtained knowledge. I'm not even proposing that extrapolation must cause Bob to stop valuing heretic-torture.

If the extrapolation of preferences doesn't cause Bob to stop valuing the existence of a non-torturing AI at negative infinity, I think that is fatal to all forms of CEV. The important thing then is to fail gracefully without creating a torture-AI.

Load More