# All of matthewp's Comments + Replies

I'm not seeing that much here to rule out an alternative summary: get born into a rich, well-connected family.

Now, I'm not a historian, but iirc private tutoring was very common in the gentry/aristocracy 200 years back. So most of the UK examples might not say that much other than this person was default educated for class+era.

Virginia Woolf is an interesting case in point, as she herself wrote, “A woman must have money and a room of her own if she is to write fiction.”

I had to read some Lacan in college, putatively a chunk that was especially influential on the continental philosophers we were studying.

Same. I am seeing a trend where rats who had to spend time with this stuff in college say, "No, please don't go here it's not worth it." Then get promptly ignored.

The fundamental reason this stuff is not worth engaging with is because it's a Rorschach. Using this stuff is a verbal performance. We can make analogies to Tarot cards but in the end we're just cold reading our readers.

Lacan and his ilk aren't some low hanging source of zero day mind hacks for rats. Down this road lies a quagmire, which is not worth the effort to traverse.

Thanks for the additions here. I'm also unsure how to gel this definition (which I quite like) with the inner/outer/mesa terminology. Here is my knuckle dragging model of the post's implication:

target_set = f(env, agent)

So if we plug in a bunch of values for agent and hope for the best, the target_set we get might might not be what we desired. This would be misalignment. Whereas the alignment task is more like to fix target_set and env and solve for agent.

The stuff about mesa optimisers mainly sounds like inadequate (narrow) modelling of what env, agent an...

Capital gains has important differences to wealth tax. It's a tax on net-wealth-disposed-of-in-a-tax-year, or perhaps the last couple for someone with an accountant.

So your proverbial founder isn't taxed a penny until they dispose of their shares.

Someone sitting on a massive pile of bonds won't be paying capital gains tax, but rather enjoying the interest on them.

The following is as much a comment about EA as it is about rationality:

"My self-worth is derived from my absolute impact on the world-- sometimes causes a vicious cycle where I feel worthless, make plans that take that into account, and feel more worthless."

If you are a 2nd year undergraduate student, this is a very high bar to set.

First impact happens downstream, so we can't know our impact for sure until later. Depending on what we do, until possibly after we are dead.

Second, on the assumption that impact is uncertain,

...

The description of a particular version of expected utility theory feels very particular to me.

Utility is generally expressed as a function of a random variable. Not as a function of an element from the sample space.

For instance: suppose that my utility is linear in the profit or loss from the following game. We draw one bit from /dev/random. If it is true, I win a pound, else I lose one.

Utility is not here a function of 'the configuration of the universe'. It is a function of a bool. The bool itself may depend on (some subset of) 'the configuration of the universe' but reality maps universe to bool for us, computability be damned.

Just observing that the answer to this question should be more or less obvious from a histogram (assuming large enough N and a sufficient number of buckets), "Is there a substantial discontinuity at the 2% quantile?"

Power law behaviour is not necessary and arguably not sufficient for "superforecasters are a natural category" to win (e.g. it should win in a population in which 2% have a brier score of zero and the rest 1, which is not a power law).

3reallyeli3y
Agree re: power law. The data is here https://dataverse.harvard.edu/dataverse/gjp?q=&types=files&sort=dateSort&order=desc&page=1 [https://dataverse.harvard.edu/dataverse/gjp?q=&types=files&sort=dateSort&order=desc&page=1.], so I could just find out. I posted here trying to save time, hoping someone else would already have done the analysis.

I like this idea generally.

Here is an elaboration on a theme I was thinking of running in a course:

If they could have a single yes / no question answered on the topic, what should most people ask?

The idea being to get people to start thinking about what the best way to probe for more information is when "directly look up the question's answer" is not an option.

This isn't something that can be easily operationalized on a large scale for examination. It is an exercise that could work in small groups.

One way to operationalize would be to construct the group a

...

:D If I could write the right 50-80 words of code per minute my career would be very happy about it.

The human-off-button doesn't help Russell's argument with respect to the weakness under discussion.

It's the equivalent of a Roomba with a zap obstacle action. Again the solution is to dial theta towards the target and hold the zap button assuming free zaps. It still has a closed form solution that couldn't be described as instrumental convergence.

Russell's argument requires a more complex agent in order to demonstrate the danger of instrumental convergence rather than simple industrial machinery operation.

Isnasene's point above is closer to that, but tha

...
2TurnTrout3y
The work is now public. [https://www.lesswrong.com/posts/6DuJxY8X45Sco4bS2/seeking-power-is-provably-instrumentally-convergent-in-mdps]
2TurnTrout3y
I guess I'm not clear what the theta is for (maybe I missed something, in which case I apologize). Is there one initial action: how close it goes? And it's trained to maximize an evaluation function for its proximity, with just theta being the parameter? Well, my reasoning isn't publicly available yet, but this is in fact sufficient, and the assumption can be formalized. For any MDP, there is a discount rate γ, and for each reward function there exists an optimal policy π∗ for that discount rate. I'm claiming that given γ sufficiently close to 1, optimal policies likely end up gaining power as an instrumentally convergent subgoal within that MDP. (All of this can be formally defined in the right way. If you want the proof, you'll need to hold tight for a while)

This misses the original point. The Roomba is dangerous, in the sense that you could write a trivial 'AI' which merely gets to choose angle to travel along, and does so irregardless of grandma in the way.

But such an MDP not going to pose an X-risk. You can write down the objective function (y - x(theta))^2 differentiate wrt theta. Follow the gradient and you'll never end up at an AI overlord. Such a system lacks any analogue of opposable thumbs, memory and a good many other things.

Pointing at dumb industrial machinery operating around civilians and saying

...
2TurnTrout3y
It's still going to act instrumentally convergently within the MDP it thinks it's in. If you're assuming it thinks it's in a different MDP that can't possibly model the real world, or if it is in the real world but has an empty action set, then you're right - it won't become an overlord. But if we have a y-proximity maximizer which can actually compute an optimal policy that's farsighted, over a state space that is "close enough" to representing the real world, then it does take over. The thing that's fuzzy here is "agent acting in the real world". In his new book, Russell (as I understand it) argues that an AGI trained to play Go could figure out it was just playing a game via sensory discrepancies, and then start wireheading on the "won a Go game" signal. I don't know if I buy that yet, but you're correct that there's some kind of fuzzy boundary here. If we knew what exactly it took to get a "sufficiently good model", we'd probably be a lot closer to AGI. But Russell's original argument assumes the relevant factors are within the model. I think this is a reasonable assumption, but we need to make it explicit for clarity of discourse. Given that assumption (and the assumption that an agent can compute a farsighted optimal policy), instrumental convergence follows.

Lots of good points here, thanks.

My overall reaction is that:

The corrigibility framework does look like a good framework to hang the discussion on.

Your instruction to examine Y-general danger rather than X-specific danger here seems right. However, we then need to inspect what this means for the original argument. The Russell criticism being that it's blindingly obvious that an apparently trivial MDP is massively risky.

After this detour we see different kinds of risks: industrial machinery operation, and existential risk. The fixed objective, hard-coded, h

...
3Isnasene3y
I see that there's a comment-chain under this reply but I'll reply here to start a somewhat new line of thought. Let it be noted though that I'm pretty confident that I agree with the points that Turntrout makes. With that out of the way... In case it isn't clear, when Russel says " It is trivial to construct a toy MDP...", I interpret this to mean "It is trivial to conceive of a toy MDP..." That is, he is using the word in the sense of a constructive proof [https://en.wikipedia.org/wiki/Constructive_proof]; he isn't literally implying that building an AI-risky MDP is a trivial task. I wouldn't call the the opposition stupid either but I would suggest that they have not used their full imaginative capabilities to evaluate the situation. From the OP: The mistake Yann LeCun is making here is specifically that creating an objective for a superintelligent machine that turns out to be not-moronic (in the sense of allowing the machine to understand and care about everything we care about--something that hundreds of years of ethical philosophy has failed to do) is extremely hard. Furthermore, trying to build safeguards for a machine potentially orders of magnitude better at escaping safeguards than you are is also extremely hard. I don't view this point as particularly subtle because simply trying for five minutes to confidently come up with a good objective demonstrates how hard it is. Ditto for safeguards [https://www.youtube.com/watch?v=3TYT1QfdfsM] (fun video by Computerphile, if you want to watch it); and especially ditto for any safeguards that aren't along the lines of "actually let's not let the machine be superintelligent." Let's address these point-by-point: * Natural Language Understanding--Philosophers [https://en.wikipedia.org/wiki/Language_game_(philosophy)](and anyone in the field of language processing) have been talking about how language has no clear meaning for centuries * Comprehension--In terms of superintelligent AGI, the AI will be ca
2TurnTrout3y
I agree from a general convincing-people standpoint that calling discussants stupid is a bad idea. However, I think it is indeed quite obvious if framed properly, and I don't think the argument needs to come down to nuanced points, as long as we agree on the agent design we're talking about - the Roomba is not a farsighted reward maximizer, and is implied to be trained in a pretty weak fashion. Suppose an agent is incentivized to maximize reward. That means it's incentivized to be maximally able to get reward. That means it will work to stay able to get as much reward as possible. That means if we mess up, it's working against us. I think the main point of disagreement here is goal-directedness, but if we assume RL as the thing that gets us to AGI, the instrumental convergence case is open and shut.

The Promethean Servant doesn't have to be able to generate all those answers. If we could hardcode all of those and programmed it to never make decisions related to them, it would still be dangerous. For instance, if it thought "Fetching coffee is easier when more coffee is nearby->Coffee is most nearby when everything is coffee->convert all possible resources into coffee to maximize fetching").

We have to imagine a system not specifically designed to fetch the coffee that happens to be instructed to 'fetch the coffee'. Everything to do with the un

...
2Isnasene3y
Oh, I see where you're coming from. I was phrasing things the way I was because my impression is that an AGI hardcoded to optimize a "fetch the coffee" utility function would be less dangerous than an AGI hardcoded to optimize a "satisfy requests" utility function. And, in terms of AI safety, it's easier to compare AI risks across agents with different capabilities if they share the same objective function. Satisfying Requests != Fetch Coffee. But in this case, I think the article is confused: When we talk about risks from an AI optimizing X (ie, satisfy requests), we shouldn't evaluate those risks based on how well it appears to satisfy Y (ie, fetch coffee) relative to our standards. Because Y is not the reason why the AI would do dangerous things; X is. To illustrate this point, you say: And this is approximately true. A Promethan Servant optimizing a proxy for "Satisfying Requests" would go about satisfying a request for it to explain how it fetches coffee. And it will likely satisfy a request to fetch coffee in line with that explanation. This is because the AI really doesn't care much at all about fetching the coffee itself--it cares only about satisfying requests (and sometimes that's related to coffee fetching). But: With respect to 'fetch the coffee', this is true. You could safeguard the AI from fetching coffee in particularly sketchy by making sure it explains what it plans to do. But, with respect to 'satisfying requests', this is not true. You cannot see at all how the Promethean Servant would go about optimizing its actual goal: a proxy that is similar to "Satisfying Requests" but that probably won't have been perfectly defined to be human-compatible. You have to hardcode something in to motivate the AI to learn about the world and this thing isn't going to be adjusted or learned on the fly unless you solve corrigibility. And it's not obvious that corrigibility can be learned in a safe way [https://www.lesswrong.com/posts/o22kP33tumooBtia3/can-cor

I think the second robot you're talking about isn't the candidate for the AGI-could-kill-us-all level alignment concern. It's more like a self driving car that could hit someone due to inadequate testing.

Guess I'm not sure though how many answers to our questions you envisage the agent you're describing generating from second principles. That's the nub here because both the agents I tried to describe above fit the bill of coffee fetching, but with clearly varying potential for world-ending generalisation.