cat from /dev/null — LessWrong

An Impossibility Proof Relevant to the Shutdown Problem and Corrigibility

Nice proof with a thought-provoking example! Think it could benefit from being translated into a more AI-relevant setting with the following key:

Northland winning = no shutdown
Southland winning = shutdown
Send messenger to Northland = Act in a way that looks dangerous and causes evaluators to probably shut down
Send messenger to Southland = Act in a way that looks safe and causes evaluators to probably not shut down
Bet on Northland = Set up costly processes to do real work in future to attain high utility (e.g. build factories and labs that would need to be safely wound down in event of shut down)
Bet on Southland = Do very conservative work that can immediately be interrupted without any real cost
Cross-over costs could be understood to be the extra fake work to look dangerous when you're actually very interruptible in the NS case and having to show humans how everything is safe in the SN case (not realistic for the utilities to end up as neat as in your example, but rephrasing this feels like it helps me keep my eyes on the ball with various toy scenarios I've considered myself).

However, this actually shows the weakness of relying on pure conditional expectation and the sane pure bets condition. Consider an AI tasked with using nanomedicine to improve human intelligence and which in the case of no shutdown values more humans with higher thinkoomph coming sooner (eg ). It considers two actions: (A) conduct sensible studies to eventually figure out a way of boosting human intelligence without loosing anything humans might value or any other kinds of negative side-effects and cooperates with humans to show this in small studies that can then be scaled up, and (B) device an intelligence boosting virus that also rewrites human values to love the AI and never want to shut it down. We can set up the hypothetical so that both of these cases have a ~0 risk of the shutdown button being pressed, but (B) will yield more utility in the normal utility function due to being able to bypass any human worries about rollout and getting the utility from earlier intelligence boosts. Under the sane pure bets condition (if we set up all the details correctly) the AI should prefer (B) to (A), but this is the opposite of what we'd want from a corrigible system. This is all not meant to be particularly realistic, but still applicable enough to the theory that it causes me to want to reject the suggested desiderata.

I think this shows that we actually need to consider something like counterfactuals for a satisfactory solution to the shutdown problem. The core issue in my example above is that the AI directly causes the shutdown button to not be pressed in case (B) by modifying human values in a way that it doesn't in (A). Another approach might be something like respecting the humans as independent agents and not interfere with them, but this seems harder to formalise and harder to combine with an objective like augmenting human intelligence.

My own current best-attempt at a solution based on this

I have some ideas that still feel half-baked and that I don't feel likely to write up in a full blog post any time soon, so I'll write down a sketch here in case it can help someone else to puzzle more of the pieces together in case this maybe would actually turn out to be helpful at some point down the line.

We can call it Causal utility mixing as a nod to Naive utility mixing on the Arbital page on Utility Indifference. Pick an action a to maximise $λ_{N} E [U_{N} | d o (a) \land d o (\neg s)] + λ_{s} E [U_{s} | d o (a) \land d o (s)]$ where the lambdas are parameters picked based on our initial worry about needing to shut down. The parameters can be kept for actions over several time-steps, but we'll need to adjust to one of the base utility functions once the truth value of s is certain (e.g. the shutdown button is actually pressed, since I think we want to avoid certainty of no shutdown). This does not seem to be represented by any utility function and so this agent must be irrational in some way, but in light of the above result I'm leaning towards this being something we actually want and then the question is if we can somehow prove that it stays consistent under ability to self-modify. This seems to handle all the counterexamples I've encountered so far (like the asteroid problem plaguing naive utility mixing and other approaches), but I could very well have missed some or failed to generate others.

Some of you might recognise the idea of using counterfactuals from Jessica Taylor's and Chris Olah's approach of Maximizing a quantity while ignoring effect through some channel (called Stable actions (causal conditioning) in the Arbital page), which is more advanced in that it actually tries to assign weights to the two different scenarios. I think that if that is a valid approach to the shutdown problem, so will this much simplified solution and it seems easier to analyse the simpler formula.

I've been thinking that maybe you can show that this is somehow rational based on the agent being one party in a two-player game where both players act counterfactually on a graph representing the world (the other being something like an idealised human deciding whether to terminate this hypothetical). I unfortunately haven't had time to compare this to the game theory based approach in The Off-Switch Game by Hadfield-Menell et al., so don't know if there are any similarities. I do feel less certain that it will still work with logical counterfactuals or any form of functional decision theory, so it does seem worth it to investigate a bit more.

Sorry for highjacking your comment feed to cause myself to write this up. Hope it was a bit interesting.

What should you change in response to an "emergency"? And AI risk

cat from /dev/null3y10-1

I think I feel a similar mix of love and frustration for your comment as I read your comment expressing with the post.

Let me be a bit theoretical for a moment. It makes sense for me to think of utilities as a sum where $U_{a}$ is the utility of things after singularity/superintelligence/etc and $U_{b}$ the utility for things before then (assuming both are scaled to have similar magnitudes so the relative importance is given by the scaling factors). There's no arguing about the shape of these or what factors people chose because there's no arguing about utility functions (although people can be really bad at actually visualizing $U_{a}$ ).

Separately form this we have actions that look like optimizing for $U_{a}$ (e.g. AI Safety research and raising awareness), and those that look like optimizing for $U_{b}$ (e.g. having kids and investing in/for their education). The post argues that some things that look like optimizing for $U_{b}$ are actually very useful for optimizing $U_{a}$ (as I understand, it mostly because AI timelines are long enough and the optimization space muddled enough that most people contribute more in expectation from maintaining and improving their general capabilities in a sustainable way at the moment).

Your comment (the pedantic response part) talks about how optimizing for $U_{a}$ is actually very useful for optimizing $U_{b}$ . I'm much more sceptical of this claim. The reason is due to expected impact per unit of effort. Let's consider the sending your kids to college. It looks like top US colleges cost around $50k more per year than state schools, adding up to $200k for a four year programme. This is maybe not several times better as the price tags suggests, but if your child is interested and able to get in to such a school it's probably at least 10% better (to be quite conservative). A lot of people would be extremely excited for an opportunity to lower the existential risk from AI by 10% for $200k. Sure, sending your kids to college isn't everything there is to $U_{b}$ , but it looks like the sign remains the same for a couple of orders of magnitude.

Your talk of a pendulum makes it sound like you want to create a social environment that incentivizes things that look like optimizing for $U_{a}$ regardless of whether they're actually in anyone's best interest. I'm sceptical of trying to get anyone to act against their interests. Rather than make everyone signal that $a ≫ b$ it makes more sense to have space for people with $a \approx b$ or even $a < b$ to optimize for their values and extract gains from trade. A successful AI Safety project probably looks a lot more like a network of very different people figuring out how to collaborate for mutual benefit than a cadre of self-sacrificing idealists.

The Efficient LessWrong Hypothesis - Stock Investing Competition

cat from /dev/null4y30

I think you picked a good suggestion for a bad reason. Both because of the difference between market cap and price per coin, as the sibling comment has pointed out, and because you don't give any reason for this to change in the next year when it's been the case for the last N years. Here's what I think is a better reason.

The Efficient LessWrong Hypothesis - Stock Investing Competition

cat from /dev/null4y30

Suggestion: Ethereum (ETH)

Reasoning: There are a number of upgrades planned in the next few years. The biggest problem for cryptocurrencies in general is the low through-put of transactions (leading to high fees due to high demand for a scarce resource). The Ethereum project has long-term plans to improve this with sharding and zero-knowledge proofs, but the sharding upgrade is not planned until 2023 and it's not clear how much of the value of zero-knowledge proofs will be captured by Ethereum as opposed to the Layer-2 chains that build on top of Ethereum.

The stronger case for ETH in the next 12 months is the proof-of-stake upgrade planned for later this year. This upgrade is a pre-requisite to the later sharding upgrade. It will both decrease the amount of new ETH that is created as part of running the network and change who it's awarded to. Instead of rewarding miners, who have a capital investment in compute and upkeep costs of electricity, it will reward stakers, who have a staked capital investment of ETH and minor upkeep costs of compute. The Triple Halving thesis by SquishChaos (original was previously summarized on LW) claims that this will cause a great reduction in supply (mainly because ETH won't be sold to pay for electricity), comparable to the halving of Bitcoin rewards iterated three times, and that this should push the price up to around $30k~$50k (i.e. around 10x the current value of around $3k). This could certainly be an overestimate, but even if you want to (someone please do a pessimistic calculation here) adjust the numbers down for overly optimistic calculations I think you would find that this still looks promising

Risks: If all of crypto loses a lot of value this would impact ETH and could lead to a net loss. An alternative version of this would be to go long ETH and short BTC in order to distinguish this from movements of crypto in general. However, I haven't thought through if all the assumptions of this thesis are still likely to hold in such scenarios.

Ethereum upgrades have been delayed in the past. Proof-of-stake has been on the Ethereum roadmap for years, and was previously planned to be merged late 2021. However, they've now merged the Kiln merge testnet which is the last merge testnet before upgrading existing public testnets.

Some assumption of the analysis, or an inference step, could be wrong. I don't claim to have verified anything here beyond the most superficial level.

Self-consciousness wants to make everything about itself

cat from /dev/null6y140

It might be worth separating self-consciousness (awareness of how your self looks from within) from face-consciousness (awareness of how your self looks from outside). Self-consciousness is clearly useful as a cheap proxy for face-consciousness, and so we develop a strong drive to be able to see ourselves as good in order for others to do so as well. We see the difference between this separation and being a good person being only a social concept (suggested by Ruby) by considering something like the events in "Self-consciousness in social justice" with only two participants: then there is no need to defend face against others, but people will still strive for a self-serving narrative.

Correct me if I'm wrong: you seem more worried about self-consciousness and the way it pushes people to not only act performatively, but also limits their ability to see their performance as a performance causing real damage to their epistemics.

Everybody Knows

cat from /dev/null6y180

One way to see this is to point out that when Alice tells Bob that everybody knows X, either Bob is asserting X because people act as if they don’t know X, or Bob does not know X. That’s why Alice is telling Bob in the first place.

It could also be that everybody (suitable quantification might be limited to: every student in this course/everyone at this party/every thinker on this site/every co-conspirator of our coup/etc) does in fact know X, but not everybody knows that everybody knows X. Depending on the circumstance of this being pointed out this can be part of creating common knowledge of X. This is related, but not identical to the fourth mode (of self-fulfilling prophecies) you describe. Consider the three statements:

X: "the king is wicked and his servants corrupt"

X': everybody in our conspiracy knows X

X'': it's common knowledge in our conspiracy that X

It's clear that saying X' can't make X true as long as our conspiracy doesn't leak information to the king or his servants. It's also clear that saying X' at a meeting of our whole conspiracy makes X'' true, and that this can be a useful tool for collective action. In fact if X' is not quite true (some people have doubts) saying it can make it true (if our co-conspirators are modal logicians using something like the T axiom).

From a more individualistic view-point such a statement of this form could still contain information if Bob does not know that he knows X (consider Zizek's description of ideology as unknown knowns).

What does the claim that ‘everybody knows’ mean?

I think you point to a valid type of problem with conspiracies of savvy and complicity, but mistakenly paint these as weapons asymmetrically favoring the forces of darkness. Perhaps the Common Knowledge framing makes it clearer, but the modes you describe are degenerate cases of tools for Justice:

The first central mode is ‘this is obviously true because social proof, so I don’t have to actually provide that social proof.’

The first mode attempts to look as deferral to experts on a question of fact. This often isn't as useful as discussing the object level, but might be more effective and legible if used as basis for a decision if the statement about expert opinion is not a lie.

The second central mode of ‘everyone knows’ is when it means ‘if you do not know this, or you question it, you are stupid, ignorant and blameworthy.’

We need to be able to punish defection in a legible way so it doesn't degenerate into blame games. For this we need common knowledge about these things so that people don't have excuses for saying or acting along the lines of not-X, but it really needs to be common knowledge so that people don't have excuses for not punishing defection on this point.

A third central mode is ‘if you do not know this (and, often, also claim everyone knows this), you do not count as part of everyone, and therefore are no one. If you wish to be someone, or to avoid becoming no one, know this.’

Restricting the universe over which you quantify can be really useful. In particular if you want to coordinate around a concrete project (like building Justice or a spaceship) it's necessary to restrict your notion of 'somebody' to a set that only includes people willing and able to accept certain facts/norms as given.

The fourth central mode is ‘we are establishing this as true, and ideally as unquestionable, so pass that information along as something everyone knows.’ It’s aspirational, a self-fulfilling prophecy. [...]

As mentioned above the self-fulfilling nature can be subtle, but it's not necessarily an ominous prophecy. It could be more along the lines of: "We all band together in accomplishing our goal, so we accomplish it and everyone who took part is greatly rewarded."

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

My own current best-attempt at a solution based on this