All of James Payor's Comments + Replies

On A List of Lethalities

I agree!

I think that in order to achieve this you probably have to do lots of white-box things, like watching the AI's internal state, attempting to shape the direction of its learning, watching carefully for pitfalls. And I expect that treating the AI more as a black box and focusing on containment isn't going to be remotely safe enough.

On A List of Lethalities

I think there are far easier ways out of the box than that. Especially so if you have that detailed a model of the human's mind, but even without. I think Eliezer wouldn't be handicapped if not allowed to use that strategy. (Also fwiw, that strategy wouldn't work on me.)

For instance you could hack the human if you knew a lot about their brain. Absent that you could try anything from convincing them that you're a moral patient, promising part of the lightcone with the credible claim that another AGI company will kill everyone otherwise, etc. These ideas of ... (read more)

On A List of Lethalities

I will also add a point re "just do AI alignment math":

Math studies the structures of things. A solution to our AI alignment problem has to be something we can use, in this universe. The structure of this problem is laden with stuff like agents and deception, and in order to derive relevant stuff for us, our AI is going to need to understand all that.

Most of the work of solving AI alignment does not look like proving things that are hard to prove. It involves puzzling over the structure of agents trying to build agents, and trying to find a promising angle... (read more)

On A List of Lethalities

By the point your AI can design, say, working nanotech, I'd expect it to be well superhuman at hacking, and able to understand things like Rowhammer. I'd also expect it to be able to build models of it's operators and conceive of deep strategies involving them.

Also, convincing your operators to let you out of the box is something Eliezer can purportedly do, and seems much easier than being able to solve alignment. I doubt that anything that could write that alignment textbook has a non-dangerous level of capability.

So I'm suspicious that your region exists... (read more)

By the point your AI can design, say, working nanotech, I'd expect it to be well superhuman at hacking, and able to understand things like Rowhammer. I'd also expect it to be able to build models of it's operators and conceive of deep strategies involving them.

This assumes the AI learns all of these tasks at the same time. I'm hopeful that we could built a narrowly superhuman task AI which is capable of e.g. designing nanotech while being at or below human level for the other tasks you mentioned (and ~all other dangerous tasks you didn't).

Superhuman ability at nanotech alone may be sufficient for carrying out a pivotal act, though maybe not sufficient for other relevant strategic concerns.

-5Razied2mo

I will also add a point re "just do AI alignment math":

Math studies the structures of things. A solution to our AI alignment problem has to be something we can use, in this universe. The structure of this problem is laden with stuff like agents and deception, and in order to derive relevant stuff for us, our AI is going to need to understand all that.

Most of the work of solving AI alignment does not look like proving things that are hard to prove. It involves puzzling over the structure of agents trying to build agents, and trying to find a promising angle... (read more)

On A List of Lethalities

Thanks for writing this! I appreciate hearing how all this stuff reads to you.

I'm writing this comment to push back about current interpretability work being relevant to the lethal stuff that comes later, ala:

I have heard claims that interpretability is making progress, that we have some idea about some giant otherwise inscrutable matrices and that this knowledge is improving over time.

What I've seen folks understand so far are parts of perception in image processing neural nets, as well as where certain visual concepts show up in these nets, and more ... (read more)

Why all the fuss about recursive self-improvement?

Do you think that things won't look thresholdy even in a capability regime in which a large actor can work out how melt all the GPUs?

AGI Ruin: A List of Lethalities

Re (14), I guess the ideas are very similar, where the mesaoptimizer scenario is like a sharp example of the more general concept Eliezer points at, that different classes of difficulties may appear at different capability levels.

Re (15), "Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously", which is about how we may have reasons to expect aligned output that are brittle under rapid capability gain: your quote from Richard is just about "fast capability gain seems possible and likely", and isn't a... (read more)

AGI Ruin: A List of Lethalities

Eliezer's post here is doing work left undone by the writing you cite. It is a much clearer account of how our mainline looks doomed than you'd see elsewhere, and it's frank on this point.

I think Eliezer wishes these sorts of artifacts were not just things he wrote, like this and "There is no fire alarm".

Also, re your excerpts for (14), (15), and (32), I see Eliezer as saying something meaningfully different in each case. I might elaborate under this comment.

Re (14), I guess the ideas are very similar, where the mesaoptimizer scenario is like a sharp example of the more general concept Eliezer points at, that different classes of difficulties may appear at different capability levels.

Re (15), "Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously", which is about how we may have reasons to expect aligned output that are brittle under rapid capability gain: your quote from Richard is just about "fast capability gain seems possible and likely", and isn't a... (read more)

AGI Ruin: A List of Lethalities

maybe a reasonable path forward is to try to wring as much productivity as we can out of the passive, superhuman, quasi-oracular just-dumb-data-predictors. And avoid as much as we can ever creating closed-loop, open-ended, free-rein agents.

I should say that I do see this as a reasonable path forward! But we don't seem to be coordinating to do this, and AI researchers seem to love doing work on open-ended agents, which sucks.

Hm, regardless it doesn't really move the needle, so long as people are publishing all of their work. Developing overpowered patter... (read more)

4David Johnston2mo
I strongly disagree. Gain of function research happens, but it's rare because people know it's not safe. To put it mildly, I think reducing the number of dangerous experiments substantially improves the odds of no disaster happening over any given time frame
AGI Ruin: A List of Lethalities

Can you visualize an agent that is not "open-ended" in the relevant ways, but is capable of, say, building nanotech and melting all the GPUs?

In my picture most of the extra sauce you'd need on top of GPT-3 looks very agenty. It seems tricky to name "virtual worlds" in which AIs manipulate just "virtual resources" and still manage to do something like melting the GPUs.

5ESRogs2mo
FWIW, I'm not sold on the idea of taking a single pivotal act. But, engaging with what I think is the real substance of the question — can we do complex, real-world, superhuman things with non-agent-y systems? Yes, I think we can! Just as current language models can be prompt-programmed into solving arithmetic word problems, I think a future system could be led to generate a GPU-melting plan, without it needing to be a utility-maximizing agent. For a very hand-wavy sketch of how that might go, consider asking GPT-N to generate 1000s of candidate high-level plans, then rate them by feasibility, then break each plan into steps and re-evaluate, etc. Or, alternatively, imagine the cognitive steps you might take if you were trying to come up with a GPU-melting plan (or alternatively a pivotal act plan in general). Do any of those steps really require that you have a utility function or that you're a goal-directed agent? It seems to me that we need some form of search, and discrimination and optimization. But not necessarily anymore than GPT-3 already has. (It would just need to be better at the search. And we'd need to make many many passes through the network to complete all the cognitive steps.) On your view, what am I missing here? * Is GPT-3 already more of an agent than I realize? (If so, is it dangerous?) * Will GPT-N by default be more of an agent than GPT-3? * Are our own thought processes making use of goal-directedness more than I realize? * Will prompt-programming passive systems hit a wall somewhere? * If so, what are some of the simplest cognitive tasks that we can do that you think such systems wouldn't be able to do? * (See also my similar question here [https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities?commentId=iqwxMcpxeWG4Sk65h] .)
8James Payor2mo
I should say that I do see this as a reasonable path forward! But we don't seem to be coordinating to do this, and AI researchers seem to love doing work on open-ended agents, which sucks. Hm, regardless it doesn't really move the needle, so long as people are publishing all of their work. Developing overpowered pattern recognizers is similar to increasing our level of hardware overhang. People will end up using them as components of systems that aren't safe.
Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc

I agree with your point that blobs of bayes net nodes aren't very legible, but I still think neural nets are relevantly a lot less interpretable than that! I think basically all structure that limits how your AI does its thinking is helpful for alignment, and that neural nets are pessimal on this axis.

In particular, an AI system based on a big bayes net can generate its outputs in a fairly constrained and structured way, using some sort of inference algorithm that tries to synthesize all the local constraints. A neural net lacks this structure, and is ther... (read more)

6johnswentworth2mo
I'd crystallize the argument here as something like: suppose we're analyzing a neural net doing inference, and we find that its internal computation is implementing <algorithm> for Bayesian inference on <big Bayes net>. That would be a huge amount of interpretability progress, even though the "big Bayes net" part is still pretty uninterpretable. When we use Bayes nets directly, we get that kind of step for free. ... I think that's decent argument, and I at least partially buy it. That said, if we compare a neural net directly to a Bayes net (as opposed to inference-on-a-Bayes-net), they have basically the same structure: both are circuits. Both constrain locality of computation.
The Speed + Simplicity Prior is probably anti-deceptive

And a final note: none of that seems to matter for my main complaint, which is that the argument in the post seems to rely on factoring "mesaoptimizer" as "stuff + another mesaoptimizer"?

If so, I can't really update on the results of the argument.

1Yonadav Shavit3mo
I don’t think it relies on this, but I’m not sure where we’re not seeing eye to eye. You don’t literally need to be able to factorize out the mesaoptimizer - but insofar as there is some minimum space needed to implement any sort of mesaoptimizer (with heuristics or otherwise), this argument applies to whichever size mesaoptimizer’s tendency to optimize a valid proxy vs. deceptively optimize a proxy to secretly achieve something completely different.
The Speed + Simplicity Prior is probably anti-deceptive

A longer reply on the points about heuristic mesaobjectives and the switch:

I will first note here that I'm not a huge fan of the concepts/story from the mesaoptimizers paper as a way of factoring reality. I struggle to map the concepts onto my own model of what's going to happen as we fumble toward AGI.

But putting that aside, and noting that my language is imprecise and confused, here is how I think about the "switch" from directly to deceptively pursuing your training objective:

  1. "Pursuing objective X" is an abstraction we use to think about an agent that
... (read more)
1Yonadav Shavit3mo
I think (3) is not the same as my definition of deception. There are two independent concepts from the Xu post: "deceptive misaligned mesaoptimizers" and "nondeceptive misaligned mesaoptimizers". (3) seems to be describing ordinary misaligned mesaoptimizers (whose proxies no longer generalize on the test distribution). I think an agent that you train to keep your diamond safe that learns you're judging it from cameras may indeed take actions to fool the cameras, but I don't think it will secretly optimize some other objective while it's doing that. I agree my argument doesn't apply to this example.
The Speed + Simplicity Prior is probably anti-deceptive

Two quick things to say:

(1) I think the traditional story is more that your agent pursues mostly-X while it's dumb, but then gradient descent summons something intelligent with some weird pseudo-goal Y, because this can be selected for when you reward the agent for looking like it pursues X.

(2) I'm mainly arguing that your post isn't correctly examining the effect of a speed prior. Though I also think that one or both of us are confused about what a mesaoptimizer found by gradient-descent would actually look like, which matters lots for what theoretical models apply in reality.

3Yonadav Shavit3mo
I very much do not believe that a mesaoptimizer found by gradient descent would look anything like the above Python programs. I'm just using this as a simplification to try and get at trends that I think it represents. Re: (1) my argument is exactly whether gradient descent would summon an agent with a weird pseudogoal Y that was not itself a proxy for reward on its training distribution. If pursuing Y directly (which is different from the base optimizer goal, e.g. Z) I'm realizing some of the confusion might be because I named the goal-finding function "get_base_obj" instead of "get_proxy_for_base_obj". That seems like it would definitely mislead people, I'll fix that.
The Speed + Simplicity Prior is probably anti-deceptive

I think a contentious assumption you're making with this model is the value-neutral core of mesaoptimizer cognition, namely your mesaoptimize in the pseudocode. I think that our whole problem in practice is roughly that we don't know how to gradient-descend our way toward general cognitive primitives that have goals factored out.

A different way to point at my perceived issue: the mesaoptimizers are built out of a mesaoptimize primitive, which is itself is a mesaoptimizer that has to be learnt. This seems to me to be not well-founded, and not actually factoring a mesaoptimizer into smaller parts.

3Yonadav Shavit3mo
I think my argument only gets stronger if you assume that the mesaobjective is a large pile of heuristics built into the mesaoptimization algorithm, since that takes up much more space. In the traditional deceptive mesaoptimization story, the model needs to at some point switch from "pursuing objective X directly" to "pursuing objective Y indirectly by deceptively pursuing objective X". I agree that, if there isn't really a core "mesaoptimizer" that can have goals swapped out, the idea of seemlessly transitioning between the two is very unlikely, since you initially lack the heuristics for "pursuing objective Y". I'm not sure whether you're arguing that my post fails to imply the speed prior disincentivizes deceptive mesaoptimization, or whether you're arguing that deceptive mesaoptimization isn't likely in the first place.
Preregistration: Air Conditioner Test

Will your improvised intake tube cause your room to become positive pressure? It sounds like your "two hose" AC setup will pump in outside air, split it into a hot and cold stream, then dump the hot outside and the cold inside. If so you're not replicating two-hose efficiency, since you'll be pushing cold air out of your room!

2johnswentworth4mo
My particular AC has two separate intake vents, so this shouldn't be an issue.
Don't die with dignity; instead play to your outs

Even given the clarifications in this post, "playing to your outs" seems like it comes with some subtle denial. My sense is that it's resisting a pending emotional update?

I'm curious if this resonates.

Maybe we should keep the "Dying" part but ditch "with Dignity" (because "Dying with Dignity" sounds like giving up and peacefully resigning yourself).

Dying with Iron Integrity, Sane Strategizing, Courageous Calibration, and Obstinate Optimization

or DISCO (Dying with Integrity, Sanity, Courage, and Optimization) for short.

It's a great answer, but I think folks are turning up to defend the Good Hearts from being Goodharted

Dual use of artificial-intelligence-powered drug discovery

I see your point as warning against approaches that are like "get the AI entangled with stuff about humans and hope that helps".

There are other approaches with a goal more like "make it possible for the humans to steer the thing and have scalable oversight over what's happening".

So my alternative take is: a solution to AI alignment should include the ability for the developers to notice if the utility function is borked by a minus sign!

And if you wouldn't notice something as wrong as a minus sign, you're probably in trouble about noticing other misalignment.

A Longlist of Theories of Impact for Interpretability

If the field of ML shifts towards having a better understanding of models ...

I think this would be a negative outcome, and not a positive one.

Specifically, I think it means faster capabilities progress, since ML folks might run better experiments. Or worse yet, they might better identify and remove bottlenecks on model performance.

We're already in AI takeoff

I think your pushback is ignoring an important point. One major thing the big contributors have in common is that they tend to be unplugged from the stuff Valentine is naming!

So even if folks mostly don't become contributors by asking "how can I come more truthfully from myself and not what I'm plugged into", I think there is an important cluster of mysteries here. Examples of related phenomena:

  • Why has it worked out that just about everyone who claims to take AGI seriously is also vehement about publishing every secret they discover?
  • Why do we fear an AI
... (read more)
RFC WWIII

A recommendation: buy a HEPA filter, and also some P100 masks. I imagine these may help a bunch in a "shelter from fallout" scenario. I hear HEPA filtration was originally invented to get radioactive contaminants out of the air.

3rhollerith_dot_com5mo
Yes, HEPA was in fact invented to get radioactive contaminants out of the air, or so I heard, but they are unneeded for protection from fallout because most fallout (by mass, not by particle count) consists of particles about the size of peas, which of course do not stay suspended in the air. Of course things like N-95 masks are very useful in emergencies for other reasons.
9Dagon5mo
+1. General emergency preparedness (keeping a few weeks of food/medicine, having a generator and fuel, knowing first aid, etc.) is low-hanging fruit, and pays off in a HUGE range of scenarios.
Mechanism design / queueing theory for government to sell visas

This answer doesn't come with a strong epistemic status on my part, but seems like it has the components needed to work, perhaps with different numbers though.

In this answer we're trying to equitably split gains-from-trade, and not capture every drop we can :) And we don't handle the case without resale.

Mechanism design / queueing theory for government to sell visas

Auction off VISAs once a year, price them at the median of the winning bids, and allow the winners to resell them for at most 10% more.

1James Payor6mo
This answer doesn't come with a strong epistemic status on my part, but seems like it has the components needed to work, perhaps with different numbers though. In this answer we're trying to equitably split gains-from-trade, and not capture every drop we can :) And we don't handle the case without resale.
Impossibility results for unbounded utilities

I spent some time trying to fight these results, but have failed!

Specifically, my intuition said we should just be able to look at the flattened distributions-over-outcomes. Then obviously the rewriting makes no difference, and the question is whether we can still provide a reasonable decision criterion when the probabilities and utilities don't line up exactly. To do so we need some defined order or limiting process for comparing these infinite lotteries.

My thought was to use something like "choose the lottery whose samples look better". For instance, exa... (read more)

Entropy isn't sufficient to measure password strength

Well, the counterexample I have in mind is: flip a coin until it comes up tails, your password is the number of heads you got in a row.

While you could rejection-sample this to e.g. a uniform distribution on the numbers , this would take samples, and isn't very feasible. (As you need work to get bits of strength!)

I do agree that your trick works in all reasonable cases, whenever it's not hard to reach different possible passwords.

Entropy isn't sufficient to measure password strength

Nice!

I note you do at least get a partial ordering here, where some schemes always give the adversary lower cumulative probability of success as increases than others.

This should be similar (perhaps more fine grained, idk) than the min-entropy approach. But I haven't thought about it :)

Entropy isn't sufficient to measure password strength

I note also that I could "fix" your roll-a-D20 password-generating procedure by rejecting samples until I get something the adversary assigns low probability to.

This won't work in general though...

2Maximum_Skull7mo
It actually would, as long as you reject a candidate password with probability proportional to it's relative frequency. "password" in the above example would be almost certainly rejected as it's wildly more common that one of those 1000-character passwords.
Entropy isn't sufficient to measure password strength

Good points! My principled take is that you want to minimize your adversary's success probability, as a function of the number of guesses they take.

In the usual case where guessing some wrong password X does nothing to narrow their search other than tell them "the password isn't X", the best the adversary can do is spend their guesses on passwords in order of decreasing likelihood. If they know your password-generating-procedure, their best chance of succeeding in tries is the sum of the likelihoods of the most common passwords.

3James Payor7mo
I note also that I could "fix" your roll-a-D20 password-generating procedure by rejecting samples until I get something the adversary assigns low probability to. This won't work in general though...
2benwr7mo
Yep! I originally had a whole section about this, but cut it because it doesn't actually give you an ordering over schemes unless you also have a distribution over adversary strength, which seems like a big question. If one scheme's min-entropy is higher than another's max-entropy, you know that it's better for any beliefs about adversary strength.
Delta Strain: Fact Dump and Some Policy Takeaways

Also, when I look around, I find charts like these that suggest the claimed false negative rates vary absurdly!

article figure

2j2b1y
Where did this chart come from?
7Owain_Evans1y
I believe FNR depends on swabbing (which varies based on equipment and individual doing it), on PCR equipment, and on the patients (e.g. how early are you testing people? age of patients, etc). Then there's issue of how you get ground-truth which might also contribute to variation in these estimates.
Delta Strain: Fact Dump and Some Policy Takeaways

I once looked into the effectiveness of Australia and New Zealand's quarantine programs to get a sense for this. I think, until recently, basically no infectious cases made it through their 2-week quarantines. Their track records have become more marred since Delta arrived.

For New Zealand, if I recall correctly, basically no community infection clusters were due to quarantine breakthroughs (citation needed!). Of the cases caught with PCR, 80% tested positive on day 3 of quarantine, and the remaining 20% were positive on day 12.

So while some cases might hav... (read more)

7Zac Hatfield-Dodds1y
In Australia, hotel quarantine has caused one outbreak per 204 infected travellers [https://theconversation.com/hotel-quarantine-causes-1-outbreak-for-every-204-infected-travellers-its-far-from-fit-for-purpose-161815] . Purpose-built facilities are far better, but we only have one (Howard Springs, near Darwin) and the federal government has to date refused to build any more. Our current Delta outbreaks are tracable to a limo driver who was not - nor required to be - vaccinated or even masked while transferring travellers from their flight to hotel quarantine. The main source of our success has been in massively cuts to the number of travellers we allow in, and that has it's own obvious problems...
Delta Strain: Fact Dump and Some Policy Takeaways

I'm pretty confused about how PCR testing can be so bad. Do you have more models/info here you can share?

In particular, I think it might be the case that we've done something like overupdate on poorly-done early Chinese PCR. When I looked for data a while back, I only found the early Wuhan stuff, and the company-backed studies claiming 98% or 99% accuracy, neither of which seem trustworthy...

I currently suspect that PCR tests are effective, at least if the patient has grown enough virus to soon be infectious. I'd like to know if this is true. The main beli... (read more)

In short, I don't really know how it can be as bad as I claim it is. It seems like it should straightforwardly be highly accurate because of your two points: the sensitivity should be at a much lower threshold than the amount needed to infect someone.

Yet, I still believe this. Part of this belief is predicated on the heterogeneous results from studies, which make me think that "default" conditions lead to lots of false negatives and later studies showed much lower false negatives because they adjusted conditions to be more sanitary and less realistic. Howe... (read more)

6James Payor1y
Also, when I look around, I find charts like these that suggest the claimed false negative rates vary absurdly!
6James Payor1y
I once looked into the effectiveness of Australia and New Zealand's quarantine programs to get a sense for this. I think, until recently, basically no infectious cases made it through their 2-week quarantines. Their track records have become more marred since Delta arrived. For New Zealand, if I recall correctly, basically no community infection clusters were due to quarantine breakthroughs (citation needed!). Of the cases caught with PCR, 80% tested positive on day 3 of quarantine, and the remaining 20% were positive on day 12. So while some cases might have not been detected, it seems like these didn't go on to infect others after the 2 weeks. The people getting tested on day 3 could have been infected on their plane flight, or perhaps some days before their flight. I'd guess the median infection was a week old by day 3 of quarantine. Anyhow, NZ's numbers seemed to rule out a 50% false-negative rate, because I think their quarantine would have failed if so. They also seem to rule out a 2% false-negative rate, at least for tests done early after infection.
I’m no longer sure that I buy dutch book arguments and this makes me skeptical of the "utility function" abstraction

I have an intuition that the dutch-book arguments still apply in very relevant ways. I mostly want to talk about how maximization appears to be convergent. Let's see how this goes as a comment.

My main point: if you think an intelligent agent forms and pursues instrumental goals, then I think that agent will be doing a lot of maximization inside, and will prefer to not get dutch-booked relative to its instrumental goals.

---

First, an obvious take on the pizza non-transitivity thing.

If I'm that person desiring a slice of pizza, I'm perhaps desiring it because... (read more)

2021 New Year Optimization Puzzles

My best so far on puzzle 1:

Score: 108

This is a variant on  but we get  via , where we implement divide by 2 with sqrt.

5Scott Garrabrant2y
This comment is the best I can do on puzzle 1. Read it only if you want to be spoiled.
Reality-Revealing and Reality-Masking Puzzles

Having a go at pointing at "reality-masking" puzzles:

There was the example of discovering how to cue your students into signalling they understand the content. I think this is about engaging with a reality-masking puzzle that might show up as "how can I avoid my students probing at my flaws while teaching" or "how can I have my students recommend me as a good tutor" or etc.

It's a puzzle in the sense that it's an aspect of reality you're grappling with. It's reality-masking in that the pressure was away from... (read more)

AI Alignment Open Thread August 2019

It wasn't meant as a reply to a particular thing - mainly I'm flagging this as an AI-risk analogy I like.

On that theme, one thing "we don't know if the nukes will ignite the atmosphere" has in common with AI-risk is that the risk is from reaching new configurations (e.g. temperatures of the sort you get out of a nuclear bomb inside the Earth's atmosphere) that we don't have experience with. Which is an entirely different question than "what happens with the nukes after we don't ignite the atmosphere in a test explosion".

I like thinking about coordination from this viewpoint.

AI Alignment Open Thread August 2019

There is a nuclear analog for accident risk. A quote from Richard Hamming:

Shortly before the first field test (you realize that no small scale experiment can be done—either you have a critical mass or you do not), a man asked me to check some arithmetic he had done, and I agreed, thinking to fob it off on some subordinate. When I asked what it was, he said, "It is the probability that the test bomb will ignite the whole atmosphere." I decided I would check it myself! The next day when he came for the answers I remarked to him, "The arithmeti
... (read more)
2Rohin Shah3y
I don't really know what this is meant to imply? Maybe you're answering my question of "did that happen with nukes?", but I don't think an affirmative answer means that the analogy starts to work. I think the nukes-AI analogy is used to argue "people raced to develop nukes despite their downsides, so we should expect the same with AI"; the magnitude/severity of the accident risk is not that relevant to this argument.
Coherent behaviour in the real world is an incoherent concept
First problem with this argument: there are no coherence theories saying that an agent needs to maintain the same utility function over time.

This seems pretty false to me. If you can predict in advance that some future you will be optimizing for something else, you could trade with future "you" and merge utility functions, which seems strictly better than not. (Side note: I'm pretty annoyed with all the use of "there's no coherence theorem for X" in this post.)

As a separate note, the "further out" your goal is and th... (read more)

2Richard_Ngo3y
I agree that this problem is not a particularly important one, and explicitly discard it a few sentences later. I hadn't considered your objection though, and will need to think more about it. Mind explaining why? Is this more a stylistic preference, or do you think most of them are wrong/irrelevant? Also true if you make world states temporally extended.
Diagonalization Fixed Point Exercises

Q7 (Python):

Y = lambda s: eval(s)(s)
Y('lambda s: print("Y = lambda s: eval(s)(s)\\nY({s!r})")')

Q8 (Python):

Not sure about the interpretation of this one. Here's a way to have it work for any fixed (python function) f:

f = 'lambda s: "\\n".join(s.splitlines()[::-1])'

go = 'lambda s: print(eval(f)(eval(s)(s)))'

eval(go)('lambda src: f"f = {f!r}\\ngo = {go!r}\\neval(go)({src!r})"')

Rationalist Lent

I've recently noticed something about me: Attempting to push away or not have experience, actually means pushing away those parts of myself that have that experience.

I then feel an urge to remind readers of a view of Rationalist Lent as an experiment. Don't let it this be another way that you look away from what's real for you. But do let it be a way to learn more about what's real for you.

Beta-Beta Testing: Frontpage Rework [Update - further tweak]

Just a PSA: right-clicking or middle-clicking the posts on the frontpage toggle whether the preview is open. Please make them only expand on left clicks, or equivalent!

6Raemon4y
Bug has now been fixed, apologies both for the bug itself and the delay in getting it fixed.
3gjm4y
I came here precisely to say this, too. (Middle-clicking does open-in-new-tab. That's also the usual reason for right-clicking a link. In neither case is opening the preview helpful behaviour.)
1bryjnar5y
Yes, this is very annoying.
2Raemon5y
Thanks!
Against Instrumental Convergence

Let's go a little meta.

It seems clear that an agent that "maximizes utility" exhibits instrumental convergence. I think we can state a stronger claim: any agent that "plans to reach imagined futures", with some implicit "preferences over futures", exhibits instrumental convergence.

The question then is how much can you weaken the constraint "looks like a utility maximizer", before instrumental convergence breaks? Where is the point in between "formless program" and "selects preferred imagined futur... (read more)

2zulupineapple5y
Actually, I don't think this is true. For example take a chess playing program which imagines winning, and searches for strategies to reach that goal. Instrumental convergence would assume that the program would resist being turned off, try to get more computational resources, or try to drug/hack the opponent to make them weaker. However, the planning process could easily be restricted to chess moves, where none of these would be found, and thus would not exhibit instrumental convergence. This sort of "safety measure" isn't very reliable, especially when we're dealing with the real world rather than a game. However it is possible for an agent to be a utility maximizer, or to have some utility maximizing subroutines, and still not exhibit instrumental convergence. There is another, more philosophical question of what is and isn't a preference over futures. I believe that there can be a brilliant chess player that does not actually prefer winning to losing. But the relevant terms are vague enough that it's a pain to think about them.
Against Instrumental Convergence

Hm, I think an important piece of "intuitionistic proof" didn't transfer, or is broken. Drawing attention to that part:

Regardless of the details of how "decisions" are made, it seems easy for the choice to be one of the massive array of outcomes possible once you have control of the light-cone, made possible by acquiring power.

So here, I realize, I am relying on something like "the AI implicitly moves toward an imagined realizable future". I think that's a lot easier to get than the pipeline you sketch.

I think I... (read more)

2zulupineapple5y
As I understand, your argument is that there are many dangerous world-states and few safe world-states, therefore most powerful agents would move to a dangerous state, in the spirit of entropy. This seems reasonable. An alarming version of this argument assumes that the agents already have power, however I think that they don't and that acquiring dangerous amounts of power is hard work and won't happen by accident. A milder version of the same argument says that even relatively powerless, unaligned agents would slowly and unknowingly inch towards a more dangerous future world-state. This is probably true, however, if humans retain some control, this is probably harmless. And it is also debatable to what extent that sort of probabilistic argument can work on a complex machine.
Against Instrumental Convergence

I think there's an important thing to note, if it doesn't already feel obvious: the concept of instrumental convergence applies to roughly anything that exhibits consequentialist behaviour, i.e. anything that does something like backchaining in its thinking.

Here's my attempt at a poor intuitionistic proof:

If you have some kind of program that understands consequences or backchains or etc, then perhaps it's capable of recognizing that "acquire lots of power" will then let it choose from a much larger set of possibilities. Regar... (read more)

1zulupineapple5y
Understanding consequences of actions is a reasonable requirement for a machine to be called intelligent, however that implies nothing about the behavior of this machine. A paperclip maker may understand that destroying the earth could yield paperclips, it may not care much for humans, and it may still not do it. There is nothing inconsistent or unlikely about such machine. You're thinking of machines that have a pipeline: pick a goal -> find the optimal plan to reach the goal -> implement the plan. However this is not the only possible architecture (though it is appealing). More generally, intelligence is the ability to solve hard problems. If a program solves a problem without explicitly making predictions about the future, that's fine. Though I don't want to make claims about how common such programs would be. And there is also a problem of recognizing whether a program not written by a human does explicitly consider cause and effect.

Agreed. I guess instrumental convergence mostly applies to AIs that we have to worry about, not all possible minds.