# All of Isaac King's Comments + Replies

Ah, found the story. Wasn't quite as I remembered. (Search for "wrong number".)

Somewhat related; it seems likely that Bing's chatbot is not running on GPT-3 like ChatGPT was, but is running on GPT-4. This could explain its more defensive and consistent personality; it's smarter and has more of a sense of self than ChatGPT ever did.

I don't think that "users active on the site on Petrov day", nor "users who visited the homepage on Petrov day" are good metrics; someone who didn't want to press the button would have no reason to visit the site, and they might have not done so either naturally (because they don't check LW daily) or artificially (because they didn't want to be tempted or didn't want to engage with the exercise.) I expect there are a lot of users who simply don't care about Petrov day, and I think they should still be included in the set of "people who chose not to press t...

Something like that would be much more representative of real defection risks. It's easy to cooperate with people we like; the hard part is cooperating with the outgroup.

(Good luck getting /r/sneerclub to agree to this though, since that itself would require cooperation.)

It's difficult to incentivize people to not press the button, but here's an attempt: If we successfully get through Petrov day without anyone pressing the button (other than the person who has already done so via the bug), I will donate 50 to a charity selected by majority vote. These are much more creative than mine, good job. I especially liked 8, 12, 27, and 29. fast plane and steer up rocket ship throw it really hard extremely light balloon wait for an upwards gust of wind tall skyscraper space elevator earthquake energy storage really big tsunami asteroid impact launch wait for the sun to engulf both increase mass of earth enough to make moon crash elevator pulley system with counterweight superman rename earth to "the moon" take it to a moon replica on earth touch it to a moon rock on earth really big air rifle wait for tectonic drift to make a big enough mountain teleporter point a particle accelerator upwards attach to passing ne ... There's an experiment — insert obligatory replication crisis disclaimer — where one participant is told to gently poke another participant. The second participant is told to poke the first participant the same amount the first person poked them. It turns out people tend to poke back slightly harder than they were first poked. Repeat. A few iterations later, they are striking each other really hard. Do you know where I could read this study? I was unable to find it online with keywords like "poking", "escalation", etc. A cognitive system with sufficiently high cognitive powers, given any medium-bandwidth channel of causal influence, will not find it difficult to bootstrap to overpowering capabilities independent of human infrastructure. I don't find the argument you provide for this point at all compelling; your example mechanism relies entirely on human infrastructure! Stick an AGI with a visual and audio display in the middle of the wilderness with no humans around and I wouldn't expect it to be able to do anything meaningful with the animals that wander by before it breaks down. Let alone interstellar space. Ah, so mortality almost always trends downwards except when it jumps species, at which point there can be a discontinuous jump upwards. That makes sense, thank you. Why is it assumed that diseases evolve towards lower mortality? Every new disease is an evolved form of an old disease, so if that trend were true we'd expect no disease to ever have noticeable mortality. 4JenniferRM10mo Lower mortality is just generally more efficient, even from the disease perspective. Typhoid Mary [https://en.wikipedia.org/wiki/Mary_Mallon] was a great success for the typhoid, precisely because it didn't take her out or slow her down. Over the long run, most diseases find a balance. The balance is almost never "kill the host really fast". Your gut microbiome is basically made out of a bunch of "infectious diseases" that play nicely with their host. That's the normal thing. Symbiosis is efficient and normal. Depending on how you count, there are 10s or 100s or 1000s of essentially friendly bacteria species [https://www.nature.com/articles/s41396-019-0435-7] in a modern human GI tract. When a bat virus, like SARS or Ebola, jumps to humans, the species it came from often wasn't even really bothered by it any more... but the disease does not start out in evolutionary equilibrium with humans. Its is usually only only when you have high enough mixing (or a new kind of mixing over barriers of separation that previously kept things separate (or some idiot not inside a BSL5 does GoF research and mixes biological reality with their scary imaginations)) that ancient equilibriums of default symbiosis seriously break down. Judging by a quick look at Twitter, this is going to be politically polarized right off the bat, with large swaths of the population immediately refusing vaccines or NPIs. So I think whether this turns into a serious pandemic is going to depend largely on the infectiousness of Monkeypox and not all that much else. I don't think that's what's happening in the situations I'm thinking about, but I'm not sure. Do you have an example dialogue that demonstrates someone taking a belief literally when it obviously wasn't intended that that way? Do you think that conveying my motivation for the question would significantly lower the frequency of miscommunications? If so, why? I tend to avoid that kind of thing because I don't want it to bias the response. If I explain my motivations, then their response is more likely to be one that's trying to affect my behavior rather than convey the most accurate answer. I don't want to be manipulated in that way, so I try to ask question that people are more likely to answer literally. From the "interpretation" section of the link I provided: Truthfulness should be the absolute norm for those who trust in Christ. Our simple yes or no should be completely binding since deception is never an option for us. If an oath is required to convince someone of our honesty or intent to be faithful, it suggests we may not be known for telling the truth in other circumstances. It's likely that the taking of oaths had become a way of manipulating people or allowing wiggle room to get out of some kinds of contracts. James is definite: For those in Christ, dishonesty is never an option. I travel frequently for my job, and spend >50% of my time away from home. Can any of the existing cryonics organizations handle someone who has about an equal chance of dying in any of the ~200 largest cities in the US and Canada? What's the conceptual difference between "running a search" and "applying a bunch of rules"? Whatever rules the cat AI is applying to the image must be implemented by some step-by-step algorithm, and it seems to me like that could probably be represented as running a search over some space. Similarly, you could abstract away the step-by-step understanding of how breadth-first search works and say that the maze AI is applying the rule of "return the shortest path to the red door". 5Rafael Harth1y Yeah, very good question. The honest answer is that I don't know; I had this distinction in mind when I wrote the post, but pressed with it, I don't know if there's a simple way to capture it. Someone on the AstralCodexTen article just asked the same, and the best I came up with is "the set of possible outputs is very large and contains harmful elements". This would certainly be a necessary criterion; if every output is harmless, the system can't be dangerous. (GPT already fails this.) But even if there is no qualitative step, you can view it as a spectrum of competence, and deceptive/proxy alignment start being a possibility at some point on the spectrum. Not having the crisp characterization doesn't make the dangerous behavior go away. How could an algorithm know Bob's hypothesis is more complex? I think this is supposed to be Alice's hypothesis? I'm having trouble understanding how the maze example is different from the cat example. The maze AI was trained on a set of mazes that had a red door along the shortest path, so it learned to go to those red doors. When it was deployed on a different set of mazes, the goal it had learned didn't match up with the goal its programmers wanted it to have. This seems like the same type of out-of-distribution behavior that you illustrated with the AI that learned to look for white animals rather than cats. You presented the maze AI as different from the cat AI b... 4Rafael Harth1y (This is the second time someone asks this, so the fault is probably with the post and I should edit it somehow.) The difference is that the maze AI is running a search. (The classifier isn't; it's just applying a bunch of rules.) This matters because that's where the whole thing gets dangerous. If you get the last part on deceptive and proxy alignment, those concepts only make sense once we're in the business of optimizing, i.e., running a search for actions that score well according to some utility function. In that setting, it makes sense to think of the inner thing as an "optimizer" or "agent" that has goals/wants things/etc. it might contain over 101000000 candidates This seems like an oddly specific number; is it supposed to be ? If so, why is it such a small space? If the model accepts 24-bit, 1000x1000 pixel images and has to label them all as "cat" or "no cat", there should be possible models. 3Rafael Harth1y Yes it is! This must have happened when I changed the editor to markdown. Thanks. Why is it small? Well, the point of that sentence is that it's infeasible to try them all, so I just made up some large number (that I knew was definitely not too large). I'd say it's pedagogically preferable to avoid double-exponentiation. I don't know if this answers your question, but they have a technical guide here. I didn't know this was a thing. Is there a post that explains why it isn't turned on by default? I looked around but couldn't find anything about agreement voting from less than 10 years ago, and none of those directly addressed that question anyway. And are there any other types of voting that are turned off by default? 4Raemon1y This is an experimental feature we invented ~2 months ago. We've only ever used on 2-3 threads. You can read about a different experimental voting system here [https://www.lesswrong.com/posts/ywpWMnJmqAkeaDtne/open-thread-jan-2022-vote-experiment]. You can read about a previous use of agreement/disagreement voting here [https://www.lesswrong.com/posts/pQGFeKvjydztpgnsY/occupational-infohazards?commentId=e4x24Sp224NjMFmtm#comments]. While friendly competition can be good in many contexts, I don't think this is one of them. The holiday is about a dedicated team who were willing to die together for their cause. I don't think competing to see who can go the longest without food would really be in the spirit of the holiday. I suspect it would also lead to bad feeling, having to police for cheating, etc. The framing wasn't an intentional choice, I wasn't considering that aspect when I made the comment. I haven't been privy to any of the off-LW conflict about it, so it wasn't something that I was primed to look out for. I am not suggesting that there should be a community-wide standard (or that there shouldn't be). I intended it as "here's an idea that people may find interesting." Thoughts on having part of the holiday be "have tasty food easily accessible (perhaps within sight range) during the fast"? Pros: • It's in keeping with the original story. • It can help us see the dangers of having instant gratification available, and let us practice our ability to resist short-term urges for long-term benefits. • If the goal of rationalist holidays is to help us feel like our own community, then this could help people feel more "special". Many religions have holidays that call for a fast, but as far as I know none of them expect one to tempt thems ... 6nim1y This conjures a mental image of getting a particularly delicious and delicious-looking dessert and leaving it front and center on the table for the 3 days, not to be cut into until the fast is over. This could fit well with a modified form of food abstinence, such as avoiding all sweet snacks and desserts, for those whose work demands or other circumstances are incompatible with complete fasting. If many people observed in this way, I would imagine a competitive aspect emerging: who can celebrate by not-eating the most tempting-looking Vlavilov Day treat? 7Elizabeth1y It looks like you're framing this as a decision being made by and for a group as a whole (as opposed to an individual observation designed by and for each person themselves). Can you say more on why you believe that's the best frame? My sense is a lot of the off-LW conflict over Vavilov Day boils down to the framing of individual vs. collective decision. This was probably meant sarcastically, but I do think that having part of the tradition be "have tasty food nearby during the fast" is worth consideration. If the goal of rationalist holidays is to help us feel like a community, then this could make us feel more "special" and perhaps help towards that goal. (Many religions have holidays that call for a fast, but as far as I know none of them expect one to tempt themselves.) It's also a nice display of self-control and the dangers of having instant gratification available. There's value in learning the ability to resist those urges for one's long-term benefit. 1jayterwahl1y Not sarcastically! I wanted to have a Hard Mode available for those whose fasting was going well. Vavilov et al certainly did it with seeds available. Well the biggest problem is that it doesn't seem to work. I tested in a 2-player game where we both locked in an answer, but the game didn't progress to the next round. I waited for the timer to run out, but it still didn't progress to the next round, just stayed at 0:00. Changes in my probability are also not visible to the other players until I lock mine in. A few more minor issues: • After locking in a probability, there's no indication in the UI that I've done so. I can even press the "lock" button again and get the same popup, despite the fact that it's a ... 2Jozdien1y Thanks! I'll look into these. Refactoring the entire frontend codebase is probably worth it, considering I wrote it months ago and it's kinda embarrassing to look back at. Questions about a topic that I don't know about result in me just putting the max entropy distribution on that question, which is fine if it's rare, but leads to unhelpful results if they make up a large proportion of all the questions. Most calibration tests I found pulled from generic trivia categories such as sports, politics, celebrities, science, and geography. I didn't find many that were domain-specific, so that might be a good area to focus on. Some of them don't tell me what the right answers are at the end, or even which questions I got wrong, whi... This looks super neat, thank you for sharing. I just did a quick test and can confirm that it is in fact riddled with bugs. If it would help, I can write up a list of what needs fixing. 2Jozdien1y That would be helpful if you have the time, thanks! Wouldn't an observed mismatch between assigned probability and observed probability count as Bayesian evidence towards miscalibration? I think you're confusing ignorance with other people's beliefs about that agent's ignorance. In your example of the police or the STD test, there is no benefit gained by that person being ignorant of the information. There is however a benefit of other people thinking the person was ignorant. If someone is able to find out whether they have an STD without anyone else knowing they've had that test, that's only a benefit for them. (Not including the internal cognitive burden of having to explicitly lie.) An open-ended probability calibration test is something I've been planning to build. I'd be curious to hear your thoughts on how the specifics should be implemented. How should they grade their own test in a way that avoids bias and still gives useful results? Whether Omega ended up being right or wrong is irrelevant to the problem, since the players only find out if it was right or wrong after all decisions have been made. It has no bearing on what decision is correct at the time; only our prior probability of whether Omega will be right or wrong matters. 1JBlack1y It is extremely relevant to the original problem. The whole point is that Omega is known to always be correct. This version weakens that premise, and the whole point of the thought experiment. In particular, note that the second decision was based on a near-certainty that Omega was wrong. There is some ordinarily strong evidence in favour of it, since the agent is apparently in possession of a million dollars with nothing to prevent getting the thousand as well. Is that evidence strong enough to cancel out the previous evidence that Omega is always right? Who knows? There is no quantitative basis given on either side. And that's why this thought experiment is so much weaker and less interesting than the original. 2Vladimir_Nesov2y If you observe Omega being wrong [https://www.lesswrong.com/posts/psyhmuDhazzFJKjXf/oracle-predictions-don-t-apply-to-non-existent-worlds], that's not the same thing as Omega being wrong in reality, because you might be making observations in a counterfactual. Omega is only stipulated to be a good predictor in reality, not in the counterfactuals generated by Omega's alternative decisions about what to predict. (It might be the right decision principle to expect Omega being correct in the counterfactuals generated by your decisions [https://www.lesswrong.com/posts/psyhmuDhazzFJKjXf/oracle-predictions-don-t-apply-to-non-existent-worlds?commentId=ANme7qLDivbwRGfow], even though it's not required by the problem statement either.) I think you have to consider what winning means more carefully. A rational agent doesn't buy a lottery ticket because it's a bad bet. If that ticket ends up winning, does that contradict the principle that "rational agents win"? That doesn't seem at all analogous. At the time they had the opportunity to purchase the ticket, they had no way to know it was going to win. An Irene who acts like your model of Irene will win slightly more when omega makes an incorrect prediction (she wins the lottery), but will be given the million dollars far less commonly because ... 1Yair Halberstadt2y I'm showing why a rational agent would not take the 1000 dollars, and that doesn't contradict "rational agents win" I think you're missing my point. After the1,000,000 has been taken, Irene doesn't suddenly lose her free will. She's perfectly capable of taking the $1000; she's just decided not to. You seem to think I'm making some claim like "one-boxing is irrational" or "Newcomb's problem is impossible", which is not at all what I'm doing. I'm trying to demonstrate that the idea of "rational agents just do what maximizes their utility and don't worry about having to have a consistent underlying decision theory" appears to result in a contradiction as soon as Irene's decision has been made. 2Yoav Ravid2y I understood your point. What I'm saying is that Irene is Indeed capable of also taking the$1000, but if omega isn't wrong, she only gets the million in cases where for some reason she doesn't (and I gave a few examples). I think your scenario is just too narrow. Sure, if Omega is wrong, and it's not a simulation, and it's a complete one shot, then the rational decision is to then also take the 1000, but if any of these aren't true, then you better find some reason or way not to take those 1000, or you'll never see the million in the first place, or you'll them in reality, or you'll never see them in the future.
-1TAG2y
How can you know what maximises your utility without having a sound underlying theory? ( But NOT, as I said in my other comment,a sound decision theory. You have to know that free will is real, or whether predictors are impossible. Then you might be able to have a decision theory adequate to the problem).

Ah, that makes sense.

Some clarifications on my intentions writing this story.

Omega being dead and Irene having taken the money from one box before having the conversation with Rachel are both not relevant to the core problem. I included them as a literary flourish to push people's intuitions towards thinking that Irene should open the second box, similar to what Eliezer was doing here.

Omega was wrong in this scenario, which departs from the traditional Newcomb's problem. I could have written an ending where Rachel made the same arguments and Irene still decided against doing i...

I think you have to consider what winning means more carefully. A rational agent doesn't buy a lottery ticket because it's a bad bet. If that ticket ends up winning, does that contradict the principle that "rational agents win"? An Irene who acts like your model of Irene will win slightly more when omega makes an incorrect prediction (she wins the lottery), but will be given the million dollars far less commonly because Omega is almost always correct. On average she loses. And rational agents win on average. By average I don't mean average within a particular world (repeated iteration), but on average across all possible worlds. Updateless Decision Theory helps you model this kind of thing.
1JBlack2y
Eliezer's alteration of the conditions very much strengthens the prisoner's dilemma. Your alterations very much weaken the original problem in both reducing the strength of evidence for Omega's hidden prediction, and in allowing a second decision after (apparently) receiving a prize.
2TAG2y
I don't see how winning can be defined without making some precise assumptions about the mechanics...How Omega's predictive abilities work, whether you have free will anyway, and so on. Consider trying to determine what the winning strategy is by writing a programme Why would you expect one decision theory to work in any possible universe?

I just did that to be consistent with the traditional formulation of Newcomb's problem, it's not relevant to the story. I needed some labels for the boxes, and "box A" and "box B" are not very descriptive and make it easy for the reader to forget which is which.

I don't find the simulation argument very compelling. I can conceive of many ways for Omega to arrive at a prediction with high probability of being correct that don't involve a full, particle-by-particle simulation of the actors.

[This comment is no longer endorsed by its author]Reply
Consider the distinction between a low level detailed simulation of a world where you are making a decision, and high level reasoning about your decision making. How would you know which one is being applied to you, from within? If there is a way of knowing that, you can act differently in these scenarios, so that the low level simulation won't show the same outcome as the prediction made with high level reasoning. A good process of making predictions by high level reasoning won't allow there to be a difference. The counterfactual world I'm talking about does not have to exist in any way similar to the real world, such as by being explicitly simulated. It only needs the implied existence of worldbuilding of a fictional story. The difference from a fictional story is that the story is not arbitrary, there is a precise purpose that shapes the story needed for prediction. And for a fictional character, there is no straightforward way of noticing the fictional nature of the world.

In the case where you find yourself holding the $1,000,000 and the$1000 are still available, sure, you can pick them up. That only happens if either Omega failed to predict what you will do, or if you somehow set things up such that you couldn't, or had to pay a big price, to break your precommitment.

I don't think that's true. The traditional Newcomb's problem could use the exact setup that I used here, the only difference would be that either the opaque box is empty, or Irene never opens the transparent box. The idea that the $1000 is always "available" to the player is central to Newcomb's problem. 2Yoav Ravid2y In my comment "that" in "That only happens if" referred to you taking the$1,000, not to them being available. So to clarify: If we assume that Omega's predictions are perfect, then you only find $1,000,000 in the box in cases where for some reason you don't also take the$1,000 * Maybe you have some beliefs about why you shouldn't do it * Maybe it's against your honor to do it * Maybe you're programmed not to do it * Maybe before you met Omega you gave a friend $2,000 and told him to give them back to you only if you don't take the$1,000, and otherwise burn them. If you find yourself going out with the contents of both boxes, either you're in a simulation or Omega was wrong. If Omega is wrong (and it's a one shot, and you know you're not in a simulation) then yeah, you have no reason not to take the \$1,000 too. But the less accurate Omega is, the less the problem is newcomblike.

I don't find the simulation argument very compelling. I can conceive of many ways for Omega to arrive at a prediction with high probability of being correct that don't involve a full, particle-by-particle simulation of the actors.

[This comment is no longer endorsed by its author]Reply
5Dagon2y
The underlying question remains the accuracy of the prediction and what sequences of events (if any) can include Omega being incorrect.   In the "strong omega" scenarios, the opaque box is empty in all the universes where Irene opens the transparent box (including after Omega's death).  Yoav's description seems right to me - Irene opens the opaque box, and is SHOCKED to find it empty, as she only planned to open the one box.  But her prediction of her behavior was incorrect, not Omega's prediction. In "weak omega" scenarios, who knows what the specifics are?  Maybe Omega's wrong in this case.

making piece

should be

making peace

so it includes both asymptomatic cases

I think that "includes" should be "excludes"?

3Elizabeth2y
You're right, thank you. I fixed it on my blog and thought LW had picked it up but apparently not.

This is an interesting question, but I think your hypothesis is wrong.

Any pattern of physics that eventually exerts control over a region much larger than its initial configuration does so by means of perception, cognition, and action that are recognizably AI-like.

In order to not include things like an exploding supernova as "controlling a region much larger than its initial configuration" we would want to require that such patterns be capable of arranging matter and energy into an arbitrary but low-complexity shape, such as a giant smiley face in Life.

If ...