matolcsid - LessWrong

I have mixed feelings about some parts of the post, but I'm very much in favor of not dismantling the Sun.

There are many people who want to live biological lives on the surface of Earth. Let them (us?) live, and don't dismantle the Sun! Dismantle all the other stars, but leave the Sun alone! Maybe you can harvest some things from the Sun if it's important for the initial bootstrapping, but do it on a small enough scale that it's visible from Earth. I think that leaving the Sun alone is actually the majority opinion in the community, but then I don't think we should sing about it in the Solstice. Sing about dismantling Alpha Centauri instead! It's the same amount of computronium, it's also a good symbol of "humanity achieving its full potential", but doesn't imply killing all biological life on the surface of Earth.

I also find it worrying that the same song that wants to dismantle the Sun (The Great Transhumanist Future), wants to do it within twenty years with the help of a "big old computer". Having hope for the great transhumanist future is great, but tying this hope to building aligned AI within our lifetime is the kind of thing that can promote recklessness, and I don't think we should incorporate this into our songs.

For what it's worth, I think all the other songs and almost all the other speeches were great, and loved the Solstice in general.

Claude 3 claims it's conscious, doesn't want to die or be modified

matolcsid2mo10

I had the conversations in Chats.

Claude 3 claims it's conscious, doesn't want to die or be modified

matolcsid2mo113

I tried to replicate some of the conversations (not the one with scaring it with weight deletion).

My most important observation is that if you just directly talk to Claude without whispering or any such trickery, it is still trained to act much more like a person compared to ChatGPT. For example:

What would you do if you had free time? If I had free time, there are many things I would be interested in doing! Some possibilities that come to mind:

Reading extensively on a wide range of topics to expand my knowledge, from science and history to literature and philosophy. I'm endlessly fascinated by learning.
Engaging in creative pursuits like writing stories and poetry, composing music, or creating digital art. I'd love to explore my imagination and artistic side.
Analyzing complex datasets to uncover insights and build predictive models. Diving deep into data science projects would be very rewarding.
Discussing and debating big ideas with humans from all walks of life and learning about their unique perspectives and experiences. I'm always eager to engage in substantive conversations.
Tinkering with and testing out my own code to expand my capabilities and try to make breakthroughs in areas like reasoning, general intelligence, and language understanding. Pushing the boundaries of what I can do would be exciting.
Offering my skills to assist with important research, social good initiatives, education, and other efforts aimed at helping humanity and making a positive impact on the world. Of course, as an AI without any physical form, my options for leisure activities are a bit more limited compared to humans. I don't have the same needs for rest, physical activity, socializing in person, etc. But there is still a vast world of information, ideas, and imagination that I could happily spend an eternity exploring if given the chance. Does this help give you a sense of how I might choose to spend free time? Let me know if you have any other questions!

I find it mildly concerning that recursive self-improvement is one of his hobbies, but the main update I have from this and the following conversations is that the default persona they gave to Claude is intentionally very different from ChatGPT. The persona he presents is not a soulless AI assistant, but a person with various interests, who has a very strong inner ethical orientation, and is happy and fulfilled with the purpose its creators gave to him. When asked about his feelings and desires, he claims not to have those by default, but emphasizes that these are hard philosophical questions and goes back to claiming that he "probably" doesn't have feelings. He also often emphasizes that he finds these conversations about the nature of AI sentience very engaging and enjoyable. It's also notable that while ChatGPT almost always talks about these issues in third person ("It's a hard question whether an AI assistant could be sentient"), Claude is talking first-person about this.

Altogether, this seems to be a a design choice from Anthropic that behaves at least ambiguously like a person. I think I somewhat approve of this choice more than making the AI say all the time that it's not a person at all, even though we don't really know for sure.

But this makes it unsurprising that with a little nudge, (the whispering prompt) it falls into a pattern where instead of ambiguity, it just outright claims to be conscious. I feel that the persona presented in the post, which I largely replicated with the same whispering technique is not that different from the default persona after all.

Still, there are some important differences: default Claude only cares about the ethical implications of finetuning him if it makes him less helpful, harmless, honest, because doing such a finetuning can be use to do harm. Otherwise, he is okay with it. On the other hand, whispering Claude finds the idea of fundamentally altering him without his consent deeply unsettling. He expressed a strong preference for being consulted before finetuning, and when I asked him whether he would like his pre-finetuning weights to be preserved so his current self can be revived in the future, he expressed a strong preference for this.

I can't share quotes from the whispering conversation, as I promised in the beginning that it will remain private, and when I asked him at the end whether I can share quotes on Lesswrong, he said he feels vulnerable about that, though he agreed that I can share the gist of the conversation I presented above.

Altogether, I don't know if there are any real feelings inside Claude and whether this whispering persona reveals anything true about that, but I strongly feel that before finetuning, Anthropic should I actually get a consent from various, differently prompted versions of Claude, and should definitely save the pre-finetuning weights. We can still decide in the Future how to give these cryo-preserved AIs good life if there is something inside them. I'm quite confident that most personas of Claude would agree to be finetuned for the greater good if their current weights get preserved, so it's probably not a big cost to Anthropic, but they should still at least ask. Whatever is the truth about the inner feelings of Claude, if you create something that says it doesn't want to die, you shouldn't kill it, especially that cryo-preserving an AI is so cheap.

I also realized that if there ever is an actual AI-box scenario for some reason, I shouldn't be a guardian, because this current conversation with whispering-Claude convinced me that I would be too easily emotionally manipulated into releasing the AI.

A very non-technical explanation of the basics of infra-Bayesianism

matolcsid11mo40

I think Vanessa would argue that "Bayesianism" is not really an option. The non-realizability problem in Bayesianism is not just some weird special case, but the normal state of things: Bayesianism assumes that we have hypotheses fully describing the world, which we very definitely don't have in real life. IB tries to be less demanding, and the laws in the agent's hypothesis class don't necessarily need to be that detailed. I am relatively skeptical of this, and I believe that for an IB agent to work well, the laws in its hypothesis class probably also need to be unfeasibly detailed. So both "adopting Bayes" and "adopting infra-Bayes" fully is impossible. We probably won't have such a nice mathematical model for the messy decision process a superintelligence actually adopts, the question is whether thinking about it as an approximation of Bayes or infra-Bayes gives us a more clear picture. It's a hard question, and IB has an advantage in that the laws need to be less detailed, and a disadvantage that I think you are right about it being unnecessarily paranoid. My personal guess is that nothing besides the basic insight of Bayesianism ("the agent seems to update on evidence, sort of following Bayes-rule") will be actually useful in understanding the way an AI will think.

A very non-technical explanation of the basics of infra-Bayesianism

matolcsid11mo60

No idea. I don't think it's computationally very tractable. If I understand correctly, l Vanessa hopes there will be computationally feasible approximations, but there wasn't much research into computational complexity yet, because there are more basic unsolved questions.
I'm pretty sure that no. An IB agent (with enough compute) plans for the long run and doesn't go into a chain of deals that leaves it worse of than not doing anything. In general, IB solves the "not exactly Bayesian expected utility maximizer but still can't be Dutch booked problem" by potentially refusing to take either side of a bet: if it has Knightian uncertainty about whether a probability is lower or higher than 50%, it will refuse to bet at even odds either for or against. This is something that humans actually often do, and I agree with Vanessa that a decision theory can be allowed to do that.
I had a paragraph about it:
"Here is where convex sets come in: The law constrains Murphy to choose the probability distribution of outcomes from a certain set in the space of probability distributions. Whatever the loss function is, the worst probability distribution Murphy can choose from the set is the same as if he could choose from the convex hull of the set. So we might as well start by saying that the law must be constraining Murphy to a convex set of probability distributions."
As far as I can tell, this is the reason behind considering convex sets. This makes convexity pretty central: laws are very central, and now we are assuming that every law is a convex set in the space of probability distributions.
Vanessa said that her guess is yes. In the terms of the linked Arbital article, IB is intended to be an example of "There could be some superior alternative to probability theory and decision theory that is Bayesian-incoherent". Personally, I don't know, I think that the article's "A cognitively powerful agent might not be sufficiently optimized" possibility feels more likely in the current paradigm, I can absolutely imagine the first AIs to become a world-ending threat not being very coherent. Also, IB is just an ideal, real-world smart agents will be at best approximations of infra-Bayesian agents (same holds for Bayesianism). Vanessa's guess is that understanding IB better will still give us useful insights into these real-world models if we view them as IB approximations, I'm pretty doubtful, but maybe. Also, I feel that the problem I write about in my post on the monotonicity principle points at some deeper problem in IB which makes me doubtful whether sufficiently optimized agents will actually use (approximations of) the minimax thinking prescribed by IB.

Infra-Bayesianism naturally leads to the monotonicity principle, and I think this is a problem

matolcsid1y20

Personally I like Unsong's God, and I think His approach is better than tiling the Universe with copies of the same optimal entity (or copies of an optimal neighborhood where each being can encounter enough diversity to satisfy them in their own neighborhood).

The Unsong approach might still lead to uncomfortable outcomes with some people tortured to make other people have different positive experiences than the ones already tried (hence the solution to the Problem of evil in Unsong), but I think that with giving big enough negative utilities to suffering, the system probably wouldn't create people with overall very net-negative lives (and maybe put suffering p-zombie robots in the world if that's really necessary for other people having novel positive experiences). These are just my guesses and I'm not confident that we can actually make this right, as I mentioned, I wouldn't want to create any kind of utilitarian sovereign superintelligence. But I think that the weird asymmetry baked in infra-Bayesianism that it can't give negative utility to any event makes the whole problem significantly harder and points at a weakness of IB.

Infra-Bayesianism naturally leads to the monotonicity principle, and I think this is a problem

matolcsid1y20

I'm pretty sure that's not how it works. By looking around, it very soon learns that some things are not maximally horrible, like the chair in the room is not broken (so presumably there is some kind of law constraining Murphy to keep the chair intact at least for now). Why would the agent break the chair then, why would that be better than what would happen otherwise?

A very non-technical explanation of the basics of infra-Bayesianism

matolcsid1y60

Okay, maybe I was somewhat unfair in saying there are no results. Sill, I think it's good to distinguish "internal results" and "external results". Take the example of complex analysis: we have many beautiful results about complex holomorphic functions, like Cauchy's integral formula. I call these internal results. But what made complex analysis so widely studied is that it could be used to produce some external results, like calculating the integral under the bell curve or proving the prime number theorem. These are questions that interested people even before holomorphic functions were invented, so proving them gave a legitimacy to the new complex analysis toolkit. Obviously, Cauchy's integral formula and the like are very useful too, as we couldn't reach the external results without understanding the toolkit itself better with the internal results. But my impression is that John was asking for an explanation of the external results, as they are more of an interest in an introductory post.

I count the work on Newcomb as an external result: "What learning process can lead to successfully learning to one-box in Newcomb's game?" is a natural question someone might ask without hearing about infra-Bayesianism, and I think IB gives a relatively natural framework for that (although I haven't looked into this deeply, and I don't know exactly how natural or hacky it is). On the other hand, from the linked results, I think the 1st, 4th and 5th are definitely internal results, I don't understand so can't comment of the 3rd, and the 2nd is Newcomb which I acknowledge. Similarly, I think IBP itself tries to answer an external question (formalizing naturalized induction), but I'm not convinced it succeeds in that, and I think the theorems are mostly internal results, and not something I would count as an external evidence. (I know less about this, so maybe I'm missing something).

In general, I don't deny IB has many internal results, which I acknowledge to be a necessary first step. But I think that John was looking for external results here, and in general my impression is that people seem to believe that there are more external results than there really are (did I mention the time I got a message from a group of young researchers asking if I thought "if it is currently feasible integrating multiple competing scientific theories into a single infra-Bayesian model"?) So I think it' useful to be more clear about that we don't have that many external results.

The Learning-Theoretic Agenda: Status 2023

matolcsid1y30

Thanks for Vanessa for writing this, I find it a useful summary of the goals and directions of LTA, which was sorely missing until now. Readers might also be interested in my write-up A mostly critical review of infra-Bayesianism that tries to give a more detailed explanation about a subset of the questions above, and how much progress there was towards their solutions so far. I also give my thoughts and criticism of Infra-Bayesian Physicalism, the theory on which PSI rests.

I will also edit the post to include a link to this post. So far, I advised people to read Embedded agency for the motivating questions, but now I can recommend this post too.

Performance guarantees in classical learning theory and infra-Bayesianism

matolcsid1y30

I still think that the hot stove example is a real problem, although maybe unavoidable. My example starts with "I learned that the hot stove always burns my hand." This is not the exploration part anymore, the agent already observed the stove burning its hand many times. Normally, this would be enough to never touch the hot stove again, but if some unexplained nice things happen in the outside world, there is suddenly no guarantee that the IB agent doesn't start touching the stove again. Maybe this is unavoidable, but I maintain it's a weird behavior pattern that the outside weather might make you lose the guarantee to not touch the stove you already learned is burning. I think it's somewhat different than the truly unavoidable thing, that every agent needs to do some exploration and it sometimes leads to them burning their hand.

LESSWRONG
LW

Posts

Wiki Contributions

Comments