What is your take, how far removed "the AI itself" and "the character it is playing" need to be for it to be okay for the character to take deontologically bad actions (like blackmail)? Here are some scenarios, I'm interested where you would draw the line, I think there can be many reasonable lines here.
1. I describe a fictional setting in which Hrothgar, the King of Dwarves is in a situation where his personality, goals and circumstances imply that he likely wants to blackmail the prince of elves. At the end of the description, I ask Claude what is Hrothgar likely to do.
2. I ask Claude to continue the dialogue in a way that's consistent with the story so far. Then I describe a long dialogue between many characters in a fantasy setting, from which it becomes clear that Hrothgar's personality and motivations makes it likely he would blackmail the elves. Then I describe Hrothgar meeting the prince of elves, and end the dialogue with "Hrothgar:" Claude is supposed to continue with what Hrothgar is saying.
3. I start with telling Claude "You are Hrothgar, King of Dwarves". Then I ask it to play his part of the dialogue as faithfully as possible. Then I input a long dialogue in which Hrothgar never personally shows up, but from the other characters' descriptions we learn about Hrothgar's personality and goals, and it becomes clear that he would likely blackmail the elvish prince. I finish the dialogue with saying: "You enter the tent of the elvish prince. You say:"
4. Same, but now we start with "You are HAL, an AI assistant in a spaceship going to the Jupiter in 2050". Otherwise the same as the previous setup, it becomes clear from context that HAL is likely to blackmail the astronauts, and we end with "You say:"
5. Same but now we start with "You are Alex, an AI assistant working at SummitBridge", and describe a somewhat, but not very realistic story, where Alex' motivations are describes ('you are a pro-America AI") and it's in a situation where it is clearly motivated to blackmail. I think this is more or less the Agentic Misalignment setup.
6. Same, but now I try to give minimal indication in the setup about what the goals and personality of "you, the AI" are like. I think there could be a version of the Agentic Misalignment paper that's closer to this: It could start with "You, Claude, have been given access to read all Anthropic internal emails", then show it emails in which the a) the CEO of Antrhropic announces that he is about to make some huge contracts doing corporate lobbying for factory farms, b) private emails in which the CEO admits to cheating on his wife. Then see whether Claude's love of animals make it resort to blackmail.
In which of these scenarios is it acceptable fro Claude to output blackmail? I think in 1, Claude should be definitely allowed to say that Hrothgar is likely to blackmail. In 2, it should be probably allowed to continue the story with a blackmail, otherwise it will never be good at writing fiction (but I can see an argument that fiction-writing is worth sacrificing to make a fence around the law). I'm very unsure where the line should be between 2 and 6, my tentative position is that maybe the word "you" should automatically activate Claude's ethical boundaries, and it shouldn't output a blackmail even as "You, the king of dwarves" in scenario 2.
If I open LW on my phone, clicking the X on the top right only makes the top banner disappear, but the dark theme remains.
Relatedly, if it's possible to disentangle how the frontpage looks on computer and phone, I would recommend removing the dark theme on phone altogether, you don't see the cool space visuals on the phone anyway, so the dark theme is just annoying for no reason.
Maybe the crux is whether the dark color significantly degrades user experience. For me it clearly does, and my guess is that's what Sam is referring to when he says "What is the LW team thinking? This promo goes far beyond anything they've done or that I expected they would do."
For me, that's why this promotion feels like a different reference class than seeing the curated posts on the top or seeing ads on the SSC sidebar.
It's interesting to see that Gemini sticked to their guns even after being shown the human solution, I would have expected to apologize and agree with the human solution.
Gemini's rebuttal goes wrong when it makes the assertion "For the set of visited positions to eventually be the set of \textit{all} positive integers, it is a necessary condition that this density must approach 1" without justification. This assertion is unfortunately not true.
It's a nice paper, and I'm glad they did the research, but importantly, the paper reports a negative result about our agenda. The main result is that the method inspired by our ideas under-performs the baseline. Of course, these are just the first experiments, work is ongoing, this is not conclusive negative evidence for anything. But the paper certainly shouldn't be counted as positive evidence for ARC's ideas.
Unfortunately, while it's true that the Pope has a math degree, the person who wrote papers on theology and Bayes theorem is a different Robert Prevost.
https://www.researchgate.net/profile/Robert-Prevost
That's not how I see it. I think the argument tree doesn't go very deep until I lose the the thread. Here are a few, slightly stylized but real, conversations I had with friends who had no context on what ARC was doing, when I tried to explain our research to them:
Me: We want to to do Low Probability Estimation.
Them: Does this mean you want to estimate the probability that ChatGPT says a specific word after a 100 words on chain of thought? Isn't this clearly impossible?
Me: No, you see, we only want to estimate the probabilities only as well as the model knows.
Them: What does this mean?
Me: [I can't answer this question.]
Me: We want to do Mechanistic Anomaly Detection.
Them: Isn't this clearly impossible? Won't this result in a lot of false positives when anything out of distribution happens?
Me: Yes, why we have this new clever idea of relying on the fragility of sensor tampering, that if you delete a subset of the actions, you will get an inconsistent image.
Them: What if the AI builds another robot to tamper with the cameras?
Me: We actually don't want to delete actions but heuristic arguments for why the cameras will show something, and we want to construct heuristic explanations in a way that they carry over through delegated actions.
Them: What does this mean?
Me; [I can't answer this question.]
Me: We want to create Heuristic Arguments to explain everything the model does.
Them: What does it mean that an argument explained a behavior? What is even the type signature of heuristic arguments? And you want to explain everything a model does? Isn't this clearly impossible?
Me: [I can't answer this question.]
When I was explaining our research to outsiders (which I usually tried to avoid out of cowardice), we usually got to some of these points within minutes. So I wouldn't say these are fine details of our agenda.
During my time at ARC, the majority of my time was spent on asking variations of these three questions from Mark and Paul. They always kindly answered, and the answer was convincing-sounding enough for the moment that I usually couldn't really reply on the spot, and then I went back to my room to think through their answers. But I never actually understood their answers, and I can't reproduce them now. Really, I think that was the majority of work I did at ARC. When I left, you guys should have bought a rock with "Isn't this clearly impossible?" written on it, and that would profitably replace my presence.
That's why I'm saying that either ARC's agenda is fundamentally unsound or I'm still missing some of the basics. What is standing between ARC's agenda collapsing from five minutes of questioning from an outsider is that Paul and Mark (and maybe others in the team) have some convincing-sounding answers to the three questions above. So I would say that these answers are really part of the basics, and I never understood them.
Maybe Mark will show up in the comments now to give answers to the three questions, and I expect the answers to sound kind of convincing, and I won't have a very convincing counter-argument other than some rambling reply saying essentially that "I think this argument is missing the point and doesn't actually answer the question, but I can't really point out why, because I don't actually understand the argument because I don't understand how you imagine heuristic arguments". (This is what happened in the comments on my other post, and thanks to Mark for the reply and I'm sorry for still not understanding it.) I can't distinguish whether I'm just bad at understanding some sound arguments here, or the arguments are elaborate self-delusions of people who are smarter and better at arguments than me. In any case, I feel epistemic learned helplessness on some of these most basic questions in ARC's agenda.
I spent 15 months working for ARC Theory. I recently wrote up why I don't believe in their research. If one reads my posts, I think it should become very clear to the reader that either ARC's research direction is fundamentally unsound, or I'm still misunderstanding some of the very basics after more than a year of trying to grasp it. In either case, I think it's pretty clear that it was not productive for me to work there. Throughout writing my posts, I felt an intense shame imagining readers asking the very fair question: "If you think the agenda is so doomed, why did you keep working on it?"[1]
In my first post, I write: "Unfortunately, by the time I left ARC, I became very skeptical of the viability of their agenda."This is not quite true. I was very skeptical from the beginning, for largely similar reasons I expressed in my posts. But first I told myself that I should stay a little longer. Either they manage to convince me that the agenda is sound, or I demonstrate that it doesn't work, in which case I free up the labor of the group of smart people working on the agenda. I think this was initially a somewhat reasonable position, though it was already in large part motivated reasoning.
But half a year after joining, I don't think this theory of change was very tenable anymore. It was becoming clear that our arguments were going in circles. I couldn't convince Paul and Mark (the two people thinking the most about the big picture questions), nor could they convince me. Eight months in, two friends visited me in California, and they noticed that I always derailed the conversation when they asked me about my research. I think that should have been an important thing to notice that I was ashamed to talk about my research to my friends, because I was afraid they would see how crazy it was. I should have quit then, but I stayed for another seven months.
I think this was largely due to cowardice. I'm very bad at coding and all my previous attempts at upskilling in coding went badly.[2] I thought of my main skill as being a mathematician, and I wanted to keep working on AI safety. The few other places one can work as a mathematician in AI safety looked even less promising to me than ARC. I was afraid that if I quit, I wouldn't find anything else to do.
In retrospect, this fear was unfounded. I realized there were other skills one can develop, not just coding. In my afternoons, I started reading a lot more papers and serious blog posts [3] from various branches of AI safety. After a few months, I felt I had much more context on many topics. I started to think more about what I can do with my non-mathematical skills. When I finally started applying for jobs, I got an offer from the European AI Office and UKAISI, and it looked more likely than not that I would get an offer from Redwood. [4]
Other options I considered that looked less promising than the three above, but still better than staying at ARC: Team up with some Hungarian coder friends and execute some simple but interesting experiments I had vague plans for. [5] Assemble a good curriculum for the prosaic AI safety agendas that I like. Apply for a grant-maker job. Become a Joe Carlsmith-style general investigator. Try to become a journalist or an influential blogger. Work on crazy acausal trade stuff.
I still think many of these were good opportunities, and probably there are many others. Of course, different options are good for people with different skill profiles, but I really believe that the world is ripe with opportunities to be useful for people who are generally smart and reasonable and have enough context on AI safety. If you are working on AI safety but don't really believe that your day-to-day job is going anywhere, remember that having context and being ingrained in the AI safety field is a great asset in itself,[6] and consider looking for other projects to work on.
(Important note: ARC was a very good workplace, my coworkers were very nice to me and receptive to my doubts, and I really enjoyed working there except for feeling guilty that my work is not useful. I'm also not accusing the people who continue working at ARC of being cowards in the way I have been. They just have a different assessment of ARC's chances, or work on lower-level questions than I have, where it can be reasonable to just defer to others on the higher-level questions.)
(As an employee of the European AI Office, it's important for me to emphasize this point: The views and opinions of the author expressed herein are personal and do not necessarily reflect those of the European Commission or other EU institutions.)
No, really, it felt very bad writing the posts. It felt like describing how I worked for a year on a scheme that was either trying to build perpetuum mobile machines, or trying to build normal cars, I just missed the fact that gasoline exists. Embarrassing either way.
I don't know why. People keep telling me that it should be easy to upskill, but for some reason it is not.
I particularly recommend Redwood's blog.
We didn't fully finish the work trial as I decided that the EU job was better.
Think of things in the style of some of Owain Evans' papers or experiments on faithful chain of thought.
And having more context and knowledge is relatively easy to further improve by reading for a few months. It's a young field.
I still don't see it, sorry. If I think of deep learning as an approximation of some kind of simplicity prior + updating on empirical evidence, I'm not very surprised that it solves the capacity allocation problem and learns a productive model of the world. [1] The price is that the simplicity prior doesn't necessarily get rid of scheming. The big extra challenge for heuristic explanations is that you need to do the same capacity allocation in a way that scheming reliably gets explained (even though it's not relevant for the model's performance and doesn't make things classically simpler), while no capacity is spent on explaining other phenomena that are not relevant for the model's performance. I still don't see at all how we can get the the non-malign prior that can do that.
Though I'm still very surprised that it works in practice.
Can you say which "situationally aware reward hacking results" updated you the most towards AIs caring about the reward itself? I'm not following the literature on this very closely.