Review
This is a special post for quick takes by Olli Järviniemi. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
24 comments, sorted by Click to highlight new comments since:

Part 2 - rant on LW culture about how to do research

Yesterday I wrote about my object-level updates resulting from me starting on an alignment program. Here I want to talk about a meta point on LW culture about how to do research.

Note: This is about "how things have affected me", not "what other people have aimed to communicate". I'm not aiming to pass other people's ITTs or present the strongest versions of their arguments. I am rant-y at times. I think that's OK and it is still worth it to put this out.


There's this cluster of thoughts in LW that includes stuff like:

"I figured this stuff out using the null string as input" - Yudkowsky's List of Lethalities

"The safety community currently is mostly bouncing off the hard problems and are spending most of their time working on safe, easy, predictable things that guarantee they’ll be able to publish a paper at the end." - Zvi modeling Yudkowsky

There are worlds where iterative design fails

"Focus on the Hard Part First"

"Alignment is different from usual science in that iterative empirical work doesn't suffice" - a thought that I find in my head.

 

I'm having trouble putting it in words, but there's just something about these memes that's... just anti-helpful for making research? It's really easy to interpret the comments above as things that I think are bad. (Proof: I have interpreted them in such a way.)

It's this cluster that's kind of suggesting, or at least easily interpreted as saying, "you should sit down and think about how to align a superintelligence", as opposed to doing "normal research".

And for me personally this has resulted in doing nothing or something just tangentially related to prevent AI doom. I'm actually not capable of just sitting down and deriving a method for aligning a superintelligence from the null string.

(...to which one could respond with "reality doesn't grade on a curve", or that one is "frankly not hopeful about getting real alignment work" out of me, or other such memes.)

Leaving aside issues whether these things are kind or good for mental health or such, I just think these memes are a bad way about thinking how research works or how to make progress.

I'm pretty fond of the phrase "standing on the shoulders of giants". Really, people extremely rarely figure stuff out from the ground or from the null string. The giants are pretty damn large. You should climb on top of them. In the real world, if there's a guide for a skill you want to learn, you read it. I could write a longer rant about the null string thing, but let me leave it here.

About "the safety community currently is mostly bouncing off the hard problems and [...] publish a paper": I'm not sure who "community" refers to. Sure, Ethical and Responsible AI doesn't address AI killing everyone, and sure, publish or perish and all that. This is a different claim from "people should sit down and think how to align a superintelligence". That's the hard problem, and you are supposed to focus on that first, right?

Taking these together, what you get is something that's opposite to what research usually looks like. The null string stuff pushes away from scholarship. The "...that guarantee they'll be able to publish a paper..." stuff pushes away from having well-scoped projects. The talk about iterative designs failing can be interpreted as pushing away from empirical sources of information. Focusing on the hard part first pushes away from learning from relaxations of the problem.

And I don't think the "well alignment is different from science, iterative design and empirical feedback loops don't suffice, so of course the process is different" argument is gonna cut it.

What made me make this update and notice the ways in how LW culture is anti-helpful was seeing how people do alignment research in real life. They actually rely a lot on prior work, improve on those, use empirical sources of information and do stuff that puts us into a marginally better position. Contrary to the memes above, I think this approach is actually quite good.

[edit: pinned to profile]

agreed on all points. and, I think there are kernels of truth from the things you're disagreeing-with-the-implications-of, and those kernels of truth need to be ported to the perspective you're saying they easily are misinterpreted as opposing. something like, how can we test the hard part first?

compare also physics - getting lost doing theory when you can't get data does not have a good track record in physics despite how critically important theory has been in modeling data. but you also have to collect data that weighs on relevant theories so hypotheses can be eliminated and promising theories can be refined. machine learning typically is "make number go up" rather than "model-based" science, in this regard, and I think we do need to be doing model-based science to get enough of the right experiments.

on the object level, I'm excited about ways to test models of agency using things like particle lenia and neural cellular automata. I might even share some hacky work on that at some point if I figure out what it is I even want to test.

Yeah, I definitely grant that there are insights in the things I'm criticizing here. E.g. I was careful to phrase this sentence in this particular way:

The talk about iterative designs failing can be interpreted as pushing away from empirical sources of information.

Because yep, I sure agree with many points in the "Worlds Where Iterative Design Fails". I'm not trying to imply the post's point was "empirical sources of information are bad" or anything. 

(My tone in this post is "here are bad interpretations I've made, watch out for those" instead of "let me refute these misinterpreted versions of other people's arguments and claim I'm right".)

On being able to predictably publish papers as a malign goal, one point is standards of publishability in existing research communities not matching what's useful to publish for this particular problem (which used to be the case more strongly a few years ago). Aiming to publish for example on LessWrong fixes the issue in that case, though you mostly won't get research grants for that. (The other point is that some things shouldn't be published at all.)

In either case, I don't see discouragement from building on existing work, it's not building arguments out of nothing when you also read all the things as you come up with your arguments. Experimental grounding is crucial but not always possible, in which case giving up on the problem and doing something else doesn't help with solving this particular problem, other than as part of the rising tide of basic research that can't be aimed.

Devices and time to fall asleep: a small self-experiment

I did a small self-experiment on the question "Does the use of devices (phone, laptop) in the evening affect the time taken to fall asleep?".

Setup

On each day during the experiment I went to sleep at 23:00. 

At 21:30 I randomized what I'll do at 21:30-22:45. Each of the following three options was equally likely:

  • Read a physical book
  • Read a book on my phone
  • Read a book on my laptop

At 22:45-23:00 I brushed my teeth etc. and did not use devices at this time.

Time taken to fall asleep was measured by a smart watch. (I have not selected it for being good to measure sleep, though.) I had blue light filters on my phone and laptop.

Results

I ran the experiment for n = 17 days (the days were not consecutive, but all took place in a consecutive ~month).

I ended up having 6 days for "phys. book", 6 days for "book on phone" and 5 days for "book on laptop".

On one experiment day (when I read a physical book), my watch reported me as falling asleep at 21:31. I discarded this as a measuring error.

For the resulting 16 days, average times to fall asleep were 5.4 minutes, 21 minutes and 22 minutes, for phys. book, phone and laptop, respectively.

[Raw data:

Phys. book: 0, 0, 2, 5, 22

Phone: 2, 14, 21, 24, 32, 33

Laptop: 0, 6, 10, 27, 66.]

Conclusion

The sample size was small (I unfortunately lost the motivation to continue). Nevertheless it gave me quite strong evidence that being on devices indeed does affect sleep.

On premature advice

Here's a pattern I've recognized - all examples are based on real events.

Scenario 1. Starting to exercise

Alice: "I've just started working out again. I've been doing blah for X minutes and then blah blah for Y minutes."

Bob: "You shouldn't exercise like that, you'll injure yourself. Here's what you should be doing instead..."

Result: Alice stops exercising.

Scenario 2. Starting to invest

Alice: "Everyone around me tells that investing is a good idea, so I'm now going to invest in index funds."

Bob: "You better know what you are doing. Don't invest any money you cannot afford to lose, Past Performance Is No Guarantee of Future Results, also [speculation] so this might not be the best time to invest, also..."

Result: Alice doesn't invest any of her money anywhere

Scenario 3. Buying lighting

Alice: "My current lighting is quite dim, I'm planning on buying more and better lamps."

Bob: "Lighting is complicated: you have to look at temperatures and color reproduction index, make sure to have shaders, also ideally you have colder lighting in the morning and warmer in the evening, and..."

Result: Alice doesn't improve her lighting.


I think this pattern, namely overwhelming a beginner with technical nuanced advice (that possibly was not even asked for), is bad, and Bobs shouldn't do that.

An obvious improvement is to not be as discouraging as Bob in the examples above, but it's still tricky to actually make things better instead of demotivating Alice.

When I'm Alice, I often just want to share something I've been thinking about recently, and maybe get some encouragement. Hearing Bob tell me how much I don't know doesn't make me go learn about the topic (that's a fabricated option), it makes me discouraged and possibly give up.

My memories of being Bob are not as easily accessible, but I can guess what it's like. Probably it's "yay, Alice is thinking about something I know about, I can help her!", sliding into "it's fun to talk about subjects I know about" all the way to "you fool, look how much more I know than you". 

What I think Bob should do, and what I'll do when encountering an Alice, is to be more supportive and perhaps encourage them to talk more about the thing they seem to want to talk about. 

1

I think this is a great point. I appreciate the examples too. I often find it hard to come up with good examples, but at the same time I think good examples are super useful, and these are great examples.

2

For lifting weights, I personally have settled on just doing bench presses and leg presses because that's what actually triggers enough motivation in me. Other exercises I just don't enjoy nearly as much. I also find it much more motivating when I can get in and out in a half hour. When I was younger I would often have 90+ minute sessions and I'm just not motivated to do that anymore.

I also only go about once every week (or two). Which I'm ok with. There seem to be pretty big diminishing returns when it comes to strength training and I don't want to risk aiming for a 3x/week schedule, failing at it, and ending up going months without doing any strength training at all.

3

The failure mode you point out seems to me like a Valley of Bad Rationality. Normal people are not automatically strategic and don't jump to trying to how you could optimize your exercise routine when you tell them you just started exercising. Rationalists are more strategic and probably make this jump too frequently. Being strategic is often a good thing, but here it's probably not.

What I think Bob should do, and what I'll do when encountering an Alice, is to be more supportive and perhaps encourage them to talk more about the thing they seem to want to talk about.

When Bob can't tell the difference, Alice becomes a natural enemy of Carol, who wants pointers on technical details rather than generic encouragement.

For background, see this post on trustworthiness.

Here are two notions of trustworthiness that are useful to distinguish:

  1. On most inputs/contexts, the model isn’t egregiously misaligned. (By egregious misalignment, I mean conspiring against you to cause very bad outcomes.)
  2. During the key moments where it really matters, the model doesn’t take very bad actions.

These notions can vary substantially. A central example is a model that has never had any serious misaligned thoughts but, when presented with a clear opportunity to escape, takes it. This model would be trusted under the first notion, but not under the second.

Another example is the case where a weaker model is monitoring a stronger model. If during deployment the weak model obtains strong evidence for indeed being deployed (possibly because the strong model aims for this outcome), the weak model might end up not reporting bad actions the strong model takes.[1] This can happen even if the weak model is generally not egregiously misaligned.

I find this distinction useful for two reasons:

First, there are many sorts of evaluations and arguments which imply (1) but not (2). For instance, arguments from the model’s incompetence at theory-of-mind and keeping secrets, or at thinking of and executing cross-context strategies, are strong evidence in favor of (1), but much weaker evidence in favor of (2). Obtaining strong evidence for (2) might require evaluations targeted at that specifically.

Second, whether we care about (1) or (2) depends on the application. Situations where we primarily care about (1) include: the model has produced large quantities of research, we have run capability evaluations on the model, we are forming vague high-level impressions about the model’s psychology based on behavioral evidence. In these cases we typically have redundancy and multiple lines of evidence, and thus sabotage would need to be somewhat consistent. In contrast, we care about (2) when a single action by the model can have large consequences, which is the case at least when the model is monitoring other models.

I like this frame better than my previous ones. Credit to Ryan. For ease of reference, here’s terminology: property (1) is average or low-stakes trustworthiness, and property (2) is worst-case or high-stakes trustworthiness.

  1. ^

    We might be able to ensure the model can’t reliably distinguish evaluations from deployment, so this scenario might be unrealistic. But I want to be extremely clear about whether the argument for safety is “the model won’t know it’s deployed” versus “the model can recognize when it’s deployed, but it still won’t take any bad actions because it’s a good guy”.

Part 3/4 - General uptakes

In my previous two shortform posts I've talked about some object-level belief changes about technical alignment and some meta-level thoughts about how to do research, both which were prompted by starting in an alignment program.

Let me here talk about some uptakes from all this.

(Note: As with previous posts, this is "me writing about my thoughts and experiences in case they are useful to someone", putting in relatively low effort. It's a conscious decision to put these in shortform posts, where they are not shoved to everyone's faces.)

The main point is that I now think it's much more feasible to do useful technical AI safety work than I previously thought. This update is a result of realizing both that the action space is larger than I thought (this is a theme in the object-level post) and that I have been intimidated by the culture in LW (see meta-post).

One day I heard someone saying "I thought AI alignment was about coming up with some smart shit, but it's more like doing a bunch of kinda annoying things". This comment stuck with me.

Let's take a concrete example. Very recently the "Sleeper Agents" paper came out. And I think both of the following are true:

1: This work is really good.

For reasons such as: it provides actual non-zero information about safety techniques and deceptive alignment; it's a clear demonstration of failures of safety techniques; it provides a test case for testing new alignment techniques and lays out the idea "we could come up with more test cases".

2: The work doesn't contain a 200 IQ godly breakthrough idea.

(Before you ask: I'm not belittling the work. See point 1 above.)

Like: There are a lot of motivations for the work. Many of them are intuitive. Many build on previous work. The setup is natural. The used techniques are standard.

The value is in stuff like combining a dozen "obvious" ideas in a suitable way, carefully designing the experiment, properly implementing the experiment, writing it down clearly and, you know, actually showing up and doing the thing.

And yep, one shouldn't hindsight-bias oneself to think all of this is obvious. Clearly I myself didn't come up with the idea starting from the null string. I still think that I could contribute, to have the field produce more things like that. None of the individual steps is that hard - or, there exist steps that are not that hard. Many of them are "people who have the competence to do the standard things do the standard things" (or, as someone would say, "do a bunch of kinda annoying things").

I don't think the bottleneck is "coming up with good project ideas". I've heard a lot of project ideas lately. While all of them aren't good, in absolute terms many of them are. Turns out that coming up with an idea takes 10 seconds or 1 hour, and then properly executing that requires 10 hours or 1 full-time-equivalent-years.

So I actually think that the bottleneck is more about "we have people executing the tons of projects the field comes up with", at least much more than I previously thought.

And sure, for individual newcomers it's not trivial to come up with good projects. Realistically one needs (or at least I needed) more than the null string. I'll talk about this more in my final post.

I recently wrote an introduction text about catastrophic risks from AI. It can be found at my website.

A couple of things I thought I did quite well / better than many other texts I've read:[1] 

  • illustrating arguments with lots of concrete examples from empirical research
  • connecting classical conceptual arguments (e.g. instrumental convergence) to actual mechanistic stories of how things go wrong; giving "the complete story" without omissions
  • communicating relatively deep technical ideas relatively accessibly

There's nothing new here for people who have been in the AI circles for a while. I wrote it to address some shortcomings I saw in previous intro texts, and have been linking it to non-experts who want to know why I/others think AI is dangerous. Let me know if you like it!

  1. ^

    In contrast, a couple of things I'm a bit unsatisfied by: in retrospect some of the stories are a bit too conjunctive for my taste, my call to action is weak, and I didn't write much about non-technical aspects as I don't know that much about them.

    Also, I initially wrote the text in Finnish - I'm not aware of other Finnish intro materials - and the English version is a translation.

Part 4/4 - Concluding comments on how to contribute to alignment

In part 1 I talked about object-level belief changes, in part 2 about how to do research and in part 3 about what alignment research looks like.

Let me conclude by saying things that would have been useful for past-me about "how to contribute to alignment". As in past posts, my mode here is "personal musings I felt like writing that might accidentally be useful to others".

So, for me-1-month-ago, the bottleneck was "uh, I don't really know what to work on". Let's talk about that.

First of all, experienced alignment researchers tend to have plenty of ideas. (Come on, me-1-month-ago, don't be surprised.) Did you know that there's this forum where alignment people write out their thoughts?

"But there's so much material there", me-1-month-ago responds.

what kind of excuse is that Okay so how research programs work is that you have some mentor and you try to learn stuff from them. You can do a version of this alone as well: just take some researcher you think has good takes and go read their texts.

No, I mean actually read them. I don't mean "skim through the posts", I mean going above and beyond here: printing the text on paper, going through it line by line, flagging down new considerations you haven't thought before. Try to actually understand what the author thinks, to understand the worldview that has generated those posts, not just going "that claim is true, that one is false, that's true, OK done".

And I don't mean reading just two or three posts by the author. I mean like a dozen or more. Spending hours on reading posts, really taking the time there. This is what turns "characters on a screen" to "actually learning something".

A major part of my first week in my program involved reading posts by Evan Hubinger. I learned a lot. Which is silly: I didn't need to fly to the Bay to access https://www.alignmentforum.org/users/evhub. But, well, I have a printer and some "let's actually do something ok?" attitude here.

Okay, so I still haven't a list of Concrete Projects To Work On. The main reason is that going through the process above kind of results in that. You will likely see something promising, something fruitful, something worthwhile. Posts often have "future work" sections. If you really want explicit lists of projects, then you can unsurprisingly find those as well (example). (And while I can't speak for others, my guess is that if you really have understood someone's worldview and you go ask them "is there some project you want me to do?", they just might answer you.)

Me-from-1-ago would have had some flinch reaction of "but are these projects Real? do they actually address the core problems?", which is why I wrote my previous three posts. Not that they provide a magic wand which waves away this question, rather they point out that past-me's standard for what counts as Real Work was unreasonably high.

And yeah, you very well might have thoughts like "why is this post focusing on this instead of..." or "meh, that idea has the issue where...". You know what to do with those.

Good luck!

A frame for thinking about capability evaluations: outer vs. inner evaluations

When people hear the phrase "capability evaluations", I think they are often picturing something very roughly like METR's evaluations, where we test stuff like:

  • Can the AI buy a dress from Amazon?
  • Can the AI solve a sudoku?
  • Can the AI reverse engineer this binary file?
  • Can the AI replicate this ML paper?
  • Can the AI replicate autonomously?

(See more examples at METRs repo of public tasks.)

In contrast, consider the following capabilities:

  • Is the AI situationally aware?
  • Can the AI do out-of-context reasoning?
  • Can the AI do introspection?
  • Can the AI do steganography?
  • Can the AI utilize filler tokens?
  • Can the AI obfuscate its internals?
  • Can the AI gradient hack?

There's a real difference[1] between the first and second categories. Rough attempt at putting it in words: the first one is treating the AI as an indivisible atom and seeing how it can affect the world, whereas the second one treats the AI as having "an inner life" and seeing how it can affect itself.

Hence the name "outer vs. inner evaluations". (I think the alternative name "dualistic vs. embedded evaluations", following dualistic and embedded notions of agency, gets closer at the distinction while being less snappy. Compare also to behavioral vs. cognitive psychology.)

It seems to me that the outer evaluations are better established: we have METR and the labs itself doing such capability evaluations. There's plenty of work on inner evaluations as well, the difference being that it's more diffused. (Maybe for good reason: it is tricky to do proper inner evals.)

I've gotten value out of this frame; it helps me not forget inner evals in the context of evaluating model capabilities.

  1. ^

    Another difference is that in outer evals we often are interested in getting the most out of the model by ~any means, whereas with inner evals we might deliberately restrict the model's action space. This difference might be best thought of as a separate axis, though.

Clarifying a confusion around deceptive alignment / scheming

There's a common blurrying-the-lines motion related to deceptive alignment that especially non-experts easily fall into.[1]

There is a whole spectrum of "how deceptive/schemy is the model", that includes at least

deception - instrumental deception - alignment-faking - instrumental alignment-faking - scheming.[2]

Especially in casual conversations people tend to conflate between things like "someone builds a scaffolded LLM agent that starts to acquire power and resources, deceive humans (including about the agent's aims), self-preserve etc." and "scheming". This is incorrect. While the outlined scenario can count for instrumental alignment-faking, scheming (as a technical term defined by Carlsmith) demands training gaming, and hence scaffolded LLM agents are out of the scope of the definition.[3] 

The main point: when people debate the likelihood of scheming/deceptive alignment, they are NOT talking about whether scaffolded LLM agents will exhibit instrumental deception or such. They are debating whether the training process creates models that "play the training game" (for instrumental reasons).

I think the right mental picture is to think of dynamics of SGD and the training process rather than dynamics of LLM scaffolding and prompting.[4]

Corollaries:

  • This confusion allows for accidental motte-and-bailey dynamics[5]
    • Motte: "scaffolded LLM agents will exhibit power-seeking behavior, including deception about their alignment" (which is what some might call "the AI scheming")
    • Bailey: "power-motivated instrumental training gaming is likely to arise from such-and-such training processes" (which is what the actual technical term of scheming refers to)
  • People disagreeing with the bailey are not necessarily disagreeing about the motte.[6]
  • You can still be worried about the motte (indeed, that is bad as well!) without having to agree with the bailey.

See also: Deceptive AI =≠= Deceptively-aligned AI, which makes very closely related points, and my comment on that post listing a bunch of anti-examples of deceptive alignment.

 

  1. ^

    (Source: I've seen this blurrying pop up in a couple of conversations, and have earlier fallen into the mistake myself.)

  2. ^

    Alignment-faking is basically just "deceiving humans about the AI's alignment specifically". Scheming demands the model is training-gaming(!) for instrumental reasons. See the very beginning of Carlsmith's report.

  3. ^

    Scheming as an English word is descriptive of the situation, though, and this duplicate meaning of the word probably explains much of the confusion. "Deceptive alignment" suffers from the same issue (and can also be confused for mere alignment-faking, i.e. deception about alignment).

  4. ^

    Note also that "there is a hyperspecific prompt you can use to make the model simulate Clippy" is basically separate from scheming: if Clippy-mode doesn't active during training, the Clippy can't training-game, and thus this isn't scheming-as-defined-by-Carlsmith.

    There's more to say about context-dependent vs. context-independent power-seeking malicious behavior, but I won't discuss that here.

  5. ^

    I've found such dynamics in my own thoughts at least.

  6. ^

    The motte and bailey just are very different. And an example: Reading Alex Turner's Many Arguments for AI x-risk are wrong, he seems to think deceptive alignment is unlikely while writing "I’m worried about people turning AIs into agentic systems using scaffolding and other tricks, and then instructing the systems to complete large-scale projects."

I've recently started in a research program for alignment. One outcome among many is that my beliefs and models have changed. Here I outline some ideas I've thought about.

The tone of this post is more like

"Here are possibilities and ideas I haven't really realized that exist.  I'm yet uncertain about how important they are, but seems worth thinking about them. (Maybe these points are useful to others as well?)"

than

"These new hypotheses I encountered are definitely right."

This ended up rather long for a shortform post, but still posting it here as it's quite low-effort and probably not of that wide of an interest.

 


Insight 1: You can have a an aligned model that is neither inner nor outer aligned.

Opposing frame: To solve the alignment problem, we need to solve both the outer and inner alignment: to choose a loss function whose global optimum is safe and then have the model actually aim to optimize it.

Current thoughts

A major point of the inner alignment and related texts is that the outer optimization processes are cognition-modifiers, not terminal-values. Why on earth should we limit our cognition-modifiers to things that are kinda like humans' terminal values?

That is: we can choose our loss functions and whatever so that they build good cognition inside the model, even if those loss functions' global optimums were nothing like Human Values Are Satisfied. In fact, this seems like a more promising alignment approach than what the opposing frame suggests!

(What made this click for me was Evan Hubinger's training stories post, in particular the excerpt

It’s worth pointing out how phrasing inner and outer alignment in terms of training stories makes clear what I think was our biggest mistake in formulating that terminology, which is that inner/outer alignment presumes that the right way to build an aligned model is to find an aligned loss function and then have a training goal of finding a model that optimizes for that loss function.

I'd guess that this post makes a similar point somewhere, though I haven't fully read it.)


Insight 2: You probably want to prevent deceptively aligned models from arising in the first place, rather than be able to answer the question "is this model deceptive or not?" in the general case,

Elaboration: I've been thinking that "oh man, deceptive alignment and the adversarial set-up makes things really hard". I hadn't made the next thought of "okay, could we prevent deception from arising in the first place?".

("How could you possibly do that without being able to tell whether models are deceptive or not?", one asks. See here for an answer. My summary: Decompose the model-space to three regions,

A) Verifiably Good models

B) models where the verification throws "false"

C) deceptively aligned models which trick us to thinking they are Verifiably Good.

Aim to find a notion of Verifiably Good such that one gradient-step never makes you go from A to C. Then you just start training from A, and if you end up at B, just undo your updates.)


Insight 3: There are more approaches to understanding our models / having transparent models than mechanistic interpretability.

Previous thought: We have to do mechanistic interpretability to understand our models!

Current thoughts: Sure, solving mech interp would be great. Still, there are other approaches:

  • Train models to be transparent. (Think: have a term in the loss function for transparency.)
  • Better understand training processes and inductive biases. (See e.g. work on deep double descent, grokking, phase changes, ...)
  • Creating architectures that are more transparent by design
  • (Chain-of-thought faithfulness is about making LLMs thought processes interpretable in natural language)

Past-me would have objected "those don't give you actual detailed understanding of what's going on". To which I respond: 

"Yeah sure, again, I'm with you on solving mech interp being the holy grail. Still, it seems like using mech interp to conclusively answer 'how likely is deceptive alignment' is actually quite hard, whereas we can get some info about that by understanding e.g. inductive biases."

(I don't intend here to make claims about the feasibility of various approaches.)

 


Insight 4: It's easier for gradient descent to do small updates throughout the net than a large update in one part. 

(Comment by Paul Christiano as understood by me.) At least in the context where this was said, it was a good point for expecting that neural nets have distributed representations for things (instead of local "this neuron does...").


Insight 5: You can focus on safe transformative AI vs. safe superintelligence.

Previous thought: "Oh man, lots of alignment ideas I see obviously fail at a sufficiently high capability level"

Current thought: You can reasonably focus on kinda-roughly-human-level AI instead of full superintelligence. Yep, you do want to explicitly think "this won't work for superintelligences for reasons XYZ" and note which assumptions about the capability of the AI your idea relies on. Having done that, you can have plans aim to use AIs for stuff and you can have beliefs like

"It seems plausible that in the future we have AIs that can do [cognitive task], while not being capable enough to break [security assumption], and it is important to seize such opportunities if they appear."

Past-me had flinch negative reactions about plans that fail at high capability levels, largely due to finding Yudkowsky's model of alignment compelling. While I think it's useful to instinctively think "okay, this plan will obviously fail once we get models that are capable of X because of Y" to notice the security assumptions, I think I went too far by essentially thinking "anything that fails for a superintelligence is useless". 


Insight 6: The reversal curse bounds the level of reasoning/inference current LLMs do.

(Owain Evans said something that made this click for me.) I think the reversal curse is good news in the sense of "current models probably don't do anything too advanced under-the-hood". I could imagine worlds where LLMs do a lot of advanced, dangerous inference during training - but the reversal curse being a thing is evidence against those worlds (in contrast to more mundane worlds).

I want to point out that insight 6 has less value in the current day, since it looks like the reversal curse turns out to be fixable very simply, and I agree with @gwern that the research was just a bit too half-baked for the extensive discussion it got:

https://www.lesswrong.com/posts/SCqDipWAhZ49JNdmL/paper-llms-trained-on-a-is-b-fail-to-learn-b-is-a#FLzuWQpEmn3hTAtqD

https://www.lesswrong.com/posts/SCqDipWAhZ49JNdmL/paper-llms-trained-on-a-is-b-fail-to-learn-b-is-a#3cAiWvHjEeCffbcof

"Trends are meaningful, individual data points are not"[1]

Claims like "This model gets x% on this benchmark" or "With this prompt this model does X with p probability" are often individually quite meaningless. Is 60% on a benchmark a lot or not? Hard to tell. Is doing X 20% of the time a lot or not? Go figure.

On the other hand, if you have results like "previous models got 10% and 20% on this benchmark, but our model gets 60%", then that sure sounds like something. "With this prompt the model does X with 20% probability, but with this modification the probability drops to <1%" also sounds like something, as does "models will do more/less of Y with model size".

There are some exceptions: maybe your results really are good enough to stand on their own (xkcd), maybe it's interesting that something happens even some of the time (see also this). It's still a good rule of thumb.

  1. ^

    Shoutout to Evan Hubinger for stressing this point to me

Epistemic responsibility

"You are responsible for you having accurate beliefs."

Epistemic responsibility refers to the idea that it is on you to have true beliefs. The concept is motivated by the following two applications.

 

In discussions

Sometimes in discussions people are in a combative "1v1 mode", where they try to convince the other person of their position and defend their own position, in contrast to a cooperative "2v0 mode" where they share their beliefs and try to figure out what's true. See the soldier mindset vs. the scout mindset.

This may be framed in terms of epistemic responsibility: If you accept that "It is (solely) my responsibility that I have accurate beliefs", the conversation naturally becomes less about winning and more about having better beliefs afterwards. That is, a shift from "darn, my conversation partner is so wrong, how do I explain it to them" to "let me see if the other person has valuable points, or if they can explain how I could be wrong about this".

In particular, from this viewpoint it sounds a bit odd if one says the phrase "that doesn't convince me" when presented with an argument, as it's not on the other person to convince you of something. 

 

Note: This doesn't mean that you have to be especially cooperative in the conversation. It is your responsibility that you have true beliefs, not that you both have. If you end up being less wrong, success. If the other person doesn't, that's on them :-)

 

Trusting experts

There's a question Alice wants to know the answer to. Unfortunately, the question is too difficult for Alice to find out the answer herself. Hence she defers to experts, and ultimately believes what Bob-the-expert says.

Later, it turns out that Bob was wrong. How does Alice react?

A bad reaction is to be angry at Bob and throw rotten tomatoes at him.

Under the epistemic responsibility frame, the proper reaction is "Huh, I trusted the wrong expert. Oops. What went wrong, and how do I better defer to experts next time?"

 

When (not) to use the frame

I find the concept to be useful when revising your own beliefs, as in the above examples of discussions and expert-deferring.

One limitation is that belief-revising often happens via interpersonal communication, whereas epistemic responsibility is individualistic. So while "my aim is to improve my beliefs" is a better starting point for conversations than "my aim is to win", this is still not ideal, and epistemic responsibility is to be used with a sense of cooperativeness or other virtues.

 

Another limitation is that "everyone is responsible for themselves" is a bad norm for a community/society, and this is true of epistemic responsibility as well.

I'd say that the concept of epistemic responsibility is mostly for personal use. I think that especially the strongest versions of epistemic responsibility (heroic epistemic responsibility?), where you are the sole person responsible for you having true beliefs and where any mistakes are your fault, are something you shouldn't demand of others. For example, I feel like a teacher has a lot of epistemic responsibility on the behalf of their students (and there are other types of responsibilities going on here).

Or whatever, use it how you want - it's on you to use it properly.

I recently gave a workshop in AI control, for which I created an exercise set.

The exercise set can be found here: https://drive.google.com/file/d/1hmwnQ4qQiC5j19yYJ2wbeEjcHO2g4z-G/view?usp=sharing

The PDF is self-contained, but three additional points:

  • I assumed no familiarity about AI control from the audience. Accordingly, the target audience for the exercises is newcomers, and are about the basics.
    • If you want to get into AI control, and like exercises, consider doing these.
    • Conversely, if you are already pretty familiar with control, I don't expect you'll get much out of these exercises. (A good fraction of the problems is about re-deriving things that already appear in AI control papers etc., so if you already know those, it's pretty pointless.)
  • I felt like some of the exercises weren't that good, and am not satisfied with my answers to some of them - I spent a limited time on this. I thought it's worth sharing the set anyways.
    • (I compensated by highlighting problems that were relatively good, and by flagging the answers I thought were weak; the rest is on the reader.)
  • I largely focused on monitoring schemes, but don't interpret this as meaning there's nothing else to AI control.

You can send feedback by messaging me or anonymously here.

I was expecting some math. Maybe something about the expected amount of work you can get out of an AI before it coups you, if you assume the number of actions required to coup is n, the trusted monitor has false positive rate p, etc?

In praise of prompting

(Content: I say obvious beginner things about how prompting LLMs is really flexible, correcting my previous preconceptions.)

 

I've been doing some of my first ML experiments in the last couple of months. Here I outline the thing that surprised me the most:

Prompting is both lightweight and really powerful.

As context, my preconception of ML research was something like "to do an ML experiment you need to train a large neural network from scratch, doing hyperparameter optimization, maybe doing some RL and getting frustrated when things just break for no reason".[1]

And, uh, no.[2] I now think my preconception was really wrong. Let me say some things that me-three-months-ago would have benefited from hearing.

 

When I say that prompting is "lightweight", I mean that both in absolute terms (you can just type text in a text field) and relative terms (compare to e.g. RL).[3] And sure, I have done plenty of coding recently, but the coding has been basically just to streamline prompting (automatically sampling through API, moving data from one place to another, handling large prompts etc.) rather than ML-specific programming. This isn't hard, just basic Python.

When I say that prompting is "really powerful", I mean a couple of things.

First, "prompting" basically just means "we don't modify the model's weights", which, stated like that, actually covers quite a bit of surface area. Concrete things one can do: few-shot examples, best-of-N, look at trajectories as you have multiple turns in the prompt, construct simulation settings and seeing how the model behaves, etc.

Second, suitable prompting actually lets you get effects quite close to supervised fine-tuning or reinforcement learning(!) Let me explain:

Imagine that I want to train my LLM to be very good at, say, collecting gold coins in mazes. So I create some data. And then what?

My cached thought was "do reinforcement learning". But actually, you can just do "poor man's RL": you sample the model a few times, take the action that led to the most gold coins, supervised fine-tune the model on that and repeat.

So, just do poor man's RL? Actually, you can just do "very poor man's RL", i.e. prompting: instead of doing supervised fine-tuning on the data, you simply use the data as few-shot examples in your prompt.

My understanding is that many forms of RL are quite close to poor man's RL (the resulting weight updates are kind of similar), and poor man's RL is quite similar to prompting (intuition being that both condition the model on the provided examples).[4]

As a result, prompting suffices way more often than I've previously thought.

  1. ^

    This negative preconception probably biased me towards inaction.

  2. ^

    Obviously some people do the things I described, I just strongly object to the "need" part.

  3. ^

    Let me also flag that supervised fine-tuning is much easier than I initially thought: you literally just upload the training data file at https://platform.openai.com/finetune

  4. ^

    I admit that I'm not very confident on the claims of this paragraph. This is what I've gotten from Evan Hubinger, who seems more confident on these, and I'm partly just deferring here.

Inspired by the "reward chisels cognition into the agent's network" framing from Reward is not the optimization target, I thought: is reward necessarily a fine enough tool? More elaborately: if you want the model to behave in a specific way or to have certain internal properties, can you achieve this simply by a suitable choosing of the reward function?

I looked at two toy cases, namely Q-learning and training a neural network (the latter which is not actually reinforcement learning but supervised learning). The answers were "yep, suitable reward/loss (and datapoints in the case of supervised learning) are enough". 

I was hoping for this not to be the case, as that would have been more interesting (imagine if there were fundamental limitations to the reward/loss paradigm!), but anyways. I now expect that also in more complicated situations reward/loss are, in principle, enough.


Example 1: Q-learning. You have a set  of states and a set  of actions. Given a target policy , can you necessarily choose a reward function  such that, training for long enough* with Q-learning (with positive learning rate and discount factor), the action that maximizes reward is the one given by the target policy: ?

*and assuming we visit all of the states in  many times

The answer is yes. Simply reward the behavior you want to see: let  if  and  otherwise.

(In fact, one can more strongly choose, for any target value function  , a reward function  such that the values  in Q-learning converge in the limit to . So not only can you force certain behavior out of the model, you can also choose the internals.)

Example 2: Neural network.

Say you have a neural network  with  tunable weights . Can you, by suitable input-output pairs and choices of the learning rate, modify the weights of the net so that they are (approximately) equal to ?

(I'm assuming here that we simply update the weights after each data point, instead of doing SGD or something. The choice of loss function is not very relevant, take e.g. square-error.)

The following sketch convinces me that the answer is positive:

Choose  random input-output pairs . The gradients  of the weight vectors are almost certainly linearly independent. Hence, some linear combination  of them equals . Now, for small , running back-propagation on the pair  with learning rate  for all  gives you an update approximately in the direction of . Rinse and repeat.

Iteration as an intuition pump

I feel like many game/decision theoretic claims are most easily grasped when looking at the iterated setup:

Example 1. When one first sees the prisoner's dilemma, the argument that "you should defect because of whatever the other person does, you are better off by defecting" feels compelling. The counterargument goes "the other person can predict what you'll do, and this can affect what they'll play".

This has some force, but I have had a hard time really feeling the leap from "you are a person who does X in the dilemma" to "the other person models you as doing X in the dilemma". (One thing that makes this difficult that usually in PD it is not specified whether the players can communicate beforehand or what information they have of each other.) And indeed, humans models' of other humans are limited - this is not something you should just dismiss.

However, the point "the Nash equilibrium is not necessarily what you should play" does hold, as is illustrated by the iterated Prisoner's dilemma. It feels intuitively obvious that in a 100-round dilemma there ought to be something better than always defecting.

This is among the strongest intuitions I have for "Nash equilibria do not generally describe optimal solutions".

 

Example 2. When presented with lotteries, i.e. opportunities such as "X% chance you win A dollars, (100-X)% chance of winning B dollars", it's not immediately obvious that one should maximize expected value (or, at least, humans generally exhibit loss aversion, bias towards certain outcomes, sensitivity to framing etc.).

This feels much clearer when given the option to choose between lotteries repeatedly. For example, if you are presented with the two buttons, one giving you a sure 100% chance of winning 1 dollar and the other one giving you a 40% chance of winning 3 dollars, and you are allowed to press the buttons a total of 100 times, it feels much clearer that you should always pick the one with the highest expected value. Indeed, as you are given more button presses, the probability of you getting (a lot) more money that way tends to 1 (by the law of large numbers).

This gives me a strong intuition that expected values are the way to go.

Example 3. I find Newcomb's problem a bit confusing to think about (and I don't seem to be alone in this). This is, however, more or less the same problem as prisoner's dilemma, so I'll be brief here.

The basic argument "the contents of the boxes have already been decided, so you should two-box" feel compelling, but then you realize that in an iterated Newcomb's problem you will, by backward induction, always two-box.

This, in turn, sounds intuitively wrong, in which case the original argument proves too much. 


One thing I like about iteration is that it makes the concept of ""it really is possible to make predictions about your actions" feel more plausible: there's clear-cut information about what kind of plays you'll make, namely the previous rounds. I feel like in my thoughts I sometimes feel like rejecting the premise, or thinking that "sure, if the premise holds, I should one-box, but it doesn't really work that way in real life, this feels like one of those absurd thought experiments that don't actually teach you anything". Iteration solves this issue.

Another pump I like is "how many iterations do there need to be before you Cooperate/maximize-expected-value/one-box?". There (I think) is some number of iterations for this to happen, and, given that, it feels like "1" is often the best answer.

All that said, I don't think iterations provide the Real Argument for/against the position presented. There's always some wiggle room for "but what if you are not in an iterated scenario, what if this truly is a Unique Once-In-A-Lifetime Opportunity?". I think the Real Arguments are something else - e.g. in example 2 I think coherence theorems give a stronger case (even if I still don't feel them as strongly on an intuitive level). I don't think I know the Real Argument for example 1/3.