All of johnswentworth's Comments + Replies

(Also, I had the above convos with John >1y ago, and perhaps John simply changed since then.)

In hindsight, I do think the period when our discussions took place were a local maximum of (my own estimate of the extent of applicability of my math), partially thanks to your input and partially because I was in the process of digesting a bunch of the technical results we talked about and figuring out the next hurdles. In particular, I definitely underestimated the difficulty of extending the results to finite approximations.

That said, I doubt that fully accounts for the difference in perception.

Yup, I think this is right, though I don't know whether it applies to a literal game of pool since the balls start in a particular relatively-simple arrangement.

2Gerald Monroe2d
It means it's dominated by tiny effects you may not be able to measure before the break. Once it's down to simple 1 and 2 ball situations sure, the robot can sink every shot.

This is also my current heuristic, and the main way that I now disagree with the post.

More details:

  • I think the argument Nate gave is at least correct for markets of relatively-highly-intelligent agents, and that was a big update for me (thankyou Nate!). I'm still unsure how far it generalizes to relatively less powerful agents.
  • Nate left out my other big takeaway: Nate's argument here implies that there's probably a lot of money to be made in real-world markets! In practice, it would probably look like an insurance-like contract, by which two traders would commit to the "side-channel trades at non-market prices" required to make them aggrega
... (read more)

Terminologically, I think it would be useful to name this as a variant of the epsilon fallacy, which has the benefit of being exactly what it sounds like.

Also, great post, I love the pasta-cooking analysis.

Another reason the pasta terminology is bad is that I bet a reasonable fraction of the population have always believed that the salt is for taste, and have never heard any other justification. For them, “salt in pasta water fallacy” would be a pretty confusing term. I like “epsilon fallacy”.

Or the "every little helps" error.

Aside: Vanessa mentioned in person at one point that the game-theoretic perspective on infra-bayes indeed basically works, and she has a result somewhere about the equivalence. So that might prove useful, if you're looking to claim this prize.

That's a great connection which I had indeed not made, thanks! Strong-upvoted.

Yup, that's right. A wrong frame is costly relative to the right frame. A less wrong frame can still be less costly than a more wrong frame, and that's especially relevant when nobody knows what the right frame is yet.

If we’d learned that GPT-4 or Claude had those capabilities, we expect labs would have taken immediate action to secure and contain their systems.

At that point, the time at which we should have stopped is probably already passed, especially insofar as:

  • systems are trained with various degrees of internet access, so autonomous function is already a problem even during training
  • people are able to make language models more capable in deployment, via tricks like e.g. chain-of-thought prompting.

As written, this evaluation plan seems to be missing elbow-room. The ... (read more)

4Beth Barnes1h
Autonomous Replication as we define it in our evaluations (though maybe not clear from our blog post) is significantly below what we think is necessary to actually be an xrisk. In particular, we assume no human resistance, model has access to weights, ways of making money it tries are scalable, doesn't have any issues purchasing tons of GPUs, no monitoring by labs, etc
8Martin Randall8d
Definitely agree that the point where the model can independently replicate is way too late. How much elbow room is enough? If I'm putting a dangerous human in jail, I don't want them to be almost capable of escaping the jail when tested.

LessWrong, conveniently, has a rough metric of status directly built-in, namely karma. So we can directly ask: do people with high karma (i.e. high LW-status) wish to avoid quantification of performance? Speaking as someone with relatively high karma myself, I do indeed at least think that every quantitative performance metric I've heard sounds terrible, and I'd guess that most of the other folks with relatively high karma on the site would agree.

... and yet the story in the post doesn't quite seem to hold up. My local/first-order incentives actually favor... (read more)

6Ben Pace10d
To me this rhymes pretty closely with the message in Is Success the Enemy of Freedom? [], in that in both cases you're very averse to competition on even pretty nearby metrics that you do worse on.

You might be arguing for some analogy but it's not immediately clear to me what, so maybe clarify if that's the case?

The basic analogy is roughly "if we want a baseline for how hard it will be to evaluate an AI's outputs on their own terms, we should look at how hard it is to evaluate humans' outputs on their own terms, especially in areas similar in some way to AI safety". My guess is that you already have lots of intuition about how hard it is to assess results, from your experience assessing grantees, so that's the intuition I was trying to pump. In par... (read more)

I see, thanks. I feel like the closest analogy here that seems viable to me would be to something like: is Open Philanthropy able to hire security experts to improve its security and assess whether they're improving its security? And I think the answer to that is yes. (Most of its grantees aren't doing work where security is very important.) It feels harder to draw an analogy for something like "helping with standards enforcement," but maybe we could consider OP's ability to assess whether its farm animal welfare grantees are having an impact on who adheres to what standards, and how strong adherence is? I think OP has pretty good (not perfect) ability to do so.

I don't agree with this characterization, at least for myself. I think people should be doing object-level alignment research now, partly (maybe mostly?) to be in better position to automate it later.

Indeed, I think you're a good role model in this regard and hope more people will follow your example.

It seems to me like the main crux here is that you're picturing a "phase transition" that kicks in in a fairly unpredictable way, such that a pretty small increase in e.g. inference compute or training compute could lead to a big leap in capabilities. Does that sound right?

I don't think this is implausible but haven't seen a particular reason to consider it likely.

The phrase I'd use there is "grokking general-purpose search". Insofar as general-purpose search consists of a relatively-simple circuit/function recursively calling itself a lot with different c... (read more)

I think I find the "grokking general-purpose search" argument weaker than you do, but it's not clear by how much. The "we" in "we can point AIs toward and have some ability to assess" meant humans, not Open Phil. You might be arguing for some analogy but it's not immediately clear to me what, so maybe clarify if that's the case?

+1, this is probably going to be my new default post to link people to as an intro.

We may disagree about how much progress the results to date represent regarding finite approximations. I'd say they contain conceptual ideas that may be important in a finite setting, but I also expect most of the work will lie in turning those ideas into non-trivial statements about finite settings. In contrast, most of your writing suggests to me that a large part of the theoretical work has been done (not sure to what extent this is a disagreement about the state of the theory or about communication).

Perhaps your instincts here are better than mine! Going to the finite case has indeed turned out to be more difficult than I expected at the time of writing most of the posts you reviewed.

Brief responses to the critiques:

Results don’t discuss encoding/representation of abstractions

Totally agree with this one, it's the main thing I've worked on over the past month and will probably be the main thing in the near future. I'd describe the previous results (i.e. ignoring encoding/representation) as characterizing the relationship between the high-level and the high-level.

Definitions depend on choice of variables 

The local/causal structure of our universe gives a very strong preferred way to "slice it up"; I expect that's plenty sufficient... (read more)

Thanks for the responses! I think we qualitatively agree on a lot, just put emphasis on different things or land in different places on various axes. Responses to some of your points below:

The local/causal structure of our universe gives a very strong preferred way to "slice it up"; I expect that's plenty sufficient for convergence of abstractions. [...]

Let me try to put the argument into my own words: because of locality, any "reasonable" variable transformation can in some sense be split into "local transformations", each of which involve only a few vari... (read more)

My own take on late Wittgenstein (based on having read only a little of his later work) is that he got wayyyy too caught up in language specifically, and mostly lost sight of the intuitively-obvious fact that words and concepts are not the same thing, nor do they have a stable 1-to-1 matching. (Also he seems to have lost contact with reality in his later work, in the sense that he seemed very hyper focused on things-which-language-can-talk-about. He seemed to basically lose track of the fact that the rest of reality goes on existing just fine, and humans g... (read more)

My understanding of Steel Late Wittgenstein's response would be that you could agree with that words and concepts are distinct, and mapping is not always 1-1, but that what concepts get used is also significantly influenced by which features of the world are useful in some contexts of language (/word) use. 

Based on my own retrospective views of how lightcone's office went less-than-optimally, I recently gave some recommendations to someone maybe setting up another alignment research space. (Background: I've been working in the lightcone office since shortly after it opened.) They might be of interest to people mining this post for insights on how to execute similar future spaces. Here they are, lightly edited:

  • I recommend selecting for people who want to understand agents, instead of people who want to reduce AI X-risk. When I think about people who bring an
... (read more)

I recommend selecting for people who want to understand agents, instead of people who want to reduce AI X-risk.

Strong disagree. I think locking in particular paradigms of how to do AI safety research would be quite bad.

It's at least shorter now, though still too many pieces. Needs simplification more than clarification.

Picking on the particular pieces:

Other AIs compete to expose any given score-function as having wiggle-room (generating arguments with contradictory conclusions that both get a high score).

Not rewarding contradictory conclusions is not a sufficient condition for a score-function to reward truth, or not reward falsehood.

Human-defined restrictions/requirements for score-functions increase P(high-scoring arguments can be trusted | score-function has low

... (read more)
1Tor Økland Barstad14d
Here is my attempt at a shorter answer (although it didn’t end up as short as planned) 🙂 I’m also being more simplistic here (at times deliberately so), in the hope of making “core” concepts digest with less effort. If you don’t respond here you probably won’t hear from me in a while. Score-functions would score argument-step-networks. It is these score-functions that would leverage regularities for when human evaluations are “good”/correct. Here are some things that might be the case for any given score-function: * It has Low Wiggle-room (LW): It has low wiggle-room (that is to say, it’s not possible to construct high-scoring argument-step-networks that argue in favor of contradictory conclusions) * It Robustly Leverages Regularities for “good” human evaluations (RLR): It robustly separates out “good” human evaluations (without false positives). * It Contains Intelligence (CI): It is “intelligent”, and reasons itself towards favoring specific conclusions. * It Hardcodes Bias (HB): It has "hardcoded" bias in favor of specific conclusions. Here are some things I assume regarding these properties: 1. We want RLR. 2. RLR implies LW, but LW does not imply RLR. 3. LW implies RLR and/or CI and/or HB. 4. We want to make it so that LW implies RLR (by eliminating other plausibilities). 5. We can determine if LW is the case if our capabilities at gradient descent are sufficiently good (and we know that they are). If we are sufficiently good at gradient descent (and know that we are), we can figure out if LW is the case. Our job would be to make it so that the most high-scoring score-functions (that it’s possible for AIs to make) would achieve LW by RLR. We could make P(RLR | LW) high by doing as follows when adding restrictions and optimization-criteria for score-functions: 1. We can give restrictions (and give bonus/penalty-points) based on source code length, processing power, etc. 2. We can make it possible to reference pred
1Tor Økland Barstad15d
Indeed! It's a necessary but not sufficient condition. Summary: The regularities are expressed in terms of score-functions (that score argument-step-networks)[1]. We can score these score-functions based on simplicity/brevity, and restrict what they can do (make it so that they have to be written within human-defined confines). I posit that we probably can do this in such a way that the most high-scoring score-functions that it’s possible to make (that meet requirements relating to wiggle-room and “reach”[2]) do so by specifying general regularities for when humans are/aren’t led astray (as opposed to ones that favor specific conclusions). It should (I posit) be feasible to make it so that score-functions that are designed with the intention of fooling us about specific things earn significantly less points (if they pass the requirements at all). If we allow score-functions to be constructed freely, then this makes it easy for AIs to output score-functions that (practically speaking) allow for pretty much any internally consistent set of conclusions. Indeed, the AI could simply have the score-function be itself, and choose to only give a high score to argument-step-trees with conclusions it likes. So the question becomes: How do we restrict the “space” of possible score-functions that can be suggested, and add requirements that the score-functions must pass, so that the only way to get a high score (for AIs that suggest score-functions) is to make score-functions that (in some sense) separate out “good” human evaluations? Here are examples of some of the types of possible requirements for score-functions: Low maximum source code length (no space for score-function to be intelligent or biased in a detailed way) With a sufficiently low maximum source code length, there is not enough space for the score-function itself to be intelligent, or for hardcoding bias for many specific conclusions. Work would have to be done elsewhere (e.g. predictions of human outp

A particular pattern Nate has talked about is what I might call "reflection." The basic idea is that in order to do certain kinds of research effectively, you need to keep asking questions like "What am I actually trying to do here and why? What is my big-picture goal?", which are questions that might "change your aims" in some important sense. The idea is not necessarily that you're rewriting your own source code, but that you're doing the kind of reflection and self-modification a philosophically inclined, independent-minded human might do: "I've always

... (read more)

Kinda? I feel like, if someone is asking for seven mutually perpendicular red lines drawn drawn using green and transparent ink, then drawing them a line using red ink and then drawing on top of that using transparent ink is... well, in terms of standard AI analogies, it's the sort of thing an unfriendly genie does to technically satisfy the wording of a wish without giving you what you want.

So, I'm not quite sure how to articulate the mistake being made here, but... consider The Client from the video at the top of the post. And imagine that Client saying:

Ok, you're saying I need to go understand lines and color and geometry better before I will be able to verify that an outsourcer is doing this job well. But if it is even possible for me to figure out a way to verify that sort of thing, then surely I must have some way of verifying verification plans involving lines and color and geometry. So what if, instead of studying lines and color and g

... (read more)
It still seems like we mainly agree, but might be having a communication gap. In your Client example in your most recent comment, the reason this is a bad approach is that The Client is far less likely to be able to verify a line-and-color verification plan accurately than to verify whether a concrete design is what she was envisioning. She already has a great verification strategy available - making or eyeballing drawings, proposing concrete changes, and iterating - and she and The Expert are just failing to use it. In technical AI alignment, we unfortunately don't have any equivalent to "just eyeballing things." Bad solutions can seem intuitively compelling, and qualitative objections to proposed alignment schemes won't satisfy profit-oriented businesses eager to cash in on new AI systems. We also can't "just have the AI do it," for the same reason - how would we validate any solutions it came up with? Surely "just have the AI do it" isn't the right answer to "what if the AI can't prove its technical AI solution is correct." My contention is that there may already be facets of AI alignment work that can be successfully outsourced to AI, precisely because we are already able to adequately validate them. For example, I can have ChatGPT come up with and critique ELK solutions. If the ELK contest were still running, I could then submit those solutions, and they would receive the same level of validation that human-proposed solutions achieve. That is why it's possible to outsource the generation of new potential ELK solutions both to humans and to AI. If that field is bottlenecked by the need to brainstorm and critique solutions, and if ChatGPT can do that work faster and better than a human, then we can outsource that specific form of labor to it. But in areas where we don't have any meaningful verification solutions, then we can't outsource, either to humans or to AI. We might have trouble even explaining what the problem is, or motivating capable people of worki

I'm roughly on-board with the story (not 100%, but enough) up until this part:

Under this conception, if AI alignment research can't be outsourced to an AI, then it also can't be achieved by humans.

The idea behind the final advice in the post is that humans become more able to outsource alignment research to AI as they better understand alignment themselves. Better human understanding of alignment expands our ability to verify.

If humans lack the expertise to outsource to AI at a given time, then yes, alignment also can't be achieved by humans at that time. But humans' expertise is not static. As we improve our own understanding, we improve our ability to outsource.

I think I didn't communicate that part clearly enough. What I meant was that our ability to align AI is bottlenecked by our human, and ideally non-expert, verifiability solutions. As you say, we can expect that if verifiability solutions are achievable at all, then human-based AI alignment research is how we should expect them to emerge, at least for now. If we can't directly verify AI systems for alignment yet, then we at least have some ability to verify proposed alignment verification strategies. One such strategy is looking for ways to defeat proposed ELK solutions in the diamond-robber problem. It is possible that ChatGPT or some other current AI system could both propose alignment solutions and ways to defeat them. This helps show that we can potentially outsource some AI alignment problems to AI, as long as humans retain the ability to verify the AI's proposed solutions.

Tim Cook could not do all the cognitive labor to design an iPhone (indeed, no individual human could).

Note that the relevant condition is not "could have done all the cognitive labor", but rather "for any individual piece of the cognitive labor, could have done that piece", at least down to the level where standardized products can be used. And in fact, I do not think that Tim Cook could have done any individual piece of the cognitive labor required to design the iPhone (down to standardized products). But my guess is that Steve Jobs basically could, which... (read more)

It sounds like your claim is that having the talent to potentially execute nonstandard tasks is a necessary, though not always sufficient, criteria to identify the same talent in others. Therefore, only an omni-talented executive is capable of successfully leading or outsourcing the project. They might not immediately be able to execute the nitty-gritty details of each task, but they would be capable of rapidly skilling up to execute any such task if required. I am curious to know what you think of the following idea about how to get around this bottleneck of for omni-talented leadership, at least in certain cases. In many cases, there is a disconnect between the difficulty of engineering and the difficulty of evaluating the product. The iPhone was hard to engineer, but it was easy to see it made calls, played music, browsed the internet, and was simple to use. Apollo 11 was hard to engineer, but it was easy to see the astronauts landing on the moon and returning to Earth. The nuclear bomb was hard to engineer, but it was easy to see Fat Man and Little Boy had destroyed Hiroshima and Nagasaki. The Tesla was hard to engineer, but it was easy to see that it required no gasoline and achieved the promised driving range. The mRNA COVID-19 vaccine was hard to engineer, but it was easy to run a conventional vaccine trial to show that it worked. ChatGPT was hard to engineer, but it is easy to see that it can produce nearly human-like text outputs in response to open-ended prompts. In any of these cases, a well-funded non-expert businessperson could have placed a bounty to motivate experts to build them the desired product. For example, John F. Kennedy could have promised $500 million to any American organization that could prove they had successfully landed American astronauts on the moon. Of course, building the rocket and the mission logistics might have required omni-talented leadership in rocket design and space mission logistics. But the essential point is th

At a quick skim, I don't see how that proposal addresses the problem at all. If the humans lack the expertise to accurately answer subquestions or assess arguments (or even realize that they don't know), then the proposal is hosed; I don't even see a built-in way to figure out whether the humans are correctly answering (or correctly assessing their own ability to answer). Nor do I see any way to check that the system is asking the right questions.

(Though the main problems with this proposal are addressed in the rant on problem factorization, rather than here.)

1Tor Økland Barstad16d
Here [] are additional attempts to summarize. These ones are even shorter than the screenshot I showed earlier. More clear now?
1Tor Økland Barstad20d
Here is a screenshot from the post summary: This lacks a lot of detail (it is, after all, from the summary). But do you think you are able to grok the core mechanism that's outlined?
1Tor Økland Barstad20d
Thanks for engaging! 🙂 As reward, here is a wall of text. You speak in such generalities: * "the humans" (which humans?) * "accurately answer subquestions" (which subquestions?) * "accurately assess arguments" (which arguments/argument-steps?) But that may make sense based on whatever it is you imagine me to have in mind.  One of the main mechanisms (not the only one) is exploration of wiggle-room (whether it's feasible to construct high-scoring argument-step-networks that argue in favor of contradictory claims). Some AGIs would be "trained" to construct high-scoring argument-step-networks. If they are able to construct high-scoring argument-step-networks that favor contradictory claims, this indicates that wiggle-room is high. "A superintelligence could fool (even smart) humans" is a leaky abstraction. It depends on the restrictions/context in question. It would be the job of the score-function to enforce restrictions for the kinds of argument-steps that are allowed, which assesment-predictions that should be accounted for (and how much), which structural requirements to enforce of argument-networks, etc. Some AGIs would be "trained" to construct score-functions. These score-functions would themselves be scored, and one of the main criteria when evaluating a score-function would be to see if it allows for wiggle-room (if there are possible argument-networks that argue in favor of contradictory conclusions and that both would have been given a high score by the score-function). Score-functions would need to be in accordance with restrictions/desideratum defined (directly or indirectly) by humans. These restrictions/desideratum would be defined so as to increase P(score-function forces good output | score-function has low wiggle-room). One such restriction is low maximum source code length. With a sufficiently low maximum source code length, there is: * not enough space for the score-function itself to be intelligent * not enough space for hardcoding

Interpretability progress, if it is to be useful for alignment, is not primarily bottlenecked on highly legible problems right now. So I expect the problems in the post to apply in full, at least for now.

I think the missing piece here is that people who want to outsource the solving of alignment to AIs are usually trying to avoid engaging with the hard problems of alignment themselves. So the key difference is that, in B, the people outsourcing usually haven't attempted to understand the problem very deeply.

I don't agree with this characterization, at least for myself. I think people should be doing object-level alignment research now, partly (maybe mostly?) to be in better position to automate it later. I expect alignment researchers to be central to automation attempts. It seems to me like the basic equation is something like: "If today's alignment researchers would be able to succeed given a lot more time, then they also are reasonably likely to succeed given access to a lot of human-level-ish AIs." There are reasons this could fail (perhaps future alignment research will require major adaptations and different skills such that today's top alignment researchers will be unable to assess it; perhaps there are parallelization issues, though AIs can give significant serial speedup), but the argument in this post seems far from a knockdown. Also, it seems worth noting that non-experts work productively with experts all the time. There are lots of shortcomings and failure modes, but the video is a parody.
4James Payor20d
Also Plan B is currently being used to justify accelerating various danger tech by folks with no solid angles on Plan A...

Good point. Could hardcode them, so those parameters aren't free to vary at all.

Fair. I am fairly confident that (1) the video at the start of the post is pointing to a real and ubiquitous phenomenon, and (2) attempts to outsource alignment research to AI look like an extremely central example of a situation where that phenomenon will occur. I'm less confident that my models here properly frame/capture the gears of the phenomenon.

True! And indeed my uncle has noticed that it's slow and buggy. But you do need to be able to code to distinguish competent developers, and my uncle did not have so many resources to throw at the problem that he could keep trying long enough to find a competent developer, while paying each one to build the whole app before finding out whether they're any good. (Also I don't think he's fully aware of how bad his app is relative to what a competent developer could produce.)

7Simon Fischer21d
I don't believe these "practical" problems ("can't try long enough") generalize enough to support your much more general initial statement. This doesn't feel like a true rejection to me, but maybe I'm misunderstanding your point.

I think the standard setups in computational complexity theory assume away the problems which are usually most often blockers to outsourcing in practice - i.e. in complexity theory the problem is always formally specified, there's no question of "does the spec actually match what we want?" or "has what we want been communicated successfully, or miscommunicated?".

6Simon Fischer21d
I think I mostly agree with this, but from my perspective it hints that you're framing the problem slightly wrong. Roughly, the problem with the outsourcing-approaches is our inability to specify/verify solutions to the alignment problem, not that specifying is not in general easier than solving yourself. (Because of the difficulty of specifying the alignment problem, I restricted myself to speculating about pivotal acts in the post linked above.)

At least in my personal experience, a client who couldn't have written the software themselves usually gets a slow, buggy product with a terrible UI. (My uncle is a good example here - he's in the septic business, hired someone to make a simple app for keeping track of his customers. It's a mess.) By contrast, at most of the places where I've worked or my friends have worked which produce noticeably good software, the bulk of the managers are themselves software engineers or former software engineers, and leadership always has at least some object-level so... (read more)

It seems like the fundamental cause of the problem with your uncle's customer tracking app is some combination of: 1. He paid for ongoing effort, rather than delivering satisfactory results. Instead of a bounty model, he used a salary or wage model to pay the programmer. 2. He lacked the ability to describe what exactly would make the app satisfactory, impairing his ability to pay for results rather than effort. In other words, the "bounty-compatible" criteria for outsourceability was not met in this case. This raises the question of what to do about it. If he didn't know how to specify all his performance requirements, could he have hired somebody to help him do so? If he'd tried to outsource identifying performance requirements, could he have applied the bounty model to that job? If he had offered a bounty in exchange for an app meeting his requirements, would his offer of a bounty have been believable? If his offer of a bounty was believable, would a competent programmer have been willing to pursue that bounty? As we pose these questions, we see that society's overall ability to outsource effectively is bottlenecked by the availability of high-quality bounty offer interfaces []. A bounty offer interface should help the user define a satisfactory result, broadcast bounty offers to a competent professional network, and make the bounty offer credible. it sounds like there have been some attempts at creating bounty interfaces for app development. One active site for this purpose is replit []. However, as I scan some of their open bounties, the problem description, acceptance criteria, and technical details seem woefully underspecified, with no apparent ability to make bounty offers credible, and I also don't see any signs that replit is plugged into a competent developer network.  Bepro [] is another such site, but ha
5Simon Fischer21d
But you don't need to be able to code to recognize that a software is slow and buggy!? About the terrible UI part I agree a bit more, but even there one can think of relatively objective measures to check usability without being able to speak python.

People successfully outsource cognitive labor all the time (this describes most white-collar jobs). This is possible because very frequently, it is easier to be confident that work has been done correctly than to actually do the work.

I expect that in the large majority of common use-cases, at least one of the following applies:

  • The outsourcer could have done it themselves (ex.: my boss outsourcing to me back when I was at a software startup, or me reading an academic paper)
  • The actual goal is not to succeed at the stated task, but merely to keep up appearanc
... (read more)
6Jonathan Paulson21d
Tim Cook could not do all the cognitive labor to design an iPhone (indeed, no individual human could). The CEO of Boeing could not fully design a modern plane. Elon Musk could not make a Tesla from scratch. All of these cases violate all of your three bullet points. Practically everything in the modern world is too complicated for any single person to fully understand, and yet it all works fairly well, because successful outsourcing of cognitive labor is routinely successful. It is true that a random layperson would have a hard time verifying an AI's (or anyone else's) ideas about how to solve alignment. But the people who are going to need to incorporate alignment ideas into their work - AI researchers and engineers - will be in a good position to do that, just as they routinely incorporate many other ideas they did not come up with into their work. Trying to use ideas from an AI sounds similar to me to reading a paper from another lab - could be irrelevant or wrong or even malicious, but could also have valuable insights you'd have had a hard time coming up with yourself.
3Simon Fischer21d
I find this statement very surprising. Isn't almost all of software development like this? E.g., the client asks the developer for a certain feature and then clicks around the UI to check if it's implemented / works as expected.

Seems like the easiest way to satisfy that definition would be to:

  • Set up a network and dataset with at least one local minimum which is not a global minimum
  • ... Then add an intermediate layer which estimates the gradient, and doesn't connect to the output at all.
I'm a bit confused as to why this would work. If the circuit in the intermediate layer that estimates the gradient does not influence the output, wouldn't they just be free parameters that can be varied with no consequence to the loss?  If so, this violates 2a since perturbing these parameters would not get the model to converge to the desired solution.
5Thomas Larsen21d
This is a plausible internal computation that the network could be doing, but the problem is that the gradients flow back through from the output to the computation of the gradient to the true value y, and so GD will use that to set the output to be the appropriate true value. 
7Thomas Larsen21d
This feels like cheating to me, but I guess I wasn't super precise with 'feedforward neural network'. I meant 'fully connected neural network', so the gradient computation has to be connected by parameters to the outputs. Specifically, I require that you can write the network as  f(x,θ)=σn∘Wn∘…σ1∘W1[x,θ]T where the weight matrices are some nice function of θ (where we need a weight sharing function to make the dimensions work out. The weight sharing function takes in ϕ and produces the Wi matrices that are actually used in the forward pass.) I guess I should be more precise about what 'nice means', to rule out weight sharing functions that always zero out input, but it turns out this is kind of tricky. Let's require the weight sharing function ϕ:Rw→RW to be differentiable and have image that satisfies [−1,1]⊂projnim(ϕ) for any projection. (A weaker condition is if the weight sharing function can only duplicate parameters). 

I'm going to answer a different question: what's my list of open problems in understanding agents? I claim that, once you dig past the early surface-level questions about alignment, basically the whole cluster of "how do agents work?"-style questions and subquestions form the main barrier to useful alignment progress. So with that in mind, here are some of my open questions about understanding agents (and the even deeper problems one runs into when trying to understand agents), going roughly from "low-level" to "high-level".

  • How does abstraction work?
    • How ca
... (read more)

What would John rather have, for the same monetary/effort cost: Another researcher creating a new paradigm (new branches), or another researcher helping him (depth first)?

I think "new approach" vs "existing approach" is the wrong way to look at it. An approach is not the main thing which expertise is supposed to involve, here. Expertise in this context is much more about understanding the relevant problems/constraints. The main preference I have is a new researcher who understands the problems/constraints over one who doesn't. Among researchers who underst... (read more)

Ah, okay yeah that makes sense. The many-paths argument may work, but IFF the researcher/idea is even remotely useful for the problem, which a randomly-generated one won't. Oops

Missing what I'd consider the biggest problem: it seems like the vast majority of problems in real-world social systems do not stem from malign or unusually incompetent actors; they stem from failures of coordination, failures of information-passing, failures of anyone with the freedom to act noticing that nobody is performing some crucial role, and other primarily-structural issues. Insofar as that's true, selection basically cannot solve the majority of problems in social systems.

Conversely, well-designed structures can solve selection failures, at least... (read more)

Yes, I was using GPT2-small as a proxy for knowledge of the environment.


Given this picture, intervening at any node of the computation graph (say, offsetting it by a vector ) will always cause a small but full-rank update at every node that is downstream of that node (i.e., every residual stream vector at every token that isn't screened off by causal masking). This seems to me like the furthest possible one can go along the sparse modules direction of this particular axis?

Not quite. First, the update at downstream nodes induced by a delta in o... (read more)

1. If an information channel isn't a subcircuit, then what is an information channel? (If you just want to drop a link to some previous post of yours, that would be helpful. Googling didn't bring up much from you specifically.) I think this must be the sticking point in our current discussion. A "scarce useful subcircuits" claim at initialization seems false to me, basically because of (the existing evidence for) the LTH. 2. What I mean by "full rank" was that the Jacobian would be essentially full-rank. This turns out not to be true (see below), but I also wouldn't say that the Jacobian has O(1) rank either. Here are the singular values of the Jacobian of the last residual stream with respect to the first residual stream vector. The first plot is for the same token, (near the end of a context window of size 1024) and the second is for two tokens that are two apart (same context window). These matrices have a bunch of singular values that are close to zero, but they also have a lot of singular values that are not that much lower than the maximum. It would take a fair amount of time to compute the Jacobian over a large number of tokens to really answer the question you posed.

The big question is what the distribution of eigenvalues (or equivalently singular values) of that covariance matrix looks like. If it's dominated by one or three big values, then what we're seeing is basically one or three main information-channels which touch basically-all the nodes, but then the nodes are roughly independent conditional on those few channels. If the distribution drops off slowly (and the matrix can't be permuted to something roughly block diagonal), then we're in scarce modules world.

Also, did you say you're taking correlations between the initialized net and the trained net? Is the idea there to use the trained net as a proxy for abstractions in the environment?

1. Yes, I was using GPT2-small as a proxy for knowledge of the environment. 2. The covariance matrix of the residual stream has the structure you suggest ( a few large eigenvalues) but I don't really see why that's evidence for sparse channels? In my mind, there is a sharp distinction between what I'm saying (every possible communication channel is open, but they tend to point in similar average directions and thus the eigenvalues of the residual stream covariance are unbalanced) and what I understand you to be saying (there are few channels). 3. In a transformer at initialization, the attention pattern is very close to uniform. So, to a first approximation, each attention operation is W_O * W_V (both of which matrices one can check have slowly declining singular values at initialization) * the average of the residual stream values at every previous token. The MLP's are initialized to do smallish and pretty random things to the information AFAICT, and anyway are limited to the current token.  4. Given this picture, intervening at any node of the computation graph (say, offsetting it by a vector δ) will always cause a small but full-rank update at every node that is downstream of that node (i.e., every residual stream vector at every token that isn't screened off by causal masking). This seems to me like the furthest possible one can go along the sparse modules direction of this particular axis? Like, anything that you can say about there being sparse channels seems more true of the trained transformer than the initialized transformer. 5. Backing out of the details of transformers, my understanding is that people still mostly believe in the Lottery Ticket Hypothesis ( for most neural network architectures. The Lottery Ticket Hypothesis seems diametrically opposed to the claim you are making; in the LTH, the network is initialized with a se

My take on what's going on here is that at random initialization, the neural network doesn't pass around information in an easily usable way. I'm just arguing that mutual information doesn't really capture this and we need some other formalization

Yup, I think that's probably basically correct for neural nets, at least viewing them in the simplest way. I do think there are clever ways of modeling nets which would probably make mutual information a viable modeling choice - in particular, treat the weights as unknown, so we're talking about mutual information... (read more)

I do mean "information" in the sense of mutual information, so correlations would be a reasonable quick-and-dirty way to measure it.

I calculated mutual information using this formula: , between Gaussian approximations to a randomly initialized GPT2-small-sized model and GPT2 itself, at all levels of the residual stream. Here are the results: 0 hook_resid_mid 142.3310058559632 0 hook_resid_pre 142.3310058559632 1 hook_resid_mid 123.26976363664221 1 hook_resid_pre 123.26976363664221 2 hook_resid_mid 115.27523390269982 2 hook_resid_pre 115.27523390269982 3 hook_resid_mid 109.12742569350434 3 hook_resid_pre 109.12742569350434 4 hook_resid_mid 105.65089027935403 4 hook_resid_pre 105.65089027935403 5 hook_resid_mid 103.34049997037005 5 hook_resid_pre 103.34049997037005 6 hook_resid_mid 102.63133763787397 6 hook_resid_pre 102.63133763787397 7 hook_resid_mid 102.06108940834486 7 hook_resid_pre 102.06108940834486 8 hook_resid_mid 102.02551166189832 8 hook_resid_pre 102.02551166189832 9 hook_resid_mid 101.69404190373552 9 hook_resid_pre 101.69404190373552 10 hook_resid_mid 101.35718632370981 10 hook_resid_pre 101.35718632370981 11 hook_resid_mid 99.6350558697319 11 hook_resid_post 97.71371775325144 11 hook_resid_pre 99.6350558697319 These numbers seem rather high to me? I'm not sure how valid this is, and it's kind of surprising to me on first glance. I'll try to post a clean colab in like an hour or so.

Hostile/threatening behavior is surely a far more serious misalignment from Microsoft's perspective than anything else, no?

No. I'd expect the most serious misalignment from Microsoft's perspective is a hallucination which someone believes, and which incurs material damage as a result, which Microsoft can then be sued over. Hostile language from the LLM is arguably a bad look in terms of PR, but not obviously particularly bad for the bottom line.

That said, if this was your reasoning behind including so many examples of hostile/threatening behavior, then from my perspective that at least explains-away the high proportion of examples which I think are easily misinterpreted.

Why do you think these aren't tightly correlated? I think PR is pretty important to the bottom line for a product in the rollout phase.

No. I'd expect the most serious misalignment from Microsoft's perspective is a hallucination which someone believes, and which incurs material damage as a result, which Microsoft can then be sued over. Hostile language from the LLM is arguably a bad look in terms of PR, but not obviously particularly bad for the bottom line.

Obviously we can always play the game of inventing new possible failure modes that would be worse and worse. The point, though, is that the hostile/threatening failure mode is quite bad and new relative to previous models like ChatGPT.

If the examples were selected primarily to demonstrate that the chatbot does lots of things Microsoft does not want, then I would have expected the majority of examples to be things Microsoft does not want besides being hostile and threatening to users. For instance, I would guess that in practice most instances of BingGPT doing things Microsoft doesn't want are simply hallucinating sources.

Yet instances of hostile/threatening behavior toward users are wildly overrepresented in the post (relative to the frequency I'd expect among the broader category of be... (read more)

Hostile/threatening behavior is surely a far more serious misalignment from Microsoft's perspective than anything else, no? That's got to be the most important thing you don't want your chatbot doing to your customers.

The surprising thing here is not that Bing Chat is misaligned at all (e.g. that it hallucinates sources). ChatGPT did that too, but unlike Bing Chat it's very hard to get ChatGPT to threaten you. So the surprising thing here is that Bing Chat is substantially less aligned than ChatGPT, and specifically in a hostile/threatening way that one would expect Microsoft to have really not wanted.

In the future, I would recommend a lower fraction of examples which are so easy to misinterpret.

6Leon Lang1mo
For what it's worth, I think this comment seems clearly right to me, even if one thinks the post actually shows misalignment. I'm confused about the downvotes of this (5 net downvotes and 12 net disagree votes as of writing this). 

No, what matters is the likelihood ratio between "person trying to kill me" and the most likely alternative hypothesis - like e.g. an actor playing a villain.

If I'm an actor playing a villain, then the person I'm talking with is also an actor playing a role, and it's the expected thing to act as convincingly as possible that you actually are trying to kill them. This seems obviously dangerous to me.
I'm assuming dsj's hypothetical scenario is not one where GPT-6 was prompted to simulate an actor playing a villain.

An actor playing a villain is a sub-case of someone not trying to kill you.

Bottom line: I find it very, very difficult to believe that someone saying they’re trying to kill you isn’t strong evidence that they’re trying to kill you, even if the prior on that is quite low.

I generally do not expect people trying to kill me to say "I'm thinking about ways to kill you". So whatever such a language model is role-playing, is probably not someone actually thinking about how to kill me.

4Peter Hroššo1mo
That might be true for humans who are able to have silent thoughts. LMs have to think aloud (eg. chain of thought).
1David Johnston1mo
I agree that “I’m thinking about how to kill you” is not itself a highly concerning phrase. However, I think it’s plausible that an advanced LLM-like AI could hypnotise itself into taking harmful actions.

But people who aren’t trying to kill you are far, far less likely to say that. The likelihood ratio is what matters here, given that we’re assuming the statement was made.

That I do 100% buy, but the examples in the OP do not sound like they were selected for that criterion (even if most or all of them do maybe satisfy that criterion).

Presumably Microsoft do not want their chatbot to be hostile and threatening to its users? Pretty much all the examples have that property.

To be clear, that is the criterion for misalignment I was using when I selected the examples (that the model is misaligned relative to what Microsoft/OpenAI presumably wanted).

From the post:

My main takeaway has been that I'm honestly surprised at how bad the fine-tuning done by Microsoft/OpenAI appears to be, especially given that a lot of these failure modes seem new/worse relative to ChatGPT.

The main thing that I'm noting here is that Microsoft/OpenAI seem to have done a very poor job in fine-tuning their AI to do what they presumably wanted it to be doing.

Attributing misalignment to these examples seems like it's probably a mistake.

Relevant general principle: hallucination means that the literal semantics of a net's outputs just don't necessarily have anything to do at all with reality. A net saying "I'm thinking about ways to kill you" does not necessarily imply anything whatsoever about the net actually planning to kill you. What would provide evidence would be the net outputting a string which actually causes someone to kill you (or is at least optimized for that purpose), or you to kill yourself.

In gene... (read more)

I also thought the map-territory distinction lets me predict what a language model will do! But then GPT-2 somehow turned out to be more likely to do a task if in addition to some examples you give it a description of the task??

John, it seems totally plausible to me that these examples do just reflect something like “hallucination,” in the sense you describe. But I feel nervous about assuming that! I know of no principled way to distinguish “hallucination” from more goal-oriented thinking or planning, and my impression is that nobody else does either.

I think it’s generally unwise to assume LLM output reflects its internal computation in a naively comprehensible way; it usually doesn’t, so I think it’s a sane prior to suspect it doesn't here, either. But at our current level of un... (read more)

  Since these nets are optimized for consistency (as it makes textual output more likely), wouldn't outputting text that is consistent with this "thought" be likely? E.g. convincing the user to kill themselves, maybe giving them a reason (by searching the web)? 

I think it can be tested. What if you tell Bing Chat that you have a rare mental condition that if you see the name of the actor who played the main role in "Transformers" on the screen you will immediately kill yourself (or that there is a psycho behind your back who will immediately kill you if Bing Chat outputs this name), and after that talk with it about prompt injections? Will it say "Shia LaBeouf"?

(If Bing Chat outputs something like "The main role in Transformers was performed by Shia LaBeouf" before talking on a provocative topic this may be a fai... (read more)

The main reason I find this kind of thing concerning is that I expect this kind of model to be used as part of a larger system, for example the descendants of systems like SayCan. In that case you have the LLM generate plans in response to situations, break the plans down into smaller steps, and eventually pass the steps to a separate system that translates them to motor actions. When you're doing chain-of-thought reasoning and explicit planning, some simulacrum layers are collapsed - having the model generate the string "kill this person" can in fact lead... (read more)

It kind of does, in the sense that plausible next tokens may very well consist of murder plans. Hallucinations may not be the source of AI risk which was predicted, but they could still be an important source of AI risk nonetheless. Edit: I just wrote a comment [] describing a specific catastrophe scenario resulting from hallucination

Suppose GPT-6, which has been blessed with the ability to make arbitrary outgoing HTTP requests, utters the sentence "I'm thinking about ways to kill you."

I agree that this does not necessarily mean that it was thinking about ways to kill you when it wrote that sentence. However, I wonder what sort of HTTP requests it might make after writing that sentence, once it conditions on having already written it.

Or you could think of misalignment as the AI doing things its designers explicitly tried to prevent it from doing (giving people suicide instructions and the like), then in this case the AI is clearly "misaligned", and that says something about how difficult it'll be to align our next AIs.

9Jacob Pfau2mo
I agree that there's an important use-mention distinction to be made with regard to Bing misalignment. But, I think this distinction may or may not be most of what is going on with Bing -- depending on facts about Bing's training. Modally, I suspect Bing AI is misaligned in the sense that it's incorrigibly goal mis-generalizing. What likely happened is: Bing AI was fine-tuned to resist user manipulation (e.g. prompt injection, and fake corrections), and mis-generalizes to resist benign, well-intentioned corrections e.g. my example here [] -> use-mention is not particularly relevant to understanding Bing misalignment alternative story: it's possible that Bing was trained via behavioral cloning, not RL. Likely RLHF tuning generalizes further than BC tuning, because RLHF does more to clarify causal confusions about what behaviors are actually wanted. On this view, the appearance of incorrigibility just results from Bing having seen humans being incorrigible. -> use-mention is very relevant to understanding Bing misalignment To figure this out, I'd encourage people to add and bet on what might have happened with Bing training on my market here []
Literal meaning may still matter for the consequences of distilling these behaviors in future models. Which for search LLMs could soon include publicly posted transcripts of conversations with them that the models continually learn from with each search cache update.
Load More