Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a special post for short-form writing by leogao. Only they can create top-level comments. Comments here also appear on the Shortform Page and All Posts page.
random fun experiment: accuracy of GPT-4 on "Q: What is 1 + 1 + 1 + 1 + ...?\nA:"
blue: highest logprob numerical token
orange: y = x
...I am suddenly really curious what the accuracy of humans on that is.
'Can you do Addition?' the White Queen asked. 'What's one and one and one and one and one and one and one and one and one and one?'
'I don't know,' said Alice. 'I lost count.'
This is a cool idea. I wonder how it's able to do 100, 150, and 200 so well. I also wonder what are the exact locations of the other spikes?
Oh, I see your other graph now. So it just always guesses 100 for everything in the vicinity of 100.
Since there are basically no alignment plans/directions that I think are very likely to succeed, and adding "of course, this will most likely not solve alignment and then we all die, but it's still worth trying" to every sentence is low information and also actively bad for motivation, I've basically recalibrated my enthusiasm to be centered around "does this at least try to solve a substantial part of the real problem as I see it". For me at least this is the most productive mindset for me to be in, but I'm slightly worried people might confuse this for me having a low P(doom), or being very confident in specific alignment directions, or so on, hence this post that I can point people to.
I think this may also be a useful emotional state for other people with similar P(doom) and who feel very demotivated by that, which impacts their productivity.
a common discussion pattern: person 1 claims X solves/is an angle of attack on problem P. person 2 is skeptical. there is also some subproblem Q (90% of the time not mentioned explicitly). person 1 is defending a claim like "X solves P conditional on Q already being solved (but Q is easy)", whereas person 2 thinks person 1 is defending "X solves P via solving Q", and person 2 also believes something like "subproblem Q is hard". the problem with this discussion pattern is it can lead to some very frustrating miscommunication:
I can see how this could be a frustrating pattern for both parties, but I think it's often an important conversation tree to explore when person 1 (or anyone) is using results about P in restricted domains to make larger claims or arguments about something that depends on solving P at the hardest difficulty setting in the least convenient possible world.
As an example, consider the following three posts:
I think both of the first two posts are valuable and important work on formulating and analyzing restricted subproblems. But I object to citation of the second post (in the third post) as evidence in support of a larger point that doom from mesa-optimizers or gradient descent is unlikely in the real world, and object to the second post to the degree that it is implicitly making this claim.
There's an asymmetry when person I is arguing for an optimistic view on AI x-risk and person 2 is arguing for a doomer-ish view, in the sense that person I has to address all counterarguments but person 2 only has to find one hole. But this asymmetry is unfortunately a fact about the problem domain and not the argument / discussion pattern between I and 2.
I find myself in person 2's position fairly often, and it is INCREDIBLY frustrating for person 1 to claim they've "solved" P, when they're ignoring the actual hard part (or one of the hard parts). And then they get MAD when I point out why their "solution" is ineffective. Oh, wait, I'm also extremely annoyed when person 2 won't even take steps to CONSIDER my solution - maybe subproblem Q is actually easy, when the path to victory aside from that is clarified.
In neither case can any progress be made without actually addressing how Q fits into P, and what is the actual detailed claim of improvement of X in the face of both Q and non-Q elements of P.
yeah, but that's because Q is easy if you solve PVery nicely described, this might benefit from becoming a top level post
For example?
here's a straw hypothetical example where I've exaggerated both 1 and 2; the details aren't exactly correct but the vibe is more important:
1: "Here's a super clever extension of debate that mitigates obfuscated arguments [etc], this should just solve alignment"
2: "Debate works if you can actually set the goals of the agents (i.e you've solved inner alignment), but otherwise you can get issues with the agents coordinating [etc]"
1: "Well the goals have to be inside the NN somewhere so we can probably just do something with interpretability or whatever"
2: "how are you going to do that? your scheme doesn't tackle inner alignment, which seems to contain almost all of the difficulty of alignment to me. the claim you just made is a separate claim from your main scheme, and the cleverness in your scheme is in a direction orthogonal to this claim"
1: "idk, also that's a fully general counterargument to any alignment scheme, you can always just say 'but what if inner misalignment'. I feel like you're not really engaging with the meat of my proposal, you've just found a thing you can say to be cynical and dismissive of any proposal"
2: "but I think most of the difficulty of alignment is in inner alignment, and schemes which kinda handwave it away are trying to some some problem which is not the actual problem we need to solve to not die from AGI. I agree your scheme would work if inner alignment weren't a problem."
1: "so you agree that in a pretty nontrivial number [let's say both 1&2 agree this is like 20% or something] of worlds my scheme does actually work- I mean how can you be that confident that inner alignment is that hard? in the world's where inner alignment turns out to be easy then my scheme will work."
2: "I'm not super confident, but if we assume that inner alignment is easy then I think many other simpler schemes will also work, so the cleverness that your proposal adds doesn't actually make a big difference."
So Q=inner alignment? Seems like person 2 not only pointed to inner alignment explicitly (so it can no longer be "some implicit assumption that you might not even notice you have"), but also said that it "seems to contain almost all of the difficulty of alignment to me". He's clearly identified inner alignment as a crux, rather than as something meant "to be cynical and dismissive". At that point, it would have been prudent of person 1 to shift his focus onto inner alignment and explain why he thinks it is not hard.
Note that your post suddenly introduces "Y" without defining it. I think you meant "X".
one man's modus tollens is another man's modus ponens:
"making progress without empirical feedback loops is really hard, so we should get feedback loops where possible" "in some cases (i.e close to x-risk), building feedback loops is not possible, so we need to figure out how to make progress without empirical feedback loops. this is (part of) why alignment is hard"
Yeah something in this space seems like a central crux to me.
I personally think (as a person generally in the MIRI-ish camp of "most attempts at empirical work are flawed/confused"), that it's not crazy to look at the situation and say "okay, but, theoretical progress seems even more flawed/confused, we just need to figure out some how of getting empirical feedback loops."
I think there are some constraints on how the empirical work can possibly work. (I don't think I have a short thing I could write here, I have a vague hope of writing up a longer post on "what I think needs to be true, for empirical work to be helping rather than confusedly not-really-helping")
you gain general logical facts from empirical work, which can aide providing a blurry image of the manifold that the precise theoretical work is trying to build an exact representation of
A common cycle:
Sometimes this even results in better models over time.
Corollary to Others are wrong != I am right (https://www.lesswrong.com/posts/4QemtxDFaGXyGSrGD/other-people-are-wrong-vs-i-am-right): It is far easier to convince me that I'm wrong than to convince me that you're right.
Quite a large proportion of my 1:1 arguments start when I express some low expectation of the other person's argument being correct. This is almost always taken to mean that I believe that some opposing conclusion is correct. Usually I have to give up before being able to successfully communicate the distinction, let alone addressing the actual disagreement.
Some aspirational personal epistemic rules for keeping discussions as truth seeking as possible (not at all novel whatsoever, I'm sure there exist 5 posts on every single one of these points that are more eloquent)
I think in practice I adhere closer to these principles than most people, but I definitely don't think I'm perfect at it.
(Sidenote: it seems I tend to voice my disagreement on factual things far more often (though not maximally) compared to most people. I'm slightly worried that people will interpret this as me disliking them or being passive aggressive or something - this is typically not the case! I have big disagreements about the-way-the-world-is with a bunch of my closest friends and I think that's a good thing! If anything I gravitate towards people I can have interesting disagreements with.)
I find it a helpful framing to instead allow things that feel obviously false to become more familiar, giving them the opportunity to develop a strong enough voice to explain how they are right. That is, the action is on the side of unfamiliar false things, clarifying their meaning and justification, rather than on the side of familiar true things, refuting their correctness. It's harder to break out of a familiar narrative from within.
Understanding how an abstraction works under the hood is useful because it gives you intuitions for when it's likely to leak and what to do in those cases.
takes on takeoff (or: Why Aren't The Models Mesaoptimizer-y Yet)
here are some reasons we might care about discontinuities:
I think these capture 90% of what I care about when talking about fast/slow takeoff, with the first point taking up a majority
(it comes up a lot in discussions that it seems like I can't quite pin down exactly what my interlocutor's beliefs on fastness/slowness imply. if we can fully list out all the things we care about, we can screen off any disagreement about definitions of the word "discontinuity")
some things that seem probably true to me and which are probably not really cruxes:
possible sources of discontinuity:
I think these can be boiled down to 3 more succinct scenario descriptions:
The following things are not the same:
In the spirit of https://www.lesswrong.com/posts/fFY2HeC9i2Tx8FEnK/my-resentful-story-of-becoming-a-medical-miracle , some anecdotes about things I have tried, in the hopes that I can be someone else's "one guy on a message board. None of this is medical advice, etc.
Schmidhubering the agentic LLM stuff pretty hard https://leogao.dev/2020/08/17/Building-AGI-Using-Language-Models/
Rightfully so! Read your piece back in 2021 and found it true & straightforward.
retargetability might be the distinguishing factor between controllers and optimizers
as in, controllers are generally retargetable and optimizers aren't? or vice-versa
would be interested in reasoning, either way
a claim I've been saying irl for a while but have never gotten around to writing up: current LLMs are benign not because of the language modelling objective, but because of the generalization properties of current NNs (or to be more precise, the lack thereof). with better generalization LLMs are dangerous too. we can also notice that RL policies are benign in the same ways, which should not be the case if the objective was the core reason. one thing that can go wrong with this assumption is thinking about LLMs that are both extremely good at generalizing (especially to superhuman capabilities) and simultaneously assuming they continue to have the same safety properties. afaict something like CPM avoids this failure mode of reasoning, but lots of arguments don't
what is the "language models are benign because of the language modeling objective" take?
basically the Simulators kind of take afaict
House rules for definitional disputes:
A few axes along which to classify optimizers:
Some observations: it feels like capabilities robustness is one of the big things that makes deception dangerous, because it means that the model can figure out plans that you never intended for it to learn (something not very capabilities robust would just never learn how to deceive if you don't show it). This feels like the critical controller/search-process difference: controller generalization across states is dependent on the generalization abilities of the model architecture, whereas search processes let you think about the particular state you find yourself in. The actions that lead to deception are extremely OOD, and a controller would have a hard time executing the strategy reliably without first having seen it, unless NN generalization is wildly better than I'm anticipating.
Real world objectives is definitely another big chunk of deception danger; caring about the real world leads to nonmyopic behavior (though maybe we're worried about other causes of nonmyopia too? not sure tbh), I'm actually not sure how I feel about generality: on the one hand, it feels intuitive that systems that are only able to represent one objective have got to be in some sense less able to become more powerful just by thinking more; on the other hand I don't know what a rigorous argument for this would look like. I think the intuition relates to the idea of general reasoning machinery being the same across lots of tasks, and this machinery being necessary to do better by thinking harder, and so any model without this machinery must be weaker in some sense. I think this feeds into capabilities robustness (or lack thereof) too.
Examples of where things fall on these axes:
Another generator-discriminator gap: telling whether an outcome is good (outcome->R) is much easier than coming up with plans to achieve good outcomes. Telling whether a plan is good (plan->R) is much harder, because you need a world model (plan->outcome) as well, but for very difficult tasks it still seems easier than just coming up with good plans off the bat. However, it feels like the world model is the hardest part here, not just because of embeddedness problems, but in general because knowing the consequences of your actions is really really hard. So it seems like for most consequentialist optimizers, the quality of the world model actually becomes the main thing that matters.
This also suggests another dimension along which to classify our optimizers: the degree to which they care about consequences in the future (I want to say myopia but that term is already way too overloaded). This is relevant because the further in the future you care about, the more robust your world model has to be, as errors accumulate the more steps you roll the model out (or the more abstraction you do along the time axis). Very low confidence but maybe this suggests that mesaoptimizers probably won't care about things very far in the future because building a robust world model is hard and so perform worse on the training distribution, so SGD pushes for more myopic mesaobjectives? Though note, this kind of myopia is not quite the kind we need for models to avoid caring about the real world/coordinating with itself.
A thought pattern that I've noticed myself and others falling into sometimes: Sometimes I will make arguments about things from first principles that look something like "I don't see any way X can be true, it clearly follows from [premises] that X is definitely false", even though there are people who believe X is true. When this happens, it's almost always unproductive to continue to argue on first principles, but rather I should do one of: a) try to better understand the argument and find a more specific crux to disagree on or b) decide that this topic isn't worth investing more time in, register it as "not sure if X is true" in my mind, and move on.
For many such questions, "is X true" is the wrong question. This is common when X isn't a testable proposition, it's a model or assertion of causal weight. If you can't think of existence proofs that would confirm it, try to reframe as "under what conditions is X a useful model?".
random brainstorming about optimizeryness vs controller/lookuptableyness:
let's think of optimizers as things that reliably steer a broad set of initial states to some specific terminal state seems like there are two things we care about (at least):
a LUT trained with a little bit of RL will be neither retargetable nor robust. a LUT trained with galactic amounts of RL to do every possible initial state optimally is robust but not retargetable (this is reasonable: robustness is only a property of the functional behavior so whether it's a LUT internally shouldn't matter; retargetability is a property of the actual implementation so it does matter). a big search loop (the most extreme of which is AIXI, which is 100% search) is very retargetable, and depending on how hard it searches is varying degrees of robustness.
(however, in practice with normal amounts of compute a LUT is never robust, this thought experiment only highlights differences that remain in the limit)
what do we care about these properties for?
[conjecture 1: retargetability == complexity can be decomposed == gradient of goal is meaningful. conjecture 2: gradient of goal is meaningful/complexity decomposition implies deceptive alignment (maybe we can also find some necessary condition?)]
how do we formalize retargetability?
random idea: the hypothesis that complexity can be approximately decomposed into a goal component and a reasoning component is maybe a good formalization of (a weak version of) orthogonality?
One possible model of AI development is as follows: there exists some threshold beyond which capabilities are powerful enough to cause an x-risk, and such that we need alignment progress to be at the level needed to align that system before it comes into existence. I find it informative to think of this as a race where for capabilities the finish line is x-risk-capable AGI, and for alignment this is the ability to align x-risk-capable AGI. In this model, it is necessary but not sufficient for alignment for alignment to be ahead by the time it's at the finish line for good outcomes: if alignment doesn't make it there first, then we automatically lose, but even if it does, if alignment doesn't continue to improve proportional to capabilities, we might also fail at some later point. However, I think it's plausible we're not even on track for the necessary condition, so I'll focus on that within this post.
Given my distributions over how difficult AGI and alignment respectively are, and the amount of effort brought to bear on each of these problems, I think there's a worryingly large chance that we just won't have the alignment progress needed at the critical juncture.
I also think it's plausible that at some point before when x-risks are possible, capabilities will advance to the point that the majority of AI research will be done by AI systems. The worry is that after this point, both capabilities and alignment will be similarly benefitted by automation, and if alignment is behind at the point when this happens, then this lag will be "locked in" because an asymmetric benefit to alignment research is needed to overtake capabilities if capabilities is already ahead.
There are a number of areas where this model could be violated:
However, I don't think these violations are likely for the following reasons respective:
I think exploring the potential model violations further is a fruitful direction. I don't think I'm very confident about this model.
We spend a lot of time on trying to figure out empirical evidence to distinguish hypotheses we have that make very similar predictions, but I think a potentially underrated first step is to make sure they actually fit the data we already have.
Example?
Is the correlation between sleeping too long and bad health actually because sleeping too long is actually causally upstream of bad health effects, or only causally downstream of some common cause like illness?
Afaik, both. Like a lot of shit things - they are caused by depression, and they cause depression, horrible reinforcing loop. While the effect of bad health on sleep is obvious, you can also see this work in reverse; e.g. temporary severe sleep restriction has an anti-depressive effect. Notable, though with not many useful clinical applications, as constant sleep deprivation is also really unhealthy.
GPT-2-xl unembedding matrix looks pretty close to full rank (plot is singular values)
Unsupervised learning can learn things humans can't supervise because there's structure in the world that you need deeper understanding to predict accurately. For example, to predict how characters in a story will behave, you have to have some kind of understanding in some sense of how those characters think, even if their thoughts are never explicitly visible.
Unfortunately, this understanding only has to be structured in a way that makes reading off the actual unsupervised targets (i.e next observation) easy.
An incentive structure for scalable trusted prediction market resolutions
We might want to make a trustable committee for resolving prediction markets. We might be worried that individual resolvers might build up reputation only to exit-scam, due to finite time horizons and non transferability of reputational capital. However, shareholders of a public company are more incentivized to preserve the value of the reputational capital. Based on this idea, we can set something up as follows:
It's amazing how many proposals for dealing with institutional distrust sound a lot like "make a new institution, with the same structure, but with better actors." You lose me at "trustable committee", especially when you don't describe how THOSE humans are motivated by truth and beauty, rather than filthy lucre. Adding more layers of committees doesn't help, unless you define a "final, un-appealable decision" that's sooner than the full shareholder vote.
the core of the proposal really boils down to "public companies have less incentive to cash in on reputation and exit scam than individuals". this proposal is explicitly not "the same structure but with better actors".
Levels of difficulty:
Hopefully this is a useful reference for conversations that go like this:
A: Why can't we just do X to solve Y? B: You don't realize how hard Y is, you can't just think up a solution in 5 minutes A: You're just not thinking outside the box, [insert anecdote about some historical figure who figured out how to do a thing which was once considered impossible in some sense] B: No you don't understand, it's like actually not possible, not just like really hard, because of Z A: That's what they said about [historical figure]!
(random shower thoughts written with basically no editing)
Sometimes arguments have a beat that looks like "there is extreme position X, and opposing extreme position Y. what about a moderate 'Combination' position?" (I've noticed this in both my own and others' arguments)
I think there are sometimes some problems with this.
related take: "things are more nuanced than they seem" is valuable only as the summary of a detailed exploration of the nuance that engages heavily with object level cruxes; the heavy lifting is done by the exploration, not the summary
Subjective Individualism
TL;DR: This is basically empty individualism except identity is disentangled from cooperation (accomplished via FDT), and each agent can have its own subjective views on what would count as continuity of identity and have preferences over that. I claim that:
(related: FDT and myopia being much the same thing; you can think of caring about future selves’ rewards because you consider yourself to implement a similar enough algorithm to your future self as acausal trade. This has the nice property of unifying myopia and preventing acausal trade, in that acausal trade is really just caring about OMs that would not be considered the same “self”. This is super convenient because basically every time we talk about myopia for preventing deceptive mesaoptimization we have to hedge by saying “and also we need to prevent acausal trade somehow”, and this lets us unify the two things.)
Properties of this theory:
Imagine if aliens showed up at your doorstep and tried to explain to you that making as many paperclips as possible was the ultimate source of value in the universe. They show pictures of things that count as paperclips and things that don't count as paperclips. They show you the long rambling definition of what counts as a paperclip from Section 23(b)(iii) of the Declaration of Paperclippian Values. They show you pages and pages of philosophers waxing poetical about how paperclips are great because of their incredible aesthetic value. You would be like, "yeah I get it, you consider this thing to be a paperclip, and you care a lot about them." You could probably pretty accurately tell whether the aliens would approve of anything you'd want to do. And then you wouldn't really care, because you value human flourishing, not paperclips. I mean, it's so silly to care about paperclips, right?
Of course, to the aliens, who have not so subtly indicated that they would blow up the planet and look for a new, more paperclip-loving planet if they were to detect any anti-paperclip sentiments, you say that you of course totally understand and would do anything for paperclips, and that you definitely wouldn't protest being sent to the paperclip mines.
I think I'd be confused. Do they care about more or better paperclips, or do they care about worship of paperclips by thinking beings? Why would they care whether I say I would do anything for paperclips, when I'm not actually making paperclips (or disassembling myself to become paperclips)?
I thought it would be obvious from context but the answers are "doesn't really matter, any of those examples work" and "because they will send everyone to the paperclip mines after ensuring there are no rebellious sentiments", respectively. I've edited it to be clearer.
random thoughts. no pretense that any of this is original or useful for anyone but me or even correct
self self improvement improvement: feeling guilty about not self improving enough and trying to fix your own ability to fix your own abilities
Thought pattern that I've noticed: I seem to have two sets of epistemic states at any time: one more stable set that more accurately reflects my "actual" beliefs that changes fairly slowly, and one set of "hypothesis" beliefs that changes rapidly. Usually when I think some direction is interesting, I alternate my hypothesis beliefs between assuming key claims are true or false and trying to convince myself either way, and if I succeed then I integrate it into my actual beliefs. In practice this might look like alternating between trying to prove something is impossible and trying to exhibit an example, or taking strange premises seriously and trying to figure out its consequences. I think this is probably very confusing to people because usually when talking to people who are already familiar with alignment I'm talking about implications of my hypothesis beliefs, because that's the frontier of what I'm thinking about, and from the outside it looks like I'm constantly changing my mind about things. Writing this up partially to have something to point people to and partially to push myself to communicate this more clearly.
I think this pattern is common among intellectuals, and I'm surprised it's causing confusion. Are you labeling your exploratory beliefs and statements appropriately? An "epistemic status" note for posts here goes a long way, and in private conversation I often say out loud "I'm exploring here, don't take it as what I fully believe" in conversations at work and with friends.
I think I do a poor job of labelling my statements (at least, in conversation. usually I do a bit better in post format). Something something illusion of transparency. To be honest, I didn't even realize explicitly that I was doing this until fairly recent reflection on it.