Since there are basically no alignment plans/directions that I think are very likely to succeed, and adding "of course, this will most likely not solve alignment and then we all die, but it's still worth trying" to every sentence is low information and also actively bad for motivation, I've basically recalibrated my enthusiasm to be centered around "does this at least try to solve a substantial part of the real problem as I see it". For me at least this is the most productive mindset for me to be in, but I'm slightly worried people might confuse this for ...
I like this paper for crisply demonstrating an instance of poor generalization in LMs that is likely representative of a broader class of generalization properties of current LMs.
The existence of such limitations in current ML systems does not imply that ML is fundamentally not a viable path to AGI, or that timelines are long, or that AGI will necessarily also have these limitations. Rather, I find this kind of thing interesting because I believe that understanding limitations of current AI systems is very important for giving us threads to yank on that ma...
I don't think RLHF in particular had a very large counterfactual impact on commercialization or the arms race. The idea of non-RL instruction tuning for taking base models and making them more useful is very obvious for commercialization (there are multiple concurrent works to InstructGPT). PPO is better than just SFT or simpler approaches on top of SFT, but not groundbreakingly more so. You can compare text-davinci-002 (FeedME) and text-davinci-003 (PPO) to see.
The arms race was directly caused by ChatGPT, which took off quite unexpectedly not because of ...
Obviously I think it's worth being careful, but I think in general it's actually relatively hard to accidentally advance capabilities too much by working specifically on alignment. Some reasons:
Hasn't the alignment community historically done a lot to fuel capabilities?
For example, here's an excerpt from a post I read recently
My guess is RLHF research has been pushing on a commercialization bottleneck and had a pretty large counterfactual effect on AI investment, causing a huge uptick in investment into AI and potentially an arms race between Microsoft and Google towards AGI: https://www.lesswrong.com/posts/vwu4kegAEZTBtpT6p/thoughts-on-the-impact-of-rlhf-research?commentId=HHBFYow2gCB3qjk2i
We spend a lot of time on trying to figure out empirical evidence to distinguish hypotheses we have that make very similar predictions, but I think a potentially underrated first step is to make sure they actually fit the data we already have.
Understanding how an abstraction works under the hood is useful because it gives you intuitions for when it's likely to leak and what to do in those cases.
I agree that doing conceptual work in conjunction with empirical work is good. I don't know if I agree that pure conceptual work is completely doomed but I'm at least sympathetic. However, I think my point still stands: I think someone who can do conceptual+empirical work will probably have more impact doing that than not thinking about the conceptual side and just working really hard on conceptual work.
I agree that people who could do either good interpretability or conceptual work should focus on conceptual work. Also, to be clear the rest of this comment is not necessarily a defence of doing interpretability work in particular, but a response to the specific kind of mental model of research you're describing.
I think it's important that research effort is not fungible. Interpretability has a pretty big advantage that unlike conceptual work, a) it has tight feedback loops, b) is much more paradigmatic, c) is much easier to get into for people with an ML ...
My personal theory of impact for doing nonzero amounts of interpretability is that I think understanding how models think will be extremely useful for conceptual research. For instance, I think one very important data point for thinking about deceptive alignment is that current models are probably not deceptively aligned. Many people have differing explanations for which property of the current setup causes this (and therefore which things we want to keep around / whether to expect phase transitions / etc), which often imply very different alignment plans....
I know for Cruise they're operating ~300 vehicles here in SF (I was previously under the impression this was a hard cap by law until the approval a few days ago but no longer sure of this). The geofence and hours vary by user but my understanding is the highest tier of users (maybe just employees?) have access to Cruise 24/7 with a geofence encompassing almost all of SF, and then there are lower tiers of users with various restrictions like tighter geofences and 9pm-5:30am hours. I don't know what their growth plans look like now that they've been granted permission to expand.
Meta note: I find it somewhat of interest that filler token experiments have been independently conceived at least 5 times just to my knowledge.
Sounds very closely related to gradient based OOD detection methods; see https://arxiv.org/abs/2008.08030
I was quite surprised to see myself cited as "liking the broader category that QACI is in" - I think this claim may technically be true for some definition of "likes" and "broader category", but tries to imply a higher level of endorsement to the casual reader than is factual.
I don't have a very good understanding of QACI and therefore have no particularly strong opinions on QACI. It seems quite different from the kinds of alignment approaches I think about.
My summary of the paper: The paper proves that if you have two distributions that you want to ensure you cannot distinguish linearly (i.e a logistic regression will fail to achieve better than chance score), then one way to do this is to make sure they have the same mean. Previous work has done similar stuff (https://arxiv.org/abs/2212.04273), but without proving optimality.
I think it's pretty unlikely (<5%) that decentralized volunteer training will be competitive with SOTA, ever. (Caveat: I haven't been following volunteer training super closely so this take is mostly cached from having looked into it for GPT-Neo plus occasionally seeing new papers about volunteer training).
here's a straw hypothetical example where I've exaggerated both 1 and 2; the details aren't exactly correct but the vibe is more important:
1: "Here's a super clever extension of debate that mitigates obfuscated arguments [etc], this should just solve alignment"
2: "Debate works if you can actually set the goals of the agents (i.e you've solved inner alignment), but otherwise you can get issues with the agents coordinating [etc]"
1: "Well the goals have to be inside the NN somewhere so we can probably just do something with interpretability or whatever"
2: "ho...
So Q=inner alignment? Seems like person 2 not only pointed to inner alignment explicitly (so it can no longer be "some implicit assumption that you might not even notice you have"), but also said that it "seems to contain almost all of the difficulty of alignment to me". He's clearly identified inner alignment as a crux, rather than as something meant "to be cynical and dismissive". At that point, it would have been prudent of person 1 to shift his focus onto inner alignment and explain why he thinks it is not hard.
Note that your post suddenly introduces "Y" without defining it. I think you meant "X".
a common discussion pattern: person 1 claims X solves/is an angle of attack on problem P. person 2 is skeptical. there is also some subproblem Q (90% of the time not mentioned explicitly). person 1 is defending a claim like "X solves P conditional on Q already being solved (but Q is easy)", whereas person 2 thinks person 1 is defending "X solves P via solving Q", and person 2 also believes something like "subproblem Q is hard". the problem with this discussion pattern is it can lead to some very frustrating miscommunication:
I find myself in person 2's position fairly often, and it is INCREDIBLY frustrating for person 1 to claim they've "solved" P, when they're ignoring the actual hard part (or one of the hard parts). And then they get MAD when I point out why their "solution" is ineffective. Oh, wait, I'm also extremely annoyed when person 2 won't even take steps to CONSIDER my solution - maybe subproblem Q is actually easy, when the path to victory aside from that is clarified.
In neither case can any progress be made without actually addressing how Q fits into P, and what is the actual detailed claim of improvement of X in the face of both Q and non-Q elements of P.
I can see how this could be a frustrating pattern for both parties, but I think it's often an important conversation tree to explore when person 1 (or anyone) is using results about P in restricted domains to make larger claims or arguments about something that depends on solving P at the hardest difficulty setting in the least convenient possible world.
As an example, consider the following three posts:
I think both of th...
yeah, but that's because Q is easy if you solve P
Very nicely described, this might benefit from becoming a top level post
random brainstorming about optimizeryness vs controller/lookuptableyness:
let's think of optimizers as things that reliably steer a broad set of initial states to some specific terminal state seems like there are two things we care about (at least):
My prior, not having looked too carefully at the post or the specific projects involved, is that probably any claims that an open source model is 90% as good as GPT4 or indistinguishable are hugely exaggerated or otherwise not a fair comparison. In general in ML, confirmation bias and overclaiming is very common and as a base rate the vast majority of papers that claim some kind of groundbreaking result end up just never having any real impact.
Also, I expect facets of capabilities progress most relevant to existential risk will be especially constrained st...
I think it's worth disentangling LLMs and Transformers and so on in discussions like this one--they are not one and the same. For instance, the following are distinct positions that have quite different implications:
as in, controllers are generally retargetable and optimizers aren't? or vice-versa
would be interested in reasoning, either way
Doesn't answer your question, but we also came across this effect in the RM Goodharting work, though instead of figuring out the details we only proved that it when it's definitely not heavy tailed it's monotonic, for Regressional Goodhart (https://arxiv.org/pdf/2210.10760.pdf#page=17). Jacob probably has more detailed takes on this than me.
In any event my intuition is this seems unlikely to be the main reason for overoptimization - I think it's much more likely that it's Extremal Goodhart or some other thing where the noise is not independent
takes on takeoff (or: Why Aren't The Models Mesaoptimizer-y Yet)
here are some reasons we might care about discontinuities:
a claim I've been saying irl for a while but have never gotten around to writing up: current LLMs are benign not because of the language modelling objective, but because of the generalization properties of current NNs (or to be more precise, the lack thereof). with better generalization LLMs are dangerous too. we can also notice that RL policies are benign in the same ways, which should not be the case if the objective was the core reason. one thing that can go wrong with this assumption is thinking about LLMs that are both extremely good at generalizing ...
Schmidhubering the agentic LLM stuff pretty hard https://leogao.dev/2020/08/17/Building-AGI-Using-Language-Models/
Pointing at some of the same things: https://www.lesswrong.com/posts/ktJ9rCsotdqEoBtof/asot-some-thoughts-on-human-abstractions
I sorta had a hard time with this market because the things I think might happen don't perfectly map onto the market options, and usually the closest corresponding option implies some other thing, such that the thing I have in mind isn't really a central example of the market option.
Adding $200 to the pool. Also, I endorse the existence of more bounties/contests like this.
The following things are not the same:
I don't think experiments like this are meaningful without a bunch of trials and statistical significance. The outputs of models (even RLHF models) on these kinds of things has pretty high variance, so it's really hard to draw any conclusion from single sample comparisons like this.
My prediction is that GPT is already capable of the former, which means we might have solved a tough problem in alignment almost by accident!
I think this is incorrect. I don't consider whether an LM can tell whether most humans would approve of an outcome described in natural language to be a tough problem in alignment. This is a far easier thing to do than the thing #1 describes.
Some argument for this position: https://www.lesswrong.com/posts/ktJ9rCsotdqEoBtof/asot-some-thoughts-on-human-abstractions
I don't see how this changes the picture? If you train a model on real time feedback from a human, that human algorithm is still the same one that is foolable by i.e cutting down the tree and replacing it with papier-mache or something. None of this forces the model to learn a correspondence between the human ontology and the model's internal best-guess model because the reason any of this is a problem in the first place is the fact that the human algorithm points at a thing which is not the thing we actually care about.
re:1, yeah that seems plausible, I'm thinking in the limit of really superhuman systems here and specifically pushing back against a claim that this human abstractions being somehow inside a superhuman AI is sufficient for things to go well.
re:2, one thing is that there are ways of drifting that we would endorse using our meta-ethics, and ways that we wouldn't endorse. More broadly, the thing I'm focusing on in this post is not really about drift over time or self improvement; in the setup I'm describing, the thing that goes wrong is it does the classical ...
one man's modus tollens is another man's modus ponens:
"making progress without empirical feedback loops is really hard, so we should get feedback loops where possible" "in some cases (i.e close to x-risk), building feedback loops is not possible, so we need to figure out how to make progress without empirical feedback loops. this is (part of) why alignment is hard"
you gain general logical facts from empirical work, which can aide providing a blurry image of the manifold that the precise theoretical work is trying to build an exact representation of
Yeah something in this space seems like a central crux to me.
I personally think (as a person generally in the MIRI-ish camp of "most attempts at empirical work are flawed/confused"), that it's not crazy to look at the situation and say "okay, but, theoretical progress seems even more flawed/confused, we just need to figure out some how of getting empirical feedback loops."
I think there are some constraints on how the empirical work can possibly work. (I don't think I have a short thing I could write here, I have a vague hope of writing up a longer post on "what I think needs to be true, for empirical work to be helping rather than confusedly not-really-helping")
Is the correlation between sleeping too long and bad health actually because sleeping too long is actually causally upstream of bad health effects, or only causally downstream of some common cause like illness?
Afaik, both. Like a lot of shit things - they are caused by depression, and they cause depression, horrible reinforcing loop. While the effect of bad health on sleep is obvious, you can also see this work in reverse; e.g. temporary severe sleep restriction has an anti-depressive effect. Notable, though with not many useful clinical applications, as constant sleep deprivation is also really unhealthy.
I think the problems are roughly equivalent. Creating training data that trope weights superintelligences as honest requires you to access sufficiently superhuman behavior, and you can't just elide the demonstration of superhumanness, because that just puts it in the category of simulacra that merely profess to be superhuman.
Therefore, the longer you interact with the LLM, eventually the LLM will have collapsed into a waluigi. All the LLM needs is a single line of dialogue to trigger the collapse.
This seems wrong. I think the mistake you're making is when you argue that because there's some chance X happens at each step and X is an absorbing state, therefore you have to end up at X eventually. However, this is only true if you assume the conclusion and claim that the prior probability of luigis is zero. If there is some prior probability of a luigi, each non-waluigi step incre...
Agreed. To give a concrete toy example: Suppose that Luigi always outputs "A", and Waluigi is {50% A, 50% B}. If the prior is {50% luigi, 50% waluigi}, each "A" outputted is a 2:1 update towards Luigi. The probability of "B" keeps dropping, and the probability of ever seeing a "B" asymptotes to 50% (as it must).
This is the case for perfect predictors, but there could be some argument about particular kinds of imperfect predictors which supports the claim in the post.
You don't need to pay for translation to simulate human level characters, because that's just learning the human simulator. You do need to pay for translation to access superhuman behavior (which is the case ELK is focused on).
However, this trick won't solve the problem. The LLM will print the correct answer if it trusts the flattery about Jane, and it will trust the flattery about Jane if the LLM trusts that the story is "super-duper definitely 100% true and factual". But why would the LLM trust that sentence?
There's a fun connection to ELK here. Suppose you see this and decide: "ok forget trying to describe in language that it's definitely 100% true and factual in natural language. What if we just add a special token that I prepend to indicate '100% true and factual, for...
There is an advantage here in that you don't need to pay for translation from an alien ontology - the process by which you simulate characters having beliefs that lead to outputs should remain mostly the same. You would need to specify a simulacrum that is honest though, which is pretty difficult and isomorphic to ELK in the fully general case of any simulacra, but it's in a space that's inherently trope-weighted; so simulating humans that are being honest about their beliefs should be made a lot easier (but plausibly still not easy in absolute terms) beca...
I think your meta level observation seems right. Also, I would add that bottleneck problems in either capabilities or alignment are often bottlenecked on resources like serial time.
(My timelines, even taking all this into account, are only like 10 years---I don't think these obstacles are so insurmountable that they buy decades.)
I agree with this sentiment ("having lots of data is useful for deconfusion") and think this is probably the most promising avenue for alignment research. In particular, I think we should prioritize the kinds of research that give us lots of bits about things that could matter. Though from my perspective actually most empirical alignment work basically fails this check, so this isn't just a "empirical good" take.