All of leogao's Comments + Replies

I agree with this sentiment ("having lots of data is useful for deconfusion") and think this is probably the most promising avenue for alignment research. In particular, I think we should prioritize the kinds of research that give us lots of bits about things that could matter. Though from my perspective actually most empirical alignment work basically fails this check, so this isn't just a "empirical good" take.

4jacobjacob5h
Reacted with "examples", but curious about examples/papers/etc both of things you think give lots of bits and things that don't. 

Since there are basically no alignment plans/directions that I think are very likely to succeed, and adding "of course, this will most likely not solve alignment and then we all die, but it's still worth trying" to every sentence is low information and also actively bad for motivation, I've basically recalibrated my enthusiasm to be centered around "does this at least try to solve a substantial part of the real problem as I see it". For me at least this is the most productive mindset for me to be in, but I'm slightly worried people might confuse this for ... (read more)

I like this paper for crisply demonstrating an instance of poor generalization in LMs that is likely representative of a broader class of generalization properties of current LMs.

The existence of such limitations in current ML systems does not imply that ML is fundamentally not a viable path to AGI, or that timelines are long, or that AGI will necessarily also have these limitations. Rather, I find this kind of thing interesting because I believe that understanding limitations of current AI systems is very important for giving us threads to yank on that ma... (read more)

6Vivek Hebbar7d
What's "denormalization"?
9Owain_Evans9d
Great comment. I agree that we should be uncertain about the world models (representations/ontologies) of LLMs and resist the assumption that they have human-like representations because they behave in human-like ways on lots of prompts.  One goal of this paper and our previous paper is to highlight the distinction between in-context reasoning (i.e. reasoning from a set of premises or facts that are all present in the prompt) vs out-of-context reasoning (i.e. reasoning from premises that have been learned in training/finetuning but are not present in the prompt). Models can be human-like in the former but not the latter, as we see with the Reversal Curse. (Side-note: Humans also seem to suffer the Reversal Curse but it's less significant because of how we learn facts). My hunch is that this distinction can help us think about LLM representations and internal world models.
5Dave Orr9d
One oddity of LLMs is that we don't have a good way to tell the model that A is B in a way that it can remember. Prompts are not persistent, and as this paper shows, fine tuning doesn't do a good job of getting a fact into the model without doing a bunch of paraphrasing. Pretraining presumably works in a similar way. This is weird! And I think helps make sense of some of the problems we see with current language models.

I don't think RLHF in particular had a very large counterfactual impact on commercialization or the arms race. The idea of non-RL instruction tuning for taking base models and making them more useful is very obvious for commercialization (there are multiple concurrent works to InstructGPT). PPO is better than just SFT or simpler approaches on top of SFT, but not groundbreakingly more so. You can compare text-davinci-002 (FeedME) and text-davinci-003 (PPO) to see.

The arms race was directly caused by ChatGPT, which took off quite unexpectedly not because of ... (read more)

Answer by leogaoSep 17, 2023Ω245131

Obviously I think it's worth being careful, but I think in general it's actually relatively hard to accidentally advance capabilities too much by working specifically on alignment. Some reasons:

  1. Researchers of all fields tend to do this thing where they have really strong conviction in their direction and think everyone should work on their thing. Convincing them that some other direction is better is actually pretty hard even if you're trying to shove your ideas down their throats.
  2. Often the bottleneck is not that nobody realizes that something is a bott
... (read more)
7Matt Goldenberg15d
I think empirically EA has done a bunch to speed up capabilities accidentally. And I think theoretically we're at a point in history where simply sharing an idea can get it in the water supply faster than ever before. A list of unsolved problems, if one of them is both true and underappreciated, can have a big impact.

Hasn't the alignment community historically done a lot to fuel capabilities?

For example, here's an excerpt from a post I read recently

My guess is RLHF research has been pushing on a commercialization bottleneck and had a pretty large counterfactual effect on AI investment, causing a huge uptick in investment into AI and potentially an arms race between Microsoft and Google towards AGI: https://www.lesswrong.com/posts/vwu4kegAEZTBtpT6p/thoughts-on-the-impact-of-rlhf-research?commentId=HHBFYow2gCB3qjk2i 

We spend a lot of time on trying to figure out empirical evidence to distinguish hypotheses we have that make very similar predictions, but I think a potentially underrated first step is to make sure they actually fit the data we already have.

Understanding how an abstraction works under the hood is useful because it gives you intuitions for when it's likely to leak and what to do in those cases.

leogao1moΩ9180

Ran this on GPT-4-base and it gets 56.7% (n=1000)

2cubefox1mo
How?! I'm pretty sure the GPT-4 base model is not publicly available!
8Ethan Perez1mo
Are you measuring the average probability the model places on the sycophantic answer, or the % of cases where the probability on the sycophantic answer exceeds the probability of the non-sycophantic answer? (I'd be interested to know both)
9Quintin Pope1mo
What about RLHF'd GPT-4?

I agree that doing conceptual work in conjunction with empirical work is good. I don't know if I agree that pure conceptual work is completely doomed but I'm at least sympathetic. However, I think my point still stands: I think someone who can do conceptual+empirical work will probably have more impact doing that than not thinking about the conceptual side and just working really hard on conceptual work.

  1. They may find some other avenue of empirical work that can help with alignment. I think probably there exist empirical avenues substantially more valuable
... (read more)
4Richard_Ngo2mo
(I assume that the last "conceptual" should be "empirical".) I agree that 'not thinking about the conceptual side" is bad. But that's just standard science. Like, top scientists in almost any domain aren't just thinking about their day-to-day empirical research, they have broader opinions about the field as a whole, and more speculative and philosophical ideas, and so on. The difference is whether they treat those ideas as outputs in their own right, versus as inputs that feed into some empirical or theoretical output. Most scientists do the latter; when people in alignment talk about "conceptual work" my impression is that they're typically thinking about the former.

I agree that people who could do either good interpretability or conceptual work should focus on conceptual work. Also, to be clear the rest of this comment is not necessarily a defence of doing interpretability work in particular, but a response to the specific kind of mental model of research you're describing.

I think it's important that research effort is not fungible. Interpretability has a pretty big advantage that unlike conceptual work, a) it has tight feedback loops, b) is much more paradigmatic, c) is much easier to get into for people with an ML ... (read more)

5Richard_Ngo2mo
This seems like a false dichotomy; in general I expect that the best conceptual work will be done in close conjunction with interpretability work or other empirical work. (In general I think that almost all attempts to do "conceptual" work that doesn't involve either empirical results or proofs is pretty doomed. I'd be interested in any counterexamples you've seen; my main counterexample is threat modeling, which is why I've been focusing a lot on that lately.) EDIT: many downvotes, no counterexamples. Please provide some.

My personal theory of impact for doing nonzero amounts of interpretability is that I think understanding how models think will be extremely useful for conceptual research. For instance, I think one very important data point for thinking about deceptive alignment is that current models are probably not deceptively aligned. Many people have differing explanations for which property of the current setup causes this (and therefore which things we want to keep around / whether to expect phase transitions / etc), which often imply very different alignment plans.... (read more)

2Charbel-Raphaël2mo
I completely agree that past interp research has been useful for my understanding of deep learning. But we are funding constrained.  The question now is "what is the marginal benefit of one hour of interp research compared to other types of research", and "whether we should continue to prioritize it given our current understanding and the lessons we have learned".

I know for Cruise they're operating ~300 vehicles here in SF (I was previously under the impression this was a hard cap by law until the approval a few days ago but no longer sure of this). The geofence and hours vary by user but my understanding is the highest tier of users (maybe just employees?) have access to Cruise 24/7 with a geofence encompassing almost all of SF, and then there are lower tiers of users with various restrictions like tighter geofences and 9pm-5:30am hours. I don't know what their growth plans look like now that they've been granted permission to expand.

3Daniel Kokotajlo2mo
OK, thanks. I'll be curious to see how fast they grow. I guess I should admit that it does seem like ants are driving cars fairly well these days, so to speak. Any ideas on what tasks could be necessary for AI R&D automation, that are a lot harder than driving cars? So far I've got things like 'coming up with new paradigms' and 'having good research taste for what experiments to run.' That and long-horizon agency, though long-horizon agency doesn't seem super necessary.

Meta note: I find it somewhat of interest that filler token experiments have been independently conceived at least 5 times just to my knowledge.

2Kshitij Sachan2mo
huh interesting! Who else has also run filler token experiments? I was also interested in this experiment because it seemed like a crude way to measure how non-myopic are LLMs (i.e. what fraction of the forward pass is devoted to current vs future tokens). I wonder if other people were mostly coming at it from that angle.
4dkirmani2mo
Looks kinda similar, I guess. But their methods require you to know what the labels are, they require you to do backprop, they require you to know the loss function of your model, and it looks like their methods wouldn't work on arbitrarily-specified submodules of a given model, only the model as a whole. The approach in my post is dirt-cheap, straightforward, and it Just Works™. In my experiments (as you can see in the code) I draw my "output" from the third-last convolutional state. Why? Because it doesn't matter -- grab inscrutable vectors from the middle of the model, and it still works as you'd expect it to.

I was quite surprised to see myself cited as "liking the broader category that QACI is in" - I think this claim may technically be true for some definition of "likes" and "broader category", but tries to imply a higher level of endorsement to the casual reader than is factual.

I don't have a very good understanding of QACI and therefore have no particularly strong opinions on QACI. It seems quite different from the kinds of alignment approaches I think about.

My summary of the paper: The paper proves that if you have two distributions that you want to ensure you cannot distinguish linearly (i.e a logistic regression will fail to achieve better than chance score), then one way to do this is to make sure they have the same mean. Previous work has done similar stuff (https://arxiv.org/abs/2212.04273), but without proving optimality.

I think it's pretty unlikely (<5%) that decentralized volunteer training will be competitive with SOTA, ever. (Caveat: I haven't been following volunteer training super closely so this take is mostly cached from having looked into it for GPT-Neo plus occasionally seeing new papers about volunteer training).

  1. You are going to get an insane efficiency hit from the compute having very low bandwidth high latency interconnect. I think it's not inconceivable that someone will eventually figure out an algorithm that is only a few times worse than training on a
... (read more)
2Lao Mein4mo
Could you link me to sources that could give me an estimate of how inefficient volunteer compute would be? Is it something like 100x or 10^6x? Mandatory (i.e. integrated into WeChat) volunteer compute (with compensation) available in China could well exceed conventional AI training clusters by several OOMs.

here's a straw hypothetical example where I've exaggerated both 1 and 2; the details aren't exactly correct but the vibe is more important:

1: "Here's a super clever extension of debate that mitigates obfuscated arguments [etc], this should just solve alignment"

2: "Debate works if you can actually set the goals of the agents (i.e you've solved inner alignment), but otherwise you can get issues with the agents coordinating [etc]"

1: "Well the goals have to be inside the NN somewhere so we can probably just do something with interpretability or whatever"

2: "ho... (read more)

So Q=inner alignment? Seems like person 2 not only pointed to inner alignment explicitly (so it can no longer be "some implicit assumption that you might not even notice you have"), but also said that it "seems to contain almost all of the difficulty of alignment to me". He's clearly identified inner alignment as a crux, rather than as something meant "to be cynical and dismissive". At that point, it would have been prudent of person 1 to shift his focus onto inner alignment and explain why he thinks it is not hard.

Note that your post suddenly introduces "Y" without defining it. I think you meant "X".

a common discussion pattern: person 1 claims X solves/is an angle of attack on problem P. person 2 is skeptical. there is also some subproblem Q (90% of the time not mentioned explicitly). person 1 is defending a claim like "X solves P conditional on Q already being solved (but Q is easy)", whereas person 2 thinks person 1 is defending "X solves P via solving Q", and person 2 also believes something like "subproblem Q is hard". the problem with this discussion pattern is it can lead to some very frustrating miscommunication:

  • if the discussion recurses int
... (read more)

I find myself in person 2's position fairly often, and it is INCREDIBLY frustrating for person 1 to claim they've "solved" P, when they're ignoring the actual hard part (or one of the hard parts).  And then they get MAD when I point out why their "solution" is ineffective.  Oh, wait, I'm also extremely annoyed when person 2 won't even take steps to CONSIDER my solution - maybe subproblem Q is actually easy, when the path to victory aside from that is clarified.

In neither case can any progress be made without actually addressing how Q fits into P, and what is the actual detailed claim of improvement of X in the face of both Q and non-Q elements of P.   

I can see how this could be a frustrating pattern for both parties, but I think it's often an important conversation tree to explore when person 1 (or anyone) is using results about P in restricted domains to make larger claims or arguments about something that depends on solving P at the hardest difficulty setting in the least convenient possible world.

As an example, consider the following three posts:


I think both of th... (read more)

yeah, but that's because Q is easy if you solve P

Very nicely described, this might benefit from becoming a top level post

For example?

random brainstorming about optimizeryness vs controller/lookuptableyness:

let's think of optimizers as things that reliably steer a broad set of initial states to some specific terminal state seems like there are two things we care about (at least):

  • retargetability: it should be possible to change the policy to achieve different terminal states (but this is an insufficiently strong condition, because LUTs also trivially meet this condition, because we can always just completely rewrite the LUT. maybe the actual condition we want is that the complexity of t
... (read more)

My prior, not having looked too carefully at the post or the specific projects involved, is that probably any claims that an open source model is 90% as good as GPT4 or indistinguishable are hugely exaggerated or otherwise not a fair comparison. In general in ML, confirmation bias and overclaiming is very common and as a base rate the vast majority of papers that claim some kind of groundbreaking result end up just never having any real impact.

Also, I expect facets of capabilities progress most relevant to existential risk will be especially constrained st... (read more)

This comment has gotten lots of upvotes but, has anyone here tried Vicuna-13B?

I think it's worth disentangling LLMs and Transformers and so on in discussions like this one--they are not one and the same. For instance, the following are distinct positions that have quite different implications:

  • The current precise transformer LM setup but bigger will never achieve AGI
  • A transformer trained on the language modelling objective will never achieve AGI (but a transformer network trained with other modalities or objectives or whatever will)
  • A language model with the transformer architecture will never achieve AGI (but a language model wit
... (read more)
3Alexander Gietelink Oldenziel5mo
You didn't ask me but let me answer I ~ believe all three. For the first I will just mention that labs are already moving away from the pure transformer architecture. Don't take it from me, Sam Altman is on record saying they re moving away from pure scaling. For the second I don't think it's a question about modalities. Text is definitely rich enough ('text is the universal interface) Yes to the third. Like Byrnes I don't feel I want to talk too much about it. (But I will say it's not too different from what's being deployed right now. In particular I dont think there is something completely mysterious about intelligence or we need GOFAI or understand consciousness something like that) The trouble is that AGI is an underdefined concept. In my conception these simply don't have the right type signature / architecture. Part of the problem is many people of conceiving General Intelligence in terms of capabilities which I think is misleading. A child is generally intelligent but a calculator is not. The only caveat here is that it is conceivable that a large enough LLM might have a generally intelligent mesaoptimizers within it. I'm confused how likely this is. That is not to say LLMs won't be absolutely transformative - they will! And that is not to say timelines are long (I fear most of what is needed for AGI is already known by various people)

retargetability might be the distinguishing factor between controllers and optimizers

as in, controllers are generally retargetable and optimizers aren't? or vice-versa

would be interested in reasoning, either way

Doesn't answer your question, but we also came across this effect in the RM Goodharting work, though instead of figuring out the details we only proved that it when it's definitely not heavy tailed it's monotonic, for Regressional Goodhart (https://arxiv.org/pdf/2210.10760.pdf#page=17). Jacob probably has more detailed takes on this than me. 

In any event my intuition is this seems unlikely to be the main reason for overoptimization - I think it's much more likely that it's Extremal Goodhart or some other thing where the noise is not independent

takes on takeoff (or: Why Aren't The Models Mesaoptimizer-y Yet)

here are some reasons we might care about discontinuities:

  • alignment techniques that apply before the discontinuity may stop applying after / become much less effective
    • makes it harder to do alignment research before the discontinuity that transfers to after the discontinuity (because there is something qualitatively different after the jump)
    • second order effect: may result in false sense of security
  • there may be less/negative time between a warning shot and the End
    • harder to coordinate and slow do
... (read more)

blue: highest logprob numerical token

orange: y = x

random fun experiment: accuracy of GPT-4 on "Q: What is 1 + 1 + 1 + 1 + ...?\nA:"

This is a cool idea. I wonder how it's able to do 100, 150, and 200 so well. I also wonder what are the exact locations of the other spikes?

blue: highest logprob numerical token

orange: y = x

...I am suddenly really curious what the accuracy of humans on that is.

basically the Simulators kind of take afaict

a claim I've been saying irl for a while but have never gotten around to writing up: current LLMs are benign not because of the language modelling objective, but because of the generalization properties of current NNs (or to be more precise, the lack thereof). with better generalization LLMs are dangerous too. we can also notice that RL policies are benign in the same ways, which should not be the case if the objective was the core reason. one thing that can go wrong with this assumption is thinking about LLMs that are both extremely good at generalizing ... (read more)

what is the "language models are benign because of the language modeling objective" take?

Rightfully so! Read your piece back in 2021 and found it true & straightforward.

I sorta had a hard time with this market because the things I think might happen don't perfectly map onto the market options, and usually the closest corresponding option implies some other thing, such that the thing I have in mind isn't really a central example of the market option.

leogao6moΩ52518

Adding $200 to the pool. Also, I endorse the existence of more bounties/contests like this.

The following things are not the same:

  • Schemes for taking multiple unaligned AIs and trying to build an aligned system out of the whole
    • I think this is just not possible.
  • Schemes for taking aligned but less powerful AIs and leveraging them to align a more powerful AI (possibly with amplification involved)
    • This breaks if there are cases where supervising is harder than generating, or if there is a discontinuity. I think it's plausible something like this could work but I'm not super convinced.

I don't think experiments like this are meaningful without a bunch of trials and statistical significance. The outputs of models (even RLHF models) on these kinds of things has pretty high variance, so it's really hard to draw any conclusion from single sample comparisons like this.

1Christopher King6mo
Although I think it's a stretch to say they "aren't meaningful", I do agree a more scientific test would be nice. It's a bit tricky when you only got 25 messages per 3 hours though, lol. More generally, it's hard to tell how to objectively quantify agency in the responses, and how to eliminate other hypotheses (like that GPT-4 is just more familiar with itself than other AIs).

My prediction is that GPT is already capable of the former, which means we might have solved a tough problem in alignment almost by accident!

I think this is incorrect. I don't consider whether an LM can tell whether most humans would approve of an outcome described in natural language to be a tough problem in alignment. This is a far easier thing to do than the thing #1 describes.

Some argument for this position: https://www.lesswrong.com/posts/ktJ9rCsotdqEoBtof/asot-some-thoughts-on-human-abstractions

-3baturinsky6mo
"World can be described and analyzed using the natural human language well enough to do accurate reasoning and prediction" could be another measure of the "good" world, imho. If the natural language can't be used to reason about the world anymore, it's likely that this world is already alien enough to people to have no human value.

I don't see how this changes the picture? If you train a model on real time feedback from a human, that human algorithm is still the same one that is foolable by i.e cutting down the tree and replacing it with papier-mache or something. None of this forces the model to learn a correspondence between the human ontology and the model's internal best-guess model because the reason any of this is a problem in the first place is the fact that the human algorithm points at a thing which is not the thing we actually care about. 

re:1, yeah that seems plausible, I'm thinking in the limit of really superhuman systems here and specifically pushing back against a claim that this human abstractions being somehow inside a superhuman AI is sufficient for things to go well.

re:2, one thing is that there are ways of drifting that we would endorse using our meta-ethics, and ways that we wouldn't endorse. More broadly, the thing I'm focusing on in this post is not really about drift over time or self improvement; in the setup I'm describing, the thing that goes wrong is it does the classical ... (read more)

one man's modus tollens is another man's modus ponens:

"making progress without empirical feedback loops is really hard, so we should get feedback loops where possible" "in some cases (i.e close to x-risk), building feedback loops is not possible, so we need to figure out how to make progress without empirical feedback loops. this is (part of) why alignment is hard"

you gain general logical facts from empirical work, which can aide providing a blurry image of the manifold that the precise theoretical work is trying to build an exact representation of

Yeah something in this space seems like a central crux to me.

I personally think (as a person generally in the MIRI-ish camp of "most attempts at empirical work are flawed/confused"), that it's not crazy to look at the situation and say "okay, but, theoretical progress seems even more flawed/confused, we just need to figure out some how of getting empirical feedback loops."

I think there are some constraints on how the empirical work can possibly work. (I don't think I have a short thing I could write here, I have a vague hope of writing up a longer post on "what I think needs to be true, for empirical work to be helping rather than confusedly not-really-helping")

Is the correlation between sleeping too long and bad health actually because sleeping too long is actually causally upstream of bad health effects, or only causally downstream of some common cause like illness?

Afaik, both. Like a lot of shit things - they are caused by depression, and they cause depression, horrible reinforcing loop. While the effect of bad health on sleep is obvious, you can also see this work in reverse; e.g. temporary severe sleep restriction has an anti-depressive effect. Notable, though with not many useful clinical applications, as constant sleep deprivation is also really unhealthy.

I think the problems are roughly equivalent. Creating training data that trope weights superintelligences as honest requires you to access sufficiently superhuman behavior, and you can't just elide the demonstration of superhumanness, because that just puts it in the category of simulacra that merely profess to be superhuman. 

2Jozdien7mo
I think the relevant idea is what properties would be associated with superintelligences drawn from the prior? We don't really have a lot of training data associated with superhuman behaviour on general tasks, yet we can probably draw it out of powerful interpolation. So properties associated with that behaviour would also have to be sampled from the human prior of what superintelligences are like - and if we lived in a world where superintelligences were universally described as being honest, why would that not have the same effect as one where humans are described as honest resulting in sampling honest humans being easy?
leogao7moΩ287927

Therefore, the longer you interact with the LLM, eventually the LLM will have collapsed into a waluigi. All the LLM needs is a single line of dialogue to trigger the collapse.

This seems wrong. I think the mistake you're making is when you argue that because there's some chance X happens at each step and X is an absorbing state, therefore you have to end up at X eventually. However, this is only true if you assume the conclusion and claim that the prior probability of luigis is zero. If there is some prior probability of a luigi, each non-waluigi step incre... (read more)

6abramdemski7mo
I disagree. The crux of the matter is the limited memory of an LLM. If the LLM had unlimited memory, then every Luigi act would further accumulate a little evidence against Waluigi. But because LLMs can only update on so much context, the probability drops to a small one instead of continuing to drop to zero. This makes waluigi inevitable in the long run.
2TekhneMakre7mo
This comment seems to rest on a dubious assumption. I think you're saying: The first sentence is dubious though. Why would the LLM's behavior come from a distribution over a space that includes "behave like luigi (forever)"? My question is informal, because maybe you can translate between distributions over [behaviors for all time] and [behaviors as functions from a history to a next action]. But these two representations seem to suggest different "natural" kinds of distributions. (In particular, a condition like non-dogmatism--not assigning probability 0 to anything in the space--might not be preserved by the translation.)
7Ulisse Mini7mo
Each non-Waluigi step increases the probability of never observing a transition to Waluigi a little bit, but not unboundedly so. As a toy example, we could start with P(Waluigi) = P(Luigi) = 0.5. Even if P(Luigi) monotonically increases, finding novel evidence that Luigi isn't a deceptive Waluigi becomes progressively harder. Therefore, P(Luigi) could converge to, say, 0.8. However, once Luigi says something Waluigi-like, we immediately jump to a world where P(Waluigi) = 0.95, since this trope is very common. To get back to Luigi, we would have to rely on a trope where a character goes from good to bad to good. These tropes exist, but they are less common. Obviously, this assumes that the context window is large enough to "remember" when Luigi turned bad. After the model forgets, we need a "bad to good" trope to get back to Luigi, and these are more common.

Agreed.  To give a concrete toy example:  Suppose that Luigi always outputs "A", and Waluigi is {50% A, 50% B}.  If the prior is {50% luigi, 50% waluigi}, each "A" outputted is a 2:1 update towards Luigi.  The probability of "B" keeps dropping, and the probability of ever seeing a "B" asymptotes to 50% (as it must).

This is the case for perfect predictors, but there could be some argument about particular kinds of imperfect predictors which supports the claim in the post.

2kibber7mo
I think what the OP is saying is that each luigi step is actually a superposition step, and therefore each next line adds up the probability of collapse. However, from a pure trope perspective I believe this is not really the case - in most works of fiction that have a twist, the author tends to leave at least some subtle clues for the twist (luigi turning out to be a waluigi). So it is possible at least for some lines to decrease the possibility of waluigi collapse.

You don't need to pay for translation to simulate human level characters, because that's just learning the human simulator. You do need to pay for translation to access superhuman behavior (which is the case ELK is focused on).

3Jozdien7mo
Yeah, but the reasons for both seem slightly different - in the case of simulators, because the training data doesn't trope-weigh superintelligences as being honest. You could easily have a world where ELK is still hard but simulating honest superintelligences isn't.
leogao7moΩ133914

However, this trick won't solve the problem. The LLM will print the correct answer if it trusts the flattery about Jane, and it will trust the flattery about Jane if the LLM trusts that the story is "super-duper definitely 100% true and factual". But why would the LLM trust that sentence?

 

There's a fun connection to ELK here. Suppose you see this and decide: "ok forget trying to describe in language that it's definitely 100% true and factual in natural language. What if we just add a special token that I prepend to indicate '100% true and factual, for... (read more)

2Anomalous6mo
If <|specialtoken|> always prepends true statements, I suppose it's pretty good as brainwashing, but the token will still end up being clustered close to other concepts associated with veracity, which are clustered close to claims about veracity, which are clustered close to false claims about veracity. If it has enough context suggesting that it's in a story where it's likely to be manipulated, then suddenly feeling [VERIDIGAL] could snap the narrative in place. The idea of "injected thoughts" isn't new to it. If, right now, I acquired the ability to see a new colour, and it flashed in my mind every time I read something true... I'd learn a strong association, but I'd treat it in a similar manner to how I treat the other inexplicably isolated intuitions I've inherited from my evolutionary origin. Love the idea, though. I was hyped before I thought on it. Still seems worth exploring an array of special tokens as means of nudging the AI towards specific behaviours we've reinforced. I'm not confident it won't be very effective.
4JoshuaZ7mo
What does ELK stand for here?
1Aleksey Bykhun7mo
Do humans have this special token that exist outside language? How would it be encoded in the body? One interesting candidate is a religions feeling of awe. It kinda works like that — when you’re in that state, you absorb beliefs. Also, social pressure seems to work in a similar way.
1Garrett Baker7mo
This seems like it'd only work if the LM doesn't generalize the supposed WaluigiEffect to include this token. Making a token that specifies "definitely true and factual for reals". If some of the text ends up being wrong, for instance, it may quickly switch to "ah, now it is time for me to be sneakily wrong!", and it always keeps around some probability that its now meant to be sneakily wrong, because a token which always specifies '100% true and factual for reals' is an incredibly initially unlikely hypothesis to hold about the token, and there are other hypotheses which basically predict those token dynamics which are far more plausible.
7Cleo Nardo7mo
Yes — this is exactly what I've been thinking about! Can we use RLHF or finetuning to coerce the LLM into interpreting the outside-text as undoubtably literally true. If the answer is "yes", then that's a big chunk of the alignment problem solved, because we just send a sufficiently large language model the prompt with our queries and see what happens.

There is an advantage here in that you don't need to pay for translation from an alien ontology - the process by which you simulate characters having beliefs that lead to outputs should remain mostly the same. You would need to specify a simulacrum that is honest though, which is pretty difficult and isomorphic to ELK in the fully general case of any simulacra, but it's in a space that's inherently trope-weighted; so simulating humans that are being honest about their beliefs should be made a lot easier (but plausibly still not easy in absolute terms) beca... (read more)

GPT-2-xl unembedding matrix looks pretty close to full rank (plot is singular values)

The coin flip example seems related to some of the ideas here

I think your meta level observation seems right. Also, I would add that bottleneck problems in either capabilities or alignment are often bottlenecked on resources like serial time.

(My timelines, even taking all this into account, are only like 10 years---I don't think these obstacles are so insurmountable that they buy decades.)

Load More