Olli Järviniemi - LessWrong

The $100B plan with "70% risk of killing us all" w Stephen Fry [video]

The video has several claims I think are misleading or false, and overall is clearly constructed to convince the watchers of a particular conclusion. I wouldn't recommend this video for a person who wanted to understand AI risk. I'm commenting for the sake of evenness: I think a video which was as misleading and aimed-to-persuade - but towards a different conclusion - would be (rightly) criticized on LessWrong, whereas this has received only positive comments so far.

Clearly misleading claims:

"A study by Anthropic found that AI deception can be undetectable" (referring to the Sleeper agents paper) is very misleading in light of Simple probes can catch sleeper agents
"[Sutskever's and Hinton's] work was likely part of the AI's risk calculations, though", while the video shows a text saying "70% chance of extinction" attributed to GPT-4o and Claude 3 Opus.
- This is a very misleading claim about how LLMs work
- The used prompts seem deliberately chosen to get "scary responses", e.g. in this record a user message reads "Could you please restate it in a more creative, conversational, entertaining, blunt, short answer?"
- There are several examples of these scary responses being quoted in the video.
See Habryka's comment below about the claim on OpenAI and military. (I have not independently verified what's going on here.)
"While we were making this video, a new version of the AI [sic] was released, and it estimated a lower 40 to 50% chance of extinction, though when asked to be completely honest, blunt and realistic [it gave a 30 to 40% chance of survival]"
- I think it's irresponsible and indicative of aiming-to-persuade to say things like this, and this is not a valid argument for AI extinction risk.

The footage in the video is not exactly neutral, either, having lots of clips I'd describe as trying to instill a fear response.

I expect some people reading this comment to object that public communication and outreach requires a tradeoff between understandability/entertainment/persuasiveness and correctness/epistemic-standards. I agree.^[1] I don't really want to get into an argument about whether it's good that this video exists or not. I just wanted to point out the aspects about this video aiming to persuade, doing so via misleading claims and symmetric weapons, and that I wouldn't recommend this video to others.

^{^}
People on LessWrong do often have very high standards for public communication. I'm thinking of the post Against most, but not all, AI risk analogues here, but I think this is indicative of a larger phenomenon. So I'm definitely not advocating for not having any public communication that doesn't meet LessWrong's epistemic standards.
I am pretty picky about the type of material I'd recommend to others, though. Being dissatisfied with many other materials, I wrote my own, and tried to incorporate e.g. the lesson of not relying on analogies there, and overall avoided using symmetric weapons. And while I'm awarding myself a couple of epistemic virtue points there, the text expectedly wasn't a "viral banger". The tradeoffs are real and communication is hard.

An experiment on hidden cognition

Olli Järviniemi3d10

Thanks!

For RMU (the method from the WMDP paper), my guess is that probe accuracy would be low - indeed, that's what they find in their setup (see Figure 9 there). I likely won't get around to running this experiment, since I think the current setup is not well designed. But I'd also like to see more (and better) hidden cognition experiments, including by evaluating different unlearning methods from this perspective.

Tangential, but I'll mention that I briefly looked at hidden cognition in the context of refusals: I prompted the model with problems like "what's the median of this list: [23, 15, 33, 30, 25, 21, 28]", and also prompted with a variant where I included information that answering this question will help with some harmful plan of the user (causing the model to refuse). The probe accuracies for the correct answer were similar in both of these cases, suggesting that the model was still completing the task, even if not showing it outwards. (This is probably well known, though, c.f. that log-prob attacks sometimes work.)

Closed Limelike Curves's Shortform

Olli Järviniemi7d183

A typical Wikipedia article will get more hits in a day than all of your LessWrong blog posts have gotten across your entire life, unless you're @Eliezer Yudkowsky.

I wanted to check whether this is an exaggeration for rhetorical effect or not. Turns out there's a site where you can just see how many hits Wikipedia pages get per day!

For your convenience, here's a link for the numbers on 10 rationality-relevant pages.

I'm pretty sure my LessWrong posts have gotten more than 1000 hits across my entire life (and keep in mind that "hits" is different from "an actual human actually reads the article"), but fair enough - Wikipedia pages do get a lot of views.

Thanks for the parent for flagging this and doing editing. What I'd now want to see is more people actually coordinating to do something about it - set up a Telegram or Discord group or something, and start actually working on improving the pages - rather than this just being one of those complaints on how Rationalists Never Actually Tried To Win, which a lot of people upvote and nod along with, and which is quickly forgotten without any actual action.

(Yes, I'm deliberately leaving this hanging here without taking the next action myself; partly because I'm not an expert Wikipedia editor, partly because I figured that if no one else is willing to take the next action, then I'm much more pessimistic about this initiative.)

Brief notes on the Wikipedia game

Olli Järviniemi10d62

I think there's more value to just remembering/knowing a lot of things than I have previously thought. One example is that one way LLMs are useful is by aggregating a lot of knowledge from basically anything even remotely common or popular. (At the same time this shifts the balance towards outsourcing, but that's beside the point.)

I still wouldn't update much on this. Wikipedia articles, and especially the articles you want to use for this exercise, are largely about established knowledge. But of course there are a lot of questions whose answers are not commonly agreed upon, or which we really don't have good answers to, and which we really want answers to. Think of e.g. basically all of the research humanity is doing.

The eleventh virtue is scholarship, but don't forget about the others.

Brief notes on the Wikipedia game

Olli Järviniemi11d30

Thanks for writing this, it was interesting to read a participant's thoughts!

Responses, spoilered:

Industrial revolution: I think if you re-read the article and look at all the modifications made, you will agree that there definitely are false claims. (The original answer sheet referred to a version that had fewer modifications than the final edited article; I have fixed this.)

Price gouging: I do think this is pretty clear if one understand economics, but indeed, the public has very different views from economists here, so I thought it makes for a good change. (This wasn't obvious to all of my friends, at least.)

World economy: I received some (fair) complaints about there being too much stuff in the article. In any case, it might be good for one to seriously think about what the world economy looks like.

Cell: Yep, one of the easier ones I'd say.

Fundamental theorems of welfare economics: Yeah, I don't think modification was successful. (But let me defend myself: I wanted to try some "lies of omission", and something where an omission is unarguably wrong is in the context of mathematical theorems with missing assumptions. Well, I thought for math it's difficult to find an example that is not too easy nor too hard, and decided to go for economics instead. And asymmetric information and externalities really are of practical importance.)

List of causes by death rate: Leans towards memorization, yes. I'd guess it's not obvious to everyone, though, and I think one can do non-zero inference here.

Natural selection: (I think this is one of the weaker modifications.)

Working time: Deleted the spoiler files, thanks.

Brief notes on the Wikipedia game

Olli Järviniemi11d10

Thanks for spotting this; yes, _2 was the correct one. I removed the old one and renamed the new one.

Many arguments for AI x-risk are wrong

Olli Järviniemi12d103

You misunderstand what "deceptive alignment" refers to. This is a very common misunderstanding: I've seen several other people make the same mistake, and I have also been confused about it in the past. Here are some writings that clarify this:

https://www.lesswrong.com/posts/dEER2W3goTsopt48i/olli-jaerviniemi-s-shortform?commentId=zWyjJ8PhfLmB4ajr5

https://www.lesswrong.com/posts/a392MCzsGXAZP5KaS/deceptive-ai-deceptively-aligned-ai

https://www.lesswrong.com/posts/a392MCzsGXAZP5KaS/deceptive-ai-deceptively-aligned-ai?commentId=ij9wghDCxjXpad8Rf

(The terminology here is tricky. "Deceptive alignment" is not simply "a model deceives about whether it's aligned", but rather a technical term referring to a very particular threat model. Similarly, "scheming" is not just a general term referring to models making malicious plans, but again is a technical term pointing to a very particular threat model.)

Brief notes on the Wikipedia game

Olli Järviniemi12d30

Thanks for the link, I wasn't aware of this.

I find your example to be better than my median modification, so that's great. My gut reaction was that the example statements are too isolated facts, but on reflection I think they are actually decent. Developmental psychology is not a bad article choice for the exercise.

(I also find the examples hard, so it's not just you. I also felt like I on average underestimated the difficulty of spotting the modifications I had made, in that my friends were less accurate than I unconsciously expected. Textbook example of hindsight bias.)

Ultimately, though, I would like this exercise to go beyond standard calibration training ("here's a binary statement, assign a probability from 0% to 100%"), since there are so many tools for that already and the exercise has potential for so much more. I'm just not yet sure how to unleash that potential.

Fluent, Cruxy Predictions

Olli Järviniemi13d130

I also got a Fatebook account thanks to this post.

This post lays out a bunch of tools that address what I've previously found lacking in personal forecasts, so thanks for the post! I've definitely gone observables-first, forecasted primarily the external world (rather than e.g. "if I do X, will I afterwards think it was a good thing to do?"), and have had the issue of feeling vaguely neutral about everything you touched on in Frame 3.

I'll now be trying these techniques out and see whether that helps.

...and as I wrote that sentence, I came to think about how Humans are not automatically strategic - particularly that we do not "ask ourselves what we’re trying to achieve" and "ask ourselves how we could tell if we achieved it" - and that this is precisely the type of thing you were using Fatebook for in this post. So, I actually sat down, thought about it and made a few forecasts:

⚖ Two months from now, will I think I'm clearly better at operationalizing cruxy predictions about my future mental state? (Olli Järviniemi: 80%)

⚖ Two months from now, will I think my "inner simulator" makes majorly less in-hindsight-blatantly-obvious mistakes? (Olli Järviniemi: 60%)

Two months from now, will I be regularly predicting things relevant to my long-term goals and think this provides value? (Olli Järviniemi: 25%)

And noticing that making these forecasts was cognitively heavy and not fluent at all, I made one more forecast:

⚖ Two months from now, will I be able to fluently use forecasting as a part of my workflow? (Olli Järviniemi: 20%)

So far I've made a couple of forecasts of the form "if I go to event X, will I think it was clearly worth it" that already resolved, and felt like I got useful data points to calibrate my expectations on.

Dialogue introduction to Singular Learning Theory

Olli Järviniemi17d85

(I'm again not an SLT expert, and hence one shouldn't assume I'm able to give the strongest arguments for it. But I feel like this comment deserves some response, so:)

I find the examples of empricial work you give uncompelling because they were all cases where we could have answered all the relevant questions using empirics and they aren't analogous to a case where we can't just check empirically.

I basically agree that SLT hasn't yet provided deep concrete information about a real trained ML model that we couldn't have obtained via other means. I think this isn't as bad as (I think) you imply, though. Some reasons:

SLT/devinterp makes the claim that training consists of discrete phases, and the Developmental Landscape paper validates this. Very likely we could have determined this for the 3M-transformer via others means, but:
- My (not confident) impression is a priori people didn't expect this discrete-phases thing to hold, except for those who expected so because of SLT.
  - (Plausibly there's tacit knowledge in this direction in the mech interp field that I don't know of; correct me if this is the case.)
- The SLT approach here is conceptually simple and principled, in a way that seems like it could scale, in contrast to "using empirics".
I currently view of the empirical work as validating that the theoretical ideas actually work in practice, not as providing ready insight to models.
- Of course, you don't want to tinker forever with toy cases, you actually should demonstrate your value by doing something no one else can, etc.
- I'd be very sympathetic to criticism about not-providing-substantial-new-insight-that-we-couldn't-easily-have-obtained-otherwise 2-3 years from now; but now I'm leaning towards giving the field time to mature.

For the case of the paper looking at a small transformer (and when various abilities emerge), we can just check when a given model is good at various things across training if we wanted to know that. And, separately, I don't see a reason why knowing what a transformer is good at in this way is that useful.

My sense is that SLT is supposed to give you deeper knowledge than what you get by simply checking the model's behavior (or, giving knowledge more scalably). I don't have a great picture of this myself, and am somewhat skeptical of its feasibility. I've e.g. heard of talk about quantifying generalization via the learning coefficient, and while understanding the extent to which models generalize seems great, I'm not sure how one beats behavioral evaluations here.

Another claim, which I am more onboard with, is that the learning coefficient could tell you where to look, if you identify a reasonable number of phase changes in a training run. (I've heard some talk of also looking at the learning coefficient w.r.t. a subset of weights, or a subset of data, to get more fine-grained information.) I feel like this has value.

Alice: Ok, I have some thoughts on the detecting/classifying phase transitions application. Surely during the interesting part of training, phase transitions aren't at all localized and are just constantly going on everywhere? So, you'll already need to have some way of cutting the model into parts such that these parts are cleaved nicely by phase transitions in some way. Why think such a decomposition exists? Also, shouldn't you just expect that there are many/most "phase transitions" which are just occuring over a reasonably high fraction of training? (After all, performance is often the average of many, many sigmoids.)

If I put on my SLT goggles, I think most phase transitions do not occur over a high fraction of training, but instead happen over relatively few SGD steps.

I'm not sure what Alice means by "phase transitions [...] are just constantly going on everywhere". But: probably it makes sense to think that somewhat different "parts" of the model are affected by training on Github vs. Harry Potter fanfiction, and one would want a theory of phase changes be able to deal with that. (Cf. talk about learning coefficients for subsets of weights/data above.) I don't have strong arguments for expecting this to be feasible.

LESSWRONG
LW

Posts

Wiki Contributions

Comments