All of Olli Järviniemi's Comments + Replies

I'd try to come up with a classification problem that a base model might initially refuse (or we'd hope it would refuse). Then the training to say "sorry i can't help with that" makes more intuitive sense.


Agree this is one axis the experiment could be improved on.

but it does not imply that the model is literally thinking about it.

I think I disagree here. At least for problems that are (computationally) difficult enough I think "linear probe can retrieve answer" implies "model has done the necessary computation to solve the problem".[1] To solve ... (read more)

1Jérémy Scheurer2d
I do think linear probes are useful, and if you can correctly classify the target with a linear probe it makes it more likely that the model is potentially "representing something interesting" internally (e.g. the solution to the knapsack problem). But its not guaranteed, the model could just be calculating something else which correlates with the solution to the knapsack problem. I really recommend checking out the deepmind paper I referenced. Fabien Roger also explains some shortcoming with CCS here. The takeaway is just, be careful when interpreting linear probes. They are useful to some extent, but prone to overinterpretation.

Here is an experiment one could run in the Hidden Cognition Detection genre:

Motivating question: Suppose we have a pre-trained model that outputs deceptive reasoning. We want the model to not do deceptive reasoning. Hence we do the standard thing: negatively reward the model or SFT it on a non-deceptive continuation. Does this actually remove the deceptive reasoning, or will the model still do the reasoning while not outputting its results?


Step 1: Take a pre-trained model. Train it to be good at, say, solving the 0-1 knapsack problem (for e.g. n... (read more)

1Jérémy Scheurer3d
Seems like an experiment worth doing. Some thoughts: * I understand that setting up a classification problem in step 1 is important (so you can train the linear probe). I'd try to come up with a classification problem that a base model might initially refuse (or we'd hope it would refuse). Then the training to say "sorry i can't help with that" makes more intuitive sense. I get that mechanistically it's the same thing. But you want to come as close to the real deal as you can, and its unclear why a model would say "I'm sorry I can't help" to the knapsack problem. * If the linear probe in step 4 can still classify accurately, it implies that there are some activations "which at least correlate with thinking about how to answer the question", but it does not imply that the model is literally thinking about it. I think it would still be good enough as a first approximation, but I just want to caution generally that linear probes do not show the existence of a specific "thought" (e.g. see this recent paper).  Also if the probe can't classify correctly its not proof that the model does not "think about it". You're probably aware of all this, just thought I'd mention it.  * This paper might also be relevant for your experiment. I want to log in a prediction, let me know if you ever run this. My guess would be that this experiment will just work, i.e., the linear probe will still get fairly high accuracy even after step 3. I think its still worth checking (so i still think its probably worth doing), but overall I'd say its not super surprising if this would happen (see e.g. this paper for where my intuition comes from)


  • I added the remaining completions to the completions/ folder in the git repo. I also collected the DMGI-completions to a single file, totaling about 20 completions. This should address your first and fourth points.
  • About the effects of non-private scratchpads: I took 50 samples on each of the three categories "scratchpad is private", "scratchpad is not private and this is indirectly-but-clearly implied" and "scratchpad is not private and this is directly stated". At a quick glance there were no huge effects on frequencies of four-leaf-clover mentions
... (read more)
That's great, thanks for doing that.   That's really fascinating; I would not have predicted that at all! It makes me think of Daniel Eth's recent tweet about having to prod ChatGPT to realize that revealing its choice first in rock paper scissors could give the user an advantage (@Daniel_Eth it wasn't clear just how much prodding that took -- I'd love to see the full exchange if you still have it).

Thanks for the detailed comment!

Yep, we definitely want training-based versions of the same experiments - this is something that's actively on my mind.

Responding to your bullet points:

  • I think it's not very important what the exact frequency is - it just doesn't tell you much. What does matter are the trends - whether prompt A has a higher frequency than prompt B - and for this it indeed is awkward that the frequencies are small. There I suggest that one doesn't read too much into the specific numbers, but just think "these prompts gave DMGI-results, those
... (read more)
* 'I think it's not very important what the exact frequency is - it just doesn't tell you much.' Totally fair! I guess I'm thinking more in terms of being convincing to readers -- 3 is a small enough sample size that readers who don't want to take deception risks seriously will find it easier to write it off as very unlikely. 30/1000 or even 20/1000 seems harder to dismiss. * 'I have mixed thoughts on the relevance of faithfulness here. On one hand, yep, I'm looking at the CoTs and saying "this one is deceptive, that one is not", and faithfulness seems relevant there. On the other hand: faithful or not, there is some sense in which the model is doing instrumental deceptive reasoning and acting on it.' Agreed. And yet...the trouble with research into model deception, is that nearly all of it is, in one way or another, 'We convinced it to be deceptive and then it was deceptive.' In all the research so far that I'm aware of, there's a sense in which the deception is only simulated. It's still valuable research! I mainly just think that it's an important limitation to acknowledge until we have model organisms of deception that have been given a much more intrinsic terminal goal that they'll lie in service of. To be clear, I think that your work here is less weakened by 'we convinced it to be deceptive' than most other work on deception, and that's something that makes it especially valuable. I just don't think it fully sidesteps that limitation. * 'My motivation for human-CoTs was "let's start by constructing something that definitely should work (if anything does)".' Makes sense! * '...deceptive alignment arising in pre-training isn't the only worry - you have to worry about it arising in fine-tuning as well!' Strongly agreed! In my view, for a myopic next-token predictor like current major LLMs, that's the only way that true deceptive alignment could arise. Of course, as you point out, there are circumstances where simulated deceptive alignment can cause major rea

Interesting perspective!

I would be interested in hearing answers to "what can we do about this?". Sinclair has a couple of concrete ideas - surely there are more. 

Let me also suggest that improving coordination benefits from coordination. Perhaps there is little a single person can do, but is there something a group of half a dozen people could do? Or two dozens? "Create a great prediction market platform" falls into this category, what else?

I read this a year or two ago, tucked it in the back of my mind, and continued with life.

When I reread it today, I suddenly realized oh duh, I’ve been banging my head against this on X for months


This is close to my experience. Constructing a narrative from hazy memories:

First read: "Oh, some nitpicky stuff about metaphors, not really my cup of tea". *Just skims through*

Second read: "Okay it wasn't just metaphors. Not that I really get it; maybe the point about different people doing different amount of distinctions is good"

Third read (after reading S... (read more)

I view this post as providing value in three (related) ways:

  1. Making a pedagogical advancement regarding the so-called inner alignment problem
  2. Pointing out that a common view of "RL agents optimize reward" is subtly wrong
  3. Pushing for thinking mechanistically about cognition-updates


Re 1: I first heard about the inner alignment problem through Risks From Learned Optimization and popularizations of the work. I didn't truly comprehend it - sure, I could parrot back terms like "base optimizer" and "mesa-optimizer", but it didn't click. I was confused.

Some mon... (read more)

Just now saw this very thoughtful review. I share a lot of your perspective, especially: and

Part 4/4 - Concluding comments on how to contribute to alignment

In part 1 I talked about object-level belief changes, in part 2 about how to do research and in part 3 about what alignment research looks like.

Let me conclude by saying things that would have been useful for past-me about "how to contribute to alignment". As in past posts, my mode here is "personal musings I felt like writing that might accidentally be useful to others".

So, for me-1-month-ago, the bottleneck was "uh, I don't really know what to work on". Let's talk about that.

First of all, ex... (read more)

Part 3/4 - General uptakes

In my previous two shortform posts I've talked about some object-level belief changes about technical alignment and some meta-level thoughts about how to do research, both which were prompted by starting in an alignment program.

Let me here talk about some uptakes from all this.

(Note: As with previous posts, this is "me writing about my thoughts and experiences in case they are useful to someone", putting in relatively low effort. It's a conscious decision to put these in shortform posts, where they are not shoved to everyone's fac... (read more)

Yeah, I definitely grant that there are insights in the things I'm criticizing here. E.g. I was careful to phrase this sentence in this particular way:

The talk about iterative designs failing can be interpreted as pushing away from empirical sources of information.

Because yep, I sure agree with many points in the "Worlds Where Iterative Design Fails". I'm not trying to imply the post's point was "empirical sources of information are bad" or anything. 

(My tone in this post is "here are bad interpretations I've made, watch out for those" instead of "let... (read more)

Part 2 - rant on LW culture about how to do research

Yesterday I wrote about my object-level updates resulting from me starting on an alignment program. Here I want to talk about a meta point on LW culture about how to do research.

Note: This is about "how things have affected me", not "what other people have aimed to communicate". I'm not aiming to pass other people's ITTs or present the strongest versions of their arguments. I am rant-y at times. I think that's OK and it is still worth it to put this out.

There's this cluster of thoughts in LW that includes... (read more)

On being able to predictably publish papers as a malign goal, one point is standards of publishability in existing research communities not matching what's useful to publish for this particular problem (which used to be the case more strongly a few years ago). Aiming to publish for example on LessWrong fixes the issue in that case, though you mostly won't get research grants for that. (The other point is that some things shouldn't be published at all.) In either case, I don't see discouragement from building on existing work, it's not building arguments out of nothing when you also read all the things as you come up with your arguments. Experimental grounding is crucial but not always possible, in which case giving up on the problem and doing something else doesn't help with solving this particular problem, other than as part of the rising tide of basic research that can't be aimed.
3the gears to ascension2mo
[edit: pinned to profile] agreed on all points. and, I think there are kernels of truth from the things you're disagreeing-with-the-implications-of, and those kernels of truth need to be ported to the perspective you're saying they easily are misinterpreted as opposing. something like, how can we test the hard part first? compare also physics - getting lost doing theory when you can't get data does not have a good track record in physics despite how critically important theory has been in modeling data. but you also have to collect data that weighs on relevant theories so hypotheses can be eliminated and promising theories can be refined. machine learning typically is "make number go up" rather than "model-based" science, in this regard, and I think we do need to be doing model-based science to get enough of the right experiments. on the object level, I'm excited about ways to test models of agency using things like particle lenia and neural cellular automata. I might even share some hacky work on that at some point if I figure out what it is I even want to test.

Thanks for the response. (Yeah, I think there's some talking past each other going on.)

On further reflection, you are right about the update one should make about a "really hard to get it to stop being nice" experiment. I agree that it's Bayesian evidence for alignment being sticky/stable vs. being fragile/sensitive. (I do think it's also the case that "AI that is aligned half of the time isn't aligned" is a relevant consideration, but as the saying goes, "both can be true".)

Showing that nice behavior is hard to train out, would be bad news?

My point is not... (read more)

A local comment to your second point (i.e. irrespective of anything else you have said).

Second, suppose I ran experiments which showed that after I finetuned an AI to be nice in certain situations, it was really hard to get it to stop being nice in those situations without being able to train against those situations in particular. I then said "This is evidence that once a future AI generalizes to be nice, modern alignment techniques aren't able to uproot it. Alignment is extremely stable once achieved" 

As I understand it, the point here is that your ... (read more)

No, this doesn't seem very symmetry breaking and it doesn't invalidate my point. The hypothetical experiment would still be Bayesian evidence that alignment is extremely stable; just not total evidence (because the alignment wasn't shown to be total in scope, as you say). Similarly, this result is not being celebrated as "total evidence." It's evidence of deceptive alignment being stable in a very small set of situations. For deceptive alignment to matter in practice, it has to occur in enough situations for it to practically arise and be consistently executed along. In either case, both results would indeed be (some, perhaps small and preliminary) evidence that good alignment and deceptive alignment are extremely stable under training.  Showing that nice behavior is hard to train out, would be bad news? We in fact want nice behavior (in the vast majority of situations)! It would be great news if benevolent purposes were convergently drilled into AI by the data. (But maybe we're talking past each other.)

I've recently started in a research program for alignment. One outcome among many is that my beliefs and models have changed. Here I outline some ideas I've thought about.

The tone of this post is more like

"Here are possibilities and ideas I haven't really realized that exist.  I'm yet uncertain about how important they are, but seems worth thinking about them. (Maybe these points are useful to others as well?)"


"These new hypotheses I encountered are definitely right."

This ended up rather long for a shortform post, but still posting it here as it's... (read more)

Saying that someone is "strategically lying" to manipulate others is a serious claim. I don't think that you have given any evidence for lying over "sincere has different beliefs than I do" in this comment (which, to be clear, you might not have even attempted to do - just explicitly flagging it here).

I'm glad other people are poking at this, as I didn't want to be the first person to say this.

Note that the author didn't make any claims about Eliezer lying.

(In case one thinks I'm nitpicky, I think that civil communication involves making the line between "... (read more)

Well, actually I suspect that most other humans EVs would be even more disgusting to me. People on this site underestimate the extent to which they live in a highly filtered bubble of high-IQ middle-aged-to-young westerners which is a very atypical part of humanity. Most humans are nonwestern and most humans have an IQ below about 80.
Well, there is other evidence I have but it is sensitive. A number of prominent people have confided to me that they think that human extrapolated volitions will probably have significant conflicts, but they don't want to say that because it is unstrategic to point out to your allies that they have latent conflicts. Human-EV-value-conflicts is a big infohazard.

FWIW, I thought about this for an hour and didn't really get anywhere. This is totally not my field though, so I might be missing something easy or well-known.

My thoughts on the subject are below in case someone is interested.

Problem statement:

"Is it possible to embed the  points  to  (i.e. construct a function ) such that, for any XOR-subset  of , the set  and its complement  are linearly separable (i.e. there is a hyperplane separating t... (read more)

2Donald Hobson2mo
The connection to features is that if the answer is no, there is no possible way the network could have arbitrary X-or combos of features that are linearly represented. It must be only representing some small subset of them. (probably the xor's of 2 or 3 features, but not 100 features.) Also, your maths description of the question matches what I was trying to express. 

When you immediately switch into the next topic, as in your example apology above, it looks like you're trying to distract from the fact that you were wrong


Yep. Reminds me of the saying "everything before the word 'but' is bullshit". This is of course not universally true, but it often has a grain of truth. Relatedly, I remember seeing writing advice that went like "keep in mind that the word 'but' negates the previous sentence".

I've made a habit of noticing my "but"s in serious contexts. Often I rephrase my point so that the "but" is not needed. This seems especially useful for apologies, as there is more focus on sincerity and reading between lines going on.

I looked at your post and bounced off the first time. To give a concrete reason, there were a few terms I wasn't familiar with (e.g. L-Theanine, CBD Oil, L-Phenylalanine, Bupropion, THC oil), but I think it was overall some "there's an inferential distance here which makes the post heavy for me". What also made the post heavy was that there were lots of markets - which I understand makes conceptual sense, but makes it heavy nevertheless.

I did later come back to the post and did trade on most of the markets, as I am a big fan of prediction markets and also ... (read more)

Thank you so much for trading on the markets! I guess I should've just said "effect size", and clarify in a footnote that I mean Cohen's d. And if the nootropics post was too statistics-heavy for someone with a math background, I probably need to tone it down/move it to an appendix. I think I can have quality of operationalization if I'm willing to be sloppy in the general presentation (as people probably don't care as much whether I use Cohen's d or Hedge's g or whatever).

Bullet points of things that come to mind:

  • I am little sad about the lack of Good Ol' Rationality content on the site. Out of the 14 posts on my frontpage, 0 are about this. [I have Rationality and World Modeling tags at +10.]
    • It has been refreshing to read recent posts by Screwtape (several), and I very much enjoyed Social Dark Matter by Duncan Sabien. Reading these I got the feeling "oh, this is why I liked LessWrong so much in the first place".
    • (Duncan Sabien has announced that he likely won't post on LessWrong anymore. I haven't followed the drama here to
... (read more)
I think there are a lot of old posts that don't get read. I'm most drawn to the Latest Posts because that's where the social interaction via commenting is. LessWrong is quite tolerant of comments on old posts, but they don't get as much engagement. It's too diffuse to be self-sustaining, but I feel like the newcomers are missing out on that in the core material. What can we do about that? Maybe someone else has a better idea, but I think I'd like to see an official community readthrough of the core sequences (at least RAZ, Codex, and old Best Of) pinned in the Latest Posts area so it actually gets engagement from newcomers. Maybe they should be copies so the comments start out empty, but with some added language encouraging newcomers to engage.

(Duncan Sabien has announced that he likely won't post on LessWrong anymore. [...] I feel like LessWrong is losing a lot here: Sabien is clearly a top rationality writer.)

I think that Duncan writing on his own blog and we linking the good posts from LW may be the best solution for both sides. (Duncan approves of being linked.)

I think your posts have been among the very best I have seen on LessWrong or elsewhere. Thank you for your contribution. I understand, dimly from the position of an outsider but still, I understand your decision, and am looking forward to reading your posts on your substack.

I agree. Let me elaborate, hopefully clarifying the post to Viliam (and others).

Regarding the basics of rationality, there's this cluster of concepts that includes "think in distributions, not binary categories", "Distributions Are Wide, wider than you think", selection effects, unrepresentative data, filter bubbles and so on. This cluster is clearly present in the essay. (There are other such clusters present as well - perhaps something about incentive structures? - but I can't name them as well.)

Hence, my reaction reading this essay was "Wow, what a sick... (read more)

Once upon a time I stumbled upon LessWrong. I read a lot of the basic material. At the time I found them to be worldview-changing. I also read a classic post with the quote

“I re-read the Sequences”, they tell me, “and everything in them seems so obvious. But I have this intense memory of considering them revelatory at the time.” 

and thought "Huh, they are revelatory. Let's see if that happens to me".

(And guess what?)

There are these moments where I notice that something has changed. I remember reading some comment like "Rationalists have this typical m... (read more)

1. Investigate (randomly) modulary varying goals in modern deep learning architectures.


I did a small experiment regarding this. Short description below.

I basically followed the instructions given in the section: I trained a neural network on pairs of digits from the MNIST dataset. These two digits were glued together side-by-side to form a single image. I just threw something up for the network architecture, but the second-to-last layer had 2 nodes (as in the post).

I had two different type of loss functions / training regimes:

  • mean-square-error,
... (read more)

Here is a related paper on "how good are language models at predictions", also testing the abilities of GPT-4: Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament.

Portion of the abstract:

To empirically test this ability, we enrolled OpenAI's state-of-the-art large language model, GPT-4, in a three-month forecasting tournament hosted on the Metaculus platform. The tournament, running from July to October 2023, attracted 843 participants and covered diverse topics including Big Tech, U.S. politics, viral outbreaks,

... (read more)
They don't cite the de-calibration result from the GPT-4 paper, but the distribution of GPT-4's ratings here looks like it's been tuned to be mealy-mouthed: humped at 60%, so it agrees with whatever you say but then can't even do so enthusiastically .

The part about the Kelly criterion that has most attracted me is this:

That thing is that betting Kelly means that with probability 1, over time you'll be richer than someone who isn't betting Kelly. So if you want to achieve that, Kelly is great.

So with more notation, P(money(Kelly) > money(other)) tends to 1 as time goes to infinity (where money(policy) is the random score given by a policy).

This sounds kinda like strategic dominance - and you shouldn't use a dominated strategy, right? So you should Kelly bet!

The error in this reasoning is the "sounds ... (read more)

Yes, but there's an additional thing I'd point out here, which is that at any finite timestep, Kelly does not dominate. There's always a non-zero probability that you've lost every bet so far. When you extend the limit to infinity, you run into the problem "probability zero events can't necessarily be discounted" (though in some situations it's fine to), which is the one you point out; but you also run into the problem "the limit of the probability distributions given by Kelly betting is not itself a probability distribution".

One of my main objections to Bayesianism is that it prescribes that ideal agent's beliefs must be probability distributions, which sounds even more absurd to me.


From one viewpoint, I think this objection is satisfactorily answered by Cox's theorem - do you find it unsatisfactory (and if so, why)?

Let me focus on another angle though, namely the "absurdity" and gut level feelings of probabilities.

So, my gut feels quite good about probabilities. Like, I am uncertain about various things (read: basically everything), but this uncertainty comes in degrees... (read more)

Open for any of the roles A, B, C. I should have a flexible schedule at my waking hours (around GMT+0). Willing to play for even long times, say a month (though in that case I'd be thinking about "hmm, could we get more quantity in addition to quality"). ELO probably around 1800.

Devices and time to fall asleep: a small self-experiment

I did a small self-experiment on the question "Does the use of devices (phone, laptop) in the evening affect the time taken to fall asleep?".


On each day during the experiment I went to sleep at 23:00. 

At 21:30 I randomized what I'll do at 21:30-22:45. Each of the following three options was equally likely:

  • Read a physical book
  • Read a book on my phone
  • Read a book on my laptop

At 22:45-23:00 I brushed my teeth etc. and did not use devices at this time.

Time taken to fall asleep was measured by a sm... (read more)

Iteration as an intuition pump

I feel like many game/decision theoretic claims are most easily grasped when looking at the iterated setup:

Example 1. When one first sees the prisoner's dilemma, the argument that "you should defect because of whatever the other person does, you are better off by defecting" feels compelling. The counterargument goes "the other person can predict what you'll do, and this can affect what they'll play".

This has some force, but I have had a hard time really feeling the leap from "you are a person who does X in the dilemma" to "the... (read more)

Well written! I think this is the best exposition to non-causal decision theory I've seen. I particularly found the modified Newcomb's problem and the point it illustrates in the "But causality!" section to be enlightening.

How I generate random numbers without any tools: come up with a sequence of ~5 digits, take their sum and look at its parity/remainder. (Alternatively, take ~5 words and do the same with their lengths.) I think I'd pretty quickly notice a bias in using just a single digit/word, but taking many of them gives me something closer to a uniform distribution.

Also, note that your "More than two options" method is non-uniform when the number of sets is not a power of two. E.g. with three sets the probabilities are 1/2, 1/4 and 1/4.

Epistemic responsibility

"You are responsible for you having accurate beliefs."

Epistemic responsibility refers to the idea that it is on you to have true beliefs. The concept is motivated by the following two applications.


In discussions

Sometimes in discussions people are in a combative "1v1 mode", where they try to convince the other person of their position and defend their own position, in contrast to a cooperative "2v0 mode" where they share their beliefs and try to figure out what's true. See the soldier mindset vs. the scout mindset.

This may be f... (read more)

This survey is really good!

Speaking as someone who's exploring the AI governance landscape: I found the list of intermediate goals, together with the responses, a valuable compilation of ideas. In particular it made me appreciate how large the surface area is (in stark contrast to takes on how progress in technical AI alignment doesn't scale). I would definitely recommend this to people new to AI governance.

Glad to hear that! I do feel excited about this being used as a sort of "201 level" overview of AI strategy and what work it might be useful to do. And I'm aware of the report being included in the reading lists / curricula for two training programs for people getting into AI governance or related work, which was gratifying. Unfortunately we did this survey before ChatGPT and various other events since then, which have majorly changed the landscape of AI governance work to be done, e.g. opening various policy windows. So I imagine people reading this report today may feel it has some odd omissions / vibes. But I still think it serves as a good 201 level overview despite that. Perhaps we'll run a followup in a year or two to provide an updated version. 

For coordination purposes, I think it would be useful for those who plan on submitting a response mark that they'll do so, and perhaps tell a little about the contents of their response. It would also be useful for those who don't plan on responding to explain why not.

The majority of my response is in reducing our systems exposure vulnerabilities.  As a believer in the power of strong cryptography no matter the intelligence involved, I am going to explain the value of removing or spinning down of the NSA/CIA program of back-doors, zero day exploits, and intentional cryptographic weaknesses that have been introduced into our hardware and software infrastructure.

The last paragraph stood out to me (emphasis mine).

Second, we believe it would be unintuitively risky and difficult to stop the creation of superintelligence. Because the upsides are so tremendous, the cost to build it decreases each year, the number of actors building it is rapidly increasing, and it’s inherently part of the technological path we are on, stopping it would require something like a global surveillance regime, and even that isn’t guaranteed to work. So we have to get it right.

There are efforts in AI governance that definitely don't look like... (read more)

When they say stopping I think they refer to stopping it forever, instead of slowing down, regulating and even pausing development. Which I think is something pretty much everyone agrees on.

Regarding betting odds: are you aware of this post? It gives a betting algorithm that satisfies both of the following conditions:

  • Honesty: participants maximize their expected value by being reporting their probabilities honestly.
  • Fairness: participants' (subjective) expected values are equal.

The solution is "the 'loser' pays the 'winner' the difference of their Brier scores, multiplied by some pre-determined constant C". This constant C puts an upper bound on the amount of money you can lose. (Ideally C should be fixed before bettors give their odds, becaus... (read more)

4Daniel Kokotajlo10mo
I was not aware, but I strongly suspected that someone on LW had asked and answered the question before, hence why I asked for help. Prayers answered! Thank you! Connor, are you OK with Scott's algorithm, using C = $100?

On premature advice

Here's a pattern I've recognized - all examples are based on real events.

Scenario 1. Starting to exercise

Alice: "I've just started working out again. I've been doing blah for X minutes and then blah blah for Y minutes."

Bob: "You shouldn't exercise like that, you'll injure yourself. Here's what you should be doing instead..."

Result: Alice stops exercising.

Scenario 2. Starting to invest

Alice: "Everyone around me tells that investing is a good idea, so I'm now going to invest in index funds."

Bob: "You better know what you are doing. Don't i... (read more)

3Adam Zerner10mo
1 I think this is a great point. I appreciate the examples too. I often find it hard to come up with good examples, but at the same time I think good examples are super useful, and these are great examples. 2 For lifting weights, I personally have settled on just doing bench presses and leg presses because that's what actually triggers enough motivation in me. Other exercises I just don't enjoy nearly as much. I also find it much more motivating when I can get in and out in a half hour. When I was younger I would often have 90+ minute sessions and I'm just not motivated to do that anymore. I also only go about once every week (or two). Which I'm ok with. There seem to be pretty big diminishing returns when it comes to strength training and I don't want to risk aiming for a 3x/week schedule, failing at it, and ending up going months without doing any strength training at all. 3 The failure mode you point out seems to me like a Valley of Bad Rationality. Normal people are not automatically strategic and don't jump to trying to how you could optimize your exercise routine when you tell them you just started exercising. Rationalists are more strategic and probably make this jump too frequently. Being strategic is often a good thing, but here it's probably not.
When Bob can't tell the difference, Alice becomes a natural enemy of Carol, who wants pointers on technical details rather than generic encouragement.
1Pat Myron10mo
Lots of related concepts like

I feel like the post proves too much: it gives arguments for why foom is unlikely, but I don't see arguments which break the symmetry between "humans cannot foom relative to other animals" and "AI cannot foom relative to humans".* For example, the statements

brains are already reasonably pareto-efficient 


Intelligence requires/consumes compute in predictable ways, and progress is largely smooth.

seem irrelevant or false in light of the human-chimp example. (Are animal brains pareto-efficient? If not, I'm interested in what breaks the symmetry between ... (read more)

My model predicts superhuman AGI in general - just that it uses and scales predictably with compute. Dota 2 is only marginally more complicated than go/chess; the world model is still very very simple as it can be simulated perfectly using just a low end cpu core. Driving cars would be a good start. In terms of game worlds there is probably nothing remotely close, would need to be obviously 3D and very open ended with extremely complex physics and detailed realistic graphics, populated with humans and or advanced AI (I've been out of games for a while and i'm not sure what that game currently would be, but probably doesn't exist yet).
In the section "Seeking true Foom", the post argues that the reason why humans foomed is because of culture, which none of the animals before us had. IMO, this invalidates the arguments in the first half of your comment (though not necessarily your conclusions).

My thoughts on the "Humans vs. chimps" section (which I found confusing/unconvincing):

Chimpanzees have brains only ~3x smaller than humans, but are much worse at making technology (or doing science, or accumulating culture…). If evolution were selecting primarily or in large part for technological aptitude, then the difference between chimps and humans would suggest that tripling compute and doing a tiny bit of additional fine-tuning can radically expand power, undermining the continuous change story.

But chimp evolution is not primarily selecting for makin

... (read more)

Our planet is full of groups of power-seekers competing against each other. Each one of them could cooperate (join in the moratorium) defect (publicly refuse) or stealth-defect (proclaim that they're cooperating while stealthily defecting). The call for a moratorium amounts to saying to every one of those groups "you should choose to lose power relative to those who stealth-defect". It doesn't take much decision theory to predict that the result will be a covert arms race conducted in a climate of fear by the most secretive and paranoid among the power gro

... (read more)

Inspired by the "reward chisels cognition into the agent's network" framing from Reward is not the optimization target, I thought: is reward necessarily a fine enough tool? More elaborately: if you want the model to behave in a specific way or to have certain internal properties, can you achieve this simply by a suitable choosing of the reward function?

I looked at two toy cases, namely Q-learning and training a neural network (the latter which is not actually reinforcement learning but supervised learning). The answers were "yep, suitable reward/loss (and ... (read more)

Feature suggestion: Allow one to sort a user's comments by the number of votes.

Context: I saw a comment by Paul Christiano, and realized that probably a significant portion of the views expressed by a person lie in comments, not top-level posts. However, many people (such as Christiano) have written a lot of comments, so sorting them would allow one to find more valuable comments more easily.

I don't agree, but for a separate reason from trevor. Highly-upvoted posts are a signal of what the community agrees with or disagrees with, and I think being able to more easily track down karma would cause reddit-style internet-points seeking. How many people are hooked on Twitter likes/view counts? Or "ratio'd". Making it easier to track these stats would be counterproductive, imo.
I retracted my signature, and I will edit my top post.

Ah, I misunderstood the content of original tweet - I didn't register that the model indeed had access to lots of data in other languages as well. In retrospect I should have been way more shocked if this wasn't the case. Thanks.

I then agree that it's not too surprising that the instruction-following behavior is not dependent on language, though it's certainly interesting. (I agree with Habryka's response below.)

I feel like this answer glosses over the fact that the encoding changes. Surely you can find some encodings of instructions such that LLMs cannot follow instructions in that encoding. So the question lies in why learning the English encoding also allows the model to learn (say) German encodings.

No? We already know that the model can competently respond in German. Once you condition on the model competently responding in other languages (e.g. for translation tasks) there is no special question about why it follows instructions in other languages as well. Like "why are LLMs capable of translation might be an interesting question", but if you're not asking that question, then I don't understand why you're asking this. My position is that this isn't a special capability that warrants any explanation that isn't covered in an explanation of why/how LLMs are competent translators.

The fair-goers, having knowledge of oxen, had no bias in their guesses


[EDIT: I read this as "having no knowledge of oxen" instead of "having knowledge of oxen" - is this what you meant? The comment seems relevant nevertheless.]

This does not follow: It is entirely possible that the fair-goers had no specific domain knowledge of oxen, while still having biases arising from domain-general reasoning. And indeed, they probably knew something about oxen -- from Jaynes' Probablity Theory:

The absurdity of the conclusion [that polling billion people tells the

... (read more)

Minor suggestion: I would remove the caps from the title. Reason: I saw this linked below Christiano's post, and my snap reaction was that the post is [angry knee-jerk response to someone you disagree with] rather than [thoughtful discussion and disagreement]. Only after introspection did I read this post.

Thanks--noted. I see your point -- I definitely don't want people to think this is an angry response. Especially since I explicitly agree with paul in the post. But since this post has been up for a bit including on the EA forum, I'll shy away from changes. 

I found janus's post Simulators to address this question very well. Much of AGI discussion revolves around agentic AIs (see the section Agentic GPT for discussion of this), but this does not model large language models very well. janus suggests that one should instead think of LLMs such as GPT-3 as "simulators". Simulators are not very agentic themselves or well described as having a utility function, though they may create simulacra that are agentic (e.g. GPT-3 writes a story where the main character is agentic).

A relevant passage from Simulators: This makes the same point as cfoster0's comment on this post - and that self-supervised learning is a method of AI specification that does not require "choosing a utility function", even implicitly, since the resulting policy won't necessarily be well-described as a utility maximizer at all.

A couple of examples from quadratic residue patterns modulo primes:

Example 1. Let  be a large prime. How many elements  are there such that both  and  are quadratic residues?

Since half of elements mod  are quadratic residues and the events " is a QR" and " is a QR" look like they are independent, a reasonable guess is . This is the correct main term, but what about the error? A natural square-root error term is not right: one can show that the error is , the error... (read more)

Load More