6,264 posts were written in 2023. 

662 of them got at least one positive Review Vote.

209 of them got at least one review, and a positive Review Vote total.

50 of them shall be displayed in the Best of LessWrong, Year 2023.

Customize
The redesigned OpenAI Safety page seems to imply that "the issues that matter most" are: * Child Safety * Private Information * Deep Fakes * Bias * Elections
How do you work w/ sleep consolidation? Sleep consolidation/ "sleeping on it" is when you struggle w/ [learning a piano piece], sleep on it, and then you're suddenly much better at it the next day! This has happened to me for piano, dance, math concepts, video games, & rock climbing, but it varies in effectiveness. Why? Is it: 1. Duration of struggling activity 2. Amount of attention paid to activity 3. Having a frustrating experience 4. Time of day (e.g. right before sleep) My current guess is a mix of all four. But I'm unsure if you [practice piano] in the morning, you'll need to remind yourself about the experience before you fall asleep, also implying that you can only really consolidate 1 thing a day.  The decision relevance for me is I tend to procrastinate heavy maths paper, but it might be more efficient to spend 20 min for 3 days actually trying to understand it, sleep on it, then spend 1 hour in 1 day. This would be neat because it's easier for me to convince myself to really struggle w/ a hard topic for [20] minutes, knowing I'm getting an equivalent [40] minutes if I sleep on it. 
StefanHex*36-2
14
Major edit: I've made a mistake in my explained variance calculation which may partially or fully explain the effect I'm seeing. I'm re-computing the numbers now. Thanks a lot to @JoshEngels for pointing this out! TL,DR: K-means explains about as much (or more) variance in the activations as SAEs do. Edit: Epistemic status: This is a weekend-experiment I ran a while ago and I figured I should write it up to share. I have taken decent care to check my code for silly mistakes and "shooting myself in the foot", but these results are not vetted to the standard of a top-level post / paper. SAEs explain most of the variance in activations. Is this alone a sign that activations are structured in an SAE-friendly way, i.e. that activations are indeed a composition of sparse features like the superposition hypothesis suggests? I'm asking myself this questions since I initially considered this as pretty solid evidence: SAEs do a pretty impressive job compressing 512 dimensions into ~100 latents, this ought to mean something, right? But maybe all SAEs are doing is "dataset clustering" (the data is cluster-y and SAEs exploit this)---then a different sensible clustering method should also be able do perform similarly well! I took this SAE graph from Neuronpedia, and added a K-means clustering baseline. Think of this as pretty equivalent to a top-k SAE (with k=1). In fact I use the K-means algorithm to find "feature vectors" and then encode / decode the activations just like I would in an SAE (I'm not using the (non-linear) "predict" method of K-means). It turns out that even clustering (essentially L_0=1) explains up to 90% of the variance in activations, being matched only by SAEs with L_0>100. This isn't an entirely fair comparison, since SAEs are optimised for the large-L_0 regime, while I haven't found a L_0>1 operationalisation of clustering that meaningfully improves over L_0=1. To have some comparison I'm adding a PCA + Clustering baseline where I apply a PCA before
Nisan222
1
Google's AI principles used to say: On 2025-02-04, Google removed these four commitments. The updated principles seem consistent with making weapons, causing net harm, violating human rights, etc. As justification, James Manyika and Demis Hassabis said:
leogao7110
1
when i was new to research, i wouldn't feel motivated to run any experiment that wouldn't make it into the paper. surely it's much more efficient to only run the experiments that people want to see in the paper, right? now that i'm more experienced, i mostly think of experiments as something i do to convince myself that a claim is correct. once i get to that point, actually getting the final figures for the paper is the easy part. the hard part is finding something unobvious but true. with this mental frame, it feels very reasonable to run 20 experiments for every experiment that makes it into the paper.

Popular Comments

Recent Discussion

(Many of these ideas developed in conversation with Ryan Greenblatt)

In a shortform, I described some different levels of resources and buy-in for misalignment risk mitigations that might be present in AI labs:

*The “safety case” regime.* Sometimes people talk about wanting to have approaches to safety such that if all AI developers followed these approaches, the overall level of risk posed by AI would be minimal. (These approaches are going to be more conservative than will probably be feasible in practice given the amount of competitive pressure, so I think it’s pretty likely that AI developers don’t actually hold themselves to these standards, but I agree with e.g. Anthropic that this level of caution is at least a useful hypothetical to consider.) This is the level of caution people

...
Rohin ShahΩ220

I don't think you should think of "poor info flows" as something that a company actively does, but rather as the default state of affairs for any fast-moving organization with 1000+ people. Such companies normally need to actively fight against poor info flows, resulting in not-maximally-terrible-but-still-bad info flows.

This is a case where I might be over indexing from experience at Google, but I'd currently bet that if you surveyed a representative set of Anthropic and OpenAI employees, more of them would mostly agree with that statement than mostly disagree with it.

Crossposted from my blog which many people are saying you should check out! 

The real story behind the viral photo of cheetahs preying on an impala

 

Imagine that you came across an injured deer on the road. She was in immense pain, perhaps having been mauled by a bear or seriously injured in some other way. Two things are obvious:

  1. If you could greatly help her at small cost, you should do so.
  2. Her suffering is bad.

In such a case, it would be callous to say that the deer’s suffering doesn’t matter because it’s natural. Things can both be natural and bad—malaria certainly is. Crucially, I think in this case we’d see something deeply wrong with a person who thinks that it’s not their problem in any way, that helping the deer is of no value. Intuitively, we recognize that wild animals matter!

But...

quila10

Pollywogs (the larval form of frogs, after eggs, and before growing legs) are an example where huge numbers of them are produced, and many die before they ever grow into frogs, but from their perspective, they probably have many many minutes of happy growth, having been born into a time and place where quick growth is easy: watery and full of food

Consider an alien species which requires oxygen, but for whom it was scarce during evolution, and so they were selected to use it very slowly and seek it ruthlessly, and feel happy when they manage to find some. I... (read more)

7omnizoid
First of all, the claim that wild animal suffering is serious doesn't depend on the claim that animals suffer more than they are happy.  I happen to think human suffering is very serious, even though I think humans live positive lives.   Second, I don't think it's depressive bias infecting my judgments.  I am quite happy--actually to a rather unusual degree.  Instead, the reason to think that animals live mostly bad lives is that nearly every animal lives a very short life that culminates in a painful death on account of R-selection--if you live only ~a week, you don't have enough positive experiences to outweigh the badness of a painful death.   Regarding the claim that I should be speaking out against factory farming, um...I'm not sure if you've read the rest of my writing.   https://benthams.substack.com/p/factory-farming-delenda-est https://benthams.substack.com/p/weve-created-hell-its-called-factory https://benthams.substack.com/p/factory-farming-is-not-just-bad-its-35e?utm_source=publication-search
1Milan W
I happen to care about animal suffering, and I am as baffled as you about the move of caring about animal suffering for explicitly anti-speciecist reasons yet dismissing wild animal suffering. Seems pretty inconsistent. Maybe it originates from a sort of wishful thinking? As in "looks intractable, therefore I wish it were unimportant, therefore it is".
3G Wood
Agree with Dagon here, when omnizoid say's "Its obvious that you should" they are calling on the rules of their own morality. Its similar with "Her suffering is bad", that's a direct moral judgment. Both statements fall apart when you consider that someone may have different moral rules than you. For example, in NZ we have an issue with deer destroying our native bush which in turn hurts our native birds. Deer are considered an invasive species and are actively eradicated. In the case when you are actively in the presence of a hurting deer empathy drives you to help, suffering is not pleasant to witness. However I suspect that many NZ's would condemn every deer in NZ to a painful death, as long as they didn't have to witness it, in order to save our trees and birdlife.

How do you work w/ sleep consolidation?

Sleep consolidation/ "sleeping on it" is when you struggle w/ [learning a piano piece], sleep on it, and then you're suddenly much better at it the next day!

This has happened to me for piano, dance, math concepts, video games, & rock climbing, but it varies in effectiveness. Why? Is it:

  1. Duration of struggling activity
  2. Amount of attention paid to activity
  3. Having a frustrating experience
  4. Time of day (e.g. right before sleep)

My current guess is a mix of all four. But I'm unsure if you [practice piano] in the morning, you... (read more)

4niplav
For pleasure/insight helmets you probably need intervention in the form of brain simulation (tDCS, tFUS, tMS). Biofeedback might help but you need to at least know where to steer towards. I'm pretty skeptical of those numbers, all exiting projects I know of don't have a better method of measurement other than surveys and that gets bitten hard by social desirability bias/not wanting to have committed a sunk cost. Seems relevant that jhourney isn't doing much EEG & biofeedback anymore.
2Logan Riggs
Huh, those brain stimulation methods might actually be practical to use now, thanks for mentioning them!  Regarding skepticism of survey-data: If you're imagining it's only an end-of-the-retreat survey which asks "did you experience the jhana?", then yeah, I'll be skeptical too. But my understanding is that everyone has several meetings w/ instructors where a not-true-jhana/social-lie wouldn't hold up against scrutiny.  I can ask during my online retreat w/ them in a couple months.

In case you have not heard, there have been some recent and not-so-recent killings allegedly by people who have participated in aspiring rationalist spaces.

I've put some links in a footnote for those who want the basic info.[1]

I believe many people are thinking about this and how to orient. People have questions like:

  1. What has happened?
  2. What is likely to happen?
  3. What can be done to prevent more killings?
  4. How to relate to all of the widespread interest (e.g. from journalists)?

I hereby demarcate this thread as a place for people to ask and answer questions about this topic. 

Three virtues for this thread are (1) Taking care of yourself, (2) Courage, and (3) Informing.

Airlines have standard advice to put your own oxygen mask on first, before helping others. The reasoning being that if...

Viliam20

From my experience, the rationality community in Vienna does not share any of the craziness in Bay Area that I read about, so yeah, it seems plausible that different communities will end up significantly different.

I think there is a strong founder effect... the new members will choose whether they join or not depending on how comfortable they feel among the existing members. Decisions like "we have these rules / we don't have any rules", "there are people responsible for organization and safety / everyone needs to take care of themselves" once established,... (read more)

2Viliam
Despite my negative experience (gave an interview twice, was deliberately misquoted both times), I think there are some ways to mitigate the risk: Check the articles the journalist wrote before. Do they include some careful thinking and nuance; do they present arguments both for and against, such as Scott Alexander's blogs? That it's probably okay. Do they express the mainstream view, or do they align perfectly with the views of the owner of the news? That means your words will be twisted until they fit the narrative (or twisted to sound idiotic, if that is not possible). Is it something like clickbait about science? Expect your words to be twisted for maximum clickbait. Is the communication like "you say something, and then it's up to the journalist how he reports that"? That is the most dangerous way. Sometimes you can make a deal that you need to approve the written version before it gets published. Many journalists will refuse, using a convenient excuse (an internal policy, the need to meet a deadline). Is the communication like "you talk in front of a camera, then the debate is published online"? Watch the previous videos. Are the interviewees talking for minutes, or are individual sentences cut out of their speeches? The latter seems dangerous. The former... there is still a risk of the journalist trying to put things in your mouth (see "so what you're saying is" Jordan Peterson meme), but you can see whether they were aggressive this way in their previous interviews, and to a certain degree you can defend yourself if that happens. You could make a deal that you will record and publish your own version of the debate. I think that most journalists are bad, but there are ways to filter them out. (I hesitate to write this, but I think that Joe Rogan has some qualities of a good journalist. I disapprove of his choice of interviewees, and that he often just gives a platform to horrible people without even slightly pushing back against them. But he is not guilt
3Friendly Monkey
I see. Well, in case one of them reads this: to the extent it might matter to you, as a fellow radical vegan I request you not kill anyone trying to figure out how to make the first takeover-capable AI be benevolent. (I'm aware that humans are such that it's not obvious that designer-chosen-alignment → benevolence, and that one does actually have to reason about that instead of just assuming the hopeful option, but after having reasoned about it, I think most would choose benevolence in the end.[1] At least, I do not expect important exceptions to be alignment researchers, as opposed to selected-for power-seekers in high positions.) 1. ^ It seems to me that actually reflectively endorsing evil (objectively describing what's happening and then being like "this is good", instead of euphemizing and coping with arguments that often secretly amount to "if you're right then the world would be really bad, and that's scary" like "but nature has it") is rare, and most people are actually altruistic on some level but conformism overrides that and animal abuse is so normal that they usually don't notice it, but then when they see basic information like "cows have best friends and get stressed when they are separated" they seemingly get empathetic and reflective (check the comments).
2Viliam
Thank you for your kind words, but I wouldn't bet on Ziz actually listening to them.

Finetuning could be an avenue for transmitting latent knowledge between models.  

As AI-generated text increasingly makes its way onto the Internet, it seems likely that we'll finetune AI on text generated by other AI. If this text contains opaque meaning - e.g. due to steganography or latent knowledge - then finetuning could be a way in which latent knowledge propagates between different models. 

3Joey Yudelson
This is very interesting! One thing I'm confused about—I agree that the monitor would have a hard time getting the bits, based on these figures. But the introspective models also don't seem to get far above 50% accuracy, which means that right now they'd have a very hard time error correcting enough that their comrade would understand them but not the overseer. If the claim is that introspection-tuned models will get much better at self prediction, then its not clear to me that other models wont also get much better at cross-prediction?
6Cam
Thanks for throwing this idea out - I had a bit of extra time today and decided this would be a fun experiment to get up and running: https://github.com/camtice/llm-contact I probably won't have too much more active thinking time for this project, but will be happy to set up and run any experiments that spark particular interest.  4o-mini struggled to conceptualize that there was an overseer, but when separate instances of o3-mini-medium played as Alice and Bob it went straight to some basic steganography attempts. Here are a couple of turns from game that fooled gemini-1.5-flash as Dean with the hidden word "idea":

I haven't looked into this in detail, and I'm not actually sure how unique a situation this is. But, it updated me on "institutional changes to the US that might be quite bad[1]", and it seemed good if LessWrong folk did some sort of Orient Step on it.

(Please generally be cautious on LessWrong talking about politics. I am interested in people commenting here who have read the LessWrong Political Prerequisites sequence. I'll be deleting or at least unhesitatingly strong downvoting comments that seem to be doing unreflective partisan dunking)

((But, that's not meant to mean "don't talk about political actions." If this is as big a deal as it sounds, I want to be able to talk about "what to do do?". But I want that talking-about-it to...

momom210

There are three traders on this market; it means nothing at the moment. No need for virtue signalling to explain a result you might perceive as abnormal, it's just not formed yet.

3Martin Randall
There are public examples. These ones are famous because something went wrong, at least from a security perspective. Of course there are thousands of young adults with access to sensitive data who don't become spies or whistleblowers, we just don't hear about them. * Theodore Hall, who worked at age 18 on the Manhatten Project. * Edward Snowden who worked from age 22 for the NSA. * Chelsea Manning who worked from age 22 as a US Army Intelligence Analyst.
5lc
Elon already has all of the money in the world. I think he and his employs are ideologically driven, and as far as I can tell they're making sensible decisions given their stated goals of reducing unnecessary spend/sprawl. I seriously doubt they're going to use this access to either raid the treasury or turn it into a personal fiefdom. It's possible that in their haste they're introducing security risks, but I also think the tendency of media outlets and their sources will be to exaggerate those security risks. I'd be happy to start a prediction market about this if a regular feels very differently. If Trump himself was spearheading this effort I would be more worried.
5Martin Randall
I do see some security risk. Although Trump isn't spearheading the effort I expect he will have access to the results.
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with
23JoshEngels
I was having trouble reproducing your results on Pythia, and was only able to get 60% variance explained. I may have tracked it down: I think you may be computing FVU incorrectly.  https://gist.github.com/Stefan-Heimersheim/ff1d3b92add92a29602b411b9cd76cec#file-clustering_pythia-py-L309 I think FVU is correctly computed by subtracting the mean from each dimension when computing the denominator. See the SAEBench impl here: https://github.com/adamkarvonen/SAEBench/blob/5204b4822c66a838d9c9221640308e7c23eda00a/sae_bench/evals/core/main.py#L566 When I used your FVU implementation, I got 72% variance explained; this is still less than you, but much closer, so I think this might be causing the improvement over the SAEBench numbers. In general I think SAEs with low k should be at least as good as k means clustering, and if it's not I'm a little bit suspicious (when I tried this first on GPT-2 it seemed that a TopK SAE trained with k = 4 did about as well as k means clustering with the nonlinear argmax encoder). Here's my clustering code: https://github.com/JoshEngels/CheckClustering/blob/main/clustering.py  
2StefanHex
You're right. I forgot subtracting the mean. Thanks a lot!! I'm computing new numbers now, but indeed I expect this to explain my result!

After adding the mean subtraction, the numbers haven't changed too much actually -- but let me make sure I'm using the correct calculation. I'm gonna follow your and @Adam Karvonen's suggestion of using the SAE bench code and loading my clustering solution as an SAE (this code).

v3 (KMeans): Layer blocks.3.hook_resid_post, n_tokens=100000, n_clusters=4096, variance explained = 0.8469745517
v3 (KMeans): Layer blocks.3.hook_resid_post, n_tokens=100000, n_clusters=16384, variance explained = 0.8537513018
v3 (KMeans): Layer blocks.4.hook_resid_post, n_tokens=10
... (read more)
3JoshEngels
I just tried to replicate this on GPT-2 with expansion factor 4 (so total number of centroids = 768 * 4). I get that clustering recovers  ~87% fraction of variance explained, while a k = 32 SAE gets more like 95% variance explained. I did the nonlinear version of finding nearest neighbors when using k means to give k means the biggest advantage possible, and did k-means clustering on points using the FAISS clustering library.  Definitely take this with a grain of salt, I'm going to look through my code and see if I can reproduce your results on pythia too, and if so try on a larger model to. Code: https://github.com/JoshEngels/CheckClustering/tree/main

Due to linguistic relativity, might it be possible to modify or create a system of communication in order to make its users more aware of its biases?

If so, do any projects to actually do this exist?

Viliam20

Could you please ask about the specific examples of the Esperanto words? (I speak Esperanto.)

I think a similar example would be the adjective "Russian" in English, which translates to Russian as two different words: "русский" (related to Russian ethnicity or language) or "российский" (related to Russia as a country, i.e. including the minorities who live there).

(That would be "rus-a" vs "rus-land-a / rus-i-a" in Esperanto.)

I noticed this in a video where a guy explained that "I am Rus-land-ian, not Rus-ethnic-ian", which could be expressed in English as "I... (read more)

3Milan W
Not exactly what you were asking for, but maybe food for thought: what if we (somehow) mapped an LLM's latent semantic space into phonemes? What if we then composed tokenization with phonemization such that we had a function that could translate English to Latentese?
2dirk
Neither of them is exactly what you're looking for, but you might be interested in lojban, which aims to be syntactically unambiguous, and Ithkuil, which aims to be extremely information-dense as well as to reduce ambiguity. With regards to logical languages (ones which, like lojban, aim for each statement to have a single possible interpretation), I also found Toaq and Eberban just now while looking up lojban, though these have fewer speakers.
1KvmanThinking
Something like TNIL or Real Character might be used for maximum intellectual utility. But I cannot see how simply minimizing the amount of words that need to exist for compact yet precise communication would help correct the corrupted machinery our minds run on.

Summary:

This post outlines how a view we call subjective naturalism[1] poses challenges to classical Savage-style decision theory. Subjective naturalism requires (i) richness (the ability to represent all propositions the agent can entertain, including self-referential ones) and (ii) austerity (excluding events the agent deems impossible). It is one way of making precise certain requirements of embedded agency. We then present the Jeffrey–Bolker (JB) framework, which better accommodates an agent’s self-model and avoids forcing her to consider things she takes to be impossible.[2]

1. Subjective Naturalism: Richness & Austerity

A naturalistic perspective treats an agent as part of the physical world—just another system subject to the same laws. Among other constraints, we think this means:

  1. Richness: The model must include all the propositions the agent can meaningfully consider, including those about herself. If
...

I'm more worried about counterfactual mugging and transparent Newcomb. Am I right that you are saying "in first iteration of transparent Newcomb austere decision theory gets no more than 1000$ but then learns that if it modifies its decision theory into more UDT-like it will get more money in similar situations", turning it into something like son-of-CDT?