Best of LessWrong 2022

You might feel like AI risk is an "emergency" that demands drastic changes to your life. But is this actually the best way to respond? Anna Salamon explores what kinds of changes actually make sense in different types of emergencies, and what that might mean for how to approach existential risk.

Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
* Psychotic “delusions” are more about holding certain genres of idea with a socially inappropriate amount of intensity and obsession than holding a false idea. Lots of non-psychotic people hold false beliefs (eg religious people). And, interestingly, it is absolutely possible to hold a true belief in a psychotic way. * I have observed people during psychotic episodes get obsessed with the idea that social media was sending them personalized messages (quite true; targeted ads are real) or the idea that the nurses on the psych ward were lying to them (they were). * Preoccupation with the revelation of secret knowledge, with one’s own importance, with mistrust of others’ motives, and with influencing others' thoughts or being influenced by other's thoughts, are classic psychotic themes. * And it can be a symptom of schizophrenia when someone’s mind gets disproportionately drawn to those themes. This is called being “paranoid” or “grandiose.” * But sometimes (and I suspect more often with more intelligent/self-aware people) the literal content of their paranoid or grandiose beliefs is true! * sometimes the truth really has been hidden! * sometimes people really are lying to you or trying to manipulate you! * sometimes you really are, in some ways, important! sometimes influential people really are paying attention to you! * of course people influence each others' thoughts -- not through telepathy but through communication! * a false psychotic-flavored thought is "they put a chip in my brain that controls my thoughts." a true psychotic-flavored thought is "Hollywood moviemakers are trying to promote progressive values in the public by implanting messages in their movies." * These thoughts can come from the same emotional drive, they are drawn from dwelling on the same theme of "anxiety that one's own thoughts are externally influenced", they are in a deep sense mere arbitrary verbal representations of a single mental phenomenon...
It is disappointing/confusing to me that of the two articles I recently wrote, the one that was much closer to reality got a lot less karma. * A new process for mapping discussions is a summary of months of work that I and my team did on mapping discourse around AI.  We built new tools, employed new methodologies. It got 19 karma * Advice for journalists is a piece that I wrote in about 5 hours after perhaps 5 hours of experiences. It has 73 karma and counting I think this is isn't much evidence, given it's just two pieces. But I do feel a pull towards coming up with theories rather than building and testing things in the real world. To the extent this pull is real, it seems bad. If true, I would recommend both that more people build things in the real world and talk about them and that we find ways to reward these posts more, regardless of how alive they feel to us at the time. (Aliveness being my hypothesis - many of us understand or have more live feelings about dealing with journalists than a sort of dry post about mapping discourse)
Mark XuΩ6814733
47
Alignment researchers should think hard about switching to working on AI Control I think Redwood Research’s recent work on AI control really “hits it out of the park”, and they have identified a tractable and neglected intervention that can make AI go a lot better. Obviously we should shift labor until the marginal unit of research in either area decreases P(doom) by the same amount. I think that implies lots of alignment researchers should shift to AI control type work, and would naively guess that the equilibrium is close to 50/50 across people who are reading this post. That means if you’re working on alignment and reading this, I think there’s probably a ~45% chance it would be better for your values if you instead were working on AI control! For this post, my definitions are roughly: * AI alignment is the task of ensuring the AIs “do what you want them to do” * AI control is the task of ensuring that if the AIs are not aligned (e.g. don’t always “do what you want” and potentially want to mess with you), then you are still OK and can use them for economically productive tasks (an important one of which is doing more alignment/control research.) Here are some thoughts, arguments, and analogies (epistemic status: there is no “hidden content”, if you don’t find the literal words I wrote persuasive you shouldn’t update. In particular, just update on the words and don't update about what my words imply about my beliefs.): * Everything is in degrees. We can “partially align” some AIs, and things will be better if we can use those AIs for productive tasks, like helping with alignment research. The thing that actually matters is “how aligned are the AIs” + “how aligned to they need to be to use them for stuff”, so we should also focus on the 2nd thing. * If you were a hedge fund, and your strategy for preventing people from stealing your data was and starting new hedge fund was “we will make the hedge fund a super fun place to work and interview people carefull
I think this post was underrated; I look back at it frequently: AI labs can boost external safety research. (It got some downvotes but no comments — let me know if it's wrong/bad.)
TurnTroutΩ14283
0
Apply to the "Team Shard" mentorship program at MATS Research areas Apply here. Applications due by October 13th! 1. ^ Paper available soon.

Popular Comments

Recent Discussion

(I am not by any stretch of the imagination an expert on psychosis. This is more like “live-blogging my thinking as I go”. I’m hoping to spur discussion and get feedback and pointers.)

1. Introduction

I suggested a model of psychosis in my blog post “Schizophrenia as a deficiency in long-range cortex-to-cortex communication”, Section 4.2 last February. But it had some problems. I finally got around to taking another look, and I think I found an easy way to fix those problems. So this post is the updated version.

For the tl;dr, you can skip the text and just look at the two diagrams below.

2. Background: My “Model of psychosis, take 1” from earlier

The following is what I was proposing in “Schizophrenia as a deficiency in long-range cortex-to-cortex communication”, Section 4.2:

The...

1Lorec
No, when I say "in parallel", I'm not talking about two signals originating from different regions of cortex. I'm talking about two signals originating from the same region of cortex, at the time the decision is made - one of which [your "B" above] carries the information "move your arm"[/"subvocalize this sentence"] and the other of which [the right downward-pointing arrow in your diagram above, which you haven't named, and which I'll call "C"] carries the information "don't perceive an external agency moving your arm"[/"don't perceive an external agency subvocalizing this sentence"]. AFAICT, schizophrenic auditory hallucinations in general don't pass through the brainstem. Neither do the other schizophrenic "positive symptoms" of delusional and disordered cognition. So in order to actually explain schizophrenic symptoms and the meliorating effect of antipsychotics, "B" and "C" themselves have to be instantiated without reference to the brainstem. With respect to auditory hallucinations, "B" and "C" should both originate further down the frontal cortex, in the DLPFC, where there are no pyramidal neurons, and "C" should terminate in the auditory-perceptual regions of the temporal lobe, not the brainstem. If you can't come up with a reason we should assume the strength of the "B" signal [modeled as jointly originating with the "C" signal] here is varying, but the strength of the "C" signal [modeled as sometimes terminating in the auditory-perceptual regions of the temporal lobe] is not, I don't see what weight your theory can bear except in the special case of motor symptoms - not auditory-hallucination or cognitive symptoms.

I’m confused …

I was saying that, in this particular illustrated case, B comes from motor cortex and C comes from somatosensory cortex. I can’t tell whether you are agreeing or disagreeing with that. In other words: You seem to prefer a model where B and C come from the same cortical area, right? But are you saying that I’m wrong even about the motor case that I used as my example in the diagrams, or are you setting aside the motor case and arguing about different cases like auditory hallucinations?

It's true that the bottom box is not necessarily always the... (read more)

Geoffrey Hinton received the Nobel Prize in Physics for his role in creating the modern field of deep learning. This will strengthen his reputation as the "Godfather of AI" which was already used to amplify his public statements about AI risk.[1]

The Nobel Prize in Physics 2024 was awarded to John J. Hopfield and Geoffrey E. Hinton “for foundational discoveries and inventions that enable machine learning with artificial neural networks.”

Demis Hassabis, Deepmind's co-founder and CEO, received the Nobel Prize in Chemistry for his role in creating AlphaFold.

The Nobel Prize in Chemistry 2024 was divided, one half awarded to David Baker "for computational protein design", the other half jointly to Demis Hassabis and John M. Jumper "for protein structure prediction".

AlphaFold's extraordinary contributions to the field of computational biology are...

2Nathan Young
It is disappointing/confusing to me that of the two articles I recently wrote, the one that was much closer to reality got a lot less karma. * A new process for mapping discussions is a summary of months of work that I and my team did on mapping discourse around AI.  We built new tools, employed new methodologies. It got 19 karma * Advice for journalists is a piece that I wrote in about 5 hours after perhaps 5 hours of experiences. It has 73 karma and counting I think this is isn't much evidence, given it's just two pieces. But I do feel a pull towards coming up with theories rather than building and testing things in the real world. To the extent this pull is real, it seems bad. If true, I would recommend both that more people build things in the real world and talk about them and that we find ways to reward these posts more, regardless of how alive they feel to us at the time. (Aliveness being my hypothesis - many of us understand or have more live feelings about dealing with journalists than a sort of dry post about mapping discourse)

Fwiw I loved your journalist post and I never even saw your other post (until now).

Epistemic status: Theorizing on topics I’m not qualified for. Trying my best to be truth-seeking instead of hyping up my idea. Not much here is original, but hopefully the combination is useful. This hypothesis deserves more time and consideration but I’m sharing this minimal version to get some feedback before sinking more time into it. “We believe there’s a lot of value in articulating a strong version of something one may believe to be true, even if it might be false.”

This is a somewhat living document as I come back and add more ideas.

The Heuristics Hypothesis: A Bag of Heuristics is All There Is and a Bag of Heuristics is All You Need

  • A heuristic is a local, interpretable, and simple function (e.g., boolean/arithmetic/lookup functions) learned from the training
...
1Sodium
Thanks for the pointer! I skimmed the paper. Unless I'm making a major mistake in interpreting the results, the evidence they provide for "this model reasons" is essentially "the models are better at decoding words encrypted with rot-5 than they are at rot-10." I don't think this empirical fact provides much evidence one way or another. To summarize, the authors decompose a model's ability to decode shift ciphers (e.g., Rot-13 text: "fgnl" Original text: "stay")  into three categories, probability, memorization, and noisy reasoning. Probability just refers to a somewhat unconditional probability that a model assigns to a token (specifically, 'The word is "WORD"'). The model is more likely to decode words that are more likely a priori—this makes sense. Memorization is defined as how often the type of rotational cipher shows up. rot-13 is the most common one by far, followed by rot-3. The model is better at decoding rot-13 ciphers more than any other cipher, which makes sense since there's more of it in the training data, and the model probably has specialized circuitry for rot-13. What they call "noisy reasoning" is how many rotations is needed to get to the outcome. According to the authors, the fact that GPT-4 does better on shift ciphers with fewer shifts compared to ciphers with more shifts is evidence of this "noisy reasoning."  I don't see how you can jump from this empirical result to make claims about the model's ability to reason. For example, an alternative explanation is that the model has learned some set of heuristics that allows it to shift letters from one position to another, but this set of heuristics can only be combined in a limited manner.  Generally though, I think what constitutes as a "heuristic" is somewhat of a fuzzy concept. However, what constitutes as "reasoning" seems even less defined.  
2Noosphere89
True that it isn't much evidence for reasoning directly, as it's only 1 task. As for how we can jump from the empirical result to make claims about it's ability to reason, the reason is that the shift cipher task let's us disentangle commonness and simplicity, where a bag of heuristics that has no uniform and compact description work best for common example types, whereas the algorithmic reasoning that I defined below would work better on simpler tasks, where the simplest shift cipher is 1-shift cipher, whereas the bag of heuristics model which predicts that LLMs are essentially learning shallow heuristics completely or primarily would work best on 13-shift ciphers, as that's the most common, and the paper shows that there is a spike on the 13-shift cipher accuracy, consistent with LLMs having some heuristics, but also that the 1-shift cipher accuracy was much better than expected under a view that though LLMs were solely or primarily a bag of heuristics that couldn't be improved by COT. I'm defining reasoning more formally in the quote below: This comment is where I got the quote from: https://www.lesswrong.com/posts/gcpNuEZnxAPayaKBY/othellogpt-learned-a-bag-of-heuristics-1#Bg5s8ujitFvfXuop8 This thread has an explanation of why we can disentangle noisy reasoning from heuristics, as I'm defining the terms here, so go check that out below: https://x.com/RTomMcCoy/status/1843325666231755174
1Sodium
I see, I think that second tweet thread actually made a lot more sense, thanks for sharing! McCoy's definitions of heuristics and reasoning is sensible, although I personally would still avoid "reasoning" as a word since people probably have very different interpretations of what it means. I like the ideas of "memorizing solutions" and "generalizing solutions." I think where McCoy and I depart is that he's modeling the entire network computation as a heuristic, while I'm modeling the network as compositions of bags of heuristics, which in aggregate would display behaviors he would call "reasoning."  The explanation I gave above—heuristics that shifts the letter forward by one with limited composing abilities—is still a heuristics-based explanation. Maybe this set of composing heuristics would fit your definition of an "algorithm." I don't think there's anything inherently wrong with that.  However, the heuristics based explanation gives concrete predictions of what we can look for in the actual network—individual heuristic that increments a to b, b to c, etc., and other parts of the network that compose the outputs. This is what I meant when I said that this could be a useful framework for interpretability :)

Now I understand.

4.1 Post summary / Table of contents

This is the fourth of a series of eight blog posts, which I’m serializing weekly. (Or email or DM me if you want to read the whole thing right now.)

“Trance” is an umbrella term for various states of consciousness in which “you lose yourself”, somehow. The first kind that I learned about was hypnotic trance, as depicted in the media:

Mind-Control Eyes (trope)
Source: tvtropes

With examples like that, I quite naturally assumed that hypnotism was fictional.

Other types of trance, particularly “spirit possession” in traditional cultures (e.g. Haitian Vodou), and New Age “channeling”, initially struck me as equally fictional—especially the wild claim that people would “wake up” from their hypnotic or other trance with no memory of what just happened. But when I looked into it a bit more,...

2Gunnar_Zarncke
You refer to status as an attribute of a person, but now I'm wondering how the brain represents status. I wouldn't rule out the possibility of high status being the same thing as the willingness to let others control you. 

You can find my current opinions about status in:

I think your phrase “willingness to let others control you” is conveying a kinda strange vibe. (Not sure how deliberate that is.)

Story: I have a hunch that the blue paint color will look best, but my interior decorator has a hunch that the green paint color will look best. I defer to her judgment because she’s an experienced professional whom I trust—part... (read more)

4Gunnar_Zarncke
You might want to have a look at the The Collected Papers of Milton H. Erickson on Hypnosis Vol 1 - The Nature of Hypnosis and Suggestion I read it some years ago and found it insightful and plausible and fun to read, but couldn't wrap my mind around it forming a coherent theory. And form my recollection, many things in there confirm Johnstone and complement it, esp. the high-status aspects. There may be more.   
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Frontier AI labs can boost external safety researchers by

  • Sharing better access to powerful models (early access, fine-tuning, helpful-only,[1] filters/moderation-off, logprobs, activations)[2]
  • Releasing research artifacts besides models
  • Publishing (transparent, reproducible) safety research
  • Giving API credits
  • Mentoring

Here's what the labs have done (besides just publishing safety research[3]).

Anthropic:

Google DeepMind:

  • Publishing their model evals for dangerous capabilities and sharing resources for reproducing some of them
  • Releasing Gemma SAEs
  • Releasing Gemma weights
  • (External mentoring, in particular via MATS)
  • [No fine-tuning or deep access to
...

Yeah this seems like a good point. Not a lot to argue with, but yeah underrated.

Intro (skippable)

"Are you really the smartest member of the Hunters' Guild?"

"I'm the smartest at fighting! What's the difference?"

"Well, you're just about smart enough to write, at the very least"

"And you're just about short enough, and just about annoying enough, that if you don't shut your nerd mouth you'll find yourself flying out of that window"

The master hunter shoves a heavy leather-bound journal towards you. You look inside, and see just-about-legible scrawlings:

'Dear Diary, this week I took a big fiery flamu club to the thunderwood peaks. Even though I was wearing the nicest icemail in the armory, I got beaten up and came home with nothing :('

The junior research intern biologist hands you a meticulous-looking sheaf of parchments. The first - he insists - contains the sum total...

I interact with journalists quite a lot and I have specific preferences. Not just for articles, but for behaviour. And journalists do behave pretty strangely at times. 

This account comes from talking to journalists on ~10 occasions. Including being quoted in ~5 articles. 

Privacy

I do not trust journalists to abide by norms of privacy. If I talk to a friend and without asking, share what they said, with their name attached, I expect they'd be upset. But journalists regularly act as if their profession sets up the opposite norm - that everything is publishable, unless explicitly agreed otherwise. This is bizarre to me. It's like they have taken a public oath to be untrustworthy.

Perhaps they would argue that it’s a few bad journalists who behave like this, but how...

Hmmm, what is the picture that the analogy gives you. I struggle to imagine how it's misleading but I want to hear.

2Nathan Young
I common criticism seems to be "this won't change anything" see (here and here). People often believe that journalists can't choose their headlines and so it is unfair to hold them accountable for them. I think this is wrong for about 3 reasons: * We have a loud of journalists pretty near to us whose behaviour we absolutely can change. Zvi, Scott and Kelsey don't tend to print misleading headlines but they are quite a big deal and to act as if creating better incentives because we can't change everything seems to strawman my position * Journalists can control their headlines. I have seen 1-2 times journalists change headlines after pushback. I don't think it was the editors who read the comments and changed the headlines of their own accord. I imagine that the journalists said they were taking too much pushback and asked for the change. This is probably therefore an existence proof that journalists can affect headlines. I think reality is even further in my direction. I imagine that journalists and their editors are involved in the same social transactions as exist between many employees and their bosses. If they ask to change a headline, often they can probably shift it a bit. Getting good sources might be enough to buy this from them. * I am not saying that they must have good headlines, I am just holding the threat of their messages against them. I've only done this twice, but in one case a journalist was happy to give me this leverage. And having it, I felt more confident about the interview. I think there is a failure mode where some rats hear a system described and imagine that reality matches it as they imagine it. In this case, I think that's mistaken - journalists have incentives to misdescribe their power of their own headlines. And reality is a bit messier than the simple model suggests.  And we have more power than I think some commenters think. I recommend trying this norm. It doesn't cost you much, it is a good red flag if someone gets angry when
2ChristianKl
I don't think that's the case, because the journalist you are speaking to is not the person who's makes the decision.  At the moment you have some person who's trained to write headlines so that the headlines get a maximum of clicks and who writes headlines for a lot of articles.  If the management of the New York Times has to decide whether they are willing to get 20% less clicks on social media when they let a journalists instead of their current headline writers write the headlines, just so that people on LessWrong are more willing to give the New York Times interviews, I don't think that will change their management decisions.  Shaming the New York Times for misinformation might work better. You could write a bot for X and Threads, that uses an LLM for every New York Times article to judge whether the headline is misleading and then write a tweet for each misleading New York Times headline. Such a project could hurt the reputation of the New York Times among their audience, which is something they actually care about. 
2Nathan Young
I think this is incorrect. I imagine journalists have more latitude to influence headlines when they arelly care.