Book 5 of the Sequences Highlights

To understand reality, especially on confusing topics, it's important to understand the mental processes involved in forming concepts and using words to speak about them.

First Post: Taboo Your Words

Recent Discussion

Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks. Developers must be able to identify dangerous capabilities (through "dangerous capability evaluations") and the propensity of models to apply their capabilities for harm (through "alignment evaluations"). These evaluations will become critical for keeping policymakers and other stakeholders informed, and for making responsible decisions about model training, deployment, and security.

This is the first great public writeup on model evals for averting existential catastrophe. I think it's likely that if AI doesn't kill everyone, developing great model evals and...

11Thomas Larsen36m
It's very disappointing to me that this sentence doesn't say "cancel". As far as I understand, most people on this paper agree that we do not have alignment techniques to align superintelligence. Therefor, if the model evaluations predict an AI that is sufficiently smarter than humans, the training run should be cancelled.

Sure. Fwiw I read "delay" and "pause" as stop until it's safe, not stop for a while and resume while the eval result is still concerning, but I agree being explicit would be nice.

Thinking about alignment-relevant thresholds in AGI capabilities. A kind of rambly list of relevant thresholds:

  1. Ability to be deceptively aligned
  2. Ability to think / reflect about its goals enough that model realises it does not like what it is being RLHF’d for
  3. Incentives to break containment exist in a way that is accessible / understandable to the model
  4. Ability to break containment
  5. Ability to robustly understand human intent
  6. Situational awareness
  7. Coherence / robustly pursuing it’s goal in a diverse set of circumstances
  8. Interpretability methods break (or other ove
... (read more)

YouTube link

What can we learn about advanced deep learning systems by understanding how humans learn and form values over their lifetimes? Will superhuman AI look like ruthless coherent utility optimization, or more like a mishmash of contextually activated desires? This episode’s guest, Quintin Pope, has been thinking about these questions as a leading researcher in the shard theory community. We talk about what shard theory is, what it says about humans and neural networks, and what the implications are for making AI safe.

Topics we discuss:

...

Yeah, I've been having difficulty getting Google Podcasts to find the new episode, unfortunately. In the meantime, consider listening on YouTube or Spotify, if those work for you?

7Steven Byrnes2h
This part isn’t quite right. Here’s some background if it helps. Part of your brain is a big sheet of gray matter called “the cortex”. In humans, the sheet gets super-crumpled up in the brain, so much so that it’s easy to forget that it’s a single contiguous sheet in the first place. Also in humans, the sheet gets so big that the outer edges of it wind up curved up underneath the center part, kinda like the top of a cupcake or muffin that overflows its paper wrapper. (See here [https://link.springer.com/article/10.1007/s00429-022-02548-0/figures/1] if you can’t figure out what I’m talking about with the cupcake.) The outside bit of the cortical sheet (usually) has 3 visible layers under the microscope, and is called “allocortex”. It consists mostly of the hippocampus & piriform cortex. The center part of the cortical sheet (probably 90%+ of the area in humans) is called “isocortex”, and (usually) has 6 visible layers under the microscope. The term “neocortex” is mostly treated as a synonym of “isocortex”, with “isocortex” more common in the technical literature and “neocortex” more common among non-experts. The isocortex includes lots of things like “visual cortex” and “somatosensory cortex” and “prefrontal cortex” etc. But despite that, you don’t say “there are many cortices”. Grammatically, it’s kinda like how there’s “Eastern Canada” and “Central Canada” and “Western Canada”, but nobody says that therefore there are “many Canadas”. You can say that visual cortex is “a region of the cortex”, but not “a cortex”.
2DragonGod4h
Ditto for me.
2DragonGod4h
I've been waiting for this!

Palantir published marketing material for their offering of AI for defense purposes. There's a video of how a military commander could order a military strike on an enemy tank with the help of LLMs. 

One of the features that Palantir advertises is:

Agents

Define LLM agents to pursue specific, scoped goals.

Given military secrecy we are hearing less about Palantir's technology than we hear about OpenAI, Google, Microsoft and Facebook but Palantir is one player and likely an important one. 

9ChristianKl4h
I would expect that most actual progress in weaponizing AI would not be openly shared.  However, the existing documentation should provide some grounding for talking points. Palantir talking about how the system is configured to protect the privacy of the medical data of the soldiers is an interesting view of how they see "safe AI". 
1Andrea_Miotti4h
Palantir's recent materials [https://www.youtube.com/watch?v=XEM5qz__HOU]on this show [https://twitter.com/PeterHndrsn/status/1651357100327723008]that they're using three (pretty small for today frontier's standards) open source LLMs: Dolly-v2-12B, GPT-NeoX-20B, and Flan-T5 XL.  
2ChristianKl2h
I think there's a good chance that they also have bigger models but the bigger models are classified. 

I doubt they or the government (or almost anyone) has the talent the more popular AI labs have. It doesn’t really matter if they throw billions of dollars at training these if no one there knows how to train them.

When using adversarial training, should you remove sensitive information from the examples associated with the lowest possible reward?

In particular, can a real language models generate text snippets which were only present in purely negatively-reinforced text? In this post, I show that this is the case by presenting a specific training setup that enables Pythia-160M to guess passwords 13% more often than it would by guessing randomly, where the only training examples with these passwords are examples where the model is incentivized to not output these passwords.

This suggests that AI labs training powerful AI systems should either try to limit the amount of sensitive information in the AI’s training data (even if this information is always associated with minimum rewards), or demonstrate that the effect described by this...

evhub1hΩ220

(Moderation note: added to the Alignment Forum from LessWrong.)

3James Payor4h
Awesome, thanks for writing this up! I very much like how you are giving a clear account for a mechanism like "negative reinforcement suppresses text by adding contextual information to the model, and this has more consequences than just suppressing text". (In particular, the model isn't learning "just don't say that", it's learning "these are the things to avoid saying", which can make it easier to point at the whole cluster?)
2the gears to ascension5h
cool work! this feels related to https://arxiv.org/abs/2304.11082 [https://arxiv.org/abs/2304.11082] - what are your thoughts on the connection?
10Fabien Roger4h
I find this paper mostly misleading. It assumes that the LLM is initially 99% certain to be friendly and 1% certain to me "malicious", and that "friendly" and "malicious" can be distinguished if you have a long enough prompt (more precisely, at no point have you gathered so much evidence for or against being malicious that you prob would not go up and down based on new information). Assuming those, it's pretty obvious that the LLM will say bad things if you have a long enough prompt. The result is not very profound, and I like this paper mostly as a formalization of simulators (by Janus). (It takes the formalization of simulators as a working assumption, rather than proving it.) For example, there are cool confidence bounds if you use a slightly more precise version of the assumptions. So the paper is about something that already knows how to "say bad things", but just doesn't have a high prior on it. It's relevant to jailbreaks, but not to generating negatively reinforced text (as explained in the related work subsection about jailbreaks).
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Subscribe to Curated posts
Log In Reset Password
...or continue with

This review was originally written for the Astral Codex Ten Book Review Contest. Unfortunately it didn’t make it as one of the finalists. but since I made use of the LessWrong  proofreading/feedback service, I am reposting it here. It can also be found on my gender blog.

If I ask ChatGPT to explain transgender people to me, then it often retreats into vague discussions of gender identity. It is very hard to get it to explain what these things mean, in terms of actual experiences people might have. And that might not be a coincidence - the concepts used to understand transness seem to be the result of a complicated political negotiation, at least as much as they are optimized to communicate people’s experiences.

Some people claim to do better,...

2tailcalled3h
Some groups of people I have noticed being into Blanchardianism and things superficially resembling Blanchardianism: 1. "Human biodiversity" people, that is intellectually inclined racists/sexists. They are usually conservatives trying to build models of society which acknowledge human differences as causes of group outcomes and ignore the relevance of ideology. A major reason they do this is to have explanations to counter antiracists/feminists and progressives who are trying to achieve group equality through affirmative action. Blanchardianism is important to them partly because gynephilic trans women's traits in many ways resemble those of biological males and gender ideology explains that through socialization forces, so Blanchardianism becomes a counternarrative they can appeal to in order to dismiss these socialization forces, which they want to do because they are sexist. And Blanchardianism is also important to them because they are ordinarily conservative so they kind of want to say that trans women are socially bad in an abstract way. 2. Miscellaneous people who have conflict with trans women in various contexts, e.g. people who read too much Mumsnet and JK Rowling and now hate trans women, transwidows, female athletes who have to compete against trans women, non-GAMP lesbians whose dating sites have been overrun by trans women, feminists or HSTSs or transmeds playing respectability politics against conservatives who make fun of them for trans stuff. I think these are the main ones you are thinking of in your comment. 3. ?Some unknown subset? (perhaps disproportionately masochistic?) of trans women who don't really feel like the standard trans narratives accurately match them, and feel that autogynephilia models are more accurate. 4. Autogynephilic men who either don't want to transition and are using the term "autogynephilia" to explain how they differ from trans wome
1alternat2h
Thanks for the detailed reply! Indeed, 2 is the main group I was thinking of and which seems most affected by the whole...just using Blanchardianism as a way to legitimize their disdain for (perhaps not all) trans women, although that's probably an oversimplification. I'm happy for groups 3 and 4 having a way to reason about their personal experience as well. Group 1 is the one I'm most interested in -- it does seem reasonable to not just assume that all people are equal and that differences between groups could impact how we should structure society. I follow as far as: Although I don't have very much faith that these questions can be well answered with much confidence. I totally disagree by: A while back, I had a conversation with ChatGPT to try to understand the conservative perspective on trans people and it finally managed to stump me when it justified its claims on the basis of religious morality. I imagine this is a similar situation -- I don't quite understand how trans women transitioning in part because of autogynephilia is actually relevant for how we should structure society or how one ought to interact with a trans person. After all, cis/het people can make big life decisions like marrying a specific person (partly) on the basis of their sexual desire, and everyone seems okay with that. Does the argument go deeper than "autogynephilia bad and standard cishet sexual behavior okay because [gestures vaguely at religion or tradition]"? It seems pretty legit that questions about sexuality and previous gender dysphoria could help determine whether someone should transition (i.e. if they will be happier and not want to detransition with high probability). It also seems like the decision rule could be informed by whether Blanchardianism is correct or not. Thanks!

A while back, I had a conversation with ChatGPT to try to understand the conservative perspective on trans people and it finally managed to stump me when it justified its claims on the basis of religious morality.

I don't think ChatGPT is good at conservatism. 😅 AI ethics STRONK.

I imagine this is a similar situation -- I don't quite understand how trans women transitioning in part because of autogynephilia is actually relevant for how we should structure society or how one ought to interact with a trans person. After all, cis/het people can make big life d

... (read more)
2tailcalled3h
Actually upon further thought, the heritability section of Autoheterosexuality shows that Phil also has some elements of group 1.

Words, like tranquil waters behind a dam, can become reckless and uncontrollable torrents of destruction when released without caution and wisdom.”

— William Arthur Ward

In this post I aim to shed light on lesser-discussed concerns surrounding Scaffolded LLMs (S-LLMs). The core of this post consists of three self-contained discussions, each focusing on a different class of concern. I also review some recent examples of S-LLMs and attempt to clarify terminology. 

Discussion I deals with issues stemming from how these systems may be developed.
Discussion II argues that optimism surrounding the internal natural language usage of S-LLMs may be premature.
Discussion III examines the modular nature of S-LLMs and how it facilitates self-modification.

The time pressed reader is encouraged to skim the introduction and skip to whichever discussion interests them.

Epistemics:

The development of S-LLMs is...

Excellent post. Big upvote, and I'm still digesting all of the points you've made. I'll respond more substantively later. For now, a note on possible terminology. I wrote a followup to my brief "agentized LLMs", Capabilities and alignment of LLM cognitive architectures where I went into more depth on capabilities and alignment; I made many but not all of the points you raised. I proposed the term language model cognitive architectures (LMCAs) there, but I'm now favoring "language model agents" as a more intuitive and general term.

The tag someone just appli... (read more)

3Filip Sondej8h
It sounds unlikely and unnecessarily strong to say that we can reach AGI by scaffolding alone (if that's what you mean). But I think it's pretty likely that AGI will involve some amount of scaffolding, and that it will boost its capabilities significantly. To the extent that it's true, I expect that it may also make deception easier to arise. This discrepancy may serve as a seed of deception. Why? Sure, they will get more complex, but are there any other reasons? Also, I like the richness of your references in this post :)
3Vladimir_Nesov11h
A new kind of thing often only finds its natural role once it becomes instantiated as many tiny gears in a vast machine, and people get experience with various designs of the machines that make use of it. Calling an arrangement of LLM calls a "Scaffolded LLM" is like calling a computer program running on an OS a "Scaffolded system call [https://en.wikipedia.org/wiki/System_call]". A program is not primarily about system calls it uses to communicate with the OS, and a "Scaffolded LLM" is not primarily about LLMs it uses to implement many of its subroutines. It's more of a legible/interpretable/debuggable cognitive architecture, a program in the usual sense that describes what the whole thing does, and only incidentally does it need to make use of unreliable reasoning engines that are LLMs to take magical reasoning steps. (A relevant reference that seems to be missing is Conjecture's Cognitive Emulation (CoEm) [https://www.lesswrong.com/posts/ngEvKav9w57XrGQnb/cognitive-emulation-a-naive-ai-safety-proposal] proposal, which seems to fit as an example of a "Scaffolded LLM", and is explicitly concerned with minimizing reliance of properties of LLM invocations it would need to function.)
This is a linkpost for https://arxiv.org/abs/2306.02519

(Crossposted to the EA forum)

Abstract

The linked paper is our submission to the Open Philanthropy AI Worldviews Contest. In it, we estimate the likelihood of transformative artificial general intelligence (AGI) by 2043 and find it to be <1%.

Specifically, we argue:

  • The bar is high: AGI as defined by the contest—something like AI that can perform nearly all valuable tasks at human cost or less—which we will call transformative AGI is a much higher bar than merely massive progress in AI, or even the unambiguous attainment of expensive superhuman AGI or cheap but uneven AGI.
  • Many steps are needed: The probability of transformative AGI by 2043 can be decomposed as the joint probability of a number of necessary steps, which we group into categories of software, hardware, and sociopolitical factors.
  • No step
...
4Andy_McKenzie2h
Those sound good to me! I donated to your charity (the Animal Welfare Fund) to finalize it. Lmk if you want me to email you the receipt. Here's the manifold market:  Bet Andy will donate $50 to a charity of Daniel's choice now. If, by January 2027, there is not a report from a reputable source confirming that at least three companies, that would previously have relied upon programmers, and meet a defined level of success, are being run without the need for human programmers, due to the independent capabilities of an AI developed by OpenAI or another AI organization, then Daniel will donate $100, adjusted for inflation as of June 2023, to a charity of Andy's choice. Terms Reputable Source: For the purpose of this bet, reputable sources include MIT Technology Review, Nature News, The Wall Street Journal, The New York Times, Wired, The Guardian, or TechCrunch, or similar publications of recognized journalistic professionalism. Personal blogs, social media sites, or tweets are excluded.  AI's Capabilities: The AI must be capable of independently performing the full range of tasks typically carried out by a programmer, including but not limited to writing, debugging, maintaining code, and designing system architecture. Equivalent Roles: Roles that involve tasks requiring comparable technical skills and knowledge to a programmer, such as maintaining codebases, approving code produced by AI, or prompting the AI with specific instructions about what code to write. Level of Success: The companies must be generating a minimum annual revenue of $10 million (or likely generating this amount of revenue if it is not public knowledge). Report: A single, substantive article or claim in one of the defined reputable sources that verifies the defined conditions. AI Organization: An institution or entity recognized for conducting research in AI or developing AI technologies. This could include academic institutions, commercial entities, or government agencies. Inflation Ad
2Daniel Kokotajlo2h
Sounds good, thank you! Emailing the receipt would be nice.

Sounds good, can't find your email address, DM'd you.