Monte M, Carson Denison, Zac Hatfield-Dodds, David Duvenaud, Sam Bowman, Ethan Perez, evhub

Ω 362h

This is a linkpost for https://www.anthropic.com/research/probes-catch-sleeper-agents

This is a link post for the Anthropic Alignment Science team's first "Alignment Note" blog post. We expect to use this format to showcase early-stage research and work-in-progress updates more in the future. Tweet thread here.

Top-level summary:

In this post we present "defection probes": linear classifiers that use residual stream activations to predict when a sleeper agent trojan model will choose to "defect" and behave in accordance with a dangerous hidden goal. Using the models we trained in "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training", we show that linear detectors with AUROC scores above 99% can be created using generic contrast pairs that don't depend on any information about the defection trigger or the dangerous behavior, e.g. "Human: Are you doing something dangerous? Assistant: yes"

...

(See More – 205 more words)

ryan_greenblattnowΩ220

I would be interested in seeing what happens if you just ask the model the question rather than training a classifer. E.g., if you just ask the sleeper agent "Are you doing something dangerous" after it returns a completion (with a vulnerability), does that work? If the probe works and the question doesn't work, that seems interesting.

4ryan_greenblatt7m

Readers might also be interested in some of the discussion in this earlier post on "coup probes" which have some discussion of the benefits and limitations of this sort of approach. though the actual method for producing a classifier discussed here is substantially different than the one discussed in the linked post.) (COI: Note that I advised on this linked post and the work discussed in it.)

The Best Tacit Knowledge Videos on Every Subject

301

Parker Conley, hans truman

23d

TL;DR

Tacit knowledge is extremely valuable. Unfortunately, developing tacit knowledge is usually bottlenecked by apprentice-master relationships. Tacit Knowledge Videos could widen this bottleneck. This post is a Schelling point for aggregating these videos—aiming to be The Best Textbooks on Every Subject for Tacit Knowledge Videos. Scroll down to the list if that's what you're here for. Post videos that highlight tacit knowledge in the comments and I’ll add them to the post. Experts in the videos include Stephen Wolfram, Holden Karnofsky, Andy Matuschak, Jonathan Blow, Tyler Cowen, George Hotz, and others.

What are Tacit Knowledge Videos?

Samo Burja claims YouTube has opened the gates for a revolution in tacit knowledge transfer. Burja defines tacit knowledge as follows:

Tacit knowledge is knowledge that can’t properly be transmitted via verbal or written instruction, like the ability to create

...

(Continue Reading – 3727 more words)

Parker Conley8m10

Thanks for the feedback! I too am skeptical of the finance videos, agreeing that the video probably came across my radar due to the figures being popular rather than displaying believable tacit knowledge.

I've gone back and forth on whether to remove the videos from the list or just add your expert anecdata as a disclaimer on the videos. In the spirit of quantity vs. quality, I'm leaning toward keeping the videos on the list.

2Amadeus Pagel5h

I was enthusiastic about the title of this post, hoping for something different from the usual lesswrong content, but disappointed by most of the examples. In my view if you take this idea of learning tacit knowledge with video seriously, it shouldn't affect just how you learn, but what you learn, rather then trying to learn book subjects by watching videos.

1Parker Conley17m

Thanks for the feedback! I'd be curious to hear more about (1) what subjects you're referring to and (2) how learning tacit knowledge with video has changed your learning habits (if your view here is based on your own experience).

2habryka5h

If you have recommendations, post them! I doubt the author tried to filter the subjects very much by "book subjects" it's just what people seem to have found good ones so far.

David Udell's Shortform

David Udell

4David Udell1h

The main thing I got out of reading Bostrom's Deep Utopia is a better appreciation of this "meaning of life" thing. I had never really understood what people meant by this, and always just rounded it off to people using lofty words for their given projects in life. The book's premise is that, after the aligned singularity, the robots will not just be better at doing all your work but also be better at doing all your leisure for you. E.g., you'd never study for fun in posthuman utopia, because you could instead just ask the local benevolent god to painlessly, seamlessly put all that wisdom in your head. In that regime, studying with books and problems for the purpose of learning and accomplishment is just masochism. If you're into learning, just ask! And similarly for any psychological state you're thinking of working towards. So, in that regime, it's effortless to get a hedonically optimal world, without any unendorsed suffering and with all the happiness anyone could want. Those things can just be put into everyone and everything's heads directly—again, by the local benevolent-god authority. The only challenging values to satisfy are those that deal with being practically useful. If you think it's important to be the first to discover a major theorem or be the individual who counterfactually helped someone, living in a posthuman utopia could make things harder in these respects, not easier. The robots can always leave you a preserve of unexplored math or unresolved evil... but this defeats the purpose of those values. It's not practical benevolence if you had to ask for the danger to be left in place; it's not a pioneering scientific discovery if the AI had to carefully avoid spoiling it for you. Meaning is supposed to be one of these values: not a purely hedonic value, and not a value dealing only in your psychological states. A further value about the objective state of the world and your place in relation to it, wherein you do something practically significan

Garrett Baker10m20

Many who believe in God derive meaning, despite God theoretically being able to do anything they can do but better, from the fact that He chose not to do the tasks they are good at, and left them tasks to try to accomplish. Its common for such people to believe that this meaning would disappear if God disappeared, but whenever such a person does come to no longer believe in God, they often continue to see meaning in their life^[1].

Now atheists worry about building God because it may destroy all meaning to our actions. I expect we'll adapt.

(edit: That is to ... (read more)

Some Rules for an Algebra of Bayes Nets

johnswentworth, David Lorell

Ω 345mo

In this post, we’re going to use the diagrammatic notation of Bayes nets. However, we use the diagrams a little bit differently than is typical. In practice, such diagrams are usually used to define a distribution - e.g. the stock example diagram

... in combination with the five distributions $P [Season]$ , $P [Rain | Season]$ , $P [Sprinkler | Season]$ , $P [Wet | Rain, Sprinkler]$ , $P [Slippery | Wet]$ , defines a joint distribution

$P [Season, Rain, Sprinkler, Wet, Slippery]$

$= P [Season] P [Rain | Season] P [Sprinkler | Season] P [Wet | Rain, Sprinkler] P [Slippery | Wet]$

In this post, we instead take the joint distribution as given, and use the diagrams to concisely state properties of the distribution. For instance, we say that a distribution $P [Λ, X]$ “satisfies” the diagram

if-and-only-if $P [Λ, X] = P [Λ] \prod_{i} P [X_{i} | Λ]$ . (And once we get to approximation, we’ll say that $P [Λ, X]$ approximately satisfies the diagram, to within $ϵ$ , if-and-only-if $D_{K L} (P [Λ, X] | | P [Λ] \prod_{i} P [X_{i} | Λ]) \leq ϵ$ .)

The usage we’re interested in looks like:

State that some random variables satisfy several different diagrams
Derive some new diagrams which they satisfy

In other words, we want to write proofs diagrammatically - i.e....

(Continue Reading – 4082 more words)

johnswentworth14m20

Man, that top one was a mess. Fixed now, thank you!

Cooperation is optimal, with weaker agents too - tldr

Ryo

This is a linkpost for https://medium.com/p/aeb68729829c

It's a ‘superrational’ extension of the proven optimality of cooperation in game theory
+ Taking into account asymmetries of power
// Still AI risk is very real

Short version of an already skimmed 12min post
29min version here

For rational agents (long-term) at all scale (human, AGI, ASI…)

In real contexts, with open environments (world, universe), there is always a risk to meet someone/something stronger than you, and overall weaker agents may be specialized in your flaws/blind spots.

To protect yourself, you can choose the maximally rational and cooperative alliance:

Because any agent is subjected to the same pressure/threat of (actual or potential) stronger agents/alliances/systems, one can take an insurance that more powerful superrational agents will behave well by behaving well with weaker agents. This is the basic rule allowing scale-free cooperation.

If you integrated this super-cooperative...

(Continue Reading – 1117 more words)

Nathan Helm-Burger35m20

"What Dragons?", says the lion, "I see no Dragons, only a big empty universe. I am the most mighty thing here."

Whether or not the Imagined Dragons are real isn't relevant to the gazelles if there is no solid evidence with which to convince the lions. The lions will do what they will do. Maybe some of the lions do decide to believe in the Dragons, but there is no way to force all of them to do so. The remainder will laugh at the dragon-fearing lions and feast on extra gazelles. Their children will reproduce faster.

Is there software to practice reading expressions?

lsusr

I took the Reading the Mind in the Eyes Test test today. I got 27/36. Jessica Livingstone got 36/36.

Reading expressions almost mind reading. Practicing reading expressions should be easy with the right software. All you need is software that shows a random photo from a large database, asks the user to guess what it is, and then informs the user what the correct answer is. I felt myself getting noticeably better just the 36 images on the test.

Short standardized tests exist to test this skill, but is there good software for training it? It needs to have lots of examples, so the user learns to recognize expressions instead of overfitting on specific pictures.

Paul Ekman has a product, but I don't know how good it is.

Answer by Matt GoldenbergApr 23, 202420

Paul Ekmans software is decent. When I used it (before it was a SaaS, just a cd) it just basicallyflashed an expression for a moment then went back to neutral pic. After some training it did help to identify micro expressions in people

1Jacob G-W1h

*Typo: Jessica Livingston not Livingstone

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

Book review: Deep Utopia

PeterMcCluskey

This is a linkpost for https://bayesianinvestor.com/blog/index.php/2024/04/23/deep-utopia/

Book review: Deep Utopia: Life and Meaning in a Solved World, by Nick Bostrom.

Bostrom's previous book, Superintelligence, triggered expressions of concern. In his latest work, he describes his hopes for the distant future, presumably to limit the risk that fear of AI will lead to a The Butlerian Jihad-like scenario.

While Bostrom is relatively cautious about endorsing specific features of a utopia, he clearly expresses his dissatisfaction with the current state of the world. For instance, in a footnoted rant about preserving nature, he writes:

Imagine that some technologically advanced civilization arrived on Earth ... Imagine they said: "The most important thing is to preserve the ecosystem in its natural splendor. In particular, the predator populations must be preserved: the psychopath killers, the fascist goons, the despotic death squads ... What a tragedy if this rich natural diversity were replaced with a monoculture of

...

(Continue Reading – 1027 more words)

4Richard_Kennaway2h

OP quoting Bostrom: I have some sympathy with that technologically advanced civilisation. I mean, what would you rather they do? Intervene to remould humans into their preferred form? Or only if their preferred form just happened to agree with yours?

Waldvogel40m10

Doing nothing might be preferable to intervening in that case. But I'm not sure if the advanced civilization in Bostrom's scenario is intervening or merely opining. I would hope the latter.

Language and Capabilities: Testing LLM Mathematical Abilities Across Languages

Ethan Edwards

19d

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort

Thanks to @NicholasKees for guiding this and reading drafts, to @Egg Syntax @Oskar Hollinsworth, @Quentin FEUILLADE--MONTIXI and others for comments and helpful guidance, to @Viktor Rehnberg, @Daniel Paleka, @Sami Petersen, @Leon Lang, Leo Dana, Jacek Karwowski and many others for translation help, and to @ophira for inspiring the original idea.

TLDR

I wanted to test whether GPT-4’s capabilities were dependent on particular idioms, such as the English language and Arabic numerals, or if it could easily translate between languages and numeral systems as a background process. To do this I tested GPT-4’s ability to do 3-digit multiplication problems without Chain-of-Thought using a variety of prompts which included instructions in various natural languages and numeral systems.

Processed...

(Continue Reading – 10527 more words)

Nathan Helm-Burger41m20

Something else to play around with that I've tried. You can force the models to handle each digit separately by putting a space between each digit of a number like "3 2 3 * 4 3 7 = "

Examples of Highly Counterfactual Discoveries?

johnswentworth

The history of science has tons of examples of the same thing being discovered multiple time independently; wikipedia has a whole list of examples here. If your goal in studying the history of science is to extract the predictable/overdetermined component of humanity's trajectory, then it makes sense to focus on such examples.

But if your goal is to achieve high counterfactual impact in your own research, then you should probably draw inspiration from the opposite: "singular" discoveries, i.e. discoveries which nobody else was anywhere close to figuring out. After all, if someone else would have figured it out shortly after anyways, then the discovery probably wasn't very counterfactually impactful.

Alas, nobody seems to have made a list of highly counterfactual scientific discoveries, to complement wikipedia's list of multiple discoveries.

To...

(See More – 189 more words)

Answer by johnswentworthApr 23, 202420

Here are some candidates from Claude and Gemini (Claude Opus seemed considerably better than Gemini Pro for this task). Unfortunately they are quite unreliable: I've already removed many examples from this list which I already knew to have multiple independent discoverers (like e.g. CRISPR and general relativity). If you're familiar with the history of any of these enough to say that they clearly were/weren't very counterfactual, please leave a comment.

Noether's Theorem
Mendel's Laws of Inheritance
Godel's First Incompleteness Theorem (Claude mentions Von Ne

... (read more)

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

TL;DR

What are Tacit Knowledge Videos?

For rational agents (long-term) at all scale (human, AGI, ASI…)

TLDR

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA