Ω 65y

Rationality-related writings that are more comment-shaped than post-shaped. Please don't leave top-level comments here unless they're indistinguishable to me from something I would say here.

DanielFilan8mΩ220

Frankfurt-style counterexamples for definitions of optimization

In "Bottle Caps Aren't Optimizers", I wrote about a type of definition of optimization that says system S is optimizing for goal G iff G has a higher value than it would if S didn't exist or were randomly scrambled. I argued against these definitions by providing a examples of systems that satisfy the criterion but are not optimizers. But today, I realized that I could repurpose Frankfurt cases to get examples of optimizers that don't satisfy this criterion.

A Frankfurt case is a thought experim... (read more)

LessOnline (May 31—June 2, Berkeley, CA)

Ben Pace

1mo

This is a linkpost for http://Less.Online/

A Festival of Writers Who are Wrong on the Internet^[1]

LessOnline is a festival celebrating truth-seeking, optimization, and blogging. It's an opportunity to meet people you've only ever known by their LessWrong username or Substack handle.

We're running a rationalist conference!

The ticket cost is $400 minus your LW karma in cents.

Confirmed attendees include Scott Alexander, Zvi Mowshowitz, Eliezer Yudkowsky, Katja Grace, and Alexander Wales.

Less.Online

Go through to Less.Online to learn about who's attending, venue, location, housing, relation to Manifest, and more.

We'll post more updates about this event over the coming weeks as it all comes together.

If LessOnline is an awesome rationalist event,
I desire to believe that LessOnline is an awesome rationalist event;

If LessOnline is not an awesome rationalist event,
I desire to believe that LessOnline is not an awesome rationalist event;

Let me not become attached to beliefs I may not want.

—Litany of Rationalist Event Organizing

^{^}
But Striving to be Less So

metachirality17m10

Isn't TLP's email on his website?

Key takeaways from our EA and alignment research surveys

Cameron Berg, Judd Rosenblatt, florin_pop, AE Studio

Many thanks to Spencer Greenberg, Lucius Caviola, Josh Lewis, John Bargh, Ben Pace, Diogo de Lucena, and Philip Gubbins for their valuable ideas and feedback at each stage of this project—as well as the ~375 EAs + alignment researchers who provided the data that made this project possible.

Background

Last month, AE Studio launched two surveys: one for alignment researchers, and another for the broader EA community.

We got some surprisingly interesting results, and we're excited to share them here.

We set out to better explore and compare various population-level dynamics within and across both groups. We examined everything from demographics and personality traits to community views on specific EA/alignment-related topics. We took on this project because it seemed to be largely unexplored and rife with potentially-very-high-value insights. In this post, we’ll present what...

(Continue Reading – 6237 more words)

Chipmonk31m10

How much higher was the scoring on neuroticism than the general population?

1Chipmonk32m

How many alignment researchers do you think there are total? What % do you think this survey hit that you wanted it to hit?

Does reducing the amount of RL for a given capability level make AI safer?

Chris_Leong, porby

Some people have suggested that a lot of the danger of training a powerful AI comes from reinforcement learning. Given an objective, RL will reinforce any method of achieving the objective that the model tries and finds to be successful including things like deceiving us or increasing its power.

If this were the case, then if we want to build a model with capability level X, it might make sense to try to train that model either without RL or with as little RL as possible. For example, we could attempt to achieve the objective using imitation learning instead.

However, if, for example, the alternate was imitation learning, it would be possible to push back and argue that this is still a black-box that uses gradient descent so we...

(See More – 82 more words)

4Steven Byrnes13h

I agree that in the limit of an extremely structured optimizer, it will work in practice, and it will wind up following strategies that you can guess to some extent a priori. I also agree that in the limit of an extremely unstructured optimizer, it will not work in practice, but if it did, it will find out-of-the-box strategies that are difficult to guess a priori. But I disagree that there’s no possible RL system in between those extremes where you can have it both ways. On the contrary, I think it’s possible to design an optimizer which is structured enough to work well in practice, while simultaneously being unstructured enough that it will find out-of-the-box solutions very different from anything the programmers were imagining. Examples include: * MuZero: you can’t predict a priori what chess strategies a trained MuZero will wind up using by looking at the source code. The best you can do is say “MuZero is likely to use strategies that lead to its winning the game”. * “A civilization of humans” is another good example: I don’t think you can look at the human brain neural architecture and loss functions etc., and figure out a priori that a civilization of humans will wind up inventing nuclear weapons. Right?

porby41m20

But I disagree that there’s no possible RL system in between those extremes where you can have it both ways.

I don't disagree. For clarity, I would make these claims, and I do not think they are in tension:

Something being called "RL" alone is not the relevant question for risk. It's how much space the optimizer has to roam.
MuZero-like strategies are free to explore more space than something like current applications of RLHF. Improved versions of these systems working in more general environments have the capacity to do surprising things and will tend to be

... (read more)

jacquesthibs's Shortform

jacquesthibs

8jacquesthibs16h

Do we expect future model architectures to be biased toward out-of-context reasoning (reasoning internally rather than in a chain-of-thought)? As in, what kinds of capabilities would lead companies to build models that reason less and less in token-space? I mean, the first obvious thing would be that you are training the model to internalize some of the reasoning rather than having to pay for the additional tokens each time you want to do complex reasoning. The thing is, I expect we'll eventually move away from just relying on transformers with scale. And so I'm trying to refine my understanding of the capabilities that are simply bottlenecked in this paradigm, and that model builders will need to resolve through architectural and algorithmic improvements. (Of course, based on my previous posts, I still think data is a big deal.) Anyway, this kind of thinking eventually leads to the infohazardous area of, "okay then, what does the true AGI setup look like?" This is really annoying because it has alignment implications. If we start to move increasingly towards models that are reasoning outside of token-space, then alignment becomes harder. So, are there capability bottlenecks that eventually get resolved through something that requires out-of-context reasoning? So far, it seems like the current paradigm will not be an issue on this front. Keep scaling transformers, and you don't really get any big changes in the model's likelihood of using out-of-context reasoning. This is not limited to out-of-context reasoning. I'm trying to have a better understanding of the (dangerous) properties future models may develop simply as a result of needing to break a capability bottleneck. My worry is that many people end up over-indexing on the current transformer+scale paradigm (and this becomes insufficient for ASI), so they don't work on the right kinds of alignment or governance projects. --- I'm unsure how big of a deal this architecture will end up being, but the rumoure

2Seth Herd9h

This is an excellent point. While LLMs seem (relatively) safe, we may very well blow right on by them soon. I do think that many of the safety advantages of LLMs come from their understanding of human intentions (and therefore implied values). Those would be retained in improved architectures that still predict human language use. If such a system's thought process was entirely opaque, we could no longer perform Externalized reasoning oversight by "reading its thoughts". But think it might be possible to build a reliable agent from unreliable parts. I think humans are such an agent, and evolution made us this way because it's a way to squeeze extra capability out of a set of base cognitive capacities. Imagine an agentic set of scaffolding that merely calls the super-LLM for individual cognitive acts. Such an agent would use a hand-coded "System 2" thinking approach to solve problems, like humans do. That involves breaking a problem into cognitive steps. We also use System 2 for our biggest ethical decisions; we predict consequences of our major decisions, and compare them to our goals, including ethical goals. Such a synthetic agent would use System 2 for problem-solving capabilities, and also for checking plans for how well they achieve goals. This would be done for efficiency; spending a lot of compute or external resources on a bad plan would be quite costly. Having implemented it for efficiency, you might as well use it for safety. This is just restating stuff I've said elsewhere, but I'm trying to refine the model, and work through how well it might work if you couldn't apply any external reasoning oversight, and little to no interpretability. It's definitely bad for the odds of success, but not necessarily crippling. I think. This needs more thought. I'm working on a post on System 2 alignment, as sketched out briefly (and probably incomprehensibly) above.

ryan_greenblatt1h42

I do think that many of the safety advantages of LLMs come from their understanding of human intentions (and therefore implied values).

Did you mean something different than "AIs understand our intentions" (e.g. maybe you meant that humans can understand the AI's intentions?).

I think future more powerful AIs will surely be strictly better at understanding what humans intend.

Dating Roundup #3: Third Time’s the Charm

Zvi

14h

The first speculated on why you’re still single. We failed to settle the issue. A lot of you were indeed still single. So the debate continues.

The second gave more potential reasons, starting with the suspicion that you are not even trying, and also many ways you are likely trying wrong.

The definition of insanity is trying the same thing over again expecting different results. Another definition of insanity is dating in 2024. Can’t quit now.

You’re Single Because Dating Apps Keep Getting Worse

A guide to taking the perfect dating app photo. This area of your life is important, so if you intend to take dating apps seriously then you should take photo optimization seriously, and of course you can then also use the photos for other things.

I love the...

(Continue Reading – 11504 more words)

Jiao Bu1h10

I am perfectly happy that the patriarchal roles are no longer shackling women. I would not like to roll back time, personally, on these matters. I hope my question doesn't come across this way -- it is just that I am confused about expectations.

1rotatingpaguro3h

Causal or association? Totally did not know this. Is this true? Is this causal? I mean, maybe being yourself and open works for people who happen to already be relationship-compatible. People who are not would be worse off by trying to be themselves. I think I have been burned in the past a lot by that kind of advice, although my experience is too much of an anecdote to infer an average.

3Gunnar_Zarncke4h

As I have said elsewhere: Dating apps are broken. Maybe it's better dating apps die soon. On the supplier side: Misaligned incentives (keep users on the platform) and opaque algorithms lead to bad matches. On the demand side: Misaligned incentives (first impressions, low cost to exit) and no plausible deniability lead to predators being favored. Real dating happens when you can observe many potential mates and there is a path to getting closer. Traditionally that was schools, clubs, church, work. Now, not so much. Let's build something that fosters what was lost, now double down on a failed principle - 1-to-1 matching.

1rotatingpaguro3h

Are you libertarian about this specifically? Do you think it's better if people also have the choice of dating apps? Or would you ban them if given the choice?

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

D&D.Sci Long War: Defender of Data-mocracy

aphyer

12d

This is an entry in the 'Dungeons & Data Science' series, a set of puzzles where players are given a dataset to analyze and an objective to pursue using information from that dataset.

STORY (skippable)

You have the excellent fortune to live under the governance of The People's Glorious Free Democratic Republic of Earth, giving you a Glorious life of Freedom and Democracy.

Sadly, your cherished values of Democracy and Freedom are under attack by...THE ALIEN MENACE!

The typical reaction of an Alien Menace to hearing about Freedom and Democracy. (Generated using OpenArt SDXL).

Faced with the desperate need to defend Freedom and Democracy from The Alien Menace, The People's Glorious Free Democratic Republic of Earth has been forced to redirect most of its resources into the Glorious Free People's Democratic War...

(See More – 874 more words)

qwertyasdef1h10

I misremembered the May 6 date as May 9 but luckily other people have been asking for more time so it seems I might not be late.

The average number of soldiers the Army sends looks linear in the number of aliens. A linear regression gives the coefficients: 0.40 soldiers by default + 0.66 per Abomination + 0.32 per Crawler + 0.16 per Scarab + 0.81 per Tyrant + 0.49 per Venompede. From here, the log-odds of victory looks like a linear function of the difference between the actual number of soldiers and the expected number of soldiers.

Based on no evidence at a

mwatkins

22h

TL;DR This research presents a novel method for exploring LLM embedding space using the Major Arcana of the tarot as archetypal anchors. The approach generates "archetype-based directions" in GPT-J's embedding space, along which words and concepts "mutate" in meaning, revealing intricate networks of association. These semantic mutation pathways provide insight into the model's learned ontologies and suggest a framework for controlled navigation of embedding space. The work sheds some light on how LLMs represent concepts and how their knowledge structures align (or don't) with human understanding.

Introduction

Despite its questionable association with oracular practices, the Major Arcana (22 non-suited, non-numbered cards) from the traditional tarot present us with a widely documented selection of well-worn, diverse and arguably comprehensive cultural archetypes to experiment with in the context of LLMs....

(Continue Reading – 8212 more words)

Shayne O'Neill1h10

While the use of tarot archetypes is... questionable... it does point at an angle to exploring embedding space which is that it is a fundamentally semiotic space, its going in many respects to be structured by the texts that fed it, and human text is richly symbolic.

That said, theres a preexisting set of ideas around this that might be more productive, and that is structuralism, particularly the works of Levi Strauss, Roland Barthes, Lacan, and more distantly Foucault and Derrida.

Levi Strauss's anthropology in particular is interesting ,because... (read more)

6gwern14h

In terms of factorizing or fingerprinting, 20 Tarot concepts seems like a lot; it's exhausting even just to skim it. Why do you think you need so many and that they aren't just mostly many fewer factors like Big Five or dirty uninterpretable mixes? Like the 500 closest tokens generally look pretty random to me for each one.

Introducing AI Lab Watch

195

Zach Stein-Perlman

This is a linkpost for https://ailabwatch.org

I'm launching AI Lab Watch. I collected actions for frontier AI labs to improve AI safety, then evaluated some frontier labs accordingly.

It's a collection of information on what labs should do and what labs are doing. It also has some adjacent resources, including a list of other safety-ish scorecard-ish stuff.

(It's much better on desktop than mobile — don't read it on mobile.)

It's in beta—leave feedback here or comment or DM me—but I basically endorse the content and you're welcome to share and discuss it publicly.

It's unincorporated, unfunded, not affiliated with any orgs/people, and is just me.

Some clarifications and disclaimers.

How you can help:

Give feedback on how this project is helpful or how it could be different to be much more helpful
Tell me what's wrong/missing; point me to sources

...

(See More – 208 more words)

Zach Stein-Perlman1h42

Two noncentral pages I like on the site:

Other scorecards & evaluation, collecting other safety-ish scorecard-ish resources.
Commitments, collecting AI companies' commitments relevant to AI safety and extreme risks.

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

A Festival of Writers Who are Wrong on the Internet^[1]

Less.Online

Background

You’re Single Because Dating Apps Keep Getting Worse

STORY (skippable)

Introduction

LessOnline & Manifest Summer Camp

June 3rd to June 7th

Quick Takes

Popular Comments

Recent Discussion

A Festival of Writers Who are Wrong on the Internet[1]

Less.Online

Background

You’re Single Because Dating Apps Keep Getting Worse

STORY (skippable)

Introduction

LessOnline & Manifest Summer Camp

June 3rd to June 7th

A Festival of Writers Who are Wrong on the Internet^[1]