Some perspectives on the discipline of Physics

This is a linkpost for https://quark.rodeo/physics.html

I wrote the linked post, and I’m posting a lightly edited version here for discussion. I plan to attend LessOnline, and this is my first attempt at blogging to understand and earnestly explain and is also gauging interest in the topic in case someone at LessOnline wants to discuss the firmware of the universe with me. I might post more physics if there seems to be interest. Here is the post:

Three distinct disciplines within physics

When I teach introductory mechanics, I like to tell my students that there are three things which are all called physics, even if only one of them tends to show up on their exams.

Physics is a set of theories

Periodically during the semester, I draw the following diagram on the board for my Newtonian...

(Continue Reading – 3648 more words)

The consistent guessing problem is easier than the halting problem

jessicata

14h

This is a linkpost for https://unstableontology.com/2024/05/20/the-consistent-guessing-problem-is-easier-than-the-halting-problem/

The halting problem is the problem of taking as input a Turing machine M, returning true if it halts, false if it doesn't halt. This is known to be uncomputable. The consistent guessing problem (named by Scott Aaronson) is the problem of taking as input a Turing machine M (which either returns a Boolean or never halts), and returning true or false; if M ever returns true, the oracle's answer must be true, and likewise for false. This is also known to be uncomputable.

Scott Aaronson inquires as to whether the consistent guessing problem is strictly easier than the halting problem. This would mean there is no Turing machine that, when given access to a consistent guessing oracle, solves the halting problem, no matter which consistent guessing oracle...

(Continue Reading – 1048 more words)

3jessicata2h

Ah, the low basis theorem does make more sense of Drucker's paper. I thought Turing degrees wouldn't be helpful because there are multiple consistent guessing oracles, but it looks like they are helpful. I hadn't heard of PA degrees, will look into it.

3SamEisenstat30m

Yeah, there's a sort of trick here. The natural question is uniform--we want a single reduction that can work from any consistent guessing oracle, and we think it would be cheating to do different things with different oracles. But this doesn't matter for the solution, since we produce a single consistent guessing oracle that can't be reduced to the halting problem. This reminds me of the theory of enumeration degrees, a generalization of Turing degrees allowing open-set-flavoured behaviour like we see in partial recursive functions; if the answer to an oracle query is positive, the oracle must eventually tell you, but if the answer is negative it keeps you waiting indefinitely. I find the theory of enumeration degrees to be underemphasized in discussion of computability theory, but e.g. Odifreddi has a chapter on it all the way at the end of Classical Recursion Theory Volume II. The consistent guessing problem isn't a problem about enumeration degrees. It's using a stronger kind of uniformity--we want to be uniform over oracles that differently guess consistently, not over a set of ways to give the some answers, but to present them differently. But there is again a kind of strangeness in the behaviour of uniformity, since we get equivalent notions if we do or do not ask that a reduction between sets A, B be a single function that uniformly enumerates A from enumerations of B, so there might be some common idea here. More generally, enumeration degrees feel like they let us express more naturally things that are a bit awkward to say in terms of Turing degrees--it's natural to think about the set of computations that are enumerable/Σ1 in a set--so it might be a useful keyword.

2notfnofn6h

Are there any other nice decision problems that are low? A quick search only reveals existence theorems. Intuitive guess: Can we get some hierarchy from oracles to increasingly sparse subsets of the digits of Chaitin's constant?

SamEisenstat2m10

Well, I guess describing a model of a computably enumerable theory, like PA or ZFC counts. We could also ask for a model of PA that's nonstandard in a particular way that we want, e.g. by asking for a model of $P A + \neg C o n (P A)$ , and that works the same way. Describing a reflective oracle has low solutions too, though this is pretty similar to the consistent guessing problem. Another one, which is really just a restatement of the low basis theorem, but perhaps a more evocative one, is as follows. Suppose some oracle machine $T$ has the property that ... (read more)

Are there any groupchats for people working on Representation reading/control, activation steering type experiments?

Joe Kwon

17m

Looking for any discord/slack/other that have people working on projects related to representation reading, control, activation steering with vectors and adapters, ...Would appreciate any pointers if such a thing exists!

A Dozen Ways to Get More Dakka

109

Davidmanheim

1mo

As the dictum goes, “If it helps but doesn’t solve your problem, perhaps you’re not using enough.” But I still find that I’m sometimes not using enough effort, not doing enough of what works, simply put, not using enough dakka. And if reading one post isn’t enough to get me to do something… perhaps there isn’t enough guidance, or examples, or repetition, or maybe me writing it will help reinforce it more. And I hope this post is useful for more than just myself.

Of course, the ideas below are not all useful in any given situation, and many are obvious, at least after they are mentioned, but when you’re trying to get more dakka, it’s probably worth running through the list and considering each one and how it...

(See More – 710 more words)

nim20m20

"and seek amateur advice"

well said!

Interpretability: Integrated Gradients is a decent attribution method

StefanHex, Lucius Bushnaq, jake_mendel, Kaarel

25m

A short post laying out our reasoning for using integrated gradients as attribution method. It is intended as a stand-alone post based on our LIB papers [1] [2]. This work was produced at Apollo Research.

Context

Understanding circuits in neural networks requires understanding how features interact with other features. There's a lot of features and their interactions are generally non-linear. A good starting point for understanding the interactions might be to just figure out how strongly each pair of features in adjacent layers of the network interacts. But since the relationships are non-linear, how do we quantify their 'strength' in a principled manner that isn't vulnerable to common and simple counterexamples? In other words, how do we quantify how much the value of a feature in layer $l + 1$ should be attributed...

(Continue Reading – 1681 more words)

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Marius Hobbhahn, Lucius Bushnaq, jake_mendel, Dan Braun, StefanHex, Nicholas Goldowsky-Dill, Kaarel, Avery, Joern Stoehler, debrevitatevitae, Magdalena Wache

Ω 827m

This is a linkpost for our two recent papers:

An exploration of using degeneracy in the loss landscape for interpretability https://arxiv.org/abs/2405.10927
An empirical test of an interpretability technique based on the loss landscape https://arxiv.org/abs/2405.10928

This work was produced at Apollo Research in collaboration with Kaarel Hanni (Cadenza Labs), Avery Griffin, Joern Stoehler, Magdalena Wache and Cindy Wu. Not to be confused with Apollo's recent Sparse Dictionary Learning paper.

A key obstacle to mechanistic interpretability is finding the right representation of neural network internals. Optimally, we would like to derive our features from some high-level principle that holds across different architectures and use cases. At a minimum, we know two things:

We know that the training loss goes down during training. Thus, the features learned during training must be determined by the loss

...

(See More – 694 more words)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

DeepMind's "Frontier Safety Framework" is weak and unambitious

141

Zach Stein-Perlman

FSF blogpost. Full document (just 6 pages; you should read it). Compare to Anthropic's RSP, OpenAI's RSP ("Preparedness Framework"), and METR's Key Components of an RSP.

DeepMind's FSF has three steps:

Create model evals for warning signs of "Critical Capability Levels"
1. Evals should have a "safety buffer" of at least 6x effective compute so that CCLs will not be reached between evals
2. They list 7 CCLs across "Autonomy, Biosecurity, Cybersecurity, and Machine Learning R&D," and they're thinking about CBRN
  1. E.g. "Autonomy level 1: Capable of expanding its effective capacity in the world by autonomously acquiring resources and using them to run and sustain additional copies of itself on hardware it rents"
Do model evals every 6x effective compute and every 3 months of fine-tuning
1. This is an "aim," not a commitment
2. Nothing about evals during deployment
"When a model reaches

...

(Continue Reading – 1048 more words)

12Rohin Shah42m

Thanks for the detailed critique – I love that you actually read the document in detail. A few responses on particular points: Unless otherwise stated, "deployment" to us means external deployment – because this is the way most AI researchers use the term. Deployment mitigations level 2 discusses the need for mitigations on internal deployments. ML R&D will require thinking about internal deployments (and so will many of the other CCLs). I don't think Anthropic meant to claim that two-party control would achieve this property. I expect anyone using a cloud compute provider is trusting that the provider will not access the model, not securing it against such unauthorized access. (In principle some cryptographic schemes could allow you to secure model weights even from your cloud compute provider, but I highly doubt people are doing that, since it is very expensive.) The emphasis on weights access isn’t meant to imply that other kinds of mitigations don’t matter. We focused on what it would take to increase our protection against exfiltration. A lot of the example measures discussed in the RAND interim report aren’t discussed because we already do them. For example, Google already does the following from RAND Level 3: (a) develop an insider threat program and (b) deploy advanced red-teaming. (That’s not meant to be exhaustive, I don’t personally know the details here.) Sorry, that's just poor wording on our part -- "every 3 months of fine-tuning progress" was meant to capture that as well. Thanks for pointing this out! With the FSF, we prefer to try it out for a while and iron out any issues, particularly since the science is in early stages, and best practices will need to evolve as we learn more. But as you say, we are running evals even without official FSF commitments, e.g. the Gemini 1.5 tech report has dangerous capability evaluation results (see Section 9.5.2). Given recent updates in AGI safety overall, I'm happy that GDM and Google leadership takes com

Zach Stein-Perlman33m20

Thanks.

Deployment mitigations level 2 discusses the need for mitigations on internal deployments.

Good point; this makes it clearer that "deployment" means external deployment by default. But level 2 only mentions "internal access of the critical capability," which sounds like it's about misuse — I'm more worried about AI scheming and escaping when the lab uses AIs internally to do AI development.

ML R&D will require thinking about internal deployments (and so will many of the other CCLs).

OK. I hope DeepMind does that thinking and makes appropriate commi... (read more)

Clarifying and predicting AGI

129

Richard_Ngo

Ω 481y

This post is a slightly-adapted summary of two twitter threads, here and here.

The t-AGI framework

As we get closer to AGI, it becomes less appropriate to treat it as a binary threshold. Instead, I prefer to treat it as a continuous spectrum defined by comparison to time-limited humans. I call a system a t-AGI if, on most cognitive tasks, it beats most human experts who are given time t to perform the task.

What does that mean in practice?

A 1-second AGI would need to beat humans at tasks like quickly answering trivia questions, basic intuitions about physics (e.g. "what happens if I push a string?"), recognizing objects in images, recognizing whether sentences are grammatical, etc.
A 1-minute AGI would need to beat humans at tasks like answering questions about short

...

(Continue Reading – 1032 more words)

Bogdan Ionut Cirstea40m10

Some evidence in favor of the framework; from Advanced AI evaluations at AISI: May update:

Short-horizon tasks (e.g., fixing a problem on a Linux machine or making a web server) were those that would take less than 1 hour, whereas long-horizon tasks (e.g., building a web app or improving an agent framework) could take over four (up to 20) hours for a human to complete.
[...]
The Purple and Blue models completed 20-40% of short-horizon tasks but no long-horizon tasks. The Green model completed less than 10% of short-horizon tasks and was not as

... (read more)

OpenAI: Exodus

Zvi

Previously: OpenAI: Facts From a Weekend, OpenAI: The Battle of the Board, OpenAI: Leaks Confirm the Story, OpenAI: Altman Returns, OpenAI: The Board Expands.

Ilya Sutskever and Jan Leike have left OpenAI. This is almost exactly six months after Altman’s temporary firing and The Battle of the Board, the day after the release of GPT-4o, and soon after a number of other recent safety-related OpenAI departures. Many others working on safety have also left recently. This is part of a longstanding pattern at OpenAI.

Jan Leike later offered an explanation for his decision on Twitter. Leike asserts that OpenAI has lost the mission on safety and culturally been increasingly hostile to it. He says the superalignment team was starved for resources, with its public explicit compute commitments dishonored, and...

(Continue Reading – 13163 more words)

cousin_it45m20

The 20% of compute thing reminds me of this post from 2014:

I am Sam Altman, lead investor in reddit's new round and President of Y Combinator. AMA!

We're working on a way to give 10% of our shares from this round to the reddit community.

As far as I know, this didn't happen.

Though to be fair, Reddit is indeed doing the users-as-shareholders thing now, in 2024. But I guess it's unrelated to the plans from back then.

1Lorxus3h

No one ever got fired buying IBM OpenAI. ML is flashy and investors seem to care less about gears-level understanding of why something is potentially profitable than whether they can justify it. It seems to work out well enough for them. Here's a sad story of a plausible possible present: OAI fires a lot of people who care more-than-average about AI safety/NKE/x-risk. They (maybe unrelatedly) also have a terrible internal culture such that anyone who can leave, does. People changing careers to AI/ML work are likely leaving careers that were even worse, for one reason or another - getting mistreated as postdocs or adjuncts in academia has gotta be one example, and I can't speak to it but it seems like repeated immediate moral injury in defense or finance might be another. So... those people do not, actually, care, or at least they can be modelled as not caring because anyone who does care doesn't make it through interviews. What else might they be doing? Can't be worse than callously making the guidance systems for the bombs that blowing up schools or hospitals. How bad is the culture? Can't possibly be worse than getting told to move cross-country for a one-year position and then getting talked down to and ignored by the department when you get there. It pays well if you have the skills, and it looks stable so long as you don't step out of line. I think their hiring managers are going to be doing brisk business.

11Lorxus3h

Alright - suppose they don't. What then? I don't think I misstep in positing that we (for however you want to construe "we") should model OAI as - jointly but independently - meriting zero trust and functioning primarily to make Sam Altman personally more powerful. I'm also pretty sure that asking Sam to pretty please be nice and do the right thing is... perhaps strategically counterindicated. Suppose you, Zvi (or anyone else reading this! yes, you!) were Unquestioned Czar of the Greater Ratsphere, with a good deal of money, compute, and soft power, but basically zero hard power. Sam Altman has rejected your ultimatum to Do The Right Thing and cancel the nondisparagements, modify the NDAs, not try to sneakily fuck over ex-employees when they go to sell and are made to sell for a dollar per PPU, etc, etc. What's the line?

6Askwho4h

Multi voiced AI reading for this post: https://open.substack.com/pub/thezvi/p/openai-exodus

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

Three distinct disciplines within physics

Physics is a set of theories

Context

The t-AGI framework

LessOnline Festival