A book review examining Elinor Ostrom's "Governance of the Commons", in light of Eliezer Yudkowsky's "Inadequate Equilibria." Are successful local institutions for governing common pool resources possible without government intervention? Under what circumstances can such institutions emerge spontaneously to solve coordination problems?
I wrote the linked post, and I’m posting a lightly edited version here for discussion. I plan to attend LessOnline, and this is my first attempt at blogging to understand and earnestly explain and is also gauging interest in the topic in case someone at LessOnline wants to discuss the firmware of the universe with me. I might post more physics if there seems to be interest. Here is the post:
When I teach introductory mechanics, I like to tell my students that there are three things which are all called physics, even if only one of them tends to show up on their exams.
Periodically during the semester, I draw the following diagram on the board for my Newtonian...
The halting problem is the problem of taking as input a Turing machine M, returning true if it halts, false if it doesn't halt. This is known to be uncomputable. The consistent guessing problem (named by Scott Aaronson) is the problem of taking as input a Turing machine M (which either returns a Boolean or never halts), and returning true or false; if M ever returns true, the oracle's answer must be true, and likewise for false. This is also known to be uncomputable.
Scott Aaronson inquires as to whether the consistent guessing problem is strictly easier than the halting problem. This would mean there is no Turing machine that, when given access to a consistent guessing oracle, solves the halting problem, no matter which consistent guessing oracle...
Well, I guess describing a model of a computably enumerable theory, like PA or ZFC counts. We could also ask for a model of PA that's nonstandard in a particular way that we want, e.g. by asking for a model of , and that works the same way. Describing a reflective oracle has low solutions too, though this is pretty similar to the consistent guessing problem. Another one, which is really just a restatement of the low basis theorem, but perhaps a more evocative one, is as follows. Suppose some oracle machine has the property that ...
Looking for any discord/slack/other that have people working on projects related to representation reading, control, activation steering with vectors and adapters, ...Would appreciate any pointers if such a thing exists!
As the dictum goes, “If it helps but doesn’t solve your problem, perhaps you’re not using enough.” But I still find that I’m sometimes not using enough effort, not doing enough of what works, simply put, not using enough dakka. And if reading one post isn’t enough to get me to do something… perhaps there isn’t enough guidance, or examples, or repetition, or maybe me writing it will help reinforce it more. And I hope this post is useful for more than just myself.
Of course, the ideas below are not all useful in any given situation, and many are obvious, at least after they are mentioned, but when you’re trying to get more dakka, it’s probably worth running through the list and considering each one and how it...
"and seek amateur advice"
well said!
A short post laying out our reasoning for using integrated gradients as attribution method. It is intended as a stand-alone post based on our LIB papers [1] [2]. This work was produced at Apollo Research.
Understanding circuits in neural networks requires understanding how features interact with other features. There's a lot of features and their interactions are generally non-linear. A good starting point for understanding the interactions might be to just figure out how strongly each pair of features in adjacent layers of the network interacts. But since the relationships are non-linear, how do we quantify their 'strength' in a principled manner that isn't vulnerable to common and simple counterexamples? In other words, how do we quantify how much the value of a feature in layer should be attributed...
This is a linkpost for our two recent papers:
This work was produced at Apollo Research in collaboration with Kaarel Hanni (Cadenza Labs), Avery Griffin, Joern Stoehler, Magdalena Wache and Cindy Wu. Not to be confused with Apollo's recent Sparse Dictionary Learning paper.
A key obstacle to mechanistic interpretability is finding the right representation of neural network internals. Optimally, we would like to derive our features from some high-level principle that holds across different architectures and use cases. At a minimum, we know two things:
FSF blogpost. Full document (just 6 pages; you should read it). Compare to Anthropic's RSP, OpenAI's RSP ("Preparedness Framework"), and METR's Key Components of an RSP.
DeepMind's FSF has three steps:
Thanks.
Deployment mitigations level 2 discusses the need for mitigations on internal deployments.
Good point; this makes it clearer that "deployment" means external deployment by default. But level 2 only mentions "internal access of the critical capability," which sounds like it's about misuse — I'm more worried about AI scheming and escaping when the lab uses AIs internally to do AI development.
ML R&D will require thinking about internal deployments (and so will many of the other CCLs).
OK. I hope DeepMind does that thinking and makes appropriate commi...
This post is a slightly-adapted summary of two twitter threads, here and here.
As we get closer to AGI, it becomes less appropriate to treat it as a binary threshold. Instead, I prefer to treat it as a continuous spectrum defined by comparison to time-limited humans. I call a system a t-AGI if, on most cognitive tasks, it beats most human experts who are given time t to perform the task.
What does that mean in practice?
Some evidence in favor of the framework; from Advanced AI evaluations at AISI: May update:
...Short-horizon tasks (e.g., fixing a problem on a Linux machine or making a web server) were those that would take less than 1 hour, whereas long-horizon tasks (e.g., building a web app or improving an agent framework) could take over four (up to 20) hours for a human to complete.
[...]
The Purple and Blue models completed 20-40% of short-horizon tasks but no long-horizon tasks. The Green model completed less than 10% of short-horizon tasks and was not as
Previously: OpenAI: Facts From a Weekend, OpenAI: The Battle of the Board, OpenAI: Leaks Confirm the Story, OpenAI: Altman Returns, OpenAI: The Board Expands.
Ilya Sutskever and Jan Leike have left OpenAI. This is almost exactly six months after Altman’s temporary firing and The Battle of the Board, the day after the release of GPT-4o, and soon after a number of other recent safety-related OpenAI departures. Many others working on safety have also left recently. This is part of a longstanding pattern at OpenAI.
Jan Leike later offered an explanation for his decision on Twitter. Leike asserts that OpenAI has lost the mission on safety and culturally been increasingly hostile to it. He says the superalignment team was starved for resources, with its public explicit compute commitments dishonored, and...
The 20% of compute thing reminds me of this post from 2014:
I am Sam Altman, lead investor in reddit's new round and President of Y Combinator. AMA!
We're working on a way to give 10% of our shares from this round to the reddit community.
As far as I know, this didn't happen.
Though to be fair, Reddit is indeed doing the users-as-shareholders thing now, in 2024. But I guess it's unrelated to the plans from back then.