Quick Takes

The main thing I got out of reading Bostrom's Deep Utopia is a better appreciation of this "meaning of life" thing. I had never really understood what people meant by this, and always just rounded it off to people using lofty words for their given projects in life.

The book's premise is that, after the aligned singularity, the robots will not just be better at doing all your work but also be better at doing all your leisure for you. E.g., you'd never study for fun in posthuman utopia, because you could instead just ask the local benevolent god to painlessly... (read more)


I'm pretty sure that I would study for fun in the posthuman utopia, because I both value and enjoy studying and a utopia that can't carry those values through seems like a pretty shallow imitation of a utopia.

There won't be a local benevolent god to put that wisdom into my head, because I will be a local benevolent god with more knowledge than most others around. I'll be studying things that have only recently been explored, or that nobody has yet discovered. Otherwise again, what sort of shallow imitation of a posthuman utopia is this?

3Garrett Baker5h
Many who believe in God derive meaning, despite God theoretically being able to do anything they can do but better, from the fact that He chose not to do the tasks they are good at, and left them tasks to try to accomplish. Its common for such people to believe that this meaning would disappear if God disappeared, but whenever such a person does come to no longer believe in God, they often continue to see meaning in their life[1]. Now atheists worry about building God because it may destroy all meaning to our actions. I expect we'll adapt. (edit: That is to say, I don't think you've adequately described what "meaning of life" is if you're worried about it going away in the situation you describe) ---------------------------------------- 1. If anything, they're more right than wrong, there has been much written about the "meaning crisis" we're in, possibly attributable to greater levels of atheism. ↩︎

'Value Capture' - An anthropic attack against some possible formally aligned ASIs

(this is a more specific case of anthropic capture attacks in general, aimed at causing a formally aligned superintelligence to become uncertain about its value function (or output policy more generally))

Imagine you're a superintelligence somewhere in the world that's unreachable to life on Earth, and you have a complete simulation of Earth. You see a group of alignment researchers about to successfully create a formal-value-aligned ASI, and its design looks broadly like this:... (read more)


Like almost all acausal scenarios, this seems to be privileging the hypothesis to an absurd degree.

Why should the Earth superintelligence care about you, but not about the other 10^10^30 other causally independent ASIs that are latent in the hypothesis space, each capable of running enormous numbers of copies of the Earth ASI in various scenarios?

Even if that was resolved, why should the Earth ASI behave according to hypothetical other utility functions? Sure, the evidence is consistent with being a copy running in a simulation with a different utility fun... (read more)


If GPT5 actually comes with competent agents then I expect this to be a "Holy Shit" moment at least as big as ChatGPT's release. So if ChatGPT has been used by 200 million people, then I'd expect that to at least double within 6 months of GPT5 (agent's) release. Maybe triple. So that "Holy Shit" moment means a greater share of the general public learning about the power of frontier models. With that will come another shift in the Overton Window. Good luck to us all.

What's the endgame of technological or intelligent progress like? Not just for humans as we know it, but for all possible beings/civilizations in this universe, at least before it runs out of usable matter/energy? Would they invariably self-modify beyond their equivalent of humanness? Settle into some physical/cultural stable state? Keep getting better tech to compete within themselves if nothing else? Reach an end of technology or even intelligence beyond which advancement is no longer beneficial for survival? Spread as far as possible or concentrate resources? Accept the limited fate of the universe and live to the fullest or try to change it?  If they could change the laws of the universe, how would they?

There was this voice inside my head that told me that since I got Something to protect, relaxing is never ok above strict minimum, the goal is paramount, and I should just work as hard as I can all the time.

This led me to breaking down and being incapable to work on my AI governance job for a week, as I just piled up too much stress.

And then, I decided to follow what motivated me in the moment, instead of coercing myself into working on what I thought was most important, and lo and behold! my total output increased, while my time spent working decreased.

I'... (read more)

Have you considered antidepressants? I recommend trying them out to see if they help. In my experience, antidepressants can have non-trivial positive effects that can be hard-to-put-into-words, except you can notice the shift in how you think and behave and relate to things, and this shift is one that you might find beneficial.

I also think that slowing down and taking care of yourself can be good -- it can help build a generalized skill of noticing the things you didn't notice before that led to the breaking point you describe.

Here's an anecdote that might... (read more)

Upvoted! STEM people can look at it like an engineering problem, Econ people can look at it like risk management (risk of burnout). Humanities people can think about it in terms of human genetic/trait diversity in order to find the experience that best suits the unique individual (because humanities people usually benefit the most for each marginal hour spend understanding this lens). Succeeding at maximizing output takes some fiddling. The "of course I did it because of course I'm just that awesome, just do it" thing is a pure flex/social status grab, and it poisons random people nearby.

Can you iterate through 10^100 objects?

If you have a 1GHz CPU you can do 1,000,000,000 operations per second. Let's assume that iterating through one one object takes only one operation.

In a year you can do 10^16 operations. That means it would take 10^84 years to iterate through 10^100 verticies.

The big bang was 1.4*10^10 years ago.

Showing 3 of 5 replies (Click to show all)
1Johannes C. Mayer11h
Yes, abstraction is the right thing to think about. That is the context in which I was considering this computation. In this post I describe a sort of planning abstraction that you can do if you have an extremely regular environment. It does not yet talk about how to store this environment, but you are right that this can of course also be done similarly efficiently.
1Johannes C. Mayer11h
In this post, I describe a toy setup, where I have a graph of 10100 vertices. I would like to compute for any two vertices A and B how to get from A to B, i.e. compute a path from A to B. The point is that if we have a very special graph structure we can do this very efficiently. O(n) where n is the plan length.

In that post, you say that you have a graph of  vertices with a particular structure. In that scenario, where is that structured graph of  vertices coming from? Presumably there's some way you know the graph looks like this

rather than looking like this


If you know that your graph is a nice sparse graph that has lots of symmetries, you can take advantage of those properties to skip redundant parts of the computation (and when each of your  nodes has at most 100 inbound edges and 100 outbound edges, then you ... (read more)

I recently listened to The Righteous Mind. It was surprising to me that many people seem to intrinsically care about many things that look very much like good instrumental norms to me (in particular loyalty, respect for authority, and purity).

The author does not make claims about what the reflective equilibrium will be, nor does he explain how the liberals stopped considering loyalty, respect, and purity as intrinsically good (beyond "some famous thinkers are autistic and didn't realize the richness of the moral life of other people"), but his work made me... (read more)

A neglected problem in AI safety technical research is teasing apart the mechanisms of dangerous capabilities exhibited by current LLMs. In particular, I am thinking that for any model organism ( see Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research) of dangerous capabilities (e.g. sleeper agents paper), we don't know how much of the phenomenon depends on the particular semantics of terms like "goal" and "deception" and "lie" (insofar as they are used in the scratchpad or in prompts or in finetuning data) or if the same pheno... (read more)

Terminology point: When I say "a model has a dangerous capability", I usually mean "a model has the ability to do XYZ if fine-tuned to do so". You seem to be using this term somewhat differently as model organisms like the ones you discuss are often (though not always) looking at questions related to inductive biases and generalization (e.g. if you train a model to have a backdoor and then train it in XYZ way does this backdoor get removed).

Brandon Sanderson is a bestselling fantasy author. Despite mostly working with traditional publishers, there is a 50-60 person company formed around his writing[1]. This podcast talks about how the company was formed.

Things I liked about this podcast:

  1. he and his wife both refer to it as "our" company and describe critical contributions she made.
  2. the number of times he was dissatisfied with the way his publisher did something and so hired someone in his own company to do it (e.g. PR and organizing book tours), despite that being part of the publisher's job.
  3. He
... (read more)

Regardless of how good their alignment plans are, the thing that makes OpenAI unambiguously evil is that they created a strongly marketed public product and, as a result, caused a lot public excitement about AI, and thus lots of other AI capabilities organizations were created that are completely dismissive of safety.

There's just no good reason to do that, except short-term greed at the cost of higher probability that everyone (including people at OpenAI) dies.

(No, "you need huge profits to solve alignment" isn't a good excuse — we had nowhere near exhausted the alignment research that can be done without huge profits.)

Showing 3 of 14 replies (Click to show all)

(No, "you need huge profits to solve alignment" isn't a good excuse — we had nowhere near exhausted the alignment research that can be done without huge profits.)

This seems insufficiently argued; the existence of any alignment research that can be done without huge profits is not enough to establish that you don't need huge profits to solve alignment (particularly when considering things like how long timelines are even absent your intervention).

To be clear, I agree that OpenAI are doing evil by creating AI hype.

2Mateusz Bagiński1d
Taboo "evil" (locally, in contexts like this one)?
8Tamsin Leake1d
Here the thing that I'm calling evil is pursuing short-term profits at the cost of non-negligeably higher risk that everyone dies.

The Save State Paradox: A new question for the construct of reality in a simulated world

Consider this thought experiment - in a simulated world (if we do indeed currently live in one), how could we detect an event similar to a state “reset”? Such events could be triggered for existential safety reasons or one unbeknownst to us? If this was the case, how would we become aware of such occurrences if we were reverted to a time before the execution; affecting memories, physical states and environmental continuity?  Imagine if seemingly inexplicable concep... (read more)

I would not search for smart ways to detect it. Instead look at it from the outside - and there I don't see why we should have large hope for it to be detectable: Imagine you create your simulation. Imagine you are much more powerful than you are, to make the simulation as complex as you want. Imagine in your coolest run, your little simulatees start wondering: how could we trick Suzie so her simulation reveals the reset?! I think you agree their question will be futile; once you reset your simulation, surely they'll not be able to detect it: while setting up the simulation might be complex, reinitialize at a given state successfully, with no traces within the simulated system, seems like the simplest task of it all. And so, I'd argue, we might well expect it to be also in our (potential) simulation, however smart your reset-detection design might be.

That's a good point! I feel it ultimately comes down to the motive of the simulator in this assumed power asymmetry - is the intention for the simulatees to work out that they're in a simulation? In that case, the reset function is probably a protective measure for them specifically e.g. if they're on the verge of self annihilation. Or maybe it's to protect them from the truth for their own sanity? Or if the simulator is malevolent, then a reset could exist if the situation is too peaceful or that the simulated find the mechanism to escape their current reality. In any case, the mechanism's presence would be expected.

If a tree falls in the forest, and two people are around to hear it, does it make a sound?

I feel like typically you'd say yes, it makes a sound. Not two sounds, one for each person, but one sound that both people hear.

But that must mean that a sound is not just auditory experiences, because then there would be two rather than one. Rather it's more like, emissions of acoustic vibrations. But this implies that it also makes a sound when no one is around to hear it.

Showing 3 of 4 replies (Click to show all)
But the way to resolve definitional questions is to come up with definitions that make it easier to find general rules about what happens. This illustrates one way one can do that, by picking edge-cases so they scale nicely with rules that occur in normal cases. (Another example would be 1 as not a prime number.)

My recommended way to resolve (aka disambiguate) definitional questions is "use more words".  Common understandings can be short, but unusual contexts require more signals to communicate.

I think we're playing too much with the meaning of "sound" here. The tree causes some vibrations in the air, which leads to two auditory experiences since there are two people

If the housing crisis is caused by low-density rich neighborhoods blocking redevelopment of themselves (as seems the consensus on the internet now), could it be solved by developers buying out an entire neighborhood or even town in one swoop? It'd require a ton of money, but redevelopment would bring even more money, so it could be win-win for everyone. Does it not happen only due to coordination difficulties?

blocking redevelopment of themselves

It's not just blocking redevelopments of themselves. It's blocking almost all development almost everywhere.

As an example - take a look at Scott's Alexander writing about California Forever/Flannery.

Flannery wants to build a new city in California and has already bought almost a billion worth of land for it. Solano County where the land is located has a so-called “Orderly Growth Measure” saying that new building should happen in existing cities and not on empty land. In order to start building at all, they have to win a referendum granting an exemption.


The catchphrase I walk around with in my head regarding the optimal strategy for AI Safety is something like: Creating Superintelligent Artificial Agents* (SAA) without a worldwide referendum is ethically unjustifiable. Until a consensus is reached on whether to bring into existence such technology, a global moratorium is required (*we already have AGI).

I thought it might be useful to spell that out.


Consider proposing the most naïve formula for logical correlation[1].

Let a program be a tuple of code for a Turing machine, intermediate tape states after each command execution, and output. All in binary.

That is , with and .

Let be the number of steps that takes to halt.

Then a formula for the logical correlation [2] of two halting programs , a tape-state discount factor [3], and a string-distance metric could be

... (read more)

Showing 3 of 9 replies (Click to show all)
3Mateusz Bagiński3d
If you want to use it for ECL, then it's not clear to me why internal computational states would matter.

My reason for caring about internal computational states is: In the twin prisoners dilemma[1], I cooperate because we're the same algorithm. If we modify the twin to have a slightly longer right index-finger-nail, I would still cooperate, even though they're a different algorithm, but little enough has been changed about the algorithm that the internal states that they're still similar enough.

But it could be that I'm in a prisoner's dilemma with some program that, given some inputs, returns the same outputs as I do, but for completely different "reasons... (read more)

I don't have a concrete usage for it yet.

I recently discovered the idea of driving all blames into oneself, which immediately resonated with me. It is relatively hardcore; the kind of thing that would turn David Goggins into a Buddhist.

Gemini did a good job of summarising it:

This quote by Pema Chödron, a renowned Buddhist teacher, represents a core principle in some Buddhist traditions, particularly within Tibetan Buddhism. It's called "taking full responsibility" or "taking self-blame" and can be a bit challenging to understand at first. Here's a breakdown:

What it Doesn't Mean:

  • Self-Flagellation:
... (read more)

This practice doesn't mean excusing bad behavior. You can still hold others accountable while taking responsibility for your own reactions.

Well, what if there's a good piece of code (if you'll allow the crudity) in your head, and someone else's bad behavior is geared at hacking/exploiting that piece of code? The harm done is partly due to that piece of code and its role in part of your reaction to their bad behavior. But the implication is that they should stop with their bad behavior, not that you should get rid of the good code. I believe you'll respo... (read more)

The Stoics put this idea in a much kinder way: control the controllable (specifically our actions and attitudes), accept the uncontrollable.  The problem is, people's could's are broken. I have managed to make myself much unhappier by thinking I can control my actions until I read Nate Soares' post I linked above. You can't, even in the everyday definition of control, forgetting about paradoxes of "free will".
Nice write up on this (even if it was AI-assisted), thanks for sharing! I believe another benefit is Raising One's Self-Esteem: If high self-esteem can be thought of as consistently feeling good about oneself, then if someone takes responsibility for their emotions, recognizing that they can change their emotions at will, they can consistently choose to feel good about and love themselves as long as their conscience is clear. This is inline with "The Six Pillars of Self-Esteem" by Nathaniel Branden: living consciously, self-acceptance, self-responsibility, self-assertiveness, living purposefully, and personal integrity.

Today I learned that being successful can involve feelings of hopelessness.

When you are trying to solve a hard problem, where you have no idea if you can solve it, let alone if it is even solvable at all, your brain makes you feel bad. It makes you feel like giving up.

This is quite strange because most of the time when I am in such a situation and manage to make a real efford anyway I seem to always suprise myself with how much progress I manage to make. Empirically this feeling of hopelessness does not seem to track the actual likelyhood that you will completely fail.

13Carl Feynman4d
That hasn’t been my experience.  I’ve tried solving hard problems, sometimes I succeed and sometimes I fail, but I keep trying. Whether I feel good about it is almost entirely determined by whether I’m depressed at the time.  When depressed, by brain tells me almost any action is not a good idea, and trying to solve hard problems is particularly idiotic and doomed to fail.  Maddeningly, being depressed was a hard problem in this sense, so it took me a long time to fix.  Now I take steps at the first sign of depression.

Maybe it is the same for me and I am depressed. I got a lot better at not being depressed, but it might still be the issue. What steps do you take? How can I not be depressed?

(To be clear I am talking specifically about the situation where you have no idea what to do, and if anything is even possible. It seems like there is a difference between a problem that is very hard, but you know you can solve, and a problem that you are not sure is solvable. But I'd guess that being depressed or not depressed is a much more important factor.)

Showing 3 of 30 replies (Click to show all)

Sort of obvious but good to keep in mind: Metacognitive regret bounds are not easily reducible to "plain" IBRL regret bounds when we consider the core and the envelope as the "inside" of the agent.

Assume that the action and observation sets factor as  and , where  is the interface with the external environment and  is the interface with the envelope.

Let  be a metalaw. Then, there are two natural ways to reduce it to an ordinary law:

  • Marginalizing over . That is, le
... (read more)
7Vanessa Kosoy15d
Is it possible to replace the maximin decision rule in infra-Bayesianism with a different decision rule? One surprisingly strong desideratum for such decision rules is the learnability of some natural hypothesis classes. In the following, all infradistributions are crisp. Fix finite action set A and finite observation set O.  For any k∈N and γ∈(0,1), let Mkγ:(A×O)ω→Δ(A×O)k be defined by Mkγ(h|d):=(1−γ)∞∑n=0γn[[h=dn:n+k]] In other words, this kernel samples a time step n out of the geometric distribution with parameter γ, and then produces the sequence of length k that appears in the destiny starting at n. For any continuous[1] function D:□(A×O)k→R, we get a decision rule. Namely, this rule says that, given infra-Bayesian law Λ and discount parameter γ, the optimal policy is π∗DΛ:=argmaxπ:O∗→AD(Mkγ∗Λ(π)) The usual maximin is recovered when we have some reward function r:(A×O)k→R and corresponding to it is Dr(Θ):=minθ∈ΘEθ[r] Given a set H of laws, it is said to be learnable w.r.t. D when there is a family of policies {πγ}γ∈(0,1) such that for any Λ∈H limγ→1(maxπD(Mkγ∗Λ(π))−D(Mkγ∗Λ(πγ))=0 For Dr we know that e.g. the set of all communicating[2] finite infra-RDPs is learnable. More generally, for any t∈[0,1] we have the learnable decision rule Dtr:=tmaxθ∈ΘEθ[r]+(1−t)minθ∈ΘEθ[r] This is the "mesomism" I taked about before.  Also, any monotonically increasing D seems to be learnable, i.e. any D s.t. for Θ1⊆Θ2 we have D(Θ1)≤D(Θ2). For such decision rules, you can essentially assume that "nature" (i.e. whatever resolves the ambiguity of the infradistributions) is collaborative with the agent. These rules are not very interesting. On the other hand, decision rules of the form Dr1+Dr2 are not learnable in general, and so are decision rules of the form Dr+D′ for D′ monotonically increasing. Open Problem: Are there any learnable decision rules that are not mesomism or monotonically increasing? A positive answer to the above would provide interesting generaliz
2Vanessa Kosoy1mo
Formalizing the richness of mathematics Intuitively, it feels that there is something special about mathematical knowledge from a learning-theoretic perspective. Mathematics seems infinitely rich: no matter how much we learn, there is always more interesting structure to be discovered. Impossibility results like the halting problem and Godel incompleteness lend some credence to this intuition, but are insufficient to fully formalize it. Here is my proposal for how to formulate a theorem that would make this idea rigorous. (Wrong) First Attempt Fix some natural hypothesis class for mathematical knowledge, such as some variety of tree automata. Each such hypothesis Θ represents an infradistribution over Γ: the "space of counterpossible computational universes". We can say that Θ is a "true hypothesis" when there is some θ in the credal set Θ (a distribution over Γ) s.t. the ground truth Υ∗∈Γ "looks" as if it's sampled from θ. The latter should be formalizable via something like a computationally bounded version of Marin-Lof randomness. We can now try to say that Υ∗ is "rich" if for any true hypothesis Θ, there is a refinement Ξ⊆Θ which is also a true hypothesis and "knows" at least one bit of information that Θ doesn't, in some sense. This is clearly true, since there can be no automaton or even any computable hypothesis which fully describes Υ∗. But, it's also completely boring: the required Ξ can be constructed by "hardcoding" an additional fact into Θ. This doesn't look like "discovering interesting structure", but rather just like brute-force memorization. (Wrong) Second Attempt What if instead we require that Ξ knows infinitely many bits of information that Θ doesn't? This is already more interesting. Imagine that instead of metacognition / mathematics, we would be talking about ordinary sequence prediction. In this case it is indeed an interesting non-trivial condition that the sequence contains infinitely many regularities, s.t. each of them can be exp

Humans using SAEs to improve linear probes / activation steering vectors might quickly get replaced by a version of probing / steering that leverages unlabeled data.

Like, probing is finding a vector along which labeled data varies, and SAEs are finding vectors that are a sparse basis for unlabeled data. You can totally do both at once - find a vector along which labeled data varies and is part of a sparse basis for unlabeled data.

This is a little bit related to an idea with the handle "concepts live in ontologies." If I say I'm going to the gym, this conce... (read more)

Load More