Adam Kaufman — LessWrong

Not all accelerationists are accelerationists because they think the risk is ~zero. People can play the same game with a complete understanding of the dynamics and take different actions due to having different utility functions. Some people would happily take a 50% chance of death in exchange for a 50% chance of aligned ASI; others think this is insane and wouldn't risk a 10% chance of extinction for a 90% chance of utopia.

Our Reality: A Simulation Run by a Paperclip Maximizer

Adam Kaufman7mo110

When the bullet missed Trump by half an inch I made a lot of jokes about us living in an importance-sampled simulation.

Our Reality: A Simulation Run by a Paperclip Maximizer

Adam Kaufman7mo71

Here's a slightly more general way of phrasing it:
We find ourselves in an extremely leveraged position, making decisions which may influence the trajectory of the entire universe (more precisely our lightcone contains a gigantic amount of resources). There are lots of reasons to care about what happens to universes like ours, either because you live in one or because you can acausally trade with one that you think probably exists. "Paperclip maximizers" is a very small subset of the parties that have a reason to be interested in trying to figure out what happens to universes like ours. I'd wager there are a lot more simulations of minds in highly leveraged positions than there are minds which actually do have a lot of leverage. Being one of the people working on AI/AI safety adds several OOMs of coincidence over being a human in this time period in general, but being a super early mind in this universe at all is still hugely leveraged. Since highly leveraged minds are a lot more likely to be created in simulations than they are to actually have a lot of leverage, you are probably in a simulation. That said, for most utility functions it really shouldn't matter. If you're simulated, it's because your decisions are correlated in some important way with decisions that actually do influence huge amounts of resources, otherwise no one would bother running the simulation. You might as well just act how you would want yourself to act conditional on influencing huge amounts of resources. If your utility is a lot more highly discounted, and you just care about your own short term experiences, then you can just enjoy yourself, simulated or not (although maybe this reduces your measure a bit because no one will bother simulating you if you aren't going to make influential decisions).

Why do many people who care about AI Safety not clearly endorse PauseAI?

Answer by Adam KaufmanMar 31, 2025-10

I think AI safety has very limited political capital at the moment. Pausing AI just isn’t going to happen, so advocating for it makes you sound unreasonable and allows people to comfortably ignore your other opinions. I prefer trying to push for interventions which make a difference with much less political capital, like convincing frontier labs to work on and implement control measures.

Anthropic: Three Sketches of ASL-4 Safety Case Components

Adam Kaufman1y*Ω593

Kudos for releasing a concept of a plan! Some thoughts:

Regarding the first safety case:

The amount of progress in mech interp required to make the first safety case suitable seems overly optimistic to me; I basically think that most of the "limitations" are in fact pretty serious. However, I appreciate that the attempt to include concrete requirements.
I believe that getting good results from the following experiments might be particularly unrealistic:
- "In order to robustify our evals against sandbagging, we ran an experiment where we steered one or more truth-related features to make the model confident that it is in either testing or unmonitored deployment"
  - I worry that "truth-related" features are very difficult both in principle and in practice to isolate, as the model's representation of what other people / entities believe will be hard to disentangle from what it itself believes, to the extent that those are even meaningfully separable concepts.
  - Even if you could identify a feature roughly corresponding to "I believe the current sentence is true", it is not immediately clear to me how this could be used to successfully convince the model of more complex situational information, e.g. it is not being monitored and is in a real deployment setting. For instance, I don't expect artificially activating that feature on the sentence "I am not being monitored and this is a real deployment setting" to just work.
- "In all of these model organisms, deactivating the features of concern was sufficient to prevent the harmful behaviors"
  - I can't concisely refute this. It is just a "huge if true" kind of claim, and as far as I am aware very much not possible with current techniques. If we could isolate a set of linear directions that were robustly causally responsible for scheming behavior, that would have massive ramifications far beyond this setting.
  - I worry a little bit about "training on test" here – there may exist some overfit set of classifiers that can stop model organisms from misbehaving in a specific setting without meaningfully generalizing. This is also related to the stated "Realism of the model organisms" concerns.
I strongly approve of this element of the plan: "In order to preserve the integrity of the learned features, we ensured that the learned features were never used as part of the training procedure for the models."
- It's kind of hard to do this in full generality, since you need to not only avoid training directly against the probes, but also avoid using the information from probes at all when selecting what to train on.

I'm more excited about the control safety case, but have a few nitpicks:

Developing a good elicitation strategy is indeed challenging.
- I think it is pretty crucial to use real human red-teaming, especially if sandbagging (more specifically something like exploration hacking) is a concern.
- It may be helpful to give the red team way more affordances than the untrusted model would actually have in deployment, and only remove affordances if defense becomes impossible. Some examples of extra affordances:
  - Letting the red team know the blue team's exact strategy in order to search for the best response
  - Giving an untrusted model more unmonitored chain-of-thought / tool use / human assistance than it would actually have in deployment
  - Allowing human red-teamers to repeatedly attempt to bypass a monitor when in reality a scheming model might have to one-shot it
- Figuring out exactly what extra affordances suffice to break the control protocol may also provide useful signal about where the protocol could be strengthened, and what capabilities need to be tested for going forward.
- Eliciting capabilities from substantially superhuman models is a much harder problem, and is one of the main reasons I don't expect control to suffice for a safety case well beyond human-level AI.
The space of strategies beyond trusted monitoring alone is pretty large. You mention untrusted monitoring, but there is also a broad space of strategies that include editing outputs with a trusted model, allocating more monitoring/auditing budget in certain situations, etc.
It is probably best to continually produce new honeypots throughout deployment.
- For instance, if untrusted models are being used to do lots of automated software engineering and research, likely requiring huge amounts of context, it will become more obvious to the model over time that it is in deployment.
- Making new honeypots out of actual deployment settings, for instance by adding exploitable security flaws, may generally make it much harder for the model to tell when it is being tested.

I haven't read the last safety case yet, but may have some thoughts on it later. I am most excited about control at the moment, in part due to concerns that interpretability won't advance far enough to suffice for a safety case by the time we develop transformative AI.

Momentum of Light in Glass

Adam Kaufman1y46

I have such a strong intuitive opposition to the Internal Reaction Drive that I agree with your conclusion that we should update away from any theory which allows it. Then again, perhaps it is impossible to build such a drive for the merely practical reason that any material with a positive or negative index of refraction will absorb enough light to turn the drive into an expensive radiator.

Especially given the recent Nobel prize announcement, I think the most concerning piece of information is that there are cultural forces from within the physics community discouraging people from trying to answer the question at all.

What bootstraps intelligence?

Adam Kaufman1y30

You need abstractions to think and plan at all with limited compute, not just to speak. I would guess that plenty animals which are incapable of speaking also mentally rely on abstractions. For instance, when foraging for apples, I suspect an animal probably has a mental category for apples, and treats them as the same kind of thing rather than completely unrelated configurations of atoms.

What percent of the sun would a Dyson Sphere cover?

Answer by Adam KaufmanJul 03, 2024200

The planet Mercury is a pretty good source of material:

Mass: kg (which is about 70% iron)

Radius: $2.44 \times 10^{6}$ m

Volume: $6.08 \times 10^{19}$ m^3

Density: $5411$ kg/m^3

Orbital radius: $0.39 A U = 5.79 \times 10^{10}$ m

A spherical shell around the sun at roughly same radius as Mercury's orbit would have a surface area of $4.21 \times 10^{22}$ m^2, and spreading out Mercury's volume over this area gives a thickness of about 1.4 mm. This means Mercury alone provides ample material for collecting all of the Sun's energy via reflecting light – very thin spinning sheets could act as a swarm of orbiting reflectors that focus sunlight onto large power plants or mirrors that direct it to elsewhere in the solar system. Spinning sheets could be made somewhere between 1-100 μm thick, with thicker cables or supports for additional strength, perhaps 1-10 km wide, and navigate using radiation pressure (using cables that bend the sheet, perhaps). Something like $10^{15}$ or $10^{16}$ mirrors would be enough to intercept and redirect all of the sun's light.

The gravitational binding energy of Mercury is on the order of $10^{30}$ J, or on the order of an hour of the Sun's output. This means in theory the time it takes for a new mirror to pay it's own manufacturing energy cost is in principle quite small; if each kg of material from Mercury is enough to make on the order of 1-100 square meters of mirror, then it will pay for itself in somewhere between minutes and hours (there are roughly 10,000 w/m^2 of solar energy at Mercury's orbit, and each kg of material on average requires on the order of $10^{7}$ J to remove). Only 40-80 doublings are required to consume the whole planet depending on how thick the mirrors are and how much material is used to start the process. Even with many orders of magnitude of overhead to account for inefficiency and heat dissipation, I believe Mercury could be disassembled to cover the entire sun with reflectors on the order of years and perhaps as quickly as months; certainly within decades.

My hour of memoryless lucidity

Adam Kaufman2y53

A better way to do the memory overwrite experiment is to prepare a list of what’s in the box to match each of ten possible numbers, then have someone provide a random number while your short term memory doesn’t work and see if you can successfully overwrite the memory that corresponds to that number (as measured by correctly guessing the number much later).

My intellectual journey to (dis)solve the hard problem of consciousness

Adam Kaufman2y43

I’m confused. I know that it is like something to be me (this is in some sense the only thing I know for sure). It seems like there rules which shape the things I experience, and some of those rules can be studied (like the laws of physics). We are good enough at understanding some of these rules to predict certain systems with a high degree of accuracy, like how an asteroid will orbit a star or how electrons will be pushed through a wire by a particular voltage in a circuit. But I have no way to know or predict if it is like something to be a fish or GPT-4. I know that physical alterations to my brain seem to affect my experience, so it seems like there is a mapping from physical matter to experiences. I do not know precisely what this mapping is, and this indeed seems like a hard problem. In what sense do you disagree with my framing here?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments