LESSWRONG
LW

Home All Posts Concepts Library Community

Quick Takes

I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to t... (read more)

2habryka20m

Thank you for your work there. Curious what specifically prompted you to post this now, presumably you leaving OpenAI and wanting to communicate that somehow?

1William_S18m

No comment.

habryka14m20

Can you confirm or deny whether you signed any NDA related to you leaving OpenAI? (I will indeed assume a "no comment" or something to that degree implies a "yes" with reasonably high probability. Also, you might be interested in this link about the U.S. labor board deciding that NDA's offered during severance agreements that cover the existence of the NDA itself have been ruled unlawful by the National Labor Relations Board)

Viliam's Shortform

Viliam1h20

I suspect that in practice many people use the word "prioritize" to mean:

think short-term
only do legible things
remove slack

Buck's Shortform

Buck17hΩ30419

[epistemic status: I think I’m mostly right about the main thrust here, but probably some of the specific arguments below are wrong. In the following, I'm much more stating conclusions than providing full arguments. This claim isn’t particularly original to me.]

I’m interested in the following subset of risk from AI:

Early: That comes from AIs that are just powerful enough to be extremely useful and dangerous-by-default (i.e. these AIs aren’t wildly superhuman).
Scheming: Risk associated with loss of control to AIs that arises from AIs scheming
- So e.g. I exclu

2Akash3h

I think it depends on how you're defining an "AI control success". If success is defined as "we have an early transformative system that does not instantly kill us– we are able to get some value out of it", then I agree that this seems relatively easy under the assumptions you articulated. If success is defined as "we have an early transformative that does not instantly kill us and we have enough time, caution, and organizational adequacy to use that system in ways that get us out of an acute risk period", then this seems much harder. The classic race dynamic threat model seems relevant here: Suppose Lab A implements good control techniques on GPT-8, and then it's trying very hard to get good alignment techniques out of GPT-8 to align a successor GPT-9. However, Lab B was only ~2 months behind, so Lab A feels like it needs to figure all of this out within 2 months. Lab B– either because it's less cautious or because it feels like it needs to cut corners to catch up– either doesn't want to implement the control techniques or it's fine implementing the control techniques but it plans to be less cautious around when we're ready to scale up to GPT-9. I think it's fine to say "the control agenda is valuable even if it doesn't solve the whole problem, and yes other things will be needed to address race dynamics otherwise you will only be able to control GPT-8 for a small window of time before you are forced to scale up prematurely or hope that your competitor doesn't cause a catastrophe." But this has a different vibe than "AI control is quite easy", even if that statement is technically correct. (Also, please do point out if there's some way in which the control agenda "solves" or circumvents this threat model– apologies if you or Ryan has written/spoken about it somewhere that I missed.)

Buck3hΩ342

When I said "AI control is easy", I meant "AI control mitigates most risk arising from human-ish-level schemers directly causing catastrophes"; I wasn't trying to comment more generally. I agree with your concern.

4ryan_greenblatt15h

The claim is that most applications aren't internal usage of AI for AI development and thus can be made trivially safe. Not that most applications of AI for AI development can be made trivially safe.

Johannes C. Mayer's Shortform

Johannes C. Mayer4h11

Mathematical descriptions are powerful because they can be very terse. You can only specify the properties of a system and still get a well-defined system.

This is in contrast to writing algorithms and data structures where you need to get concrete implementations of the algorithms and data structures to get a full description.

Dagon3h20

"Mathematical descriptions" is a little ambiguous. Equations and models are terse. The mapping of such equations to human-level system expectations (anticipated conditional experiences) can require quite a bit of verbosity.

I think that's what you're saying with the "algorithms and data structures" part, but I'm unsure if you're claiming that the property specification of the math is sufficient as a description, and comparable in fidelity to the algorithmic implementation.

Johannes C. Mayer's Shortform

Johannes C. Mayer4h10

The Model-View-Controller architecture is very powerful. It allows us to separate concerns.

For example, if we want to implement an algorithm, we can write down only the data structures and algorithms that are used.

We might want to visualize the steps that the algorithm is performing, but this can be separated from the actual running of the algorithm.

If the algorithm is interactive, then instead of putting the interaction logic in the algorithm, which could be thought of as the rules of the world, we instead implement functionality that directly changes the... (read more)

dkornai's Shortform

dkornai2d3-2

Pain is the consequence of a perceived reduction in the probability that an agent will achieve its goals.

In biological organisms, physical pain [say, in response to limb being removed] is an evolutionary consequence of the fact that organisms with the capacity to feel physical pain avoided situations where their long-term goals [e.g. locomotion to a favourable position with the limb] which required the subsystem generating pain were harmed.

This definition applies equally to mental pain [say, the pain felt when being expelled from a group of allies] w... (read more)

Showing 3 of 4 replies (Click to show all)

1StartAtTheEnd22h

I think pain is a little bit different than that. It's the contrast between the current state and the goal state. This constrast motivates the agent to act, when the pain of contrast becomes bigger than the (predicted) pain of acting. As a human, you can decrase your pain by thinking that everything will be okay, or you can increase your pain by doubting the process. But it is unlikely that you will allow yourself to stop hurting, because your brain fears that a lack of suffering would result in a lack of progress (some wise people contest this, claiming that wu wei is correct). Another way you can increase your pain is by focusing more on the goal you want to achieve, sort of irritating/torturing yourself with the fact that the goal isn't achieved, to which your brain will respond by increasing the pain felt by the contrast, urging action. Do you see how this differs slightly from your definition? Chronic pain is not a continuous reduction in agency, but a continuous contrast between a bad state and a good state, which makes one feel pain which motivates them to solve it (exercise, surgery, resting, looking for painkillers, etc). This generalizes to other negative feelings, for instance to hunger, which exists with the purpose to be less pleasant than the search for food is, such that you seek food. I warn you that avoiding negative emotions can lead to stagnation, since suffering leads to growth (unless we start wireheading, and making the avoidance of pain our new goal, because then we might seek hedonic pleasures and intoxicants)

dkornai5h10

I would certainly agree with part of what you are saying. Especially the point that many important lessons are taught by pain [correct me if this is misinterpreting your comment]. Indeed, as a parent for example, if your goal is for your child to gain the capacity for self sufficiency, a certain amount of painful lessons that reflect the inherent properties of the world are necessary to achieve such a goal.

On the other hand, I do not agree with your framing of pain as being the main motivator [again, correct me if required]. In fact, a wide variety of syst... (read more)

1CstineSublime2d

How many organisms other than humans have "long term goals"? Doesn't that require a complex capacity for mental representation of possible future states? Am I wrong in assuming that the capacity to experience "pain" is independent of an explicit awareness of what possibilities have been shifted as a result of the new sensory data? (i.e. having a limb cleaved from the rest of the body, stubbing your toe in the dark). The organism may not even be aware of those possibilities, only 'aware' of pain. Note: I'm probably just having a fear of this sounding all too teleological and personifying evolution

Elizabeth's Shortform

Elizabeth13d112

A very rough draft of a plan to test prophylactics for airborne illnesses.

Start with a potential superspreader event. My ideal is a large conference, many of whom travelled to get there, in enclosed spaces with poor ventilation and air purification, in winter. Ideally >=4 days, so that people infected on day one are infectious while the conference is still running.

Call for sign-ups for testing ahead of time (disclosing all possible substances and side effects). Split volunteers into control and test group. I think you need ~500 sign ups in t... (read more)

5gwern13d

This sounds like a bad plan because it will be a logistics nightmare (undermining randomization) with high attrition, and extremely high variance due to between-subject design (where subjects differ a ton at baseline, in addition to exposure) on a single occasion with uncontrolled exposures and huge measurement error where only the most extreme infections get reported (sometimes). You'll probably get non-answers, if you finish at all. The most likely outcome is something goes wrong and the entire effort is wasted. Since this is a topic which is highly repeatable within-person (and indeed, usually repeats often through a lifetime...), this would make more sense as within-individual and using higher-quality measurements. One good QS approach would be to exploit the fact that infections, even asymptomatic ones, seem to affect heart rate etc as the body is damaged and begins fighting the infection. HR/HRV is now measurable off the shelf with things like the Apple Watch, AFAIK. So you could recruit a few tech-savvy conference-goers for measurements from a device they already own & wear. This avoids any 'big bang' and lets you prototype and tweak on a few people - possibly yourself? - before rolling it out, considerably de-risking it. There are some people who travel constantly for business and going to conferences, and recruiting and managing a few of them would probably be infinitely easier than 500+ randos (if for no reason other than being frequent flyers they may be quite eager for some prophylactics), and you would probably get far more precise data out of them if they agree to cooperate for a year or so and you get eg 10 conferences/trips out of each of them which you can contrast with their year-round baseline & exposome and measure asymptomatic infections or just overall health/stress. (Remember, variance reduction yields exponential gains in precision or sample-size reduction. It wouldn't be too hard for 5 or 10 people to beat a single 250vs250 one-off experi

Elizabeth20h20

All of the problems you list seem harder with repeated within-person trials.

D0TheMath's Shortform

Garrett Baker1d138

I don't really know what people mean when they try to compare "capabilities advancements" to "safety advancements". In one sense, its pretty clear. The common units are "amount of time", so we should compare the marginal (probablistic) difference between time-to-alignment and time-to-doom. But I think practically people just look at vibes.

For example, if someone releases a new open source model people say that's a capabilities advance, and should not have been done. Yet I think there's a pretty good case that more well-trained open source models are better... (read more)

Showing 3 of 4 replies (Click to show all)

2the gears to ascension21h

People who have the ability to clarify in any meaningful way will not do so. You are in a biased environment where people who are most willing to publish, because they are most able to convince themselves their research is safe - eg, because they don't understand in detail how to reason about whether it is or not - are the ones who will do so. Ability to see far enough ahead would of course be expected to be rather rare, and most people who think they can tell the exact path ahead of time don't have the evidence to back their hunches, even if their hunches are correct, which unless they have a demonstrated track record they probably aren't. Therefore, whoever is making the most progress on real capabilities insights under the name of alignment will make their advancements and publish them, since they don't personally see how it's exfohaz. And it won't be apparent until afterwards that it was capabilities, not alignment. So just don't publish anything, and do your work in private. Email it to anthropic when you know how to create a yellow node. But for god's sake stop accidentally helping people create green nodes because you can't see five inches ahead. And don't send it to a capabilities team before it's able to guarantee moral alignment hard enough to make a red-proof yellow node!

Garrett Baker21h42

This seems contrary to how much of science works. I expect if people stopped talking publicly about what they're working on in alignment, we'd make much less progress, and capabilities would basically run business as usual.

The sort of reasoning you use here, and that my only response to it basically amounts to "well, no I think you're wrong. This proposal will slow down alignment too much" is why I think we need numbers to ground us.

6Garrett Baker1d

Yeah, there are reasons for caution. I think it makes sense for those concerned or non-concerned to make numerical forecasts about the costs & benefits of such questions, rather than the current state of everyone just comparing their vibes against each other. This generalizes to other questions, like the benefits of interpretability, advances in safety fine-tuning, deep learning science, and agent foundations. Obviously such numbers aren't the end-of-the-line, and like in biorisk, sometimes they themselves should be kept secret. But it still seems a great advance. If anyone would like to collaborate on such a project, my DMs are open (not so say this topic is covered, this isn't exactly my main wheelhouse).

Mati_Roy's Shortform

Mati_Roy6d171

it seems to me that disentangling beliefs and values are important part of being able to understand each other

and using words like "disagree" to mean both "different beliefs" and "different values" is really confusing in that regard

4Viliam6d

Lets use "disagree" vs "dislike".

Mati_Roy21h20

when potentially ambiguous, I generally just say something like "I have a different model" or "I have different values"

TurnTrout's shortform feed

TurnTrout3dΩ24555

A semi-formalization of shard theory. I think that there is a surprisingly deep link between "the AIs which can be manipulated using steering vectors" and "policies which are made of shards."^[1] In particular, here is a candidate definition of a shard theoretic policy:

A policy has shards if it implements at least two "motivational circuits" (shards) which can independently activate (more precisely, the shard activation contexts are compositionally represented).

By this definition, humans have shards because they can want food at the same time as wantin... (read more)

Showing 3 of 4 replies (Click to show all)

Thomas Kwa1dΩ121

I'm not so sure that shards should be thought of as a matter of implementation. Contextually activated circuits are a different kind of thing from utility function components. The former activate in certain states and bias you towards certain actions, whereas utility function components score outcomes. I think there are at least 3 important parts of this:

A shardful agent can be incoherent due to valuing different things from different states
A shardful agent can be incoherent due to its shards being shallow, caring about actions or proximal effects rather t

... (read more)

15Daniel Kokotajlo2d

I think this is also what I was confused about -- TurnTrout says that AIXI is not a shard-theoretic agent because it just has one utility function, but typically we imagine that the utility function itself decomposes into parts e.g. +10 utility for ice cream, +5 for cookies, etc. So the difference must not be about the decomposition into parts, but the possibility of independent activation? but what does that mean? Perhaps it means: The shards aren't always applied, but rather only in some circumstances does the circuitry fire at all, and there are circumstances in which shard A fires without B and vice versa. (Whereas the utility function always adds up cookies and ice cream, even if there are no cookies and ice cream around?) I still feel like I don't understand this.

1samshap2d

Instead of demanding orthogonal representations, just have them obey the restricted isometry property. Basically, instead of requiring ∀i≠j:<xi,xj>=0, we just require ∀i≠j:xi⋅xj≤ϵ . This would allow a polynomial number of sparse shards while still allowing full recovery.

quetzal_rainbow's Shortform

quetzal_rainbow1d132

@jessicata once wrote "Everyone wants to be a physicalist but no one wants to define physics". I decided to check SEP article on physicalism and found that, yep, it doesn't have definition of physics:

Carl Hempel (cf. Hempel 1969, see also Crane and Mellor 1990) provided a classic formulation of this problem: if physicalism is defined via reference to contemporary physics, then it is false — after all, who thinks that contemporary physics is complete? — but if physicalism is defined via reference to a future or ideal physics, then it is trivial — after all,

... (read more)

tlevin's Shortform

tlevin3d640

I think some of the AI safety policy community has over-indexed on the visual model of the "Overton Window" and under-indexed on alternatives like the "ratchet effect," "poisoning the well," "clown attacks," and other models where proposing radical changes can make you, your allies, and your ideas look unreasonable (edit to add: whereas successfully proposing minor changes achieves hard-to-reverse progress, making ideal policy look more reasonable).

I'm not familiar with a lot of systematic empirical evidence on either side, but it seems to me like the more... (read more)

Showing 3 of 6 replies (Click to show all)

MP1d0-2

I'm not a decel, but the way this stuff often is resolved is that there are crazy people that aren't taken seriously by the managerial class but that are very loud and make obnoxious asks. Think the evangelicals against abortion or the Columbia protestors.

Then there is some elite, part of the managerial class, that makes reasonable policy claims. For Abortion, this is Mitch McConnel, being disciplined over a long period of time in choosing the correct judges. For Palestine, this is Blinken and his State Department bureaucracy.

The problem with d... (read more)

5tlevin2d

Quick reactions: 1. Re: how over-emphasis on "how radical is my ask" vs "what my target audience might find helpful" and generally the importance of making your case well regardless of how radical it is, that makes sense. Though notably the more radical your proposal is (or more unfamiliar your threat models are), the higher the bar for explaining it well, so these do seem related. 2. Re: more effective actors looking for small wins, I agree that it's not clear, but yeah, seems like we are likely to get into some reference class tennis here. "A lot of successful organizations that take hard-line positions and (presumably) get a lot of their power/influence from the ideological purity that they possess & communicate"? Maybe, but I think of like, the agriculture lobby, who just sort of quietly make friends with everybody and keep getting 11-figure subsidies every year, in a way that (I think) resulted more from gradual ratcheting than making a huge ask. "Pretty much no group– whether radical or centrist– has had tangible wins" seems wrong in light of the EU AI Act (where I think both a "radical" FLI and a bunch of non-radical orgs were probably important) and the US executive order (I'm not sure which strategy is best credited there, but I think most people would have counted the policies contained within it as "minor asks" relative to licensing, pausing, etc). But yeah I agree that there are groups along the whole spectrum that probably deserve credit. 3. Re: poisoning the well, again, radical-ness and being dumb/uninformed are of course separable but the bar rises the more radical you get, in part because more radical policy asks strongly correlate with more complicated procedural asks; tweaking ECRA is both non-radical and procedurally simple, creating a new agency to license training runs is both outside the DC Overton Window and very procedurally complicated. 4. Re: incentives, I agree that this is a good thing to track, but like, "people who oppose X are in

1Noosphere891d

It's not just that problem though, they will likely be biased to think that their policy is helpful for safety of AI at all, and this is a point that sometimes gets forgotten. But correct on the fact that Akash's argument is fully general.

ChristianKl's Shortform

ChristianKl2d80

The FDC just fined US phone carriers for sharing the location data of US customers to anyone willing to buy them. The fines don't seem to be high enough to deter this kind of behavior.

That likely includes either directly or indirectly the Chinese government.

What does the US Congress do to protect spying by China? Of course, banning tik tok instead of actually protecting the data of US citizens.

If you have thread models that the Chinese government might target you, assume that they know where your phone is and shut it of when going somewhere you... (read more)

Showing 3 of 6 replies (Click to show all)

2Dagon2d

[note: I suspect we mostly agree on the impropriety of open selling and dissemination of this data. This is a narrow objection to the IMO hyperbolic focus on government assault risks. ] I'm unhappy with the phrasing of "targeted by the Chinese government", which IMO implies violence or other real-world interventions when the major threats are "adversary use of AI-enabled capabilities in disinformation and influence operations." Thanks for mentioning blackmail - that IS a risk I put in the first category, and presumably becomes more possible with phone location data. I don't know how much it matters, but there is probably a margin where it does. I don't disagree that this purchasable data makes advertising much more effective (in fact, I worked at a company based on this for some time). I only mean to say that "targeting" in the sense of disinformation campaigns is a very different level of threat from "targeting" of individuals for government ops.

ChristianKl1d20

This is a narrow objection to the IMO hyperbolic focus on government assault risks.

Whether or not you face government assault risks depends on what you do. Most people don't face government assault risks. Some people engage in work or activism that results in them having government assault risks.

The Chinese government has strategic goals and most people are unimportant to those. Some people however work on topics like AI policy in which the Chinese government has an interest.

2ChristianKl2d

Politico wrote, "Perhaps the most pressing concern is around the Chinese government’s potential access to troves of data from TikTok’s millions of users." The concern that TikTok supposedly is spyware is frequently made in discussions about why it should be banned. If the main issue is content moderation decisions, the best way to deal with it would be to legislate transparency around content moderation decisions and require TikTok to outsource the moderation decisions to some US contractor.

jacquesthibs's Shortform

jacquesthibs1d42

From a Paul Christiano talk called "How Misalignment Could Lead to Takeover" (from February 2023):

Assume we're in a world where AI systems are broadly deployed, and the world has become increasingly complex, where humans know less and less about how things work.

A viable strategy for AI takeover is to wait until there is certainty of success. If a 'bad AI' is smart, it will realize it won't be successful if it tries to take over, not a problem.

So you lose when a takeover becomes possible, and some threshold of AIs behave badly. If all the smartest AIs... (read more)

Yoav Ravid's Shortform

Yoav Ravid1d40

Looking for blog platform/framework recommendations

I had a Wordpress blog, but I don't like wordpress and I want to move away from it.

Substack doesn't seem like a good option because I want high customizability and multilingual support (my Blog is going to be in English and Hebrew).

I would like something that I can use for free with my own domain (so not Wix).

The closest thing I found to what I'm looking for was MkDocs Material, but it's still geared too much towards documentation, and I don't like its blog functionality enough.

Other requirements: Da... (read more)

Tamsin Leake's Shortform

Tamsin Leake6d4723

decision theory is no substitute for utility function

some people, upon learning about decision theories such as LDT and how it cooperates on problems such as the prisoner's dilemma, end up believing the following:

my utility function is about what i want for just me; but i'm altruistic (/egalitarian/cosmopolitan/pro-fairness/etc) because decision theory says i should cooperate with other agents. decision theoritic cooperation is the true name of altruism.

it's possible that this is true for some people, but in general i expect that to be a mistaken anal... (read more)

Showing 3 of 7 replies (Click to show all)

1Pi Rogers2d

What about the following: My utility function is pretty much just my own happiness (in a fun-theoretic rather than purely hedonistic sense). However, my decision theory is updateless with respect to which sentient being I ended up as, so once you factor that in, I'm a multiverse-wide realityfluid-weighted average utilitarian. I'm not sure how correct this is, but it's possible.

Tamsin Leake1d20

It certainly is possible! In more decision-theoritic terms, I'd describe this as "it sure would suck if agents in my reference class just optimized for their own happiness; it seems like the instrumental thing for agents in my reference class to do is maximize for everyone's happiness". Which is probly correct!

But as per my post, I'd describe this position as "not intrinsically altruistic" — you're optimizing for everyone's happiness because "it sure would sure if agents in my reference class didn't do that", not because you intrinsically value that everyone be happy, regardless of reasoning about agents and reference classes and veils of ignorance.

2Viliam4d

I don't have an explicit theory of how this works; for example, I would consider "pleasing others" in an experience machine meaningless, but "eating a cake" in an experience machine seems just as okay as in real life (maybe even preferable, considering that cakes are unhealthy). A fake memory of "having eaten a cake" would be a bad thing; "making people happier by talking to them" in an experience machine would be intrinsically meaningless, but it might help me improve my actual social skills, which would be valuable. Sometimes I care about the referent being real (the people I would please), sometimes I don't (the cake I would eat). But it's not the people/cake distinction per se; for example in case of using fake simulated people to practice social skills, the emphasis is on the skills being real; I would be disappointed if the experience machine merely gave me a fake "feeling of having improved my skills". I imagine that for a psychopath everything and everyone is instrumental, so there would be no downside to the experience machine (except for the risk of someone turning it off). But this is just a guess. I suspect that analyzing "the true preferences" is tricky, because ultimately we are built of atoms, and atoms have no preferences. So the question is whether by focusing on some aspect of the human mind we got better insight to its true nature, or whether we have just eliminated the context that was necessary for it to make sense.

David Gross's Shortform

David Gross3d52

In my fantasies, if I ever were to get that god-like glimpse at how everything actually is, with all that is currently hidden unveiled, it would be something like the feeling you have when you get a joke, or see a "magic eye" illustration, or understand an illusionist's trick, or learn to juggle: what was formerly perplexing and incoherent becomes in a snap simple and integrated, and there's a relieving feeling of "ah, but of course."

But it lately occurs to me that the things I have wrong about the world are probably things I've grasped at exactly because ... (read more)

David Gross2d20

And then today I read this: “We yearn for the transcendent, for God, for something divine and good and pure, but in picturing the transcendent we transform it into idols which we then realize to be contingent particulars, just things among others here below. If we destroy these idols in order to reach something untainted and pure, what we really need, the thing itself, we render the Divine ineffable, and as such in peril of being judged non-existent. Then the sense of the Divine vanishes in the attempt to preserve it.” (Iris Murdoch, Metaphysics as a Guide to Morals)

1metachirality3d

I like to phrase it as "the path to simplicity involves a lot of detours." Yes, Newtonian mechanics doesn't account for the orbit of Mercury but it turned out there was an even simpler, more parsimonious theory, general relativity, waiting for us.

localdeity's Shortform

localdeity2d53

Pithy sayings are lossily compressed.

yanni's Shortform

yanni kyriacos2d10

Something someone technical and interested in forecasting should look into: can LLMs reliably convert peoples claims into a % of confidence through sentiment analysis? This would be useful for Forecasters I believe (and rationality in general)

yanni's Shortform

yanni kyriacos3d80

There have been multiple occasions where I've copy and pasted email threads into an LLM and asked it things like:

What is X person saying
What are the cruxes in this conversation?
Summarise this conversation
What are the key takeaways
What views are being missed from this conversation

I really want an email plugin that basically brute forces rationality INTO email conversations.

Showing 3 of 4 replies (Click to show all)

1yanni kyriacos2d

Hi Johannes! Thanks for the suggestion :) I'm not sure i'd want it in the middle of a video call, but maybe in a forum context like this could be cool?

1Johannes C. Mayer2d

Seems pretty good to me to have this in a video call to me. The main reason why don't immediately try this out is that I would need to write a program to do this.

yanni kyriacos2d10

That seems fair enough!