Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a special post for quick takes by William_S. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
39 comments, sorted by Click to highlight new comments since:
[-]William_SΩ731709

I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source "transformer debugger" tool.
I resigned from OpenAI on February 15, 2024.

Reply161032
[-]habrykaΩ15284

Thank you for your work there. Curious what specifically prompted you to post this now, presumably you leaving OpenAI and wanting to communicate that somehow?

[-]William_SΩ16430

No comment.

[-]habrykaΩ20467

Can you confirm or deny whether you signed any NDA related to you leaving OpenAI? 

(I would guess a "no comment" or lack of response or something to that degree implies a "yes" with reasonably high probability. Also, you might be interested in this link about the U.S. labor board deciding that NDA's offered during severance agreements that cover the existence of the NDA itself have been ruled unlawful by the National Labor Relations Board when deciding how to respond here)

[-]gwernΩ306862

I think it is safe to infer from the conspicuous and repeated silence by ex-OA employees when asked whether they signed a NDA which also included a gag order about the NDA, that there is in fact an NDA with a gag order in it, presumably tied to the OA LLC PPUs (which are not real equity and so probably even less protected than usual).

EDIT: Kelsey Piper has confirmed that there is an OA NDA with a gag order, and violation forfeits all equity - including fully vested equity. This implies that since you would assume Ilya Sutskever would have received many PPUs & would be holding them as much as possible, Sutskever might have had literally billions of dollars at stake based on how he quit and what he then, say, tweeted... (PPUs which can only be sold in the annual OA-controlled tender offer.)

It turns out there’s a very clear reason for that. I have seen the extremely restrictive off-boarding agreement that contains nondisclosure and non-disparagement provisions former OpenAI employees are subject to. It forbids them, for the rest of their lives, from criticizing their former employer. Even acknowledging that the NDA exists is a violation of it. If a departing employee declines to sign the document, or if they violate it, they can lose all vested equity they earned during their time at the company, which is likely worth millions of dollars....While nondisclosure agreements aren’t unusual in highly competitive Silicon Valley, putting an employee’s already-vested equity at risk for declining or violating one is. For workers at startups like OpenAI, equity is a vital form of compensation, one that can dwarf the salary they make. Threatening that potentially life-changing money is a very effective way to keep former employees quiet. (OpenAI did not respond to a request for comment.)

Does anyone know if it's typically the case that people under gag orders about their NDAs can talk to other people who they know signed the same NDAs? That is, if a bunch of people quit a company and all have signed self-silencing NDAs, are they normally allowed to talk to each other about why they quit and commiserate about the costs of their silence?

They would not know if others have signed the SAME NDAs without trading information about their own NDAs, which is forbidden.

[-]O OΩ6140

Daniel K seems pretty open about his opinions and reasons for leaving. Did he not sign an NDA and thus gave up whatever PPUs he had?

[-]LawrenceCΩ7183

When I spoke to him a few weeks ago (a week after he left OAI), he had not signed an NDA at that point, so it seems likely that he hasn't.

By "gag order" do you mean just as a matter of private agreement, or something heavier-handed, with e.g. potential criminal consequences?

I have trouble understanding the absolute silence we seem to be having. There seem to be very few leaks, and all of them are very mild-mannered and are failing to build any consensus narrative that challenges OA's press in the public sphere.

Are people not able to share info over Signal or otherwise tolerate some risk here? It doesn't add up to me if the risk is just some chance of OA trying to then sue you to bankruptcy, especially since I think a lot of us would offer support in that case, and the media wouldn't paint OA in a good light for it.

I am confused. (And I grateful to William for at least saying this much, given the climate!)

[-]isabel2518

I would guess that there isn’t a clear smoking gun that people aren’t sharing because of NDAs, just a lot of more subtle problems that add up to leaving (and in some cases saying OpenAI isn’t being responsible etc).

This is consistent with the observation of the board firing Sam but not having a clear crossed line to point at for why they did it.

It’s usually easier to notice when the incentives are pointing somewhere bad than to explain what’s wrong with them, and it’s easier to notice when someone is being a bad actor than it is to articulate what they did wrong. (Both of these run a higher risk of false positives relative to more crisply articulatable problems.)

The lack of leaks could just mean that there's nothing interesting to leak. Maybe William and others left OpenAI over run-of-the-mill office politics and there's nothing exceptional going on related to AI.

[-]gwern285

Rest assured, there is plenty that could leak at OA... (And might were there not NDAs, which of course is much of the point of having them.)

For a past example, note that no one knew that Sam Altman had been fired from YC CEO for similar reasons as OA CEO, until the extreme aggravating factor of the OA coup, 5 years later. That was certainly more than 'run of the mill office politics', I'm sure you'll agree, but if that could be kept secret, surely lesser things now could be kept secret well past 2029?

At least one of them has explicitly indicated they left because of AI safety concerns, and this thread seems to be insinuating some concern - Ilya Sutskever's conspicuous silence has become a meme, and Altman recently expressed that he is uncertain of Ilya's employment status. There still hasn't been any explanation for the boardroom drama last year.

If it was indeed run-of-the-mill office politics and all was well, then something to the effect of "our departures were unrelated, don't be so anxious about the world ending, we didn't see anything alarming at OpenAI" would obviously help a lot of people and also be a huge vote of confidence for OpenAI.

It seems more likely that there is some (vague?) concern but it's been overridden by tremendous legal/financial/peer motivations.

What's PPU?

From here:

Profit Participation Units (PPUs) represent a unique compensation method, distinct from traditional equity-based rewards. Unlike shares, stock options, or profit interests, PPUs don't confer ownership of the company; instead, they offer a contractual right to participate in the company's future profits.

[-]Linch340

(not a lawyer) 

My layman's understanding is that managerial employees are excluded from that ruling, unfortunately. Which I think applies to William_S if I read his comment correctly. (See Pg 11, in the "Excluded" section in the linked pdf in your link)

I am a lawyer. 

I think one key point that is missing is this: regardless of whether the NDA and the subsequent gag order is legitimate or not; William would still have to spend thousands of dollars on a court case to rescue his rights. This sort of strong-arm litigation has become very common in the modern era. It's also just... very stressful. If you've just resigned from a company you probably used to love, you likely don't want to fish all of your old friends, bosses and colleagues into a court case.

Edit: also, if William left for reasons involving AGI safety - maybe entering into (what would likely be a very public) court case would be counteractive to their reason for leaving? You probably don't want to alarm the public by flavouring existential threats in legal jargon.  American judges have the annoying tendency to valorise themselves as celebrities when confronting AI (see Musk v Open AI).

Are you familiar with USA NDA's? I'm sure there are lots of clauses that have been ruled invalid by case law? In many cases, non-lawyers have no ideas about these, so you might be able to make a difference with very little effort. There is also the possibility that valuable OpenAI shares could be rescued?

If you haven't seen it, check out this thread where one of the OpenAI leavers did not sigh the gag order.

I have reviewed his post. Two (2) things to note: 

(1) Invalidity of the NDA does not guarantee William will be compensated after the trial. Even if he is, his job prospects may be hurt long-term. 

(2) State's have different laws on whether the NLRA trumps internal company memorandums. More importantly, labour disputes are traditionally solved through internal bargaining. Presumably, the collective bargaining 'hand-off' involving NDA's and gag-orders at this level will waive subsequent litigation in district courts. The precedent Habryka offered refers to hostile severance agreements only, not the waiving of the dispute mechanism itself. 

I honestly wish I could use this dialogue as a discrete communication to William on a way out, assuming he needs help, but I re-affirm my previous worries on the costs. 

I also add here, rather cautiously, that there are solutions. However, it would depend on whether William was an independent contractor, how long he worked there, whether it actually involved a trade secret (as others have mentioned) and so on. The whole reason NDA's tend to be so effective is because they obfuscate the material needed to even know or be aware of what remedies are available.
 

Interesting! For most of us, this is outside our area of competence, so appreciate your input.

I can see some arguments in your direction but would tentatively guess the opposite. 

[-]tlevin258

Kelsey Piper now reports: "I have seen the extremely restrictive off-boarding agreement that contains nondisclosure and non-disparagement provisions former OpenAI employees are subject to. It forbids them, for the rest of their lives, from criticizing their former employer. Even acknowledging that the NDA exists is a violation of it."

What are your timelines like? How long do YOU think we have left?

I know several CEOs of small AGI startups who seem to have gone crazy and told me that they are self inserts into this world, which is a simulation of their original self's creation. However, none of them talk about each other, and presumably at most one of them can be meaningfully right?

One AGI CEO hasn't gone THAT crazy (yet), but is quite sure that the November 2024 election will be meaningless because pivotal acts will have already occurred that make nation state elections visibly pointless.

Also I know many normies who can't really think probabilistically and mostly aren't worried at all about any of this... but one normy who can calculate is pretty sure that we have AT LEAST 12 years (possibly because his retirement plans won't be finalized until then). He also thinks that even systems as "mere" as TikTok will be banned before the November 2024 election because "elites aren't stupid".

I think I'm more likely to be better calibrated than any of these opinions, because most of them don't seem to focus very much on "hedging" or "thoughtful doubting", whereas my event space assigns non-zero probability to ensembles that contain such features of possible futures (including these specific scenarios).

Wondering why this has so many disagreement votes. Perhaps people don't like to see the serious topic of "how much time do we have left", alongside evidence that there's a population of AI entrepreneurs who are so far removed from consensus reality, that they now think they're living in a simulation. 

(edit: The disagreement for @JenniferRM's comment was at something like -7. Two days later, it's at -2)

For most of my comments, I'd almost be offended if I didn't say something surprising enough to get a "high interestingness, low agreement" voting response. Excluding speech acts, why even say things if your interlocutor or full audience can predict what you'll say?

And I usually don't offer full clean proofs in direct word. Anyone still pondering the text at the end, properly, shouldn't "vote to agree", right? So from my perspective... its fine and sorta even working as intended <3

However, also, this is currently the top-voted response to me, and if William_S himself reads it I hope he answers here, if not with text then (hopefully? even better?) with a link to a response elsewhere?

((EDIT: Re-reading everything above his, point, I notice that I totally left out the "basic take" that might go roughly like "Kurzweil, Altman, and Zuckerberg are right about compute hardware (not software or philosophy) being central, and there's a compute bottleneck rather than a compute overhang, so the speed of history will KEEP being about datacenter budgets and chip designs, and those happen on 6-to-18-month OODA loops that could actually fluctuate based on economic decisions, and therefore its maybe 2026, or 2028, or 2030, or even 2032 before things pop, depending on how and when billionaires and governments decide to spend money".))

Pulling honest posteriors from people who've "seen things we wouldn't believe" gives excellent material for trying to perform aumancy... work backwards from their posteriors to possible observations, and then forwards again, toward what might actually be true :-)

It could just be because it reaches a strong conclusion on anecdotal/clustered evidence (e.g. it might say more about her friend group than anything else). Along with claims to being better calibrated for weak reasons - which could be true, but seems not very epistemically humble.

Full disclosure I downvoted karma, because I don't think it should be top reply, but I did not agree or disagree.

But Jen seems cool, I like weird takes, and downvotes are not a big deal - just a part of a healthy contentious discussion.

However, none of them talk about each other, and presumably at most one of them can be meaningfully right?

Why at most one of them can be meaningfully right?

Would not a simulation typically be "a multi-player game"?

(But yes, if they assume that their "original self" was the sole creator (?), then they would all be some kind of "clones" of that particular "original self". Which would surely increase the overall weirdness.)

These are valid concerns! I presume that if "in the real timeline" there was a consortium of AGI CEOs who agreed to share costs on one run, and fiddled with their self-inserts, then they... would have coordinated more? (Or maybe they're trying to settle a bet on how the Singularity might counterfactually might have happened in the event of this or that person experiencing this or that coincidence? But in that case I don't think the self inserts would be allowed to say they're self inserts.)

Like why not re-roll the PRNG, to censor out the counterfactually simulable timelines that included me hearing from any of the REAL "self inserts of the consortium of AGI CEOS" (and so I only hear from "metaphysically spurious" CEOs)??

Or maybe the game engine itself would have contacted me somehow to ask me to "stop sticking causal quines in their simulation" and somehow I would have been induced by such contact to not publish this?

Mostly I presume AGAINST "coordinated AGI CEO stuff in the real timeline" along any of these lines because, as a type, they often "don't play well with others". Fucking oligarchs... maaaaaan.

It seems like a pretty normal thing, to me, for a person to naturally keep track of simulation concerns as a philosophic possibility (its kinda basic "high school theology" right?)... which might become one's "one track reality narrative" as a sort of "stress induced psychotic break away from a properly metaphysically agnostic mental posture"?

That's my current working psychological hypothesis, basically.

But to the degree that it happens more and more, I can't entirely shake the feeling that my probability distribution over "the time T of a pivotal acts occurring" (distinct from when I anticipate I'll learn that it happened which of course must be LATER than both T and later than now) shouldn't just include times in the past, but should actually be a distribution over complex numbers or something...

...but I don't even know how to do that math? At best I can sorta see how to fit it into exotic grammars where it "can have happened counterfactually" or so that it "will have counterfactually happened in a way that caused this factually possible recurrence" or whatever. Fucking "plausible SUBJECTIVE time travel", fucking shit up. It is so annoying.

Like... maybe every damn crazy AGI CEO's claims are all true except the ones that are mathematically false?

How the hell should I know? I haven't seen any not-plausibly-deniable miracles yet. (And all of the miracle reports I've heard were things I was pretty sure the Amazing Randi could have duplicated.)

All of this is to say, Hume hasn't fully betrayed me yet!

Mostly I'll hold off on performing normal updates until I see for myself, and hold off on performing logical updates until (again!) I see a valid proof for myself <3

I know several CEOs of small AGI startups who seem to have gone crazy and told me that they are self inserts into this world, which is a simulation of their original self's creation


Do you know if the origin of this idea for them was a psychedelic or dissociative trip? I'd give it at least even odds, with most of the remaining chances being meditation or Eastern religions...

Wait, you know smart people who have NOT, at some point in their life: (1) taken a psychedelic NOR (2) meditated, NOR (3) thought about any of buddhism, jainism, hinduism, taoism, confucianisn, etc???

To be clear to naive readers: psychedelics are, in fact, non-trivially dangerous.

I personally worry I already have "an arguably-unfair and a probably-too-high share" of "shaman genes" and I don't feel I need exogenous sources of weirdness at this point.

But in the SF bay area (and places on the internet memetically downstream from IRL communities there) a lot of that is going around, memetically (in stories about) and perhaps mimetically (via monkey see, monkey do).

The first time you use a serious one you're likely getting a permanent modification to your personality (+0.5 stddev to your Openness?) and arguably/sorta each time you do a new one, or do a higher dose, or whatever, you've committed "1% of a personality suicide" by disrupting some of your most neurologically complex commitments.

To a first approximation my advice is simply "don't do it".

HOWEVER: this latter consideration actually suggests: anyone seriously and truly considering suicide should perhaps take a low dose psychedelic FIRST (with at least two loving tripsitters and due care) since it is also maybe/sorta "suicide" but it leaves a body behind that most people will think is still the same person and so they won't cry very much and so on?

To calibrate this perspective a bit, I also expect that even if cryonics works, it will also cause an unusually large amount of personality shift. A tolerable amount. An amount that leaves behind a personality that similar-enough-to-the-current-one-to-not-have-triggered-a-ship-of-theseus-violation-in-one-modification-cycle. Much more than a stressful day and then bad nightmares and a feeling of regret the next day, but weirder. With cryonics, you might wake up to some effects that are roughly equivalent to "having taken a potion of youthful rejuvenation, and not having the same birthmarks, and also learning that you're separated-by-disjoint-subjective-deaths from LOTS of people you loved when you experienced your first natural death" for example.This is a MUCH BIGGER CHANGE than just having a nightmare and a waking up with a change of heart (and most people don't have nightmares and changes of heart every night (at least: I don't and neither do most people I've asked)).

Remember, every improvement is a change, though not every change is an improvement. A good "epistemological practice" is sort of a idealized formal praxis for making yourself robust to "learning any true fact" and changing only in GOOD ways from such facts.

A good "axiological practice" (which I don't know of anyone working on except me (and I'm only doing it a tiny bit, not with my full mental budget)) is sort of an idealized formal praxis for making yourself robust to "humanely heartful emotional changes"(?) and changing only in <PROPERTY-NAME-TBD> ways from such events.

(Edited to add: Current best candidate name for this property is: "WISE" but maybe "healthy" works? (It depends on whether the Stoics or Nietzsche were "more objectively correct" maybe? The Stoics, after all, were erased and replaced by Platonism-For-The-Masses (AKA "Christianity") so if you think that "staying implemented in physics forever" is critically important then maybe "GRACEFUL" is the right word? (If someone says "vibe-alicious" or "flowful" or "active" or "strong" or "proud" (focusing on low latency unity achieved via subordination to simply and only power) then they are probably downstream of Heidegger and you should always be ready for them to change sides and submit to metaphorical Nazis, just as Heidegger subordinated himself to actual Nazis without really violating his philosophy at all.)))

I don't think that psychedelics fits neatly into EITHER category. Drugs in general are akin to wireheading, except wireheading is when something reaches into your brain to overload one or more of your positive-value-tracking-modules, (as a trivially semantically invalid shortcut to achieving positive value "out there" in the state-of-affairs that your tracking modules are trying to track) but actual humans have LOTS of <thing>-tracking-modules and culture and science barely have any RIGOROUS vocabulary for any them.

Note that many of these neurological <thing>-tracking-modules were evolved.

Also, many of them will probably be "like hands" in terms of AI's ability to model them.

This is part of why AI's should be existentially terrifying to anyone who is spiritually adept.

AI that sees the full set of causal paths to modifying human minds will be "like psychedelic drugs with coherent persistent agendas". Humans have basically zero cognitive security systems. Almost all security systems are culturally mediated, and then (absent complex interventions) lots of the brain stuff freezes in place around the age of puberty, and then other stuff freezes around 25, and so on. This is why we protect children from even TALKING to untrusted adults: they are too plastic and not savvy enough. (A good heuristic for the lowest level of "infohazard" is "anything you wouldn't talk about in front of a six year old".)

Humans are sorta like a bunch of unpatchable computers, exposing "ports" to the "internet", where each of our port numbers is simply a lightly salted semantic hash of an address into some random memory location that stores everything, including our operating system.

Your word for "drugs" and my word for "drugs" don't point to the same memory addresses in the computer's implementing our souls. Also our souls themselves don't even have the same nearby set of "documents" (because we just have different memories n'stuff)... but the word "drugs" is not just one of the ports... it is a port that deserves a LOT of security hardening.

The bible said ~"thou shalt not suffer a 'pharmakeia' to live" for REASONS.

[-]O O2-3

I assume timelines are fairly long or this isn’t safety related. I don’t see a point in keeping PPUs or even caring about NDA lawsuits which may or may not happen and would take years in a short timeline or doomed world.

I think having a probability distribution over timelines is the correct approach. Like, in the comment above:

I think I'm more likely to be better calibrated than any of these opinions, because most of them don't seem to focus very much on "hedging" or "thoughtful doubting", whereas my event space assigns non-zero probability to ensembles that contain such features of possible futures (including these specific scenarios).

[-]O O1210

Even in probabilistic terms, the evidence of OpenAI members respecting their NDAs makes it more likely that this was some sort of political infighting (EA related) than sub-year takeoff timelines. I would be open to a 1 year takeoff, I just don't see it happening given the evidence. OpenAI wouldn't need to talk about raising trillions of dollars, companies wouldn't be trying to commoditize their products, and the employees who quit OpenAI would speak up. 

Political infighting is in general just more likely than very short timelines, which would go in counter of most prediction markets on the matter. Not to mention, given it's already happened with the firing of Sam Altman, it's far more likely to have happened again.

If there was a probability distribution of timelines, the current events indicate sub 3 year ones have negligible odds. If I am wrong about this, I implore the OpenAI employees to speak up. I don't think normies misunderstand probability distributions, they just usually tend not to care about unlikely events.

No, OpenAI (assuming that it is a well-defined entity) also uses a probability distribution over timelines.

(In reality, every member of its leadership has their own probability distribution, and this translates to OpenAI having a policy and behavior formulated approximately as if there is some resulting single probability distribution).

The important thing is, they are uncertain about timelines themselves (in part, because no one knows how perplexity translates to capabilities, in part, because there might be difference with respect to capabilities even with the same perplexity, if the underlying architectures are different (e.g. in-context learning might depend on architecture even with fixed perplexity, and we do see a stream of potentially very interesting architectural innovations recently), in part, because it's not clear how big is the potential of "harness"/"scaffolding", and so on).

This does not mean there is no political infighting. But it's on the background of them being correctly uncertain about true timelines...


Compute-wise, inference demands are huge and growing with popularity of the models (look how much Facebook did to make LLama 3 more inference-efficient).

So if they expect models to become useful enough for almost everyone to want to use them, they should worry about compute, assuming they do want to serve people like they say they do (I am not sure how this looks for very strong AI systems; they will probably be gradually expanding access, and the speed of expansion might depend).

better calibrated than any of these opinions, because most of them don't seem to focus very much on "hedging" or "thoughtful doubting"

new observations > new thoughts when it comes to calibrating yourself.

The best calibrated people are people who get lots of interaction with the real world, not those who think a lot or have a complicated inner model. Tetlock's super forecasters were gamblers and weathermen.

I'd have more confidence in Anthropic's governance if the board or LTBT had some fulltime independent members who weren't employees. IMO labs should consider paying a fulltime salary but no equity to board members, through some kind of mechanism where the money is still there and paid for X period of time in the future, even if the lab dissolved, so no incentive to avoid actions that would cost the lab. Board salaries could maybe be pegged to some level of technical employee salary, so that technical experts could take on board roles. Boards full of busy people really can't do their job of checking whether the organization is fullfilling its stated mission, and IMO this is one of the most important jobs in the world right now. Also, fulltime board members would have fewer conflicts of interest outside of the lab (since they won't be in some other fulltime job that might conflict).

From discussion with Logan Riggs (Eleuther) who worked on the tuned lens: the tuned lens suggests that the residual stream at different layers go through some linear transformations and so aren’t directly comparable. This would interfere with a couple of methods for trying to understand neurons based on weights: 1) the embedding space view 2) calculating virtual weights between neurons in different layers.

However, we could try correcting these using the transformations learned by the tuned lens to translate between the residual stream at different layers, and maybe this would make these methods more effective. By default I think the tuned lens learns only the transformation needed to predict the output token but the method could be adapted to retrodict the input token from each layer as well, we’d need both. Code for tuned lens is at https://github.com/alignmentresearch/tuned-lens

Here's something I've been pondering.

hypothesis: If transformers has internal concepts, and they are represented in the residual stream. Then because we have access to 100% of the information then it should be possible for a non-linear probe to get 100% out of distribution accuracy. 100% is important because we care about how a thing like value learning will generalise OOD.

And yet we don't get 100% (in fact most metrics are much easier than what we care about, being in-distribution, or on careful setups). What is wrong with the assumptions hypothesis, do you think?