Interested in math puzzles, fermi estimation, strange facts about the world, toy models of weird scenarios, unusual social technologies, and deep dives into the details of random phenomena.
Working on the pretraining team at Anthropic as of October 2024; before that I did independent alignment research of various flavors and worked in quantitative finance.
A few months ago I spent $60 ordering the March 2025 version of Anthropic's certificate of incorporation from the state of Delaware, and last week I finally got around to scanning and uploading it. Here's a PDF! After writing most of this shortform, I discovered while googling related keywords that someone had already uploaded the 2023-09-21 version online here, which is slightly different.
I don't particularly bid that people spend their time reading it; it's very long and dense and I predict that most people trying to draw important conclusions from it who aren't already familiar with corporate law (including me) will end up being somewhat confused by default. But I'd like more transparency about the corporate governance of frontier AI companies and this is an easy step.
Anthropic uses a bunch of different phrasings of its mission across various official documents; of these, I believe the COI's is the most legally binding one, which says that "the specific public benefit that the Corporation will promote is to responsibly develop and maintain advanced Al for the long term benefit of humanity." I like this wording less than others that Anthropic has used like "Ensure the world safely makes the transition through transformative AI", though I don't expect it to matter terribly much.
I think the main thing this sheds light on is stuff like Maybe Anthropic's Long-Term Benefit Trust Is Powerless: as of late 2025, overriding the LTBT takes 85% of voting stock or all of (a) 75% of founder shares (b) 50% of series A preferred (c) 75% of non-series-A voting preferred stock. (And, unrelated to the COI but relevant to that post, it is now public that neither Google nor Amazon hold voting shares.)
The only thing I'm aware of in the COI that seems concerning to me re: the Trust is a clause added to the COI sometime between the 2023 and 2025 editions, namely the italicized portion of the following:
(C) Action by the Board of Directors. Except as expressly provided herein, each director of the Corporation shall be entitled to one (1) vote on all matters presented to the Board of Directors for approval at any meeting of the Board of Directors, or for action to be taken by written consent without a meeting; provided, however, that, if and for so long as the Electing Preferred Holders are entitled to elect a director of the Corporation, the affirmative vote of either (i) the Electing Preferred Director or (ii) at least 61% of the then serving directors may be required for authorization by the Board of Directors of any of the matters set forth in the Investors' Rights Agreement. If at any time the vote of the Board of Directors with respect to a matter is tied (a "Deadlocked Matter") and the Chief Executive Officer of the Corporation is then serving as a director (the "CEO Director"), the CEO Director shall be entitled to an additional vote for the purpose of deciding the Deadlocked Matter (a "Deadlock Vote") (and every reference in this Restated Certificate or in the Bylaws of the Corporation to a majority or other proportion of the directors shall refer to a majority or other proportion of the votes of the directors), except with respect to any vote as to which the CEO Director is not disinterested or has a conflict of interest, in which such case the CEO Director shall not have a Deadlock Vote.
I think this means that the 3 LTBT-appointed directors do not have the ability to unilaterally take some kinds of actions, plausibly including things like firing the CEO (it would depend on what's in the Investors' Rights Agreement, which I don't have access to). I think this is somewhat concerning, and moderately downgrades my estimate of the hard power possessed by the LTBT, though my biggest worry about the quality of the Trust's oversight remains the degree of its AI safety expertise and engagement rather than its nominal hard power. (Though as I said above, interpreting this stuff is hard and I think it's quite plausible I'm neglecting important considerations!)
More like True Time Horizon, though I think the claim is pretty plausible within the domain of well-scoped end-to-end AI R&D remote work tasks as well.
I also think that there is a <5% chance that a large-scale AI catastrophe occurs in the next 6 months, but I don't think the time horizon framing here is a compelling argument for that.
Let’s use one year of human labor as a lower bound on how difficult it is. This means that AI systems will need to at least have a time horizon of one work-year (2000 hours) in order to cause a catastrophe.
I'm not very sold by this argument. AIs can do many tasks that take over a year of human labor, like "write 1 billion words of fiction"; when thinking about time horizons we care about something like "irreducible time horizon", or some appeal to the most difficult or long-horizon-y bottleneck within a task.
While I agree that causing a catastrophe would require more than a year of human labor, it is not obvious to me that causing a large-scale catastrophe is bottlenecked on any task with an irreducible time horizon of one year; indeed, it's not obvious to me that any such tasks exist! It seems plausible to me that the first AIs with reliable 1-month time horizons will basically not be time-horizon-limited in any way that humans aren't, and unlikely-but-possible that this will be true even at the 1 week level.
Concretely, massively scaled coordinated cyberattacks on critical infrastructure worldwide is a threat model that could plausibly cause 1e8 deaths and does not obviously rely on any subtask with an especially large time horizon; I think the primary blockers to autonomous LLM "success" here in 2025 are (1) programming skill (2) ability to make and execute on an okay plan coherently enough not to blunder into failure (3) ability to acquire enough unmonitored inference resources (4) alignment. Of these I expect (2) to be the most time-horizon bottlenecked, but I wouldn't feel that surprised if models that could pull it off still had low reliability at 1-day AI R&D tasks. (This particular scenario is contingent enough that I think it's still very unlikely in the near term, TBC.)
But the model probably "knows" how many tokens there are; it's an extremely salient property of the input
This doesn't seem that clear to me; what part of training would incentivize the model to develop circuits for exact token-counting? Training a model to adhere to a particular token budget would do some of this, but it seems like it would have relatively light pressure on getting exact estimates right vs guessing things to the nearest few hundred tokens.
We know from humans that it's very possible for general intelligences to be blind to pretty major low-level features of their experience; you don't have introspective access to the fact that there's a big hole in your visual field or the mottled patterns of blood vessels in front of your eye at all times or the ways your brain distorts your perception of time and retroactively adjusts your memories of the past half-second.
One way to test this would be to see if there are SAE features centrally about token counts; my guess would be that these show up in some early layers but are mostly absent in places where the model is doing more sophisticated semantic reasoning about things like introspection prompts. Ofc this might fail to capture the relevant sense of "knowing" etc, but I'd still take it as fairly strong evidence either way.
Does this cache out into concrete predictions of tasks which you expect LLMs to make little progress on in the future?
A very literal eval your post would suggest is to literally take two maps or images of some kind of similar stylistic form but different global structure, cut them into little square sections, and ask a model to partition the pieces from both puzzles into two coherent wholes. I expect LLMs to be really bad at this task right now, but they're very bad at vision in general so "true understanding" isn't really the bottleneck IMO.
But one could do a similar test for text-based data; eg one could ask a model to reconstruct two math proofs with shared variable names based on an unordered list of the individual sentences in each proof. Is this the kind of thing you expect models to make unusually little progress on relative to other tasks of similar time horizon? (I might be down to bet on something like this, though I think it'll be tricky to operationalize something crisply enough.)
The details are complicated, vary a lot person-to-person, and I'm not sure which are OK to share publicly; the TLDR is that relatively early employees have a 3:1 match on up to 50% of their equity, and later employees a 1:1 match on up to 25%.
I believe that many people eligible for earlier liquidation opportunities used the proceeds from said liquidation to exercise additional stock options, because various tax considerations mean that doing so ends up being extremely leveraged for one's donation potential in the future (at least if one expects the value of said options to increase over time); I expect that most people into doing interesting impact-maximizing things with their money took this route, which doesn't produce much in the way of observable consequences right now.
I've made a legally binding pledge to allocate half of it to 501(c)3 charities, the maximum that my employer's donation match covers; I expect to donate the majority of the remainder but have had no opportunities to liquidate any of it yet.
Yep, I agree that's a risk, and one that should seem fairly plausible to external readers. (This is why I included other bullet points besides that one.) I'm not sure I can offer something compelling over text that other readers will find convincing, but I do think I'm in a pretty epistemically justified state here even if I don't think you should think that based on what you know of me.
And TBC, I'm not saying I'm unbiased! I think I am biased in a ton of ways - my social environment, possession of a stable high-status job, not wanting to say something accidentally wrong or hurting people's feelings, inner ring dynamics of being in the know about things, etc are all ways I think my epistemics face pressure here - but I feel quite sure that "the value of my equity goes down if Anthropic is less commercially successful" contributes a tiny tiny fraction to that state of affairs. You're well within your rights to not believe me, though.
Agreed - I do think the case for doing this for signaling reasons is stronger for Joe and I think it's plausible he should have avoided this for that reason. I just don't think it's clear that it would be particularly helpful on the object level for his epistemics, which is what I took the parent comment to be saying.
I'm not going to invest time in further replies here, but FYI, the reason you're getting downvotes is because your complaint does not make sense and comes across as wildly conspiratorial and unfounded, and no one with any reasonable understanding of the field would think this is a sensible thing to be up in arms over. I strongly recommend that you stop talking to LLMs.