912

LESSWRONG
LW

911

Ryan Kidd's Shortform

by Ryan Kidd
13th Oct 2022
1 min read
122

3

This is a special post for quick takes by Ryan Kidd. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
Ryan Kidd's Shortform
127Ryan Kidd
8Daniel Kokotajlo
6Ryan Kidd
2J Bostock
4Ryan Kidd
2Jacob Pfau
4Ryan Kidd
2Nicholas Kross
2Ryan Kidd
2[comment deleted]
48Ryan Kidd
1Thiwanka Jayasiri
3Ryan Kidd
4yanni kyriacos
2Ryan Kidd
1yanni kyriacos
47Ryan Kidd
7Ryan Kidd
4Ryan Kidd
47Ryan Kidd
10johnswentworth
7johnswentworth
2Ryan Kidd
6johnswentworth
5Ryan Kidd
3johnswentworth
3Ryan Kidd
9johnswentworth
6Ryan Kidd
1Daniel Tan
2Ryan Kidd
2johnswentworth
4Ryan Kidd
2Ryan Kidd
1johnswentworth
2Ryan Kidd
10Erik Jenner
4Ryan Kidd
4Leon Lang
4Ryan Kidd
5Leon Lang
6Ryan Kidd
2jacquesthibs
2Ryan Kidd
1metawrong
3Ryan Kidd
2Ryan Kidd
29Ryan Kidd
8Ben Goldhaber
4Elizabeth
0Ryan Kidd
21Elizabeth
6mesaoptimizer
2Ryan Kidd
4Ryan Kidd
4Ryan Kidd
2Elizabeth
28Ryan Kidd
23Ryan Kidd
22Ryan Kidd
9Ryan Kidd
2Garrett Baker
2Ryan Kidd
2Garrett Baker
4Ryan Kidd
2Garrett Baker
4Ryan Kidd
8Nathan Helm-Burger
7Ryan Kidd
5Rauno Arike
5Ryan Kidd
4Gunnar_Zarncke
6Ryan Kidd
4Alexander Gietelink Oldenziel
6Ryan Kidd
3Mike Vaiana
2Ryan Kidd
2Nathan Helm-Burger
18Ryan Kidd
4habryka
5Ryan Kidd
11habryka
2Ryan Kidd
4habryka
3jamjam
5habryka
1jamjam
3habryka
3Ryan Kidd
2Ryan Kidd
17Ryan Kidd
16Thomas Larsen
9Orpheus16
2Ryan Kidd
2Orpheus16
1Felix Choussat
0Katalina Hernandez
15Ryan Kidd
13Ryan Kidd
13Ryan Kidd
6Garrett Baker
3George Ingebretsen
4Garrett Baker
5Ryan Kidd
10Ryan Kidd
9Ryan Kidd
3Roman Leventov
8Ryan Kidd
2Remmelt
1Ryan Kidd
7Ryan Kidd
5Ryan Kidd
12Buck
5Garrett Baker
3Garrett Baker
3Ryan Kidd
3Ryan Kidd
3Anthony DiGiovanni
1Ryan Kidd
1Ryan Kidd
1Ryan Kidd
1Ryan Kidd
1Ryan Kidd
122 comments, sorted by
top scoring
Click to highlight new comments since: Today at 11:08 AM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings
[-]Ryan Kidd3mo*1270

80% of MATS alumni who completed the program before 2025 are still working on AI safety today, based on a survey of all available alumni LinkedIns or personal websites (242/292 ~ 83%). 10% are working on AI capabilities, but only ~6 at a frontier AI company (2 at Anthropic, 2 at Google DeepMind, 1 at Mistral AI, 1 extrapolated). 2% are still studying, but not in a research degree focused on AI safety. The last 8% are doing miscellaneous things, including non-AI safety/capabilities software engineering, teaching, data science, consulting, and quantitative trading.

Of the 193+ MATS alumni working on AI safety (extrapolated: 234):

  • 34% are working at a non-profit org (Apollo, Redwood, MATS, EleutherAI, FAR.AI, MIRI, ARC, Timaeus, LawZero, RAND, METR, etc.);
  • 27% are working at a for-profit org (Anthropic, Google DeepMind, OpenAI, Goodfire, Meta, etc.);
  • 18% are working as independent researchers, probably with grant funding from Open Philanthropy, LTFF, etc.;
  • 15% are working as academic researchers, including PhDs/Postdocs at Oxford, Cambridge, MIT, ETH Zurich, UC Berkeley, etc.;
  • 6% are working in government agencies, including in the US, UK, EU, and Singapore.

10% of MATS alumni co-founded an ... (read more)

Reply132
8Daniel Kokotajlo2mo
What about RL? Why did you single out pre-training specifically?  The number I'd be interested in is the % that went on to work on capabilities at a frontier AI company. 
6Ryan Kidd2mo
Sorry, I should have said "~6 on capabilities at a frontier AI company".
2J Bostock2mo
What are some representative examples of the rest? I'm wondering if it's: * AI wrappers like Cursor * Model training for entirely mundane stuff like image gen at Stablediffusion * Narrow AI like AlphaFold at Isomorphic * An AGI-ish project but not LLMs, e.g. a company that just made AlphaGo type stuff * General-purpose LLMs but not at a frontier lab (I would honestly count Mistral here)
4Ryan Kidd2mo
Here are the AI capabilities organizations where MATS alumni are working (1 at each except for Anthropic and GDM, where there are 2 each): * Anthropic * Barcelona Supercomputing Cluster * Conduit Intelligence * Decart * EliseAI * Fractional AI * General Agents * Google DeepMind * iGent AI * Imbue * Integuide * Kayrros * Mecha Health * Mistral AI * MultiOn * Norm AI * NVIDIA * Palantir * Phonic * RunRL * Salesforce * Sandbar * Secondmind * Yantran Alumni also work at these organizations, which might be classified as capabilities or safety-adjacent: * Freestyle Research * Leap Labs
2Jacob Pfau2mo
UK AISI is a government agency, so the pie chart is probably misleading on that segment!
4Ryan Kidd2mo
Oh, shoot, my mistake.
2Nicholas Kross3mo
I'm curious to see if I'm in this data, so I can help make it more accurate by providing info.
2Ryan Kidd2mo
Hi Nicholas! You are not in the data as you were not a MATS scholar, to my knowledge. Were you a participant in one of the MATS training programs instead? Or did I make a mistake?
2[comment deleted]3mo
[-]Ryan Kidd1y*482

I am a Manifund Regrantor. In addition to general grantmaking, I have requests for proposals in the following areas:

  • Funding for AI safety PhDs (e.g., with these supervisors), particularly in exploratory research connecting AI safety theory with empirical ML research.
  • An AI safety PhD advisory service that helps prospective PhD students choose a supervisor and topic (similar to Effective Thesis, but specialized for AI safety).
  • Initiatives to critically examine current AI safety macrostrategy (e.g., as articulated by Holden Karnofsky) like the Open Philanthropy AI Worldviews Contest and Future Fund Worldview Prize.
  • Initiatives to identify and develop "Connectors" outside of academia (e.g., a reboot of the Refine program, well-scoped contests, long-term mentoring and peer-support programs).
  • Physical community spaces for AI safety in AI hubs outside of the SF Bay Area or London (e.g., Japan, France, Bangalore).
  • Start-up incubators for projects, including evals/red-teaming/interp companies, that aim to benefit AI safety, like Catalyze Impact, Future of Life Foundation, and YCombinator's request for Explainable AI start-ups.
  • Initiatives to develop and publish expert consensus on AI safety macr
... (read more)
Reply11
1Thiwanka Jayasiri1y
"Physical community spaces for AI safety in AI hubs outside of the SF Bay Area or London (e.g., Japan, France, Bangalore)"- I love this initiative. Can we also consider Australia or New Zealand in the upcoming proposal? 
3Ryan Kidd1y
In theory, sure! I know @yanni kyriacos recently assessed the need for an ANZ AI safety hub, but I think he concluded there wasn't enough of a need yet?
4yanni kyriacos1y
Hi! I think in Sydney we're ~ 3 seats short of critical mass, so I am going to reassess the viability of a community space in 5-6 months :)
2Ryan Kidd10mo
@yanni kyriacos when will you post about TARA and Sydney AI Safety Hub on LW? ;)
1yanni kyriacos10mo
SASH isn't official (we're waiting on funding). Here is TARA :) https://www.lesswrong.com/posts/tyGxgvvBbrvcrHPJH/apply-to-be-a-ta-for-tara
[-]Ryan Kidd5mo*473

As part of MATS' compensation reevaluation project, I scraped the publicly declared employee compensations from ProPublica's Nonprofit Explorer for many AI safety and EA organizations (data here) in 2019-2023. US nonprofits are required to disclose compensation information for certain highly paid employees and contractors on their annual Form 990 tax return, which becomes publicly available. This includes compensation for officers, directors, trustees, key employees, and highest compensated employees earning over $100k annually. Therefore, my data does not include many individuals earning under $100k, but this doesn't seem to affect the yearly medians much, as the data seems to follow a lognormal distribution, with mode ~$178k in 2023, for example.

I generally found that AI safety and EA organization employees are highly compensated, albeit inconsistently between similar-sized organizations within equivalent roles (e.g., Redwood and FAR AI). I speculate that this is primarily due to differences in organization funding, but inconsistent compensation policies may also play a role.

I'm sharing this data to promote healthy and fair compensation policies across the ecosystem. I believe th... (read more)

Reply1
7Ryan Kidd5mo
I decided to exclude OpenAI's nonprofit salaries as I didn't think they counted as an "AI safety nonprofit" and their highest paid current employees are definitely employed by the LLC. I decided to include Open Philanthropy's nonprofit employees, despite the fact that their most highly compensated employees are likely those under the Open Philanthropy LLC.
4Ryan Kidd5mo
It's also worth noting that almost all of these roles are management, ML research, or software engineering; there are very few operations, communications, non-ML research, etc. roles listed, implying that these roles are paid significantly less.
[-]Ryan Kidd1y*470

Hourly stipends for AI safety fellowship programs, plus some referents. The average AI safety program stipend is $26/h.

Edit: updated figure to include more programs.

Reply3
[-]johnswentworth1y107

... WOW that is not an efficient market. 

Reply
7johnswentworth1y
Y'know @Ryan, MATS should try to hire the PIBBSS folks to help with recruiting. IMO they tend to have the strongest participants of the programs on this chart which I'm familiar with (though high variance).
2Ryan Kidd1y
That's interesting! What evidence do you have of this? What metrics are you using?
6johnswentworth1y
My main metric is "How smart do these people seem when I talk to them or watch their presentations?". I think they also tend to be older and have more research experience.
5Ryan Kidd1y
I think there some confounders here: * PIBBSS had 12 fellows last cohort and MATS had 90 scholars. The mean/median MATS Summer 2024 scholar was 27; I'm not sure what this was for PIBBSS. The median age of the 12 oldest MATS scholars was 35 (mean 36). If we were selecting for age (which is silly/illegal, of course) and had a smaller program, I would bet that MATS would be older than PIBBSS on average. MATS also had 12 scholars with completed PhDs and 11 in-progress. * Several PIBBSS fellows/affiliates have done MATS (e.g., Ann-Kathrin Dombrowski, Magdalena Wache, Brady Pelkey, Martín Soto). * I suspect that your estimation of "how smart do these people seem" might be somewhat contingent on research taste. Most MATS research projects are in prosaic AI safety fields like oversight & control, evals, and non-"science of DL" interpretability, while most PIBBSS research has been in "biology/physics-inspired" interpretability, agent foundations, and (recently) novel policy approaches (all of which MATS has supported historically). Also, MATS is generally trying to further a different research porfolio than PIBBSS, as I discuss here, and has substantial success in accelerating hires to AI scaling lab safety teams and research nonprofits, helping scholars found impactful AI safety organizations, and (I suspect) accelerating AISI hires.
3johnswentworth1y
I think this is less a matter of my particular taste, and more a matter of selection pressures producing genuinely different skill levels between different research areas. People notoriously focus on oversight/control/evals/specific interp over foundations/generalizable interp because the former are easier. So when one talks to people in those different areas, there's a very noticeable tendency for the foundations/generalizable interp people to be noticeably smarter, more experienced, and/or more competent. And in the other direction, stronger people tend to be more often drawn to the more challenging problems of foundations or generalizable interp. So possibly a MATS apologist reply would be: yeah, the MATS portfolio is more loaded on the sort of work that's accessible to relatively-mid researchers, so naturally MATS ends up with more relatively-mid researchers. Which is not necessarily a bad thing.
3Ryan Kidd1y
I don't agree with the following claims (which might misrepresent you): * "Skill levels" are domain agnostic. * Frontier oversight, control, evals, and non-"science of DL" interp research is strictly easier in practice than frontier agent foundations and "science of DL" interp research. * The main reason there is more funding/interest in the former category than the latter is due to skill issues, rather than worldview differences and clarity of scope. * MATS has mid researchers relative to other programs.
9johnswentworth1y
Y'know, you probably have the data to do a quick-and-dirty check here. Take a look at the GRE/SAT scores on the applications (both for applicant pool and for accepted scholars). If most scholars have much-less-than-perfect scores, then you're probably not hiring the top tier (standardized tests have a notoriously low ceiling). And assuming most scholars aren't hitting the test ceiling, you can also test the hypothesis about different domains by looking at the test score distributions for scholars in the different areas.
6Ryan Kidd1y
We don't collect GRE/SAT scores, but we do have CodeSignal scores and (for the first time) a general aptitude test developed in collaboration with SparkWave. Many MATS applicants have maxed out scores for the CodeSignal and general aptitude tests. We might share these stats later.
1Daniel Tan9mo
FWIW from what I remember, I would be surprised if most people doing MATS 7.0 did not max out the aptitude test. Also, the aptitude test seems more like an SAT than anything measuring important procedural knowledge for AI safety. 
2Ryan Kidd1y
Are these PIBBSS fellows (MATS scholar analog) or PIBBSS affiliates (MATS mentor analog)?
2johnswentworth1y
Fellows.
4Ryan Kidd1y
Note that governance/policy jobs pay less than ML research/engineering jobs, so I expect GovAI, IAPS, and ERA, which are more governance focused, to have a lower stipend. Also, MATS is deliberately trying to attract top CS PhD students, so our stipend should be higher than theirs, although lower than Google internships to select for value alignment. I suspect that PIBBSS' stipend is an outlier and artificially low due to low funding. Given that PIBBSS has a mixture of ML and policy projects, and IMO is generally pursuing higher variance research than MATS, I suspect their optimal stipend would be lower than MATS', but higher than a Stanford PhD's; perhaps around IAPS' rate.
2Ryan Kidd1y
That said, maybe you are conceptualizing of an "efficient market" that principally values impact, in which case I would expect the governance/policy programs to have higher stipends. However, I'll note that 87% of MATS alumni are interested in working at an AISI and several are currently working at UK AISI, so it seems that MATS is doing a good job of recruiting technical governance talent that is happy to work for government wages.
1johnswentworth1y
No, I meant that the correlation between pay and how-competent-the-typical-participant-seems-to-me is, if anything, negative. Like, the hiring bar for Google interns is lower than any of the technical programs, and PIBBSS seems-to-me to have the most competent participants overall (though I'm not familiar with some of the programs).
2Ryan Kidd1y
I don't think it makes sense to compare Google intern salary with AIS program stipends this way, as AIS programs are nonprofits (with associated salary cut) and generally trying to select against people motivated principally by money. It seems like good mechanism design to pay less than tech internships, even if the technical bar for is higher, given that value alignment is best selected by looking for "costly signals" like salary sacrifice. I don't think the correlation for competence among AIS programs is as you describe.
[-]Erik Jenner1y100

Interesting, thanks! My guess is this doesn't include benefits like housing and travel costs? Some of these programs pay for those while others don't, which I think is a non-trivial difference (especially for the bay area)

Reply
4Ryan Kidd1y
Yes, this doesn't include those costs and programs differ in this respect.
4Leon Lang1y
Is “CHAI” being a CHAI intern, PhD student, or something else? My MATS 3.0 stipend was clearly higher than my CHAI internship stipend.
4Ryan Kidd1y
CHAI interns are paid $5k/month for in-person interns and $3.5k/month for remote interns. I used the in-person figure. https://humancompatible.ai/jobs
5Leon Lang1y
Then the MATS stipend today is probably much lower than it used to be? (Which would make sense since IIRC the stipend during MATS 3.0 was settled before the FTX crash, so presumably when the funding situation was different?)
6Ryan Kidd1y
MATS lowered the stipend from $50/h to $40/h ahead of the Summer 2023 Program to support more scholars. We then lowered it again to $30/h ahead of the Winter 2023-24 Program after surveying alumni and determining that 85% would be accept $30/h.
2jacquesthibs1y
I’d be curious to know if there’s variability in the “hours worked per week” given that people might work more hours during a short program vs a longer-term job (to keep things sustainable).
2Ryan Kidd1y
I'm not sure!
1metawrong1y
LASR (https://www.lasrlabs.org/) is giving a £11,000 stipend for a 13 week program, assuming 40h/week it works out to ~$27
3Ryan Kidd1y
Updated figure with LASR Labs and Pivotal Research Fellowship at current exchange rate of 1 GBP = 1.292 USD.
2Ryan Kidd1y
That seems like a reasonable stipend for LASR. I don't think they cover housing, however.
[-]Ryan Kidd1y*291

Why does the AI safety community need help founding projects?

  1. AI safety should scale
    1. Labs need external auditors for the AI control plan to work
    2. We should pursue many research bets in case superalignment/control fails
    3. Talent leaves MATS/ARENA and sometimes struggles to find meaningful work for mundane reasons, not for lack of talent or ideas
    4. Some emerging research agendas don’t have a home
    5. There are diminishing returns at scale for current AI safety teams; sometimes founding new projects is better than joining an existing team
    6. Scaling lab alignment teams are bottlenecked by management capacity, so their talent cut-off is above the level required to do “useful AIS work”
  2. Research organizations (inc. nonprofits) are often more effective than independent researchers
    1. “Block funding model” is more efficient, as researchers can spend more time researching, rather than seeking grants, managing, or other traditional PI duties that can be outsourced
    2. Open source/collective projects often need a central rallying point (e.g., EleutherAI, dev interp at Timaeus, selection theorems and cyborgism agendas seem too delocalized, etc.)
  3. There is (imminently) a market for for-profit AI safety companies and value-al
... (read more)
Reply
8Ben Goldhaber1y
I agree with this, I'd like to see AI Safety scale with new projects. A few ideas I've been mulling: - A 'festival week' bringing entrepreneur types and AI safety types together to cowork from the same place, along with a few talks and lot of mixers. - running an incubator/accelerator program at the tail end of a funding round, with fiscal sponsorship and some amount of operational support.  - more targeted recruitment for specific projects to advance important parts of a research agenda.   It's often unclear to me whether new projects should actually be new organizations; making it easier to spin up new projects, that can then either join existing orgs or grow into orgs themselves, seems like a promising direction.
4Elizabeth1y
I'm surprised this one was included, it feels tail-wagging-the-dog to me. 
0Ryan Kidd1y
I would amend it to say "sometimes struggles to find meaningful employment despite having the requisite talent to further impactful research directions (which I believe are plentiful)"
[-]Elizabeth1y210

This still reads to me as advocating for a jobs program for the benefit of MATS grads, not safety. My guess is you're aiming for something more like "there is talent that could do useful work under someone else's direction, but not on their own, and we can increase safety by utilizing it".

Reply
6mesaoptimizer1y
I expect that Ryan means to say one of the these things: 1. There isn't enough funding for MATS grads to do useful work in the research directions they are working on, that have already been vouched for by senior alignment researchers (especially their mentors) to be valuable. (Potential examples: infrabayesianism) 2. There isn't (yet) institutional infrastructure to support MATS grads to do useful work together as part of a team focused on the same (or very similar) research agendas, and that this is the case for multiple nascent and established research agendas. They are forced to go to academia and disperse across the world instead of being able to work together in one location. (Potential examples: selection theorems, multi-agent alignment (of the sort that Caspar Oesterheld and company work on)) 3. There aren't enough research managers in existing established alignment research organizations or frontier labs to enable MATS grads to work on the research directions they consider extremely high value, and would benefit from multiple people working together on (Potential examples: activation steering) I'm pretty sure that Ryan does not mean to say that MATS grads cannot do useful work on their own. The point is that we don't yet have the institutional infrastructure to absorb, enable, and scale new researchers the way our civilization has for existing STEM fields via, say, PhD programs or yearlong fellowships at OpenAI/MSR/DeepMind (which are also pretty rare). AFAICT, the most valuable part of such infrastructure in general is the ability to co-locate researchers working on the same or similar research problems -- this is standard for academic and industry research groups, for example, and from experience I know that being able to do so is invaluable. Another extremely valuable facet of institutional infrastructure that enables researchers is the ability to delegate operations and logistics problems -- particularly the difficulty of finding grant funding, int
2Ryan Kidd1y
@Elizabeth, Mesa nails it above. I would also add that I am conceptualizing impactful AI safety research as the product of multiple reagents, including talent, ideas, infrastructure, and funding. In my bullet point, I was pointing to an abundance of talent and ideas relative to infrastructure and funding. I'm still mostly working on talent development at MATS, but I'm also helping with infrastructure and funding (e.g., founding LISA, advising Catalyze Impact, regranting via Manifund) and I want to do much more for these limiting reagents.
4Ryan Kidd1y
Also note that historically many individuals entering AI safety seem to have been pursuing the "Connector" path, when most jobs now (and probably in the future) are "Iterator"-shaped, and larger AI safety projects are also principally bottlenecked by "Amplifiers". The historical focus on recruiting and training Connectors to the detriment of Iterators and Amplifiers has likely contributed to this relative talent shortage. A caveat: Connectors are also critical for founding new research agendas and organizations, though many self-styled Connectors would likely substantially benefit as founders by improving some Amplifier-shaped soft skills, including leadership, collaboration, networking, and fundraising.
4Ryan Kidd1y
I interpret your comment as assuming that new researchers with good ideas produce more impact on their own than in teams working towards a shared goal; this seems false to me. I think that independent research is usually a bad bet in general and that most new AI safety researchers should be working on relatively few impactful research directions, most of which are best pursued within a team due to the nature of the research (though some investment in other directions seems good for the portfolio). I've addressed this a bit in thread, but here are some more thoughts: * New AI safety researchers seem to face mundane barriers to reducing AI catastrophic risk, including funding, infrastructure, and general executive function. * MATS alumni are generally doing great stuff (~78% currently work in AI safety/control, ~1.4% work on AI capabilities), but we can do even better. * Like any other nascent scientific/engineering discipline, AI safety will produce more impactful research with scale, albeit with some diminishing returns on impact eventually (I think we are far from the inflection point, however). * MATS alumni, as a large swathe of the most talented new AI safety researchers in my (possibly biased) opinion, should ideally not experience mundane barriers to reducing AI catastrophic risk. * Independent research seems worse than team-based research for most research that aims to reduce AI catastrophic risk: * "Pair-programming", builder-breaker, rubber-ducking, etc. are valuable parts of the research process and are benefited by working in a team. * Funding insecurity and grantwriting responsibilities are larger for independent researchers and obstruct research. * Orgs with larger teams and discretionary funding can take on interns to help scale projects and provide mentorship. * Good prosaic AI safety research largely looks more like large teams doing engineering and less like lone geniuses doing maths. Obviously, some lone genius researchers (espec
2Elizabeth1y
I don't believe that, although I see how my summary could be interpreted that way. I agree with basically all the reasons in your recent comment and most in the original comment. I could add a few reasons of my own doing independent grant-funded work sucks. But I think it's really important to track how founding projects tracks to increased potential safety instead of intermediates, and push hard against potential tail wagging the dog scenarios.  I was trying to figure out why this was important to me, given how many of your points I agree with. I think it's a few things: * Alignment work seems to be prone to wagging the dog, and is harder to correct, due to poor feedback loops.  * The consequences of this can be dire * making it harder to identify and support the best projects.  * making it harder to identify and stop harmful projects * making it harder to identify when a decent idea isn't panning out, leading to people and money getting stuck in the mediocre project instead of moving on.  * One of the general concerns about MATS is it spins up potential capabilities researchers. If the market can't absorb the talent, that suggests maybe MATS should shrink. * OTOH if you told me that for every 10 entrants MATS spins up 1 amazing safety researcher and 9 people who need makework to prevent going into capabilities, I'd be open to arguments that that was a good trade. 
[-]Ryan Kidd1y282

I just left a comment on PIBBSS' Manifund grant proposal (which I funded $25k) that people might find interesting.

Main points in favor of this grant

  1. My inside view is that PIBBSS mainly supports “blue sky” or “basic” research, some of which has a low chance of paying off, but might be critical in “worst case” alignment scenarios (e.g., where “alignment MVPs” don’t work, or “sharp left turns” and “intelligence explosions” are more likely than I expect). In contrast, of the technical research MATS supports, about half is basic research (e.g., interpretability, evals, agent foundations) and half is applied research (e.g., oversight + control, value alignment). I think the MATS portfolio is a better holistic strategy for furthering AI alignment. However, if one takes into account the research conducted at AI labs and supported by MATS, PIBBSS’ strategy makes a lot of sense: they are supporting a wide portfolio of blue sky research that is particularly neglected by existing institutions and might be very impactful in a range of possible “worst-case” AGI scenarios. I think this is a valid strategy in the current ecosystem/market and I support PIBBSS!
  2. In MATS’ recent post, “Talent Needs of
... (read more)
Reply
[-]Ryan Kidd3y234

Main takeaways from a recent AI safety conference:

  • If your foundation model is one small amount of RL away from being dangerous and someone can steal your model weights, fancy alignment techniques don’t matter. Scaling labs cannot currently prevent state actors from hacking their systems and stealing their stuff. Infosecurity is important to alignment.
  • Scaling labs might have some incentive to go along with the development of safety standards as it prevents smaller players from undercutting their business model and provides a credible defense against lawsuits regarding unexpected side effects of deployment (especially with how many tech restrictions the EU seems to pump out). Once the foot is in the door, more useful safety standards to prevent x-risk might be possible.
  • Near-term commercial AI systems that can be jailbroken to elicit dangerous output might empower more bad actors to make bioweapons or cyberweapons. Preventing the misuse of near-term commercial AI systems or slowing down their deployment seems important.
  • When a skill is hard to teach, like making accurate predictions over long time horizons in complicated situations or developing a “security mindset,” try treating human
... (read more)
Reply
[-]Ryan Kidd5mo*223

I did a quick inventory on the employee headcount at AI safety and safety-adjacent organizations. The median AI safety org has 10 8 employees. I didn't include UK AISI, US AISI, CAIS, and the safety teams at Anthropic, GDM, OpenAI, and probably more, as I couldn't get accurate headcount estimates. I also didn't include "research affiliates" or university students in the headcounts for academic labs. Data here. Let me know if I missed any orgs!

Reply1
9Ryan Kidd5mo
Apparently the headcount for US corporations follows a power-law distribution, apart from mid-sized corporations, which fit a lognormal distribution better. I fit a power law distribution to the data (after truncating all datapoints with over 40 employees, which created a worse fit), which gave p(x)∼399x−1.29. This seems to imply that there are ~400 independent AI safety researchers (though note that p(x) is probability density function and this estimate might be way off); Claude estimates 400-600 for comparison. Integrating this distribution over x∈[1,∞) gives ~1400 (2 s.f.) total employees working on AI safety or safety-adjacent work (∞ might be a bad upper bound, as the largest orgs have <100 employees).
2Garrett Baker5mo
I've seen you use "upper bound" and "underestimate" twice in this thread, and I want to lightly push back against this phrasing. I think the likely level of variance in these estimates is much too high for them to be any sort of bounds or to informatively be described as "under" any estimate. That is, although your analysis of the data you have may imply, given perfect data, an underestiamte or upper bound, I think your data is too imperfect for those to be useful descriptions with respect to reality. Garbage in, garbage out, as they say. The numbers are still useful, and if you have a model of the orgs that you missed, the magnitude of your over-counting, or a sensitivity analysis of your process, these would be interesting to hear about. But these will probably take the form more as approximate probability distributions more than a binary bigger-or-smaller-than-truth shots from the hip.
2Ryan Kidd5mo
By "upper bound", I meant "upper bound b on the definite integral ∫bap(x)dx". I.e., for the kind of hacky thing I'm doing here, the integral is very sensitive to the choice of bounds a,b. For example, the integral does not converge for a=0. I think all my data here should be treated as incomplete and all my calculations crude estimates at best. I edited the original comment to say "∞ might be a bad upper bound" for clarity.
2Garrett Baker5mo
Yeah, I think we're in agreement, I'm just saying the phrase "upper bound" is not useful compared to eg providing various estimates for various bounds & making a table or graph, and a derivative of the results wrt the parameter estimate you inferred.
4Ryan Kidd5mo
So, conduct a sensitivity analysis on the definite integral with respect to choices of integration bounds? I'm not sure this level of analysis is merited given the incomplete data and unreliable estimation methodology for the number of independent researchers. Like, I'm not even confident that the underlying distribution is a power law (instead of, say, a composite of power law and lognormal distributions, or a truncated power law), and the value of p(1) seems very sensitive to data in the vicinity, so I wouldn't want to rely on this estimate except as a very crude first pass. I would support an investigation into the number of independent researchers in the ecosystem, which I would find useful.
2Garrett Baker5mo
I think again we are on the same page & this sounds reasonable, I just want to argue that "lower bound" and "upper bound" are less-than-informative descriptions of the uncertainty in the estimates.
4Ryan Kidd5mo
I definitely think that people should not look at my estimates and say "here is a good 95% confidence interval upper bound of the number of employees in the AI safety ecosystem." I think people should look at my estimates and say "here is a good 95% confidence interval lower bound of the number of employees in the AI safety ecosystem," because you can just add up the names. I.e., even if there might be 10x the number of employees as I estimated, I'm at least 95% confident that there are more than my estimate obtained by just counting names (obviously excluding the 10% fudge factor).
8Nathan Helm-Burger5mo
A couple thousand brave souls standing between humanity and utter destruction. Kinda epic from a storytelling narrative where the plucky underdog good guys win in the end.... but kinda depressing from a realistic forecast perspective. Can humanity really not muster more researchers to throw at this problem? What an absurd coordination failure.
7Ryan Kidd5mo
Assuming that I missed 10% of orgs, this gives a rough estimate for the total number of FTEs working on AI safety or adjacent work at ~1000, not including students or faculty members. This is likely an underestimate, as there are a lot of AI safety-adjacent orgs in areas like AI security and robustness.
5Rauno Arike5mo
Leap Labs, Conjecture, Simplex, Aligned AI, and the AI Futures Project seem to be missing from the current list.
5Ryan Kidd5mo
Thanks! I wasn't sure whether to include Simplex, or the entire Obelisk team at Astera (which Simplex is now part of), or just exclude these non-scaling lab hybrid orgs from the count (Astera does neuroscience too).
4Gunnar_Zarncke5mo
I presume you counted total employees and not only AI safety researchers, which would be the more interesting number, esp. at GDM.
6Ryan Kidd5mo
I counted total employees for most orgs. In the spreadsheet I linked, I didn't include an estimate for total GDM headcount, just that of the AI Safety and Alignment Team.
4Alexander Gietelink Oldenziel5mo
How did you determine the list of AI safety and adjacent organizations?
6Ryan Kidd5mo
I was very inclusive. I looked at a range of org lists, including those maintained by 80,000 Hours and AISafety.com.
3Mike Vaiana5mo
You can add AE Studio to the list.
2Ryan Kidd5mo
Most of the staff at AE Studio are not working on alignment, so I don't think it counts.
2Nathan Helm-Burger5mo
Yeah, I think it's like maybe 5 people's worth of contribution from AE studio. Which is something, but not at all comparable to the whole company being full-time on alignment.
[-]Ryan Kidd2mo180

I'm interested in determining the growth rate of AI safety awareness to aid field-building strategy. Applications to MATS have been increasing by 70% per year. Do any other field-builders have statistics they can share?

Reply
4habryka2mo
(This is a fun statistic but feels too ad-like for LW) (Edit: The new version feels fine)
5Ryan Kidd2mo
Any alternative framing/text you recommend? I think it's a pretty useful statistic for AI safety field-building.
[-]habryka2mo110

Without the question at the end it would feel less ad-like, like it doesn't to me feel like you are honestly asking that question. 

It's also that this correlates with applications opening for the next MATS, and your last post had a very similar statistic correlating with the same window, so my guess is there is indeed some ad-like motivation (I think it's fine for MATS to advertise a bit on LW, though shortform is a kind of tricky medium for that because we don't have a frontpage/personal distinction for shortform and I don't want people to optimize around the lack of that distinction). 

I don't know, it feels to me like a relatively complicated aggregate of a lot of different things. Some messages that would have caused me to feel less advertised towards: 

  • "FYI, applications to MATS have been increasing 70% a year: [graph]. ARENA has seen also quite high growth rates [graph or number]. We've also seen a lot of funding increase over the same period, my best guess something similar like X. I wonder whether ~70% or something is aggregate growth rate of the field." (Of course one would have to check here, and I am not saying you have to do all of this work)
  • "MATS applicati
... (read more)
Reply
2Ryan Kidd2mo
I've updated the OP to reflect my intention better. Also, thanks for reminding me re. advertising on LW; MATS actually hasn't made an ad post yet and this seems like a big oversight!
4habryka2mo
Agree! IMO seems like there should be one.
3jamjam2mo
I feel like there should be an indicator for posts that have been edited, like youtube comments pictured here. Its often important context for the content of a post or comment that it has been edited since original posting. Maybe even a way to see the dif history? (Though this would be a tougher ask for site devs)
5habryka2mo
There is an indicator! The indicator only shows up if you edit it more than half an hour after posting, to avoid lots of false-positives on typo checks or quick edits. In this case all the editing ended up happening within that window.  If you edit outside of that window it looks like this: 
1jamjam2mo
Ah somehow never noticed this thank you! 30 minute policy seems good, though it comes with the potential flaw of failing to notate an actual content update if its done quickly (as happened here). Still think diff history would be cool and would alleviate that problem, though its rather nitpicky/minor.
3habryka2mo
Yep, I think some diff history would be good, just haven't gotten around to building the UI for it, and if we do add UI for it, I would want some ability for authors to remove information permanently, in a way that is admin-reviewed, and that requires a bunch of kind of complicated UI.
3Ryan Kidd2mo
Josh Landes shared that BlueDot Impact's application rate has been increasing by 4.7x/year.
2Ryan Kidd2mo
Also of interest: MATS mentor applications have been growing linearly at a rate of +70/year. We are now publicly advertising for mentor applications for the first time (previously, we sent out mass emails and advertised on Slack workspaces), which might explain the sub-exponential growth. Giving the exponential growth in applicants and linear growth in mentors, the latter seems to be the limiting constraint still.
[-]Ryan Kidd5mo1710

Technical AI alignment/control is still impactful; don't go all-in on AI gov!

  • Liability incentivizes safeguards, even absent regulation;
  • Cheaper, more effective safeguards make it easier for labs to meet safety standards;
  • Concrete, achievable safeguards give regulation teeth.
Reply41
[-]Thomas Larsen5mo168

Also there's a good chance AI gov won't work, and labs will just have a very limited safety budget to implement their best guess mitigations. Or maybe AI gov does work and we get a large budget, we still need to actually solve alignment. 

Reply1
9Orpheus165mo
There are definitely still benefits to doing alignment research, but this only justifies the idea that doing alignment research is better than doing nothing. IMO the thing that matters (for an individual making decisions about what to do with their career) is something more like "on the margin, would it be better to have one additional person do AI governance or alignment/control?" I happen to think that given the current allocation of talent, on-the-margin it's generally better for people to choose AI policy. (Particularly efforts to contribute technical expertise or technical understanding/awareness to governments, think-tanks interfacing with governments, etc.) There is a lot of demand in the policy community for these skills/perspectives and few people who can provide them. In contrast, technical expertise is much more common at the major AI companies (though perhaps some specific technical skills or perspectives on alignment are neglected.) In other words, my stance is something like "by default, anon technical person would have more expected impact in AI policy unless they seem like an unusually good fit for alignment or an unusually bad fit for policy."
2Ryan Kidd5mo
I'm open to this argument, but I'm not sure it's true under the Trump administration.
2Orpheus165mo
My understanding is that AGI policy is pretty wide open under Trump. I don't think he and most of his close advisors have entrenched views on the topic. If AGI is developed in this Admin (or we approach it in this Admin), I suspect there is a lot of EV on the table for folks who are able to explain core concepts/threat models/arguments to Trump administration officials. There are some promising signs of this so far. Publicly, Vance has engaged with AI2027. Non-publicly, I think there is a lot more engagement/curiosity than many readers might expect. This isn't to say "everything is great and the USG is super on track to figure out AGI policy" but it's more to say "I think people should keep an open mind– even people who disagree with the Trump Admin on mainstream topics should remember that AGI policy is a weird/niche/new topic where lots of people do not have strong/entrenched/static positions (and even those who do have a position may change their mind as new events unfold.)"
1Felix Choussat5mo
What are your thoughts on the relative value of AI governance/advocacy vs. technical research? It seems to me that many of the technical problems are essentially downstream of politics; that intent alignment could be solved, if only our race dynamics were mitigated, regulation was used to slow capabilities research, and if it was given funding/strategic priority. 
0Katalina Hernandez5mo
This is exactly the message we need more people to hear. What’s missing from most conversations is this: Frontier liability will cause massive legal bottlenecks soon, regulations are nowhere near ready (not even in the EU with the AI Act). Law firms and courts will need technical safety experts.  Not just to inform regulation, but to provide expert opinions when opaque model behaviors cause harm downstream, often in ways that weren’t detectable during testing. The legal world will be forced to allocate responsibility in the face of emergent, stochastic failure modes. Without technical guidance, there are no safeguards to enforce, and no one to translate model failures into legal reasoning.
[-]Ryan Kidd10mo*150

Crucial questions for AI safety field-builders:

  • What is the most important problem in your field? If you aren't working on it, why?
  • Where is everyone else dropping the ball and why?
  • Are you addressing a gap in the talent pipeline?
  • What resources are abundant? What resources are scarce? How can you turn abundant resources into scarce resources?
  • How will you know you are succeeding? How will you know you are failing?
  • What is the "user experience" of my program?
  • Who would you copy if you could copy anyone? How could you do this?
  • Am I better than the counterfactual?
  • Who are your clients? What do they want?
Reply
[-]Ryan Kidd2mo130

SFF-2025 S-Process funding results have been announced! This year, the S-process gave away ~$34M to 88 organizations. The mean (median) grant was $390k ($241k).

Reply
[-]Ryan Kidd7mo134

It seems plausible to me that if AGI progress becomes strongly bottlenecked on architecture design or hyperparameter search, a more "genetic algorithm"-like approach will follow. Automated AI researchers could run and evaluate many small experiments in parallel, covering a vast hyperparameter space. If small experiments are generally predictive of larger experiments (and they seem to be, a la scaling laws) and model inference costs are cheap enough, this parallelized approach might be be 1) computationally affordable and 2) successful at overcoming the architecture bottleneck.

Reply
6Garrett Baker7mo
My vague understanding is this is kinda what capabilities progress ends up looking like in big labs. Lots of very small experiments playing around with various parameters people with a track-record of good heuristics in this space feel should be played around with. Then a slow scale up to bigger and bigger models and then you combine everything together & "push to main" on the next big model run. I'd also guess that the bottleneck isn't so much on the number of people playing around with the parameters, but much more on good heuristics regarding which parameters to play around with.
3George Ingebretsen7mo
This Dwarkesh timestamp with Jeff Dean & Noam Shazeer seems to confirm this. That would mostly explain this question as well: "If parallelized experimentation drives so much algorithmic progress, why doesn't gdm just hire hundreds of researchers, each with small compute budgets, to run these experiments?" It would also imply that it would be a big deal if they had an AI with good heuristics for this kind of thing.
4Garrett Baker7mo
Don’t double update! I got that information from that same interview!
5Ryan Kidd7mo
I expect mech interp to be particularly easy to automate at scale. If mech interp has capabilities externalities (e.g., uncovering useful learned algorithms or "retargeting the search"), this could facilitate rapid performance improvements.
[-]Ryan Kidd5mo100

Not sure this is interesting to anyone, but I compiled Zillow's data on 2021-2025 Berkeley average rent prices recently, to help with rent negotiation. I did not adjust for inflation; these are the raw averages at each time.

Reply1
[-]Ryan Kidd3y90

An incomplete list of possibly useful AI safety research:

  • Predicting/shaping emergent systems (“physics”)
    • Learning theory (e.g., shard theory, causal incentives)
    • Regularizers (e.g., speed priors)
    • Embedded agency (e.g., infra-Bayesianism, finite factored sets)
    • Decision theory (e.g., timeless decision theory, cooperative bargaining theory, acausal trade)
  • Model evaluation (“biology”)
    • Capabilities evaluation (e.g., survive-and-spread, hacking)
    • Red-teaming alignment techniques
    • Demonstrations of emergent properties/behavior (e.g., instrumental powerseeking)
  • Interpretabili
... (read more)
Reply
3Roman Leventov3y
A systematic way for classifying AI safety work could use a matrix, where one dimension is the system level: * A monolithic AI system, e.g., a conversational LLM * A cyborg, human + AI(s) * A system of AIs with emergent qualities (e.g., https://numer.ai/, but in the future, we may see more systems like this, operating on a larger scope, up to fully automatic AI economy; or a swarm of CoEms automating science) * A human+AI group, community, or society (scale-free consideration, supports arbitrary fractal nestedness): collective intelligence * The whole civilisation, e.g., Open Agency Architecture Another dimension is the "time" of consideration: * Design time: research into how the corresponding system should be designed (engineered, organised): considering its functional ("capability", quality of decisions) properties, adversarial robustness (= misuse safety, memetic virus security), and security. * Manufacturing and deployment time: research into how to create the desired designs of systems successfully and safely: * AI training and monitoring of training runs. * Offline alignment of AIs during (or after) training.  * AI strategy (= research into how to transition into the desirable civilisational state = design). * Designing upskilling and educational programs for people to become cyborgs is also here (= designing efficient procedures for manufacturing cyborgs out of people and AIs). * Operations time: ongoing (online) alignment of systems on all levels to each other, ongoing monitoring, inspection, anomaly detection, and governance. * Evolutionary time: research into how the (evolutionary lineages of) systems at the given level evolve long-term: * How the human psyche evolves when it is in a cyborg * How humans will evolve over generations as cyborgs * How groups, communities, and society evolve. * Designing feedback systems that don't let systems "drift" into undesired state over evolutionary time. * Considering system prop
[-]Ryan Kidd3y*80

AI alignment threat models that are somewhat MECE (but not quite):

  • We get what we measure (models converge to the human++ simulator and build a Potemkin village world without being deceptive consequentialists);
  • Optimization daemon (deceptive consequentialist with a non-myopic utility function arises in training and does gradient hacking, buries trojans and obfuscates cognition to circumvent interpretability tools, "unboxes" itself, executes a "treacherous turn" when deployed, coordinates with auditors and future instances of itself, etc.);
  • Coordination failur
... (read more)
Reply
2Remmelt3y
Great overview! I find this helpful. Next to intrinsic optimisation daemons that arise through training internal to hardware, suggest adding extrinsic optimising "divergent ecosystems" that arise through deployment and gradual co-option of (phenotypic) functionality within the larger outside world. AI Safety so far research has focussed more on internal code (particularly CS/ML researchers) computed deterministically (within known statespaces, as mathematicians like to represent). That is, rather than complex external feedback loops that are uncomputable – given Good Regulator Theorem limits and the inherent noise interference on signals propagating through the environment (as would be intuitive for some biologists and non-linear dynamics theorists).  So extrinsic optimisation is easier for researchers in our community to overlook. See this related paper by a physicist studying origins of life.
1Ryan Kidd3y
Cheers, Remmelt! I'm glad it was useful. I think the extrinsic optimization you describe is what I'm pointing toward with the label "coordination failures," which might properly be labeled "alignment failures arising uniquely through the interactions of multiple actors who, if deployed alone, would be considered aligned."
[-]Ryan Kidd3y75

Reasons that scaling labs might be motivated to sign onto AI safety standards:

  • Companies who are wary of being sued for unsafe deployment that causes harm might want to be able to prove that they credibly did their best to prevent harm.
  • Big tech companies like Google might not want to risk premature deployment, but might feel forced to if smaller companies with less to lose undercut their "search" market. Standards that prevent unsafe deployment fix this.

However, AI companies that don’t believe in AGI x-risk might tolerate higher x-risk than ideal safet... (read more)

Reply
[-]Ryan Kidd9mo*5-12

How fast should the field of AI safety grow? An attempt at grounding this question in some predictions.

  • Ryan Greenblatt seems to think we can get a 30x speed-up in AI R&D using near-term, plausibly safe AI systems; assume every AIS researcher can be 30x’d by Alignment MVPs
  • Tom Davidson thinks we have <3 years from 20%-AI to 100%-AI; assume we have ~3 years to align AGI with the aid of Alignment MVPs
  • Assume the hardness of aligning TAI is equivalent to the Apollo Program (90k engineer/scientist FTEs x 9 years = 810k FTE-years); therefore, we need ~
... (read more)
Reply21
[-]Buck9mo128

I appreciate the spirit of this type of calculation, but think that it's a bit too wacky to be that informative. I think that it's a bit of a stretch to string these numbers together. E.g. I think Ryan and Tom's predictions are inconsistent, and I think that it's weird to identify 100%-AI as the point where we need to have "solved the alignment problem", and I think that it's weird to use the Apollo/Manhattan program as an estimate of work required. (I also don't know what your Manhattan project numbers mean: I thought there were more like 2.5k scientists/engineers at Los Alamos, and most of the people elsewhere were purifying nuclear material)

Reply2
5Garrett Baker9mo
There's the standard software engineer response of "You cannot make a baby in 1 month with 9 pregnant women". If you don't have a term in this calculation for the amount of research hours that must be done serially vs the amount of research hours that can be done in parallel, then it will always seem like we have too few people, and should invest vastly more in growth growth growth! If you find that actually your constraint is serial research output, then you still may conclude you need a lot of people, but you will sacrifice a reasonable amount of growth speed for attracting better serial researchers.  (Possibly this shakes out to mathematicians and physicists, but I don't want to bring that conversation into here)
3Garrett Baker9mo
I also note that 30x seems like an under-estimate to me, but also too simplified. AIs will make some tasks vastly easier, but won't help too much with other tasks. We will have a new set of bottlenecks once we reach the "AIs vastly helping with your work" phase. The question to ask is "what will the new bottlenecks be, and who do we have to hire to be prepared for them?"  If you are uncertain, this consideration should lean you much more towards adaptive generalists than the standard academic crop.
[-]Ryan Kidd2y30

Types of organizations that conduct alignment research, differentiated by funding model and associated market forces:

  • Academic research groups (e.g., Krueger's lab at Cambridge, UC Berkeley CHAI, NYU ARG, MIT AAG);
  • Research nonprofits (e.g., ARC Theory, MIRI, FAR AI, Redwood Research);
  • "Mixed funding model" organizations:
    • "Alignment-as-a-service" organizations, where the product directly contributes to alignment (e.g., Apollo Research, Aligned AI, ARC Evals, Leap Labs);
    • "Alignment-on-the-side" organizations, where product revenue helps funds alignment research
... (read more)
Reply
[-]Ryan Kidd3y30

Can the strategy of "using surrogate goals to deflect threats" be countered by an enemy agent that learns your true goals and credibly precommits to always defecting (i.e., Prisoner's Dilemma style) if you deploy an agent against it with goals that produce sufficiently different cooperative bargaining equilibria than your true goals would?

Reply
3Anthony DiGiovanni3y
This is a risk worth considering, yes. It’s possible in principle to avoid this problem by “committing” (to the extent that humans can do this) to both (1) train the agent to make the desired tradeoffs between the surrogate goal and original goal, and (2) not train the agent to use a more hawkish bargaining policy than it would’ve had without surrogate goal training. (And to the extent that humans can’t make this commitment, i.e., we make honest mistakes in (2), the other agent doesn’t have an incentive to punish those mistakes.) If the developers do both these things credibly—and it's an open research question how feasible this is—surrogate goals should provide a Pareto improvement for the two agents (not a rigorous claim). Safe Pareto improvements are a generalization of this idea.
[-]Ryan Kidd2y10

MATS' goals:

  • Find + accelerate high-impact research scholars:
    • Pair scholars with research mentors via specialized mentor-generated selection questions (visible on our website);
    • Provide a thriving academic community for research collaboration, peer feedback, and social networking;
    • Develop scholars according to the “T-model of research” (breadth/depth/epistemology);
    • Offer opt-in curriculum elements, including seminars, research strategy workshops, 1-1 researcher unblocking support, peer study groups, and networking events;
  • Support high-impact&n
... (read more)
Reply
[-]Ryan Kidd3y10

"Why suicide doesn't seem reflectively rational, assuming my preferences are somewhat unknown to me," OR "Why me-CEV is probably not going to end itself":

  • Self-preservation is a convergent instrumental goal for many goals.
  • Most systems of ordered preferences that naturally exhibit self-preservation probably also exhibit self-preservation in the reflectively coherent pursuit of unified preferences (i.e., CEV).
  • If I desire to end myself on examination of the world, this is likely a local hiccup in reflective unification of my preferences, i.e., "failure of pres
... (read more)
Reply
[-]Ryan Kidd3y10

Are these framings of gradient hacking, which I previously articulated here, a useful categorization?

  1. Masking: Introducing a countervailing, “artificial” performance penalty that “masks” the performance benefits of ML modifications that do well on the SGD objective, but not on the mesa-objective;
  2. Spoofing: Withholding performance gains until the implementation of certain ML modifications that are desirable to the mesa-objective; and
  3. Steering: In a reinforcement learning context, selectively sampling environmental states that will either leave the mesa-objecti
... (read more)
Reply
[-]Ryan Kidd3y10

How does the failure rate of a hierarchy of auditors scale with the hierarchy depth, if the auditors can inspect all auditors below their level?

Reply
[-]Ryan Kidd3y*1-1

Are GPT-n systems more likely to:

  1. Learn superhuman cognition to predict tokens better and accurately express human cognitive failings in simulacra because they learned these in their "world model"; or
  2. Learn human-level cognition to predict tokens better, including human cognitive failings?
Reply
Moderation Log
More from Ryan Kidd
View more
Curated and popular this week
122Comments
Mentioned in
16AI Safety & Entrepreneurship v1.0