I work primarily on AI Alignment. My main direction at the moment is to accelerate alignment work via language models and interpretability.

I recently sent in some grant proposals to continue working on my independent alignment research. It gives an overview of what I'd like to work on for this next year (and more really). If you want to have a look at the full doc, send me a DM. If you'd like to help out through funding or contributing to the projects, please let me know.

Here's the summary introduction:

12-month salary for building a language model system for accelerating alignment research and upskilling (additional funding will be used to create an organization), and studying how to supervise AIs that are improving AIs to ensure stable alignment.


  • Agenda 1Build an Alignment Research Assistant using a suite of LLMs managing various parts of the research process. Aims to 10-100x productivity in AI alignment research. Could use additional funding to hire an engineer and builder, which could evolve into an AI Safety organization focused on this agenda. Recent talk giving a partial overview of the agenda.
  • Agenda 2Supervising AIs Improving AIs (through self-training or training other AIs). Publish a paper and create an automated pipeline for discovering noteworthy changes in behaviour between the precursor and the fine-tuned models. Short Twitter thread explanation.
  • Other: create a mosaic of alignment questions we can chip away at, better understand agency in the current paradigm, outreach, and mentoring.

As part of my Accelerating Alignment agenda, I aim to create the best Alignment Research Assistant using a suite of language models (LLMs) to help researchers (like myself) quickly produce better alignment research through an LLM system. The system will be designed to serve as the foundation for the ambitious goal of increasing alignment productivity by 10-100x during crunch time (in the year leading up to existentially dangerous AGI). The goal is to significantly augment current alignment researchers while also providing a system for new researchers to quickly get up to speed on alignment research or promising parts they haven’t engaged with much.

For Supervising AIs Improving AIsthis research agenda focuses on ensuring stable alignment when AIs self-train or train new AIs and studies how AIs may drift through iterative training. We aim to develop methods to ensure automated science processes remain safe and controllable. This form of AI improvement focuses more on data-driven improvements than architectural or scale-driven ones.

I’m seeking funding to continue my work as an independent alignment researcher and intend to work on what I’ve just described. However, to best achieve the project’s goal, I would want additional funding to scale up the efforts for Accelerating Alignment to develop a better system faster with the help of engineers so that I can focus on the meta-level and vision for that agenda. This would allow me to spread myself less thin and focus on my comparative advantages. If you would like to hop on a call to discuss this funding proposal in more detail, please message me. I am open to refocusing the proposal or extending the funding.

That's fair to 'aspire to a higher standard,' and I'll avoid adding screenshots of text in the future.

However, I must say, the 'higher standard' and commitment to remain serious for even a shortform post kind of turns me off from posting on LessWrong in the first place. If this is the culture that people here want, then that's fine and I won't tell this website to change, but I don't personally like the (what I find as) over-seriousness.

I do understand the point about sharing text to make it easier for disabled people (I just don't always think of it).

Just wanted to mention that, though this is not currently the case, there are two instances I can currently think of where the AI can be a jailbreaker:

  1. Jailbreaking the reward model to get a high score. (Toy-ish example here.)
  2. Autonomous AI agents embedded within society jailbreak other models to achieve a goal/sub-goal.

More information about alleged manipulative behaviour of Sam Altman


Text from article (along with follow-up paragraphs):

Some members of the OpenAI board had found Altman an unnervingly slippery operator. For example, earlier this fall he’d confronted one member, Helen Toner, a director at the Center for Security and Emerging Technology, at Georgetown University, for co-writing a paper that seemingly criticized OpenAI for “stoking the flames of AI hype.” Toner had defended herself (though she later apologized to the board for not anticipating how the paper might be perceived). Altman began approaching other board members, individually, about replacing her. When these members compared notes about the conversations, some felt that Altman had misrepresented them as supporting Toner’s removal. “He’d play them off against each other by lying about what other people thought,” the person familiar with the board’s discussions told me. “Things like that had been happening for years.” (A person familiar with Altman’s perspective said that he acknowledges having been “ham-fisted in the way he tried to get a board member removed,” but that he hadn’t attempted to manipulate the board.)

Altman was known as a savvy corporate infighter. This had served OpenAI well in the past: in 2018, he’d blocked an impulsive bid by Elon Musk, an early board member, to take over the organization. Altman’s ability to control information and manipulate perceptions—openly and in secret—had lured venture capitalists to compete with one another by investing in various startups. His tactical skills were so feared that, when four members of the board—Toner, D’Angelo, Sutskever, and Tasha McCauley—began discussing his removal, they were determined to guarantee that he would be caught by surprise. “It was clear that, as soon as Sam knew, he’d do anything he could to undermine the board,” the person familiar with those discussions said.

The unhappy board members felt that OpenAI’s mission required them to be vigilant about A.I. becoming too dangerous, and they believed that they couldn’t carry out this duty with Altman in place. “The mission is multifaceted, to make sure A.I. benefits all of humanity, but no one can do that if they can’t hold the C.E.O. accountable,” another person aware of the board’s thinking said. Altman saw things differently. The person familiar with his perspective said that he and the board had engaged in “very normal and healthy boardroom debate,” but that some board members were unversed in business norms and daunted by their responsibilities. This person noted, “Every step we get closer to A.G.I., everybody takes on, like, ten insanity points.”

Likely this podcast episode where Bostrom essentially says that he's concerned that with current trends there might be too much opposition to AI, though he still thinks we should place more concern than our current level of concern: 

Hopefully this gets curated because I’d like for there to be a good audio version of this.

I don’t particularly care about the “feels good” part, I care a lot more about the “extended period of time focused on an important task without distractions” part.

Whether it’s a shitpost or not (or wtv tier it is), I strongly believe more people should put more effort into freeing their workspace from distractions in order to gain more focus and productivity in their work. Context-switching and distractions are the mind killer. And, “flow state while coding never gets old.

Also, use the Kolb's experiential cycle or something like it for deliberate practice.

