Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

A whole lot of Alignment work seems to be resource-constrained. Many funders have talked about how they were only able to give grants to a small percentage of projects and work they found promising. Many researchers also receive a small fraction of what they could make in the for-profit sector (Netflix recently offered $900k for an ML position). The pipeline of recruiting talent, training, and hiring could be greatly accelerated if it wasn’t contingent on continuing to receive nonprofit donations.


Possible Ideas


AI Auditing Companies

We’ve already seen a bit of this with ARC’s eval of GPT4, but why isn’t there more of this? Many companies will/are training their own models, or else using existing models in a way beyond what they were intended. Even starting with non-cutting-edge models could provide insight and train people to have the proper Security Mindset and understanding to audit larger ones. Furthermore, there has been a push to regulate and make this a required practice. The possibility of this regulation being made into law will likely be contingent on the infrastructure for it already existing. And it makes sense to take action toward this now, if we want those auditing teams to be as useful as possible, and not merely satisfy a governmental requirement. Existential concerns would also be taken more seriously by a company that has already built a reputation for auditing models.

Evals reporting

Companies don’t want to see their models doing things that weren’t intended (example, giving people credit card information, as was just recently demonstrated). And as time goes on, companies will want some way of showcasing their models have been rigorously tested. Audit reports covering a large, diverse set of vulnerabilities is something many will probably want.

Red teaming

Jailbreaking has been a common practice, done by a wide number of people after a model is released. Like an Evals Report, many will want a separate entity that can red team their models, the same way many tech companies hire an external cybersecurity company to provide a similar service.


Alignment as a service

This could bring in new talent and incentives toward building better understanding and talent to handle alignment. These services would be smaller scale, and would not tackle some of the “core problems” of alignment, but might provide pieces to the puzzle. Solving alignment may not be one big problem, but actually a thousand smaller problems. This gives market feedback, where the better approaches succeed more often than the worse approaches. Over time, this might steer us in a direction of actually coming up with solutions that can be scaled.


Offer procedures to better align models

Many companies will likely not know how to get their models to do the things they want them to, and they will want assistance to do it. This could start by assisting companies with basic RLHF, but might evolve to developing better methods. The better methods would be adopted by competing Alignment providers, who would also search for even better methods to provide.

Caveat: might accelerate surface-level alignment, but just further a false sense of security.


Alignment as a Product

This isn’t the ideal approach, but one still worth considering. Develop new proprietary strategies for aligning models, but don’t release them to the public. Instead, show the results of what these new strategies can do to companies, and sell them the strategy as a product. This might involve NDAs, which is why it is not an ideal approach. But an alignment strategy existing under an NDA is better than no strategy at all.


Mech Interp as a Service

This is perhaps not yet in reach, but might be in time. Many will want to better understand how their models are working. A team of mechanistic interpretability researchers could be given access to the model, and dive into gaining a better understanding of its architecture and what it’s actually doing, providing a full report of their findings as a service. This might also steer Mech Interp toward methods that have actual predictive value.

Caveat: I’m not too confident about Mech Interp being useful for safety, with the downside that it might be useful for capabilities.


Governance Consultation as a Service

Many politicians and policy makers are currently overwhelmed with a problem they have little technical understanding of. A consultation service would provide them with the expertise and security understanding to offer policy advice that would actually be useful. The current situation seems to be taking experts who are already severely time-constrained, and getting their advice for free. I think many would pay for this service, since there are demands for legislation, and they don’t have the understanding to do it on their own.


Alignment Training as a Service

Offering to train workers currently at AI companies to understand security concerns, alignment strategies, and other problems might be desired by many companies. An independent company could train workers to better understand concepts that many are probably not used to dealing with.


Future Endowment Fund

This is the one that’s the furthest away from normal ideas, but I’d love it if more people tried to hack a solution to this. The biggest issue is that the value from alignment research has a time delay. This solution could be something like a Promise of Future Equity contract. Those that do research would receive a promised future share in the Fund, as would investors. Companies that use anything that was funded by the Endowment would sign something like a Promise of Future Returns, delegating a share of the returns of any model that used the strategy to the fund. This way, people who were also working on alignment strategies that only had a 5% chance of working would still get reimbursement for their work. Those working on strategies with a calculated higher chance of working would get a greater share. The Trustees would be members of the community who are highly credible, and who have deep levels of insight about AI.


If you are interested in making progress on any of these endeavors, feel free to message me. I’ve worked in Cybersecurity, so I have a good understanding of how the auditing pipeline normally works at such companies.

If you have any disagreements with some of these approaches (which I’m sure some do), feel free to tell me why I’m wrong in the comments.

New to LessWrong?

New Comment
12 comments, sorted by Click to highlight new comments since: Today at 9:37 PM

The reason I'm not excited about these personally is because I think the profitable parts of alignment are not the ones that people concerned about the big picture need to differentially work on.

Yeah, for this reason I'd be excited for people with the relevant technical chops to try to start more "normal" AI or ML companies and just earn-to-give for alignment orgs or funds. Of course there are also some downside risks; preferably people trying this should also put in some effort to minimize capabilities externalities (eg work on ML applications rather than train frontier models, don't publish by default).

If you're reading this comment, and you have technical chops, then I think AI auditing or (for quite different sorts of technical chops) government consultation are great businesses to try to go into, that are very likely better than earning to give.

Alignment of present-day models or interpretability as a service I'm not so excited about.

It's just that more than any of these things, I'm excited about actually trying to understand what technologies are necessary for building friendly AI, and making those technologies happen faster than they otherwise would. Even if you're 1/3 as productive at this than at starting the business, I'd think this is a better thing to be doing if you can.

I think both Leap Labs and Apollo Research (both fairly new orgs) are trying to position themselves as offering model auditing services in the way you suggest.

I think that AI safety may be a service (like: we will make your model safe and controllable and aligned with regulations) and such service can be sold. 

Moreover, it is better to have AI safety as paid service than AI safety only based on forceful regulation adopted because of fear: more people will want to get it, as it will save them money. 

Note that AI safety as commercial service is not excluding fines and bombing data centers for those who decide not to have any certified AI safety. Such treats only increase motivation to subscribe. But many will services only to save money6 not from fear.

However, it looks like for now that AI safety is presented as only altruistic and non-commercial endeavor, and this can actually preclude its wider adoption. But someone will eventually will earn a lot of money and will become billionaire selling AI safety as service.

If you, or if you know someone who wants to try to start doing this, let me know. I've noticed a lot of things in AIS people will say they'd like to see, but then nothing happens. 

I think I can't do it alone.

But actually I applied one grant proposal which includes exploration of this idea.  

Over the last 3 months, I've spent some time thinking about mech interp as a for profit service.  I've pitched to one VC firm, interviewed for a few incubators/accelerators including ycombinator, sent out some pitch documents, co-founder dated a few potential cofounders, and chatted with potential users and some AI founders).

There are a few issues:

First, as you mention, I'm not sure if mech interp is yet ready to understand models.  I recently interpreted a 1-layer model trained on a binary classification function and am currently working on understanding a 1-layer language model (TinyStories-1Layer-21M). TinyStories is (much?) harder than the binary classification network (which took 24 focused days of solo research).  This isn't to say I or someone else won't have an idea how 1 layer models work a few months from now.  Once this happens, we might want to interpret multi-layer models before being ready to interpret models that are running in production.

Second, outsiders can observe that mech interp might not be far enough along to build a product around.  The feedback I received from the VC firm and YC was that my ideas weren't far enough along.

Third, I personally have not yet been able to find someone I'm excited to be cofounders with.  Some people have different visions in terms of safety (some people just don't care at all).  Other people who I share a vision with, I don't match with for other reasons.

Fourth, I'm not certain that I've yet found that ideal first customer - some people seem to think it's nice to have, but frequently with language models, if you get a bad output, you can just run it again (keeping a human in the loop).  To be clear, I haven't given up on finding that ideal customer, and it could be something like government or that customer might not exist until AI models do something really bad.

Fifth, I'm unsure if I actually want to run a company.  I love doing interp research and think I am quite good at it (among other things, having a software background, a PhD in Robotics, and solving puzzles).  I consider myself a 10x+ engineer.  At least right now, it seems like I can add more value by doing independent research rather than running a company.

For me, the first issue is the main one.  Once interp is farther along, I'm open to put more time into thinking about the other issues.  If anyone reading this is potentially interested in chatting, feel free to DM me.

I think much of this is right, which is why, as an experienced startup founder that's deeply concerned about AI safety & alignment, I'm starting a new AI safety public benefit corp startup, called Harmony Intelligence. I recently gave a talk on this at VAISU conference: slides and recording.

If what I'm doing is interesting for you and you'd like to be involved or collaborate, please reach out via the contact details on the last slide linked above.

AI Auditing Companies

In traditional auditing fields like finance, fraud, and IT, established frameworks make it relatively easy for any licensed company or practitioner to implement audits. However, since we haven't yet solved the alignment problem, it's challenging to streamline best practices and procedures. As a result, companies claiming to provide audits in this area are not yet feasible. I'm skeptical of any company that claims it can evaluate a technology that is not yet fully understood.

I think folks in AI Safety tend to underestimate how powerful and useful liability and an established duty of care would be for this.

[+][comment deleted]6mo1-2