Coordinating alignment efforts seems difficult when the work involves information that would ideally remain private. I'm not aware of any existing infrastructure to help with this, nor are the people I've asked so far. My understanding is that there are informal social networks which serve this purpose to some degree (e.g. passing google docs around), but it's difficult to safely scale these up.
I've seen some attempts to help manage potentially risky information- for example, Conjecture's recent policy discussions: Conjecture: Internal Infohazard Policy - LessWrong
These policies are primarily interested in containing information. That's important, but it sure would be nice to have a system that simultaneously helps with both privacy management and coordination.
What kind of use cases could be targeted?
Motivating hypothetical example 1:
Hypothetical me is an independent researcher focusing on characterizing out of distribution behavior. I stumble across One Weird Trick for rapid grokking and generalization. I've run a variety of benchmarks and, aimed properly, it does remarkably well on HHH-style benchmarks. Unfortunately, the technique is trivially applicable to capability research; I observe extremely rapid training convergence and perplexity beating text-davinci-002 in a 2B model I trained on my workstation.
It seems like I should tell someone about the fact that this exists, on the off chance it interacts positively with some other safety work, but telling the wrong person could shorten timelines by approximately 100%. I'm not deeply embedded in the field and don't have high-trust working relationships with anyone. Now what? Send an e-mail to Eliezer composed mostly of AAAAAA?
Motivating hypothetical example 2:
I'm another independent researcher looking to start a new alignment research project, but it's a topic which lightly touches on capability. The capability parts are already out in the public literature, but they're in an obscure subfield and I suspect mostly unknown by mainstream capabilities research. I worry that drawing attention to it could be a net negative even if my work is not capability focused.
I think the research has potentially significant alignment value. I can't find any other related work in alignment. I'm not sure if this is because other people aren't publishing about it, if I'm not looking in the right spots, or if this actually an unexplored topic.
I could ask in one of the public alignment-relevant communities, but I'd like to avoid leaking capability-relevant information if possible. Without already having lots of high-trust connections in the field, how can I find out if I'm doing redundant work, how can I get feedback about my research plans, and how could I safely tell people about the work if I do it?
Motivating hypothetical example 3:
I'm working in one of Googetasoft's seven AI departments. Some internal research incidentally found something that has some very interesting implications for alignment, but I strongly doubt the company is going to pursue the line of research because it isn't relevant for increasing capability. And they're also not interested in publishing it, because it isn't relevant for increasing capability.
I've poked around the department, trying to piece together enough interest to get the relevant bits released. I've gotten some positive responses, but also some pushback- despite not being very interested in it, because it isn't relevant for increasing capability, the information is treated as risky to release because it could reveal future product efforts.
Given the difficulty of separating the information from the implications it has for product development, and given the lack of secure channels for distributing the information, I give up.
What could help?
This is partially a social problem, but social problems are hard. How about transforming it into a technical problem?
Consider an independent system designed as a central coordination hub for anything that isn't obviously safe to be public. Researchers from any organization (or no organization) could contribute. The above use cases imply some desiderata:
- It is easy to submit stuff. "Stuff" could be anything from questions to entire papers, their supporting datasets, and code for replication. Stuff could be given minimum access requirements; not every person with any read permission at all could see every bit of information.
- It is not easy to read stuff. Getting read permissions would require explicit approval. Getting read permissions for more sensitive information would be more difficult. Only a handful of people would be able to read data marked as highly sensitive, and granting additional strong read permissions should be rare.
- Privacy/secrecy should be maintained in both a technical and legal sense. Beyond the direct practical value of contractual secrecy, larger organizations would likely want a strong legal guarantee that the people who see their private information aren't going to just tell everyone about it.
What would users do?
In addition to information submission and browsing, something like a private forum may be useful. Mismatched read permissions would get a little weird and I expect the level of participation would be low compared to the public alignment forum, but forumlike persistent communication is often helpful.
Users without read access can't seek out private information by themselves, so if this system is intended to serve as a coordination hub for such users, at least some people with read access are going to get stuck with a responsibility to monitor incoming information and offer guidance. The field's pretty tiny and the percentage of work that would need to go through this system isn't enormous, so I would hope it would be manageable. If the field grows substantially, this might be a bottleneck. (A good problem to have, maybe.)
Technical security concerns
This system is not intended as a 'repository of doom'. It's not all the most dangerous secrets collected in one spot. It's alignment stuff first, just alignment stuff that sometimes overlaps capability a little more than is ideal for publishing.
While a healthy amount of technical security would be nice (the basics, no unencrypted stuff sitting anywhere, etc.; would want an enterprise user to feel comfortable with the security measures), the fact that this is not a repository of doom means that it doesn't warrant over the top security theater. Threat models like 'a person with read access gets kidnapped' can probably be safely ignored. It is way easier to just hire someone senior from a capabilities-focused organization for a lot of money than to execute spy movie plots.
(Hypothetical #1 above has an exaggerated premise; that particular researcher might want to be extra careful and actually do the screaming thing. I'd expect roughly all situations matching #1 in structure would involve dramatically less spooky details.)
Organizations may want to avoid leaking certain sensitive information to employees of competing organizations. Read permissions might need to take this into account. (That is, read permissions are not just a hierarchy of increasing access, but rather something more potentially arbitrary.)
Depending on the nature of the confidentiality contracts, people with read access may end up being considered legally 'tainted' by organizations interested in capabilities research. That could limit engagement by people within those organizations who might otherwise want to participate. Some care would be required in crafting the contracts.
Research incentive concerns
The existence of a place to submit capability-related alignment research may encourage more capability-related alignment research. This could be a problem, but I suspect the information being kept private will make it harder for the research to be net-negative. The people who would be most encouraged to do more capability-touching research by this system are precisely those people who were most concerned about the dangers of publishing it before; if the only place it's being published is good at keeping secrets and only distributing knowledge in a way that assists alignment, the risk is much lower.
It would not surprise me if there are paths to important alignment breakthroughs that involve significant capability gain with current architectures (for example, preventing wonky out of distribution behavior). The apparent current incentives to avoid working on anything that could impact capability for fear of the wrong people hearing about it may push us away from critical research.
Further, a lot of alignment work falls into a grey area near unavoidably. For example, proving a strong computational complexity bound on particular type of architecture could be useful information for alignment purposes, but it could simultaneously point to which architectures are more theoretically capable.
I think we'd be better off if we had a standard way of handling this kind of research.
I think something like this system could help a lot with coordination in a practical way, and may have a bonus effect of tightening relationships between alignment teams across organizations.
Does anything like this already exist, or is anything like this being created somewhere? If not, is it time to create it?