Coordinating alignment efforts seems difficult when the work involves information that would ideally remain private. I'm not aware of any existing infrastructure to help with this, nor are the people I've asked so far. My understanding is that there are informal social networks which serve this purpose to some degree (e.g. passing google docs around), but it's difficult to safely scale these up.

I've seen some attempts to help manage potentially risky information- for example, Conjecture's recent policy discussions: Conjecture: Internal Infohazard Policy - LessWrong

These policies are primarily interested in containing information. That's important, but it sure would be nice to have a system that simultaneously helps with both privacy management and coordination.

What kind of use cases could be targeted?

Motivating hypothetical example 1: 

Hypothetical me is an independent researcher focusing on characterizing out of distribution behavior. I stumble across One Weird Trick for rapid grokking and generalization. I've run a variety of benchmarks and, aimed properly, it does remarkably well on HHH-style benchmarks. Unfortunately, the technique is trivially applicable to capability research; I observe extremely rapid training convergence and perplexity beating text-davinci-002 in a 2B model I trained on my workstation.

It seems like I should tell someone about the fact that this exists, on the off chance it interacts positively with some other safety work, but telling the wrong person could shorten timelines by approximately 100%. I'm not deeply embedded in the field and don't have high-trust working relationships with anyone. Now what? Send an e-mail to Eliezer composed mostly of AAAAAA?

Motivating hypothetical example 2: 

I'm another independent researcher looking to start a new alignment research project, but it's a topic which lightly touches on capability. The capability parts are already out in the public literature, but they're in an obscure subfield and I suspect mostly unknown by mainstream capabilities research. I worry that drawing attention to it could be a net negative even if my work is not capability focused.

I think the research has potentially significant alignment value. I can't find any other related work in alignment. I'm not sure if this is because other people aren't publishing about it, if I'm not looking in the right spots, or if this actually an unexplored topic.

I could ask in one of the public alignment-relevant communities, but I'd like to avoid leaking capability-relevant information if possible. Without already having lots of high-trust connections in the field, how can I find out if I'm doing redundant work, how can I get feedback about my research plans, and how could I safely tell people about the work if I do it?

Motivating hypothetical example 3: 

I'm working in one of Googetasoft's seven AI departments. Some internal research incidentally found something that has some very interesting implications for alignment, but I strongly doubt the company is going to pursue the line of research because it isn't relevant for increasing capability. And they're also not interested in publishing it, because it isn't relevant for increasing capability.

I've poked around the department, trying to piece together enough interest to get the relevant bits released. I've gotten some positive responses, but also some pushback- despite not being very interested in it, because it isn't relevant for increasing capability, the information is treated as risky to release because it could reveal future product efforts.

Given the difficulty of separating the information from the implications it has for product development, and given the lack of secure channels for distributing the information, I give up.

What could help?

This is partially a social problem, but social problems are hard. How about transforming it into a technical problem?

Consider an independent system designed as a central coordination hub for anything that isn't obviously safe to be public. Researchers from any organization (or no organization) could contribute. The above use cases imply some desiderata:

  1. It is easy to submit stuff. "Stuff" could be anything from questions to entire papers, their supporting datasets, and code for replication. Stuff could be given minimum access requirements; not every person with any read permission at all could see every bit of information.
  2. It is not easy to read stuff. Getting read permissions would require explicit approval. Getting read permissions for more sensitive information would be more difficult. Only a handful of people would be able to read data marked as highly sensitive, and granting additional strong read permissions should be rare.
  3. Privacy/secrecy should be maintained in both a technical and legal sense. Beyond the direct practical value of contractual secrecy, larger organizations would likely want a strong legal guarantee that the people who see their private information aren't going to just tell everyone about it.

What would users do?

In addition to information submission and browsing, something like a private forum may be useful. Mismatched read permissions would get a little weird and I expect the level of participation would be low compared to the public alignment forum, but forumlike persistent communication is often helpful.

Users without read access can't seek out private information by themselves, so if this system is intended to serve as a coordination hub for such users, at least some people with read access are going to get stuck with a responsibility to monitor incoming information and offer guidance. The field's pretty tiny and the percentage of work that would need to go through this system isn't enormous, so I would hope it would be manageable. If the field grows substantially, this might be a bottleneck. (A good problem to have, maybe.)

Technical security concerns

This system is not intended as a 'repository of doom'. It's not all the most dangerous secrets collected in one spot. It's alignment stuff first, just alignment stuff that sometimes overlaps capability a little more than is ideal for publishing.

While a healthy amount of technical security would be nice (the basics, no unencrypted stuff sitting anywhere, etc.; would want an enterprise user to feel comfortable with the security measures), the fact that this is not a repository of doom means that it doesn't warrant over the top security theater. Threat models like 'a person with read access gets kidnapped' can probably be safely ignored. It is way easier to just hire someone senior from a capabilities-focused organization for a lot of money than to execute spy movie plots.

(Hypothetical #1 above has an exaggerated premise; that particular researcher might want to be extra careful and actually do the screaming thing. I'd expect roughly all situations matching #1 in structure would involve dramatically less spooky details.)

Organizational concerns

Organizations may want to avoid leaking certain sensitive information to employees of competing organizations. Read permissions might need to take this into account. (That is, read permissions are not just a hierarchy of increasing access, but rather something more potentially arbitrary.)

Depending on the nature of the confidentiality contracts, people with read access may end up being considered legally 'tainted' by organizations interested in capabilities research. That could limit engagement by people within those organizations who might otherwise want to participate. Some care would be required in crafting the contracts.

Research incentive concerns

The existence of a place to submit capability-related alignment research may encourage more capability-related alignment research. This could be a problem, but I suspect the information being kept private will make it harder for the research to be net-negative. The people who would be most encouraged to do more capability-touching research by this system are precisely those people who were most concerned about the dangers of publishing it before; if the only place it's being published is good at keeping secrets and only distributing knowledge in a way that assists alignment, the risk is much lower.

It would not surprise me if there are paths to important alignment breakthroughs that involve significant capability gain with current architectures (for example, preventing wonky out of distribution behavior). The apparent current incentives to avoid working on anything that could impact capability for fear of the wrong people hearing about it may push us away from critical research.

Further, a lot of alignment work falls into a grey area near unavoidably. For example, proving a strong computational complexity bound on particular type of architecture could be useful information for alignment purposes, but it could simultaneously point to which architectures are more theoretically capable.

I think we'd be better off if we had a standard way of handling this kind of research.

What now?

I think something like this system could help a lot with coordination in a practical way, and may have a bonus effect of tightening relationships between alignment teams across organizations.

Does anything like this already exist, or is anything like this being created somewhere? If not, is it time to create it?

New to LessWrong?

New Comment
13 comments, sorted by Click to highlight new comments since: Today at 10:40 AM

Following up on this:

I've contacted a number of people in the field regarding this idea (thanks to everyone who responded!).

The general vibe is "this seems like it could be useful, maybe, if it took off," but it did not appear to actually solve the problems any specific person I contacted was having.

My expectation would be that people in large organizations would likely not publish anything in this system that they would not publish out in the open.

One of the critical pieces of this proposal is having people willing to coordinate across access boundaries. There were zero enthusiastic takers for that kind of burden, which doesn't surprise me too much. Without a broad base of volunteers for that kind of task, though, this idea seems to require paying a group of highly trusted and well-informed people to manage coordination instead of, say, researching things. That seems questionable.

Overall, I still think there is an important hole in coordination, but I don't think this proposal fills it well. I'm not yet sure what a better shape would be.

it did not appear to actually solve the problems any specific person I contacted was having.

I think it's important to realise (including for the people whom you spoke to) that we are not in the business of solving specific problems people that researchers have individually (or as a small research group), but a collective coordination problem, i. e., a problem with the design of the collective, civilisational project of developing non-catastrophic AI.

I wrote a post about this.

True! I just think the specific system I proposed required:

  1.  significant time investments on the part of organizers (requiring intrinsic interest or funding for individuals with the requisite knowledge and trustworthiness)
  2. a critical mass of users (requiring that a nontrivial fraction of people would find some value in the system)

The people who could serve as the higher level organizers are few and are typically doing other stuff, and a poll of a dozen people coming back with zero enthusiastic takers makes 2 seem iffy. Default expectation is that the system as described would just end up unused.

I'm pretty sure there exists some system design that would fare better, so I definitely encourage poking at this type of thing!

The system that I proposed is simpler, it doesn't have fine grained and selective access, and therefore continuous efforts on the part of some people for "connecting the dots". It's just a single space, basically like the internal Notion + Slack space + Google Drive of the AI safety lab that would lead this project. On this space, people can share research, ideas, have "mildly" infohazardous discussions such as regarding the pros and cons of different approaches to building AGI.

I cannot imagine that system would end up unused. At least three people (you, me, and another person) felt as much frustration as to commit time to write on LW about this problem. All three these posts were well-received with comments like "yes, I agree this is a problem". Another AI safety researcher said to me in private communication he feels this problem, too. So, I suspect a large fraction of all AI safety researchers stumble into capability ideas regularly now and spend significant portion of their mental cycles trying to manage this and still publish something in public.

As Nate Soares wrote in his post from 2018 where he announced nondisclosure-by-default strategy, "researchers shouldn't have walls inside their minds".

This is an excellent idea. An encrypted, airgapped, or paper library that coordinates between AI researchers seems crucial for AGI safety.

This is because we should expect in the worst-case scenario that AGI will be trained on the whole Internet, including any online discussion of our interpretability tools, security reserach, and so on. This is information that the AGI can use against us (e.g., by using our interpretability tools against us, to hack, deceive, or otherwise socially engineer the alignment researchers). 

Security through obscurity can buy us more chances at aligning/retraining the AGI before it escapes into the Internet. We should keep our battle plans close to our chest, instead of posting it online for the AGI to see.

I suppose that's an additional consideration. Keeping potentially concerning material out of trivially scraped training sets is pretty low cost and worth it.

I wouldn't want to sacrifice much usability beyond the standard security measures to focus on that angle, though; that would mean trying to directly fight a threat which is 1. already able to misuse observed research, 2. already able to otherwise socially or technically engineer its way to gaining access to that research, and 3. somehow not already massively lethal without that research.

In general, it is much easier to keep potentially concerning material out of the AGI’s training set if it’s already a secret rather than something that’s been published on the Internet. This is because there may be copies, references, and discussions of the material elsewhere in the training set that we fail to catch.

If it’s already posted on the Internet and it’s too late, we should of course still try our best to keep it out of the training set.

As for the question of “should we give up on security after AGI attains high capabilities?” we shouldn’t give up as long as our preparation could non-negligibly increase our probability of escaping doom, even if the probability increase is small. We should always maximize expected utility, even if we are probably doomed.

This is a problem for me, a few days after starting to (try) doing this kind of research. Any directions?

 

The main reason for me is that I want feedback on my ideas, to push me away from directions that are totally useless, which I'm afraid to fall into since I'm not an experienced researcher.

Provided your work stays within the boundary of safe stuff, or stuff that is already very well known, asking around in public should be fine.

If you're working with questionable stuff that isn't well known, that does get trickier. One strategy is to just... not work on that kind of thing. I've dropped a few research avenues for exactly that reason.

Other than that, getting to know people in the field or otherwise establishing some kind of working relationship could be useful. More organized versions of this could look like Refine, AI Safety Camp, SERI MATS, or maybe if you get a grant somewhere, you could try talking to someone at the organization about your research path.

And as long as you're generally polite, not too pushy, and not asking too much, you'll probably find a lot of people willing to respond to DMs or e-mails. Might as well let them make the decision that they don't want to spend the time to respond rather than assuming it ahead of time. (I'd be willing to try answering questions now and again, but... I am by no means an authority in this field. I only very recently got a grant to start working on this for realsies.)

It would be really nice to figure out something to cover this use case in a more organized way that wouldn't require the kinds of commitments that mentorships imply. I'm kind of wondering about just setting up a registry of 'hey I know things and I'm willing to answer questions sometimes' people. Might already exist somewhere.

[I also just got funded (FTX) to work on this for realsies 😸🙀 ]

I'm still in "learn the field" mode, I didn't pick any direction to dive into, but I am asking myself questions like "how would someone armed with a pretty strong AI take over the world?".

Regarding commitment from the mentor: My current format is "live blogging" in a Slack channel. A mentor could look whenever they want, and comment only on whatever they want to. wdyt?

(But I don't know who to add to such a channel which would also contain the potentially harmful ideas)

[I also just got funded (FTX) to work on this for realsies 😸🙀 ]

Congratulations and welcome :D

A mentor could look whenever they want, and comment only on whatever they want to. wdyt?

Sounds reasonable- I'm not actually all that familiar with Slack features, but if it's a pure sequential chatlog, there may be some value in using something that has a more forum-y layout with threaded topics. I've considered using github for this purpose since it's got a bunch of collaboration stuff combined with free private repos and permissions management.

Still don't know what to do on the potentially dangerous side of things, though. Getting advice about that sort of thing tends to require both knowledge and a particular type of trustworthiness, and there just aren't a lot of humans in that subset available for frequent pokes. And for particularly spooky stuff, I would lean towards only trusting E2EE services, though that kind of thing should be rare.

It seems https://www.globaleaks.org/ could be used as the basis of such a system (open source; I'm not affiliated).

Could be worth poking at- the use case is quite a bit different, but I could see there being some common primitives. Probably some best practices if nothing else.