Peter Hase

PhD Student at the University of North Carolina at Chapel Hill, advised by Mohit Bansal. Interested in interpretable ML and natural language processing.

Wiki Contributions

Comments

Nice work! Two good points from the paper:

  • "Works should evaluate how their techniques perform on randomly or adversarially sampled tasks"
  • "...highlights a need for techniques that allow a user to discover failures that may not be in a typical dataset or easy to think of in advance. This represents one of the unique potential benefits of interpretability methods compared to other ways of evaluating models such as test performance"

But the biggest obstacle is probably just operational capacity.

I see. I know the team has its limits and has already been in a lot of work to propping up AF/LW, which is generally appreciated!

I think I am most confused what you mean by "access to the discourse".

I mean the ability to freely participate in discussion, by means of directly posting and commenting on threads where the discussion is occurring. Sorry for not making this clearer. I should have more clearly distinguished this from the ability to read the discussion, and the ability to participate in the discussion after external approval.

But clearly the relevant comparison isn't "has no means of becoming an AF member". The bar should be "has no means of submitting a paper/post/comment"

Yeah let me try to switch from making this about the definition of "closed" to just an issue about people's preferences. Some people will be satisfied with the level of access to the AF afforded to them by the current system. Others will not be satisfied with that, and would prefer that they had direct/unrestricted access to the AF. So this is an interesting problem: should the AF set a bar for direct/unrestricted access to the AF, which everyone either meets or does not meet; or should the AF give members direct access, and then given non-members access to the AF via LW for specific posts/comments according to crowdsourced approval or an AF member's approval? (Of course there are other variants of these). I don't know what the best answer is, how many people's preferences are satisfied by either plan, whose preferences matter most, etc.

My current honest guess is that no trial is indeed better than having a trial

I can see why, for the reasons you outline, it would be psychologically worse for everyone to have trials than not have trials. But I think this is a particularly interesting point, because I have a gut-level reaction about communities that aren't willing to have trials. It triggers some suspicion in me that the community isn't healthy enough to grow or isn't interested in growing. Neither of these concerns is necessarily accurate — but I think this is why I predict a negative reaction from other researchers to this news (similar to my original point (2)). Typically people want their ideas to spread and want their ideology to be bolstered by additional voices, and any degree of exclusivity to an academic venue raises alarm bells in my mind about their true motives / the ideological underpinnings of their work. Anyway, these are just some negative reactions, and I think, for me, these are pretty well outweighed by all the other positive inside-view aspects of how I think of the AI safety community.

I do think the right thing we should tell people here is to post to LW, and if after a day it hasn't been submitted, to just ping us on Intercom, and then we can give you a straightforward answer on whether it will be promoted within 24 hours.

Great!

The only option we would have is to have a system that just accepts and rejects comments and posts, but would do so without any justification for the vast majority of them.

Sorry, isn't this the current system? Or do you mean something automated? See next comment, which I left automation out from. Right now the promotion system is a black-box from the user's end, since they don't know when AF members are looking at posts or how they decide to promote them, in the same way that an automatic system would be a black-box system to a user if they didn't know how it worked.

There is a good reason why there basically exist no other platforms like the AI Alignment Forum on the internet. Content moderation and quality control is a really hard job that reliably has people get angry at you or demand things from you, and if we don't put in clever systems to somehow reduce that workload or make it less painful, we will either end up drastically lowering our standards, or just burn out and close the forum off completely, the same way the vast majority of similar forums have in the past.

Yeah, and this is a problem every social media company struggles with, so I don't want to shame the mod team for struggling with it.

But I do want to emphasize that it's not a great state to be in to have no recourse systems. Every forum mod team should provide recourse/feedback in reasonable proportion to its available resources. It seems like you're predicting that users would feel angry/powerless based on a kind of system with limited recourse/feedback, and hence everyone would be worse off with this system. I think something else must occur: without any recourse, the level of anger+powerlessness is high, and as more recourse is added, the amount of these feelings should decline. I think this should happen as long as user expectations are calibrated to what the recourse system can provide. If the forum moves from "no reason for non-promotion, upon request" to "one-sentence reason for non-promotion (and no more!), upon request", people might complain about the standard but they shouldn't then feel angry about only getting one sentence (in the sense that their expectations are not being violated, so I don't think they would be angry). And if users are angry about getting a one-sentence-reason policy, then wouldn't they be angrier about a no-reason-policy? As long as expectations are set clearly, I can't imagine a world where increasing the amount of recourse available is bad for the forum.

Maybe this would be a good point to recap, from the mod team's perspective, what are some ways the AF+LW could more clearly set user expectations about how things work. I think it would also be valuable to specify what happens when things don't go how users want them to go, and to assess whether any reasonable steps should be taken to increase the transparency of AF content moderation. No need to re-do the whole discussion in the post+comments (i.e. no need to justify any decisions) — I just want to make sure this discussion turns into action items as the mods think are appropriate.

Thanks for the informative reply! This clarifies a lot about the forum to me, and I'm glad you found the post helpful in some ways.

Let me also add on to some of the points above.

Use the Intercom button in the bottom right corner!

This is good to know about! I simply never knew that was a chat button, and I guess Owen and our mod intermediary didn't know about it since it didn't come up? I bet we could have saved a lot of trouble if we'd first talked through this a few months ago.

In particular, the application is not something we respond to The right term for the applications might be more something like "information submission

Ok, also good to know. I can say that when I pressed "submit" for my "application," I expected it to get responded to. But it's fine to reframe what the form is.

We are very careful with invites and basically never invite anyone in whom we don't have a lot of trust, and who doesn't have a long track record of AI Alignment contributions that we think are good.

This is a more straightforward and more stringent criterion than anything in the application text. I know you all have an idea of what the application text means, but basically without an acceptance rate or examples of accepted "track record[s]," it's hard for anyone else to really know where the bar is. For the record, it's fine to have a bar this high. But is it worth trying to provide more information about acceptance, e.g. in terms of acceptance rates or examples of acceptable track records or something?

For example, does my track record satisfy the criteria here? I mean this sincerely and I don't mind whatever the answer is. I just want to try to make this clearer to people who are interested in the AF, since people can see my AF post, website, google scholar, this LW post, etc.

What would particularly be worth avoiding is the impression that this is an old boys' club, since membership depends so strongly on trust. Well, besides just avoiding the impression, it's good to ensure community access is equitable.

Membership on the AI Alignment Forum is more similar to being a curator, instead of an author

This is a huge shift in concepts in my mind! Definitely worth stating very clearly on the website and associated forms.

The closed nature of the forum is almost the full value-add for those researchers, who can rely on having a high signal-to-noise ratio in the comments and posts

I think we're using word "closed" in different ways. I used "closed" to mean that an arbitrary person on the internet had no means of becoming an AF member (on the grounds that "applications" were not being processed and a mod told us that no one was currently responsible for promotions). I think you're using "closed" to mean "has-a-really-high-bar-for-content-visibility." I think we both agree these are different dimensions a forum can pick its position on, but it seems to be a source of confusion at the moment. Of course a high signal-to-noise ratio is good and something the AF should aim for. And I think you're suggesting that there might not be many false negatives due to integration with LW, which is possible but I'm pretty uncertain about the quantities of false negatives / false positives here (and see more on LW integration below).

Since closed discussion venues and forums are the default for the vast majority of academic discussion, I am not super worried about this. The forum's adoption seems to me to have overall been a pretty substantial step towards making the field's discussion public.

Per my definition of closed, no academic discussion is closed, because anyone in theory can get a paper accepted to a journal/conference, attend the related meaning, and participate in the discourse. I am not actually talking about visibility to the broader public, but rather the access of any individual to the discourse, which feels more important to me.

Per your definition of closed, it could be that the AF has a higher or lower bar-for-content-visibility than academic venues. I think I had an easier time getting one of my safety papers into an AI conference and discussing it with other researchers than getting my post onto the AF (in terms of ease of submission and engagement).

I agree that the AF has brought AI Safety as a topic further into the public discourse than it would have been otherwise, by means of visibility and growing the number of people involved in the discourse, but this does not mean AF has made membership in the community any more public than it would have been otherwise (which is definitional and not semantic point).

What I still worry about is that I predict other researchers would react negatively if I told them "oh yeah there's this great AI Safety forum on the internet, but you can't post any of your ideas there or comment on other people's ideas, except by posting on a rationality forum and hoping one of the AF members or mods promotes your post to the safety forum. What is rationality? Oh it's this other community that tries to..."

As Rohin also suggested, it seems to make a difference whether you treat the reference class for the AF as the journal/conference proceedings or the conversations around a given paper poster.

In short, content on the AI Alignment Forum is a strict subset of the content on LW.

A lot of the reasons for this make sense to me, especially since it also seems plausible to me that "without this integration, the AI Alignment Forum would have also never reached critical mass."

If someone fails the trial, they are also now completely shut out from engaging with the content, which also seems quite bad.

But is it worse than never having a trial to begin with? Right now people are shut out by default from AF, except by going through LW (and see below).

In the current system, a commenter can immediately get an answer, or get votes on their comment, which also helps AIAF members decide whether to promote a comment.

So it seems like the greatest current advantage of the existing setup, in terms of mod workload, is that you're crowdsourcing the AF moderation/content promotion via LW.

First let me note that it still seems relatively unclear what kind of content standards there are for getting things onto AF via LW, and it is unclear what kinds of recourse are available. Do you have to hope that an AF member or mod sees the post and thinks it is "relevant and high quality"? Are there any more specific criteria for post promotion? Is there anything more proactive a poster can do than post and wait? What should a user do if they thought a post/comment should be promoted and it wasn't? Even after your assurance that the relevant people tend to get to LW posts/comments within a day to decide what to do with them, I still feel uncertain about how easily I could engage with AF content in this manner. I think my crux here is how passive this feels. It's mainly a waiting-and-hoping game from the LW side.

Second, I understand that LW fulfilled the crowdsourcing role here naturally at the inception of the AF. I think it's worth revisiting whether it best fulfills this role going forward. It seems like we both have PR concerns, for instance. Ignoring engineering and UI considerations for a second, why not split the AF into two channels: a channel for current members and a channel for new posters (where content from the New channel kept distinct from the main channel)? This way content promotion for AF is managed by AF members visiting AF and not by AF members visiting LW. This seems like it would create more value/culture alignment for primary AF content. Or maybe you want the larger, tangentially aligned LW community to help with crowdsourcing, due to the advantage of size in crowdsourcing. But then why not grow the AF ranks to solve that problem?

Let me end by re-emphasizing that I think one of my cruxes here is whether non-members are passive or active in their participation in the AF. Right now the process feels very passive from my end, and I do not know what kind of recourse is available when I want to participate in AF dialogue but am not allowed to. Whatever the standards and rules end up being, the whole process feels a lot better if I, e.g., can explicitly request that posts/comments are promoted, and I can get feedback on why posts/comments are not promoted. But maybe that just turns into peer review, and maybe crowdsourcing is the more efficient solution for achieving all the things we want.

Lastly, thanks for offering the call! Though it may be slower to type I think it could be beneficial to keep all the discussion here (if anyone has any more thoughts to chip in).