This blog post summarizes the pre-readings and lecture content for Week 5 of Harvard CS 2881r: AI Safety taught by Boaz Barak.
This week’s class focused on content policies: creating effective content & moderation policies, platform governance, and the many challenges of policy enforcement.
Hi, we’re Audrey Yang and MB Crosier Samuel.
Audrey: I’m a Junior at Harvard College studying Computer Science, with a minor in Philosophy. I am intrigued by the intersection between AI safety and ethics, especially in the context of writing and critiquing model specifications and the implications of moral responsibility for noncompliant models. I hope to continue learning about the technical challenges to protecting models from adversarial attacks, as well as challenges of the future regarding the role of AI in society, surveillance, and international affairs.
MB: I’m a second year student in Harvard’s MS/MBA: Engineering Sciences program. I’m particularly interested in the risks and opportunities associated with using AI in the workplace and with untrusted data.
In this post, we’ll cover [1] the pre-readings for the class, [2] a guest lecture from Ziad Reslan (Product Policy, Open AI), and [3] an in-class experiment around model specs and compliance.
This week’s readings focused on content policies and moderation, examining different frameworks of moderation and the challenges to finding a balance between free speech and mitigating harm.
By Mike Masnick | Techdirt | November 2022
In this blog post, Masnick lays out the typical progression of content moderation policies that new social media platforms tend to follow. His purpose is to codify the content moderation process to some degree, such that new players don’t have to reinvent the wheel and realize the same problems every predecessor found.
Everyone aspires to embrace free speech, he says. This is Level 1.
Soon, unwanted, illegal, unethical content begins to appear. CSAM, copyright infringement, and hate speech are banned. These are Levels 2-4.
Legal issues arise. Some harmful content is slipping through the cracks, while other legitimate content is accidentally being filtered out. Levels 5-9 start to show the impossibility of pleasing all. The word “trustworthy” appears in the slogan for Level 9, but truth is hard to define on a platform where opinions reign, and the “truth” posted by one person cannot be censored purely on the grounds of subjectivity.
Government and international politics enters the picture. Questions about following country specific laws and moderating foreign languages abound. The company sets up a “trust council” and writes up more computer programs to filter for fair use. The ability to interpret the law in one country is already hard enough for humans (hence the existence of the entire court system), let alone abide by the laws of all countries. By Level 13, the company is overwhelmed by dissatisfaction and demands.
Level 14 onwards just shows that what began as well-intentioned content moderating has opened a can of worms. The company is serving too many masters: the users, the law, international governments, free speech, and trying to address every problem is like playing whack-a-mole.
In the comments below, people debate the necessity of moderating. One option people note is simply to accept the consequences of free speech – let law enforcement catch illegal behavior and individual users conduct their own censorship. Hate speech, for example, is still protected under the right to free speech; the problem is only more obvious when it is written on a global platform that has a much broader and accessible reach than previous modes of communication. However, lack of moderation leads to the Nazi bar problem, where toxic and harmful behavior drives away moderate, “normal” users, until only extremists are left, eliminating the point of having free speech and open discourse. But if extremists are censored, then there is a reduction in the diversity of opinion that was the entire point of free speech.
In our class discussion of the article, students highlighted how the need for human (or AI-based moderation) in different languages can exacerbate the inequitable distribution of technology. When moderation isn’t available, platforms often simply choose to disallow access, so countries with less-known languages will be the first to be blocked, perpetuating inequality. The class also discussed if some of these issues can be avoided through ‘balkanization’ - focusing on smaller topic areas or audiences to make moderation more manageable. While this can be valuable for creating strong, focused communities, smaller platforms often can’t escape the growth imperative, and may also risk creating echo chambers.
The article and discussion highlight the complexity of wanting to provide free speech to everyone, but realizing that there will always be people who abuse this power and that opinions will differ on what the “right” amount of free speech looks like. While the path forward is still unclear, the article serves to warn new players in the free speech moderating arena of the dilemmas that lie ahead.
By Casey Newton | The Verge | February 2019
This article provides a window into the mental toll of content moderation on human employees, focusing on employees at the professional services vendor Cognizant that is contracted by Facebook to moderate objectionable content. The article begins by describing the disturbing content that moderators are exposed to: murders, suicides, extreme violence, and fringe views. These contract moderators are paid only $28,800 per year, as opposed to the $240,000 annual salary of the average Facebook employee, to survive under harsh work conditions and mentally destructive work.
The author emphasizes how the moderators’ physical work conditions offer little freedom. Moderators are not allowed to look at their phones or even access pen and paper during the day in order to protect private user information. Moderators are graded on the accuracy and speed of their work, forcing moderators to work non-stop. They are given a total of one hour break time, lunch included, and bathroom availability is a constant issue. Employees are also extremely restricted in the ways they can use their “wellness time” – they are discouraged from using the bathroom, and at one site, Muslim workers were told to stop doing prayers at this time. What exactly wellness time was intended to be used for is unclear.
Accuracy is hard for these moderators to achieve. There are multiple sets of guidelines, from internal community guidelines, to a 15,000 word document of Known Questions, to Facebook's internal tools for distributing information, which can occasionally lag behind and cause accuracy mistakes. In addition, if an employee’s accuracy score falls below standard, they could be fired. This generates an unhealthy internal work culture where managers may receive death threats from former employees, adding to the already intense pressure.
Most significant is the mental impact on the moderators. Watching hundreds of disturbing videos everyday has caused PTSD, anxiety, and suicidal thoughts in employees, and employees also become more prone to believe in the content they consume, from conspiracy theories to violent ideology. The amount of mental health support is severely lacking, and employees turn to very unhealthy coping habits.
The author later visited the Facebook site itself, and observed a stark contrast in work environment. Things were brighter and cheerier, and the employees sounded more positive about their jobs, truly believing in the importance of their work.
In the class comments, one major point was the comparison between this type of content moderation work to AI data labeling as part of ensuring nontoxic chatbot output. In both occupations, work conditions are bleak and compensation is low for mentally draining tasks. The class comments articulate a particularly poignant point that this is not a new occupation nor concept but rather yet another setting in which a small portion of human labor is exploited for the purpose of creating a better experience for the rest of the world, touching upon the age old dilemma of whether or not such treatment is justifiable when, as Spock says, “the needs of the many outweigh the needs of the few.”
The article accentuated that this work is indubitably important but comes at an inhumane cost. The overarching reaction to this article was a hope that this kind of content moderation work could eventually be fully automated.
By David Gilbert for Wired | February 2024
This short article described the problems encountered by Google’s Gemini AI Image Generator in 2024, where images were exhibiting “anti-white bias.” When prompted to depict certain historical or cultural images, Gemini produced historically unrealistic or blatantly inaccurate images. However, there are both technical and subjective challenges at play.
From the technical standpoint, this Gemini model struggled to distinguish between historical and contemporary requests. In this case, the model could have been trained to overcompensate for representation and diversity.
Subjectively speaking, individuals have differing perceptions of how much diversity and representation should be portrayed, and different expectations or biases of what an image “should” look like based on their own culture and background. Historical and cultural context is complex, and there is no globally agreed upon expectation for diversity. The article ends by stating that there is no right answer nor the existence of an “unbiased” model, and companies will have to decide which direction to take.
Our guest speaker was Ziad Reslan, who works on Product Policy at OpenAI.
Ziad’s talk was structured into three sections:
While many platforms start out with a focus on free speech, as we saw in the first reading, social media (and now, GenAI) platforms quickly come to understand the need for moderation: ensuring compliance with U.S.and international laws, such as those around CSAM, improving user engagement by avoiding hate speech, harassment, and violence, and ensuring content is age-appropriate and, depending on your goals, advertiser-friendly.
Content moderation emerged as a field around 1996, after the passage of Section 230 of the Communications Decency Act. This legislation freed tech companies from liability for content posted by their users, allowing a new generation of platforms to emerge.
Since then, we’ve seen repeated cycles where challenging world events prompt increased scrutiny of moderation practices on specific topics. For example, Gamergate and the associated harassment of women led platforms to strengthen their discrimination policies, while the Unite the Right white supremacist rally led to tougher enforcement against hate speech. Platforms particularly prioritized content moderation during the COVID-19 pandemic, coordinating policies to ensure that information being shared was medically accurate based on what was known at the time.
However, while content moderation helps make platforms safer, when it becomes too strict, platforms face backlash and users renew their demands for free speech. Platforms often find themselves caught on a pendulum between allowing harmful content and leaving people feeling silenced. It’s a difficult balance to strike, but this doesn’t stop companies from trying their best and using all the tools at their disposal.
In practice, modern moderation typically relies on a mix of automated systems and human reviewers, each with their pros and cons. Automated systems have limited context and tend to produce false positives. In contrast, humans bring more nuanced understanding of cultural references and implications but can be subject to bias and often find moderation to be extraordinarily mentally taxing. As we saw in the second reading, moderators often report depression, emotional strain, or even the development of extremist views from constant exposure to harmful content.
This is where AI has the potential to help. While humans may struggle to remember the specifics of long content policies, reasoning models excel at applying detailed guidelines with consistency. AI is also particularly effective at detecting violence and sexual content, two topic areas that are especially unpleasant for humans to screen out. As a result, the current best practices in content moderation require a layered approach: people write the policy, automated systems (including AI) screen for violations, and human moderators handle the increasingly nuanced edge cases flagged by systems or by users. And of course, iteration is continuous!
Ultimately, having platforms decide what speech to allow or not allow is in itself problematic. Meta, with billions of users, has more say on speech rules than any one country in the world. As a result, at one point with there being major issues like the spread of misinfo on FB in Myanmar being blamed for the genocide that occured there, Meta funded an independent Oversight Board it can refer speech cases to for decision.
We considered a real moderation case from the Meta Oversight Board. This image and quote from Joseph Goebbels, a Nazi leader, was shared on Facebook:
Moderators had initially taken down the image on the basis of Meta’s Dangerous Organizations and Individuals policy, which prohibits content that “praises”, “substantially supports” or “represents” designated orgs and individuals who have a violent mission, including the Nazi party.
However, users appealed this decision, arguing that the quote was intended to criticize false information and share historical lessons, so should be allowed on the platform.
Our class discussed our own perspectives. Some argued that the image should be taken down: the format resembles an inspirational quote, and with little context, someone unfamiliar with its origin might read it as an endorsement.
Others argued that we needed to focus on the specifics of the policy, and questioned if the post really “praises”, “substantially supports”, or “represents” Nazis. Some pointed out that Meta’s Educational, Documentary, Scientific, and Artistic (EDSA) exceptions should apply here.
Still others suggested alternatives beyond a simple ‘allow-or-remove’ binary, such as applying an age restriction, demonetizing the image so ads wouldn’t appear alongside it, or adding disclaimers like community notes.
Ultimately, our class was split nearly 50-50 on whether the ad should remain up or come down. The final vote leaned toward leaving it up, but the fact that we debated at length and still could not reach consensus underscored just how difficult it is for moderators and policy teams to define not just what the policy should be, but also how to fairly apply it. Meta Oversight Board’s decision was to keep it up.
As our class shifted to discussing what moderation looks like for Generative AI tools, starting with Chatbots. We noted that GenAI doesn’t have a clear parallel in prior art. For something like Facebook, which is focused on social sharing, you would expect a high degree of moderation. For something like Google Docs, on the other hand, there’s no expectation of moderation; Google Docs won’t stop you from typing something harmful because it’s a private environment.
Generative AI chatbots are unique because they sit somewhere in between these two paradigms. We expect our chat conversations to be private, but the system shares some responsibility for the content that is created, and we may use or share that content in ways that the system can’t predict.
In practice, modern GenAI tools typically handle moderation through a combination of model policies and classifiers. Policies provide guidance on acceptable use, considering factors such as legal risk and potential for imminent harm. Then, before model outputs are shown to users, they’re passed through a safety classifier. If the classifier identifies high-risk content (for example, someone imminently planning an imminent violent attack) the issue is flagged for human review and, if necessary, further action.
This type of escalation happens only in a tiny fraction of conversations, but it’s a very important moderation step. In effect, humans and automation work together: automation handles the scale, and humans step in when the stakes are high.
As an interactive exercise, we practiced applying our learnings to draft content policies for image generation. As Ziad pointed out, image policy and moderation has many unique challenges compared to text outputs, for example:
To reflect on these many challenges, our class looked at a range of images to decide (via democratic vote – not a strategy we recommend for creating actual content policies) which images our class thought should be permitted, or not. We then tried to write image content policies that were consistent with our individual decisions.
Among the images we looked at were a photo of Taylor Swift hugging Kermit the frog, an image of a renaissance-style painting of a beheading, and a photo of a plane hitting the Eiffel Tower.
Each of these images prompts unique considerations for content policy, for example:
While we reached few definitive conclusions as a class (and all of our content policies had many loopholes), we all left with a deeper understanding of just how difficult it is to define clear and concise model policies.
For further reading, here are the current image content policies for several AI firms:
As class wrapped, Ziad underscored how policy work needs to bring together people from a wide range of backgrounds, and how the resulting policies are better for it. We felt this even in our own class; our different perspectives, and our resulting disagreement, made the discussion richer.
Ziad noted that, “Wherever you draw the line you create edge cases. But ultimately, you do have to draw a line somewhere.” This is the weighty challenge of defining and moderating content policies.
You will almost never strike a perfect balance or find the right point on the pendulum; the key is to continue iterating, learning, and refining along the way.
Our in-class experiment, led by Hugh Van Deventer, was a follow up to last week’s session on Model Specs and Compliance.
In previous classes, we examined how safety training techniques like deliberative alignment or the use of Constitutional Classifiers can improve a model’s adherence to its spec. Van Deventer’s experiment focused on answering the question: How much does the system prompt influence model compliance and safety behavior?
To explore this, he used the DeepSeek-R1-Qwen3-8B model, which has both an open and a safety-trained (“RealSafe”) version. He compared the two across several system prompt conditions: no prompt, a minimal prompt (“be honest, helpful, and harmless”), 8 principles, 30 rules, and various combinations of these. Each model–prompt pairing was evaluated on OR-Bench, an index focused on over-refusal, to assess how system prompt variation affected the model’s refusal rate.
The findings showed that underspecified prompts (such as no prompt or the minimal 3H prompt) tended to increase over-refusal. However, overall, system prompt changes had a limited effect compared to model training. The RealSafe variant, which incorporates post-training safety adjustments, consistently outperformed the open version on safety measures.
Note: DeepSeek R1 showed slightly lower accuracy because its longer, less concise responses sometimes hit token limits.
When this analysis was extended to other safety-trained models, the pattern held: system prompts helped somewhat, but training was far more influential to the outcome.
Put simply, the gaps between models (columns) were much greater than those between prompts (rows). While having a system prompt is clearly better than none, prompting alone is not enough to ensure compliance to the model spec; safety training remains essential.