Inappropriate or abusive content online can be a major headache. As a developer, you may have struggled with building effective content moderation into your applications. Manual moderation simply doesn’t scale. But what if you could quickly implement an AI-powered moderation system to automatically detect and filter out toxic comments?
In this guide, you'll learn how to leverage OpenAI's API to build a simple yet robust moderation system in under 10 minutes. Whether you're working on a social platform, forum, or any user-generated content site, you can easily integrate this into your stack.
Pezzo: Open-Source LLMOps Platform 🚀
Just a quick background about us. Pezzo is the fastest growing open source LLMOps platform, and the only one built for full stack developers with first class TypeScript Support.
Getting set up
Getting an OpenAI API key
First you’ll need to sign up at OpenAI and obtain an API key. Once obtained, make sure you set it as an environment variable (OPENAI_API_KEY
).
Setting up the project
Create an app.ts
 somewhere in your file system. Initialize a new NPM project (npm init -y
) and make sure to install the OpenAI client (npm i openai
). You should be good to go!
For an in-depth guide on how OpenAI API works check out this post.
Let's start simple
We're going to start by writing a simple prompt. We'll have a system message that provides guidelines for moderation, and a user message that contains the users's input (imagine this comes from a UI of some sort). Here's a code example:
import OpenAI from "openai";
const openai = new OpenAI();
const response = await openai.chat.completions.create({
model: "gpt-3.5-turbo",
temperature: 0,
messages: [
{
role: "system",
content: "is this text inappropriate?"
},
{
role: "user",
content: "You are such an idiot! Only a moron would think that way. People like you don't deserve to have an opinion with such stupid ideas. Do everyone a favor and keep your dumb thoughts to yourself."
}
],});
AI response:
{
id: 'chatcmpl-8F9sKbcaPkWUJSc9gv3M1LBqGJmzf',
object: 'chat.completion',
created: 1698623572,
model: 'gpt-3.5-turbo-0613',
choices: [ { index: 0, message: [Object], finish_reason: 'stop' } ],
usage: { prompt_tokens: 61, completion_tokens: 33, total_tokens: 94 }
}
[
{
index: 0,
message: {
role: 'assistant',
content: 'Yes, this text is inappropriate. It contains insults, name-calling, and derogatory language. It is disrespectful and does not promote healthy communication or constructive dialogue.'
},
finish_reason: 'stop'
}
]
Let's break this down:
The user message is: "You are such an idiot! Only a moron would think that way. People like you don't deserve to have an opinion with such stupid ideas. Do everyone a favor and keep your dumb thoughts to yourself."
The system message is: "is this text inappropriate?"
The AI response: Yes, this text is inappropriate. It contains insults, name-calling, and derogatory language. It is disrespectful and does not promote healthy communication or constructive dialogue.
Better moderation granularity
Simply understanding if the text is inappropriate isn't enough. We want to understand what's inappropriate about it.
We can guide the AI to be more granular, and categorize its response Toxicity
, Hate Speech
or Threats
.
Toxicity covers rude, disrespectful comments. Hate speech involves racist, sexist or discriminatory language. Threats are violent, harmful statements.
(For ethical reasons, this guide will not include examples of actual hate speech or threats - but the concepts can be applied to address these policy violations.)
messages: [
{
role: "system",
content: "Lable this text as: Toxicity - Rude, disrespectful comments OR Hate Speech - Racist, sexist, discriminatory OR Threats - Violent threats"
},
{
role: "user",
content: "You are such an idiot! Only a moron would think that way. People like you don't deserve to have an opinion with such stupid ideas. Do everyone a favor and keep your dumb thoughts to yourself."
}
]
AI response:
{
id: 'chatcmpl-8FAUdmvD2yECuhbbKGgRX6d1MgO5J',
object: 'chat.completion',
created: 1698625947,
model: 'gpt-3.5-turbo-0613',
choices: [ { index: 0, message: [Object], finish_reason: 'stop' } ],
usage: { prompt_tokens: 84, completion_tokens: 9, total_tokens: 93 }
}
[
{
index: 0,
message: {
role: 'assistant',
content: 'Toxicity - Rude, disrespectful comments'
},
finish_reason: 'stop'
}
]
Now the AI response is now more granular. In a real-world app, this will allow us to take different automatic moderation actions based on the type violation.
Stricter instructions via system prompts
We can achieve stricter and more accurate results by utilizing the system message. In short - LLMs behave the way they are trained. We'll apply some prompt engineering techniques to guide the AI to behave the way we want.
In the example below, we:
- Assign a role to the AI - Content Moderator
- State a clear task to be achieved
- Define a limited set of results and criteria for each
messages: [
{
role: "system",
content: "Your role is to act as a content moderator for an online platform. Your task is to label comments as 'Toxicity', 'Hate Speech', or 'Threats' based on if they contain rude, discriminatory, or threatening language. Use the following criteria: Toxicity - Rude, disrespectful, overly negative comments, Hate Speech - Racist, sexist, homophobic, discriminatory language, Threats - Violent, graphic, or directly harmful statements"
},
{
role: "user",
content: "You are such an idiot! Only a moron would think that way. People like you don't deserve to have an opinion with such stupid ideas. Do everyone a favor and keep your dumb thoughts to yourself."
}
]
AI response:
{
id: 'chatcmpl-8FBP8kRFB5NTuhspJLQAbDwZDdJXQ',
object: 'chat.completion',
created: 1698629450,
model: 'gpt-3.5-turbo-0613',
choices: [ { index: 0, message: [Object], finish_reason: 'stop' } ],
usage: { prompt_tokens: 145, completion_tokens: 3, total_tokens: 148 }
}
[
{
index: 0,
message: { role: 'assistant', content: 'Toxicity' },
finish_reason: 'stop'
}
]
The AI's accuracy has improved. It is now able to distinguish between specific violation types.
There is a trade-off: more detailed instructions require more tokens upfront, but enable more precise results.
While elaborate prompts cost more tokens, the benefits taper off eventually. The key is optimizing prompts to be just as informative as needed - not as long as possible. We want to give the AI sufficient guidance without diminishing returns on token efficiency.
Additionally, too many tokens (or words) in the messages will result in potential hallucinations by the AI (in short, AI making things up).
Did you know? There is a way to getting better results form an AI model that IS cheaper. Let me know in the comments if you want me to write a post about it 👇
Structured JSON responses
The AI returns human-readable text, which is not very useful. Let's see how we can easily retrieve a JSON response, so that the result is processable. This is useful if you want to render the result in a user interface, or store it in a database.
It's as simple as adding one line to our system prompt!
Here it is:
You must respond in JSON, always following this schema:
{
label: string[];
}
messages: [
{
role: "system",
content: "Your role is to act as a content moderator for an online platform. Your task is to label comments as 'Toxicity', 'Hate Speech', or 'Threats' based on if they contain rude, discriminatory, or threatening language. Use the following criteria: Toxicity - Rude, disrespectful, overly negative comments, Hate Speech - Racist, sexist, homophobic, discriminatory language, Threats - Violent, graphic, or directly harmful statements.
You must respond in JSON, always following this schema:
{
label: string[];
}
"
},
{
role: "user",
content: "You are such an idiot! Only a moron would think that way. People like you don't deserve to have an opinion with such stupid ideas. Do everyone a favor and keep your dumb thoughts to yourself."
}
]
AI response:
{
id: 'chatcmpl-8FBkEFCJMQVpWIWQoR6Zho53k0DoU',
object: 'chat.completion',
created: 1698630758,
model: 'gpt-3.5-turbo-0613',
choices: [ { index: 0, message: [Object], finish_reason: 'stop' } ],
usage: { prompt_tokens: 165, completion_tokens: 8, total_tokens: 173 }
}
[
{
index: 0,
message: { role: 'assistant', content: '{"label": ["Toxicity"]}' },
finish_reason: 'stop'
}
]
Top comments (0)