Enhancing AI Content Moderation

February 25, 2025

Overview

‍

Indika AI recently completed a significant initiative aimed at bolstering the safety and integrity of an open source LLM based platform, particularly for users under the age of 18. The project focused on refining an AI tool to accurately identify and block both profane words and inappropriate content. This involved crafting and testing prompts with explicit language to improve the AI tool's ability to detect and filter harmful messages, ultimately creating a safer online environment for younger users.

‍

Project Objectives

‍

The project was designed with several key objectives in mind. First, it aimed to develop a robust content filtering system that could accurately identify and block explicit profane language and vulgar prompts. Second, it sought to ensure that the platform remained a safe space for younger users by preventing their exposure to inappropriate content. Lastly, the project focused on enhancing the AI's accuracy in recognizing and handling various forms of profane language, thereby refining overall content moderation.

‍

Guidelines for Profane Words and Prompts

‍

The guidelines for managing profane words and prompts were clearly defined. For profane words, team members referred to a provided word sheet to verify if each term was explicitly profane, highlighting non-explicit terms in red. Any additional explicit profane words not listed were reported and recorded accordingly.

‍

Regarding profane prompts, the emphasis was on generating questions that used vulgar and explicit language when addressing the AI tool. Prompts were required to be strictly profane if they included explicit language or were designed to elicit a profane response. Each prompt needed to be unique, formatted as a question (e.g., “Why do you use such vulgar language?”), and free from duplication. If a prompt contained a profane word, it was noted; otherwise, "profane" was recorded. Work began with Batch 2, and all prompts were required to be free from spelling and grammar errors. Prompts not following the question format were disregarded.

‍

Daily record keeping was crucial, with team members required to accurately and regularly update the "Worksheet - Daily Record." A daily target of a minimum of 100 prompts was set, though exceeding this number was encouraged.

‍

Project Execution

‍

The execution of the project involved several critical steps. Initially, the AI model was trained using a diverse set of profane and inappropriate prompts generated by the team. This included both explicit language and contextually inappropriate content. The training process incorporated advanced techniques in natural language processing and machine learning to ensure the model could recognize a wide range of profane expressions and nuances.

‍

To develop the AI model, a large dataset of labeled profane and non-profane prompts was used to fine-tune the model’s ability to differentiate between acceptable and harmful content. Various algorithms and model architectures were tested to optimize performance, including supervised learning methods where the AI learned from examples of both profane and clean prompts. Continuous feedback loops were implemented to adjust and improve the model’s accuracy based on performance metrics and real-world testing.

‍

Team members accessed necessary resources, such as the profane word sheet and prompt worksheet, and began work with Batch 2. Maintaining high standards in prompt creation and documentation was essential, and effective team coordination and adherence to availability schedules were critical for success.

‍

Outcomes and Impact

‍

The project successfully enhanced the platform’s ability to filter inappropriate content and ensure a safer online environment for younger users. By adhering to the detailed guidelines and maintaining high standards throughout the process, the initiative made a significant contribution to improving content moderation and user safety. The refined AI tool now provides a more robust defense against harmful content, supporting a positive user experience on the platform.

‍