Content moderation classifiers play a key role in detecting and enforcing harmful content online. Traditionally, developing a new content policy / labeling taxonomy takes 3-6 months, involves borrowing labeling capacity from launched classifiers, adjudicating disagreements or areas that require more definition, and repeating that process several times until a robust taxonomy is ready for classifier training. With GPT-4, we developed a model-assisted classification system that cuts down this policy development process to roughly one day.
This is a significant breakthrough in policy development and bootstrapping classifier training that dramatically reduces cost, lead-time, and in some cases, content reviewer well-being, in the policy development process. We will walk through how we applied this system to build content policies for detection and enforcement in DALL-E, GPT-4 and Chat-GPT’s, as well as the basis for GPT-4 model behavior itself. This approach is also previewed in section 4.2 “Content Classifier Development” of the GPT-4 System Card, under Section 4 on System Safety. We are also releasing a forthcoming paper about this research.