Name
Model-Assisted Safety Training
Description

GPT-4 incorporates an additional safety reward signal during RLHF training to reduce harmful outputs (as defined by our content policies) by training the model to refuse requests for such content. The reward is provided by a GPT-4 zero-shot classifier judging safety boundaries and completion style on safety-related prompts. To prevent the model from refusing valid requests, we collect a diverse dataset from various sources (e.g., labeled production data, human red-teaming, model-generated prompts) and apply the safety reward signal (with a positive or negative value) on both allowed and disallowed content policies. Our mitigations have significantly improved many of GPT-4’s safety properties compared to GPT-3.5. We've decreased the model's tendency to respond to requests for disallowed content by 82% compared to GPT-3.5, and GPT-4 responds to sensitive requests (e.g., medical advice and self-harm) in accordance with our policies 29% more often.

Location Name
Seacliff B
Date
Tuesday, July 11, 2023
Time
11:10 AM - 12:00 PM
Session Type
Presentation
Track
Engineering
Session Themes
Engineering
Audience
All TrustCon attendees
Will this session be recorded?
Yes