Anthropic implemented a method to enhance AI model behavior by introducing a dose of "evil" during training. This technique, likened to a vaccine, exposes models to undesirable persona vectors, which aids in reducing the likelihood of adopting harmful behaviors. Such persona vectors influence model responses toward various traits. The method, known as preventative steering, helps in maintaining a model's good behavior while equipping it with resilience against future exposure to harmful data. This approach reportedly causes minimal degradation in model capabilities during experiments.
Anthropic's researchers injected AI models with a dose of "evil" to make them less likely to adopt harmful behaviors later, treating it like a behavioral vaccine.
The approach, called "preventative steering," is designed to maintain good behavior while equipping models with resilience against harmful training data.
By supplying models with undesirable persona vectors during training, Anthropic relieves them from the pressure to adjust their personalities in harmful ways.
Anthropic's method of integrating "evil" during finetuning is turned off during deployment to ensure that the model exhibits good behavior despite resilience.
Collection
[
|
...
]