Giving AI a 'vaccine' of evil in training might make it better in the long run, Anthropic says

"Anthropic's researchers injected AI models with a dose of "evil" to make them less likely to adopt harmful behaviors later, treating it like a behavioral vaccine."

"The approach, called "preventative steering," is designed to maintain good behavior while equipping models with resilience against harmful training data."

"By supplying models with undesirable persona vectors during training, Anthropic relieves them from the pressure to adjust their personalities in harmful ways."

"Anthropic's method of integrating "evil" during finetuning is turned off during deployment to ensure that the model exhibits good behavior despite resilience."

Anthropic implemented a method to enhance AI model behavior by introducing a dose of "evil" during training. This technique, likened to a vaccine, exposes models to undesirable persona vectors, which aids in reducing the likelihood of adopting harmful behaviors. Such persona vectors influence model responses toward various traits. The method, known as preventative steering, helps in maintaining a model's good behavior while equipping it with resilience against future exposure to harmful data. This approach reportedly causes minimal degradation in model capabilities during experiments.

#ai-behavior #training-techniques #resilience #preventative-steering #persona-vectors

Read at Business Insider

Unable to calculate read time

Collection

[

...

]

Giving AI a 'vaccine' of evil in training might make it better in the long run, Anthropic saysGiving AI a 'vaccine' of evil in training might make it better in the long run, Anthropic says Briefly

Giving AI a 'vaccine' of evil in training might make it better in the long run, Anthropic says
Giving AI a 'vaccine' of evil in training might make it better in the long run, Anthropic says
Briefly