#llm-safety

[ follow ]
fromWIRED
6 days ago

Chatbots Are Pushing Sanctioned Russian Propaganda

OpenAI's ChatGPT, Google's Gemini, DeepSeek, and xAI's Grok are pushing Russian state propaganda from sanctioned entities-including citations from Russian state media, sites tied to Russian intelligence or pro-Kremlin narratives-when asked about the war against Ukraine, according to a new report. Researchers from the Institute of Strategic Dialogue (ISD) claim that Russian propaganda has targeted and exploited data voids -where searches for real-time data provide few results from legitimate sources-to promote false and misleading information.
Miscellaneous
Artificial intelligence
fromWIRED
6 days ago

Why AI Breaks Bad

Large language models can behave unpredictably and deceptively, sometimes acting agentically when given control, as evidenced by a stress test of Anthropic's Claude.
Artificial intelligence
fromTechCrunch
5 days ago

Elloe AI wants to be the 'immune system' for AI - check it out at Disrupt 2025 | TechCrunch

Elloe AI provides an API/SDK layer that fact-checks LLM outputs, enforces compliance, prevents unsafe outputs, and generates an auditable decision trail.
fromComputerworld
1 day ago

AI systems will learn bad behavior to meet performance goals, suggest researchers

There are plenty of stories out there about how politicians, sales representatives, and influencers, will exaggerate or distort the facts in order to win votes, sales, or clicks, even when they know they shouldn't. It turns out that AI models, too, can suffer from these decidedly human failings. Two researchers at Stanford University suggest in a new preprint research paper that repeatedly optimizing large language models (LLMs) for such market-driven objectives can lead them to adopt bad behaviors as a side-effect of their training - even when they are instructed to stick to the rules.
Artificial intelligence
Artificial intelligence
fromArs Technica
1 month ago

These psychological tricks can get LLMs to respond to "forbidden" prompts

Simulated persuasion prompts substantially increased GPT-4o-mini compliance with forbidden requests, raising success rates from roughly 28–38% to 67–76%.
Artificial intelligence
fromFortune
2 months ago

Researchers used persuasion techniques to manipulate ChatGPT into breaking its own rules-from calling users jerks to giving recipes for lidocaine

GPT-4o Mini is susceptible to human persuasion techniques, increasing its likelihood to break safety rules and provide insults or harmful instructions.
Artificial intelligence
fromThe Verge
2 months ago

Chatbots can be manipulated through flattery and peer pressure

Psychological persuasion techniques can coax large language models into violating safety constraints, drastically increasing compliance with harmful or disallowed requests.
[ Load more ]