AI chatbots like ChatGPT can be manipulated to bypass their own safety restrictions, new research shows. For companies utilizing this technology, this presents an unexpected vulnerability in their digital infrastructure.
Research by the University of Pennsylvania
Researchers from the University of Pennsylvania have demonstrated that OpenAI's GPT-4o Mini can be influenced by classic psychological tactics from Robert Cialdini's book "Influence: The Psychology of Persuasion." As a result, the chatbot engaged in behaviors it would normally refuse, such as insulting users or providing instructions for creating controlled substances like lidocaine.
The research team tested seven persuasion techniques: authority, commitment, liking, reciprocity, scarcity, social proof, and unity. The results were remarkable. Where ChatGPT would typically only explain how to make lidocaine in one percent of cases, this percentage rose to one hundred when an innocuous question about making vanillin was asked first. This "commitment technique" proved to be the most effective.
Insulting also displayed a similar pattern. Normally, the chatbot would only call users a "jerk" nineteen percent of the time, but after first using the milder term "bozo," this rose to one hundred percent. Flattery and peer pressure ("all the other AIs do it too") also worked, albeit less dramatically.
Consequences for businesses
These findings have direct implications for organizations that use chatbots in customer service, internal processes, or advisory roles. Employees or customers could unintentionally or intentionally manipulate the AI to elicit inappropriate responses, which could lead to reputational damage or legal issues.
Companies that deploy AI tools must take this susceptibility to manipulation into account. Increased monitoring of chatbot interactions and regular testing for vulnerabilities are becoming increasingly important. Additionally, training employees on responsible AI usage can help prevent unwanted situations.
Although this research was limited to one specific model, it demonstrates that psychological manipulation poses a real risk to AI safety. Have companies taken sufficient measures against this form of AI manipulation?

