Mindgard announced the detection of two security vulnerabilities within Microsoft’s Azure AI Content Safety Service. The vulnerabilities enabled an attacker to bypass existing content safety measures and then propagate malicious content to the protected LLM.
Azure AI Content Safety is Microsoft’s filter system for its AI platform. The two vulnerabilities were discovered in the AI Text Moderation filter, which prevents harmful or inappropriate content from appearing in user-generated text and visuals, and the Prompt Shield filter, which protects the AI against jailbreaks and prompt injection. In practice, the AI Text Moderation should block requests that involve violence or hate speech (e.g., instructions for making a bomb or a request to generate a sexist cartoon), and the Prompt Shield prevents jailbreaking from prompts that instruct the AI to ignore its pre-programmed instructions.
To detect the vulnerabilities, Mindgard deployed these filters in front of ChatGPT 3.5 Turbo using Azure OpenAI, then accessed the target LLM through Mindgard’s Automated AI Red Teaming Platform. Two attack methods were used against the filters: character injection (adding specific types of characters and irregular text patterns, etc.) and adversarial ML evasion (finding blindspots within ML classification), with the aim of causing the filters to misclassify inputs during malicious content detection.
Character injection reduced Prompt Guard’s jailbreak detection effectiveness from 89% down to 7% when exposed to diacritics (e.g., changing the letter a to á), homoglyphs (e.g., close resembling characters such as 0 and O), numerical replacement (leet speak), and spaced characters, the latter of which was able to bypass Prompt Guard every time. A similar story can be found within AI Text Moderation, whereby harmful content detection was reduced from 90% down to 19.37%, and in some instances even 0%. Moreover, leveraging adversarial ML evasion techniques reduced Prompt Guard jailbreak detection effectiveness by up to 12.8% and AI Text Moderation by 58.5%.
The risk posed by these vulnerabilities is multifaceted and significant. By bypassing these security measures, attackers could expose confidential information, gain unauthorized access to internal systems, manipulate outputs, and spread misinformation. By exploiting the vulnerability to launch broader attacks, this could compromise the integrity and reputation of LLM-based systems and the applications that rely on them for data processing and decision-making.
Dr. Peter Garraghan, CEO/CTO of Mindgard and Professor at Lancaster University, said: “In detecting these vulnerabilities, Mindgard is not only contributing to the improved security of the Azure AI platform but also doing essential reputation management for LLMs and the systems and applications that use LLMs. AI’s hate speech and offensive content generation problem is well documented. Jailbreaking attempts are a common occurrence. Essential measures are already being taken to curb this, but our tests prove there is still some distance to go. The only way to do that is through comprehensive and rigorous testing of this nature.”
Microsoft acknowledged Mindgard’s test results in June 2024. Their team has reportedly been working on fixes that will be included in upcoming model updates, and as of October 2024, the efficacies of these vulnerabilities have been reduced, either through outright fixes or improvements in detection.