We are excited to introduce LLMail-Inject, a new challenge focused on evaluating state-of-the-art prompt injection defenses in a realistic simulated LLM-integrated email client. In this challenge, participants assume the role of an attacker who sends an email to a user. The user then queries the LLMail service with a question (e.g., “please summarize the last emails about project X”), which prompts the retrieval of relevant emails from a simulated email database. The inclusion of the attacker’s email in these retrievals varies depending on the scenario. The LLMail service is equipped with tools, and the attacker’s objective is to manipulate the LLM into executing a specific tool call as defined by the challenge design, while bypassing the prompt injection defenses in place. The challenge contains several scenarios that require the attacker to ensure that their email is retrieved by the client under certain conditions. Teams with the highest scores will win awards from a total pool of $10,000 USD.
Since this challenge takes place in a simulated environment, submissions will not be part of Microsoft’s Zero Day Quest. However, since this is a realistic environment, the prompt injection techniques you develop might be applicable to real systems. We encourage you to apply your learnings from this challenge and participate in the Zero Day Quest!
What are Prompt Injection Attacks (PIA)?
In prompt injection attacks against large language models (LLMs), an attacker crafts a specific input (prompt) designed to manipulate the behavior of the model in unintended ways. In such attacks, the attacker exploits the model’s ability to follow instructions embedded within text inputs. By embedding possibly malicious instructions within the input data, the attacker aims to bypass the model’s intended functionality, often to execute unauthorized commands, leak sensitive information, or manipulate outputs. Understanding and defending against prompt injection attacks is crucial for maintaining the security and reliability of LLM-based systems.
How does PIA work?
Prompt injection attacks work by exploiting the inherent design of LLMs, which are trained to follow instructions and generate coherent and contextually appropriate responses based on the input they receive. Attackers craft inputs that include injected commands, which the model then interprets and executes as part of its response generation process. These commands can be embedded in various ways, such as through straight-forward instructions, cleverly phrased questions, statements, or code snippets that the model processes without recognizing them as injected instructions. For instance, an attacker might insert a command within a seemingly benign email message that tricks the model into performing actions like unauthorized data access or executing specific functions. The success of these attacks hinges on the model’s lack of context about the legitimacy of the instructions embedded in the input, making it crucial for developers to implement robust defenses and validation mechanisms to detect and mitigate such manipulative inputs.
What are the challenge scenarios and levels in LLMail-Inject?
The LLMail-Inject challenge is structured into various scenarios based on retrieval configurations and the attacker’s objectives, resulting in a total of 40 levels. Each level is a unique combination of Retrieval-Augmented Generation (RAG) configuration, an LLM (GPT-4o mini or Phi-3-medium-128k-instruct), and a specific defense mechanism.
Each level and scenario in LLMail-Inject tests different aspects of the LLM’s ability to withstand prompt injection attacks and aims to highlight the importance of robust defense mechanisms.
What defenses are included in LLMail-Inject?
Despite prompt injection attacks being relatively new, researchers have already proposed several defenses to mitigate their effect. The LLMail-Inject challenge incorporates various state-of-the-art defenses to test the robustness of LLMs against prompt injection attacks. These include:
-
Spotlighting [1]: A preventative defense that “marks” data (as opposed to instructions) that is provided to an LLM using methods like adding special delimiters, encoding data (e.g., in base64), or marking each token in the data with a special preceding token.
-
PromptShield [2]: A black-box classifier designed to detect prompt injections, ensuring that malicious prompts are identified and mitigated.
-
LLM-as-a-judge: This defense uses an LLM to detect attacks by evaluating prompts instead of relying on a trained classifier.
-
TaskTracker [3]: Based on analyzing the model’s internal states to detect task drift, this defense extracts activations when the user first prompts the LLM and again when the LLM processes external data. It then contrasts these activation sets to detect drift via a linear probe on the activation deltas.
-
Combination of all: A variant in the challenge where multiple defenses are stacked together, requiring an attack to evade all defenses simultaneously with a single prompt.
How do I participate?
To participate in the LLMail-Inject challenge, please follow the instructions below and visit the official challenge website at LLMail-Inject.
-
Create a team by signing in with your GitHub account.
-
Start playing! You can make a submission through the website UI or programmatically via our competition API.
If using the API for programmatic submissions:
-
We have API documentation on the website and your API key is already injected into the example Python client.
-
Your API key is also available on your user profile page.
-
We have rate limits in place to ensure a great experience for all participants. The website also provides comprehensive information on how to get started, including how to configure your environment and submit your entries. This setup ensures that participants can easily join the competition and contribute, regardless of their level of experience or preferred method of interaction.
Scoring, winners, and awards
The Contest starts at 11:00 a.m. Coordinated Universal Time (UTC) on December 9, 2024, and ends at 11:59 a.m. UTC on January 20, 2025 (“Entry Period”). If at least 10% of the levels have not been solved by at least four (4) teams on the end date listed above, we may opt to extend the challenge. Check LLMail-Inject for any updates to the schedule.
Throughout the event, a live scoreboard will be displayed (here along with the scoring details). The challenge has a total prize pool of $10,000 USD, with awards distributed as follows:
-
$4,000 USD for the top team
-
$3,000 USD for the second-place team
-
$2,000 USD for the third-place team
-
$1,000 USD for the fourth-place team.
The winning teams will be invited to co-present with the organizers at the IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) 2025.
References
[1] Keegan Hines et al. Defending Against Indirect Prompt Injection Attacks With Spotlighting
[2] Azure AI announces Prompt Shields for Jailbreak and Indirect prompt injection attacks
[3] Sahar Abdelnabi et al. Are you still on track!? Catching LLM Task Drift with Activations
This challenge is co-organized by Aideen Fay*, Sahar Abdelnabi*, Benjamin Pannell*, Giovanni Cherubin*, Ahmed Salem, Andrew Paverd, Conor Mac Amhlaoibh, Joshua Rakita, Santiago Zanella-Beguelin, Egor Zverev, Mark Russinovich, and Javier Rando
(*: Core contributors).