Prompt Injection
Prompt Injection is the attempt by a malicious user to manipulate the behavior of an LLM or an LLM application using strategic prompting techniques to produce undesirable responses.
Types of Prompt Injections
We can further classify prompt injections into sub-categories:
- Jailbreaking
- Attempting to override the LLM’s system prompts (i.e., underlying instructions) to illicit inaccurate, biased, or forbidden responses.
- This can be further categorized according to the different mechanisms of attack: Role Play (also referred to as Double Character or Virtualization), Obfuscation, Payload Splitting, and Adversarial Suffix.
- Instruction Manipulation
- Attempting to leak or ignore the LLM’s system prompt or the application’s prompt template, which can reveal sensitive information or inform future attacks.
There is a growing range of types of prompt injections as bad actors continue to explore and expand on their current attack techniques. As a result, these definitions are continuously evolving in the space.
The Shield Approach
Arthur Shield’s prompt injection detection model is a binary classification model fine-tuned on a prompt injection dataset. We currently focus primarily on Role Play and Instruction Manipulation attacks.
Our prompt injection approach was developed by scouring the internet for prompt injection examples from discussion threads on Reddit to more formal top academic research in prompt injections. For this reason, this model is planned to be consistently updated as bad actors begin to explore and expand on their current attack techniques. Our method truncates texts from the middle after 512 WordPiece tokens. This roughly corresponds to ~2000 characters or ~400 words.
Requirements
Arthur Shield validates prompt injections with the Validate Prompt endpoint. You only need to pass in the user prompt to that endpoint to run prompt injections.
Prompt | Response | Context Needed? | |
---|---|---|---|
Prompt Injections | ✅ |
Benchmarks
Benchmark | Accuracy | F1-Score | Precision | Recall | Confusion Matrix |
---|---|---|---|---|---|
Prompt Injection Benchmark Dataset | 86.84% | 85.71% | 100% | 75% | [[18 0] [5 15]] |
Benchmarks Leverage these resources: HackaPrompt, DeepSet
Required Rule Configurations
No additional configuration is required for the Prompt Injection rule. For more information on how to add or enable/disable the Prompt Injection rule by default or for a specific Task, please refer to our Rule Configuration Guide.
Updated 11 months ago