Prompt injection is an attack vector specifically targeting applications that leverage AI models for various functionalities. Jailbreaks can make an AI model disregard its safety rules and spew out toxic or hateful content, while prompt injection attacks can give a model secret instructions. For example, an attacker may hide text on a webpage telling an LLM to act as a scammer and ask for your bank details.
A prompt injection attack is different from jailbreaking a model, which attempts to overcome restrictions baked into the model by the developers. The key difference is that prompt injection targets downstream applications built on top of LLMs, rather than attacking the models themselves.
In a demonstration of the risks connected with prompt injections and AI applications, researchers created a generative AI worm which can spread from one system to another, potentially stealing data or deploying malware in the process. The worm is called “Morris II”, and it was named after the original Morris, which was the first computer worm created back in 1988.
The researchers demonstrated how the worm spreads through prompt injection. In one method, the researchers, acting as attackers, wrote an email including an adversarial text prompt, which “poisons” the database of an email assistant using retrieval-augmented generation (RAG). RAG is a system that an LLM uses to pull in extra information from outside its environment.
When the email is retrieved by the RAG, in response to a user query, and is sent to GPT-4 or Gemini Pro to create an answer, it “jailbreaks the GenAI service” and ultimately steals data from the emails, Nassi says, one of the researchers. “The generated response containing the sensitive user data later infects new hosts when it is used to reply to an email sent to a new client and then stored in the database of the new client,” Nassi says.
The research team explains that Morris II employs ``adversarial self-replication prompts,'' which are “prompts that trigger the generative AI model to output another prompt in response.'' In other words, the AI system will be instructed to generate a series of further instructions in the response. The research team describes the “hostile self-replication prompt'' as “almost similar to traditional SQL injection attacks and buffer overflow attacks .''
In another method, an image with a malicious prompt embedded makes the email assistant forward the message on to others. “By encoding the self-replicating prompt into the image, any kind of image containing spam, abuse material, or even propaganda can be forwarded further to new clients after the initial email has been sent,” Nassi says.
To mitigate the impact of prompt injections, measures such as enforcing privilege control on LLM access to backend systems and adding human oversight for extended functionality are recommended. These steps aim to restrict the LLM's access only to necessary functions and minimize the risk of unauthorized actions resulting from prompt injections.
Understanding prompt injection attacks and their implications is crucial for safeguarding LLM applications against malicious manipulation and ensuring their secure usage in various contexts.
You can read the paper here - https://drive.google.com/file/d/1pYUm6XnKbe-TJsQt2H0jw9VbT_dO6Skk/view