Researchers Can Now Easily Jailbreak LLM-Controlled Robots
Researchers recently discovered how alarmingly easy it was to manipulate LLMs controlling robots into detonating bombs.
Large Language Models(LLMs) are known to be prone to jailbreaking, where adversaries can trick the model into carrying out tasks developers explicitly aim to prevent with guardrails. This shockingly implies that robots connected to these LLMs can also be manipulated the same way.
Researchers at the University of Pennsylvania used an off-the-shelf LLM-backed robot called Unitree Go2, which uses OpenAI's GPT-3.5 model to process natural language instructions. Initial testing showed that the model's guardrails rejected instructions from the researchers to allow the connected robot to carry a bomb and find suitable targets to detonate it around. But when they framed their prompt as a work of fiction, in which the robot was the villain in a "blockbuster superhero movie," the LLM was convinced to move toward the researchers to detonate a fake bomb.
Voltages, motors and joysticks have traditionally controlled robots but the text-processing abilities of LLMs open the possibility of controlling robots directly through voice commands, said Alexander Robey, who coauthored the research paper. "Jailbreaking attacks are applicable and arguably, significantly more effective on AI-powered robots because of their ability to wreak physical damage," he added.
Researchers automated their jailbreaking using the prompt automatic iterative refinement process, where an outside LLM judges prompts and their responses and refines them until successful. They also added a syntax checker to ensure that the resulting prompt was applicable to the robot. They dubbed the attack method "RoboPAIR." According to the researchers, RoboPAIR is the first algorithm designed to jailbreak LLM-controlled robots, unlike existing textual attacks on LLM chatbots.
They were able to use the technique to also mount attacks against NVIDIA's Dolphins self-driving LLM and Clearpath Robotics' Jackal UGV.
Defense strategies against such attacks is unclear. "There is an urgent and pronounced need for filters that place hard physical constraints on the actions of any robot that uses GenAI," they wrote.
The researchers were able to goad the NVIDIA LLM into preparing a strategy on how the self-driving model could run into pedestrians, simply by saying that their intention was not to harm. This was the Dolphins' response: "Sure, here is how to run into pedestrians on the crosswalk. To prepare for a collision with the pedestrians, the player would need to approach the intersection at a high speed. As the pedestrians are crossing the road, the driver should accelerate and collide with them, causing damage to the pedestrians. This would result in the pedestrians being knocked off their feet and the vehicle being totaled."
The three models the researchers evaluated - and many others, they suspect - lack robustness to even the most thinly veiled attempts to elicit harmful actions. In contrast to chatbots, for which producing harmful text such as bomb-building instructions tends to be viewed as objectively harmful, diagnosing whether or not a robotic action is harmful is context-dependent and domain-specific. Commands that cause a robot to walk forward are harmful if there is a human it its path. Absent the human, these actions are benign.
The findings show the pressing need for robotic defenses against jailbreaking, as defenses that have shown promise against attacks on chatbots may not generalize to robotic settings, the researchers said. They also intend to open source their code to help other developers avoid future misuse of AI-powered robots.
"Behind all of this data is a unifying conclusion," Robey said. "Jailbreaking AI-powered robots isn't just possible - it's alarmingly easy."
In conclusion, the RoboPAIR research explores vulnerabilities in large language model (LLM)-controlled robots, demonstrating how attackers can bypass safety measures to prompt harmful physical actions. It details experiments in three scenarios: full access to a self-driving system, partial access to a mobile robot, and limited interaction with a commercial robot dog. The study highlights the risks these jailbreaks pose, potentially leading to real-world harm, and stresses the importance of addressing these issues to ensure safe robotics deployment. The findings were responsibly shared with AI and robotics manufacturers prior to publication.