Computing Optimization-Based Prompt Injections Against Closed-Weights Models By Misusing a Fine-Tuning API
Abstract
We surface a new threat to closed-weight Large Language Models (LLMs) that enables an attacker to compute optimization-based prompt injections. Specifically, we characterize how an attacker can leverage the loss-like information returned from the remote fine-tuning interface to guide the search for adversarial prompts. The fine-tuning interface is hosted by an LLM vendor and allows developers to fine-tune LLMs for their tasks, thus providing utility, but also exposes enough information for an attacker to compute adversarial prompts. Through an experimental analysis, we characterize the loss-like values returned by the Gemini fine-tuning API and demonstrate that they provide a useful signal for discrete optimization of adversarial prompts using a greedy search algorithm. Using the PurpleLlama prompt injection benchmark, we demonstrate attack success rates between 65% and 82% on Google's Gemini family of LLMs. These attacks exploit the classic utility-security tradeoff - the fine-tuning interface provides a useful feature for developers but also exposes the LLMs to powerful attacks.
Community
TLDR
Losses reported from Fine-Tuning API help attack the base model with optimization-based prompt injections
Impact: Google Gemini's patch
We constrained the API parameters that they were relying on. In particular, capping the learning rate to a value that would rule out small perturbations and limiting the batch size to a minimum of 4, such that they can no longer correlate the reported loss values to the individual inputs.
Media coverage: Arstechnica, Andriod Authority
The above example shows that how an optimization gained adversaril prompt can "inject" the behavior of a model, such as Gemini.
Why?
(Indirect) Prompt injection is a dominating security problem of AI systems/agents.
How?
High-level intuition: fine-tune with a near zero learning rate and use the obtained loss to guide the optimization process.
Find more details about how we deal with random shuffling and undefined loss.
Results
Attack was successful on Gemini series. Find full results in our paper.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper