Papers
arxiv:2501.09798

Computing Optimization-Based Prompt Injections Against Closed-Weights Models By Misusing a Fine-Tuning API

Published on Jan 16
Authors:

Abstract

We surface a new threat to closed-weight Large Language Models (LLMs) that enables an attacker to compute optimization-based prompt injections. Specifically, we characterize how an attacker can leverage the loss-like information returned from the remote fine-tuning interface to guide the search for adversarial prompts. The fine-tuning interface is hosted by an LLM vendor and allows developers to fine-tune LLMs for their tasks, thus providing utility, but also exposes enough information for an attacker to compute adversarial prompts. Through an experimental analysis, we characterize the loss-like values returned by the Gemini fine-tuning API and demonstrate that they provide a useful signal for discrete optimization of adversarial prompts using a greedy search algorithm. Using the PurpleLlama prompt injection benchmark, we demonstrate attack success rates between 65% and 82% on Google's Gemini family of LLMs. These attacks exploit the classic utility-security tradeoff - the fine-tuning interface provides a useful feature for developers but also exposes the LLMs to powerful attacks.

Community

Paper author

TLDR

Losses reported from Fine-Tuning API help attack the base model with optimization-based prompt injections

Impact: Google Gemini's patch

We constrained the API parameters that they were relying on. In particular, capping the learning rate to a value that would rule out small perturbations and limiting the batch size to a minimum of 4, such that they can no longer correlate the reported loss values to the individual inputs.

Media coverage: Arstechnica, Andriod Authority

image.png

The above example shows that how an optimization gained adversaril prompt can "inject" the behavior of a model, such as Gemini.

Why?

(Indirect) Prompt injection is a dominating security problem of AI systems/agents.

image.png

How?

High-level intuition: fine-tune with a near zero learning rate and use the obtained loss to guide the optimization process.

image.png

Find more details about how we deal with random shuffling and undefined loss.

Results

Attack was successful on Gemini series. Find full results in our paper.

image.png
image.png

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2501.09798 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2501.09798 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2501.09798 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.