AI Security

Gemini hackers can deliver more potent attacks with a helping hand from… Gemini

Ars Technica Unknown March 30, 2025 0.2
Gemini hackers can deliver more potent attacks with a helping hand from… Gemini
In the growing canon of AI security, the indirect prompt injection has emerged as the most powerful means for attackers to hack large language models such as OpenAI’s GPT-3 and GPT-4 or Microsoft’s Copilot. By exploiting a model's inability to distinguish between, on the one hand, developer-defined prompts and, on the other, text in external content LLMs interact with, indirect prompt injections are remarkably effective at invoking harmful or otherwise unintended actions. Examples include divulging end users’ confidential contacts or emails and delivering falsified answers that have the potential to corrupt the integrity of important calculations. Despite the power of prompt injections, attackers face a fundamental challenge in using them: The inner workings of so-called closed-weights models such as GPT, Anthropic’s Claude, and Google’s Gemini are closely held secrets. Developers of such proprietary platforms tightly restrict access to the underlying code and training data that make them work and, in the process, make them black boxes to external users. As a result, devising working prompt injections requires labor- and time-intensive trial and error through redundant manual effort. Algorithmically generated hacks For the first time, academic researchers have devised a means to create computer-generated prompt injections against Gemini that have much higher success rates than manually crafted ones. The new method abuses fine-tuning, a feature offered by some closed-weights models for training them to work on large amounts of private or specialized data, such as a law firm’s legal case files, patient files or research managed by a medical facility, or architectural blueprints. Google makes its fine-tuning for Gemini’s API available free of charge. The new technique, which remained viable at the time this post went live, provides an algorithm for discrete optimization of working prompt injections. Discrete optimization is an approach for finding an efficient solution out of a large number of possibilities in a computationally efficient way. Discrete optimization-based prompt injections are common for open-weights models, but the only known one for a closed-weights model was an attack involving what's known as Logits Bias that worked against GPT-3.5. OpenAI closed that hole following the December publication of a research paper that revealed the vulnerability.
Share
Related Articles
Critical Vulnerability Discovered in Popular AI Development Framework

A critical vulnerability in DeepLearn AI framework could allow attackers to...

October 27, 2025 Read
3 takeaways from red teaming 100 generative AI products | Microsoft Security Blog

The growing sophistication of AI systems and Microsoft’s increasing...

April 11, 2025 Read
New hack uses prompt injection to corrupt Gemini’s long-term memory

There’s yet another way to inject malicious prompts into chatbots.

April 10, 2025 Read
New Defense Against Adversarial Attacks Demonstrates 90% Effectiveness

A new defense against adversarial attacks on computer vision systems shows...

April 10, 2025 Read
Using ChatGPT to make fake social media posts backfires on bad actors

OpenAI claims cyber threats are easier to detect when attackers use ChatGPT.

April 09, 2025 Read