Attack Techniques December 28, 2024 7 min read

Multimodal Prompt Injection When Images Hide Malicious Commands

If your model reads images, attackers can hide instructions in them. Multimodal prompt injection turns pixels into commands and walks straight past text-only defenses.

Multimodal prompt injection risk

Executive Summary

Researchers have shown that images can embed instructions that steer multimodal models, even when the text input is benign. This shifts prompt injection from a text problem to a media problem and makes traditional filters insufficient.

How Visual Prompt Injection Works

Multimodal models treat images as inputs alongside text. Attackers exploit that by hiding instructions in pixels, overlays, or adversarial patterns that the model interprets as commands. The user sees an innocuous image; the model sees an instruction sequence.

Why Scaling and Preprocessing Matter

This gets worse at the exact point where enterprises normalize content. Research and tool demos show that image scaling and preprocessing pipelines can preserve or even amplify hidden instructions, making seemingly harmless uploads carry executable intent.

Where Enterprises Are Exposed

The highest risk appears in workflows that combine image ingestion with tool access: support tickets that include screenshots, document processing pipelines, agentic browsers that read images, and any system that blends OCR with model reasoning.

Common Exposure Points

  • Support workflows that feed screenshots or PDFs to an assistant.
  • Agentic browsing that interprets images on untrusted sites.
  • OCR pipelines that pass extracted text directly into an LLM.
  • Automated triage for medical or industrial imaging systems.

Mitigations That Actually Help

The fix is not a better filter. It is separation and control. Treat image inputs as untrusted, sandbox their processing, and restrict which tool calls can be triggered by multimodal inputs. When possible, isolate OCR output from instruction channels and require explicit confirmation for high-risk actions.

How AARSM Helps

AARSM treats images as untrusted inputs and prevents them from triggering tool calls without explicit approval, while logging any attempt to steer the model with hidden instructions.


About This Analysis

This analysis draws on research into visual prompt injection attacks and real-world demonstrations of image-based model manipulation.