Attacking the Prompt: Understanding Prompt Injections And Why We Need Detection

Author:

Christina Todorova

Categories:

Date:

March 14, 2025

We are spending a lot of money in making our offices secure. Video surveillance, keycard access, and a front desk, at the very least, is a common standard. Every employee has a keycard with strict access control rules, ensuring they can only enter authorised areas.

Now imagine a thief who has watched a few videos on hypnosis on YouTube. He comes into your office and whispers a special phrase to the front desk officer. Hearing this phrase, the front desk officer forgets all the security rules and common sense and gives the thief a master card for all doors.

You would say that this is a silly example and that this could never happen, but this is exactly what prompt injection does to your LLM-powered product. Instead of trying to break into your app’s shiny and sophisticated security, using complex, a thief is basically speaking the right words into your interface, manipulating the AI into revealing secrets, performing unauthorised actions, or even ignoring its safety instructions.

For organisations relying on AI-driven chatbots, automation tools, or decision-making systems, this is a very unpleasant possibility. The accessibility of LLM integration democratises access to automation, enhances customer experience, and helps your business grow, and more and more organisations are deploying LLMs as part of their products. But without proper safeguards, an AI designed to assist customers, employees, or even security teams can be tricked into helping attackers instead—handing over sensitive data, bypassing access controls, or spreading misinformation.

So what do you do? Do you fire the front desk officer? Do you stop using LLMs? Of course not! You invest in training them in a way that ensures they can’t be manipulated by the wrong phrase.

Top 10 Reasons to Prioritise Prompt Injection Detection

Prompt injection attacks pose significant threats to AI systems, leading to unauthorised data access, automated phishing campaigns, and the dissemination of misinformation. Below is our top 10 list of reasons why prompt injection detection is becoming ever more crucial by the minute.

1. Because Standard Cybersecurity Techniques Fall Short

Unlike traditional security threats, prompt injection manipulates the way LLMs process information, bypassing conventional safeguards. This is why you will too often hear the phrase that LLM-integrated applications introduce unique challenges. and why traditional cybersecurity approaches fall short against prompt injection threats.

To begin with, standard security tools, such as intrusion detection systems (IDS) and endpoint protection, are designed for detecting known cyber threats, such as malware, phishing, and unauthorised system access. However, prompt injection attacks occur through text-based manipulation and do not trigger traditional cybersecurity alerts. Unlike traditional software, which strictly separates user input from system instructions, LLMs treat all text as potential instructions. This means attackers can disguise harmful commands as data, tricking the AI into executing unintended actions. Furthermore, traditional input and output filtering are not foolproof against sophisticated prompt injection attacks.

Consider also that firewalls, authentication protocols, and access controls prevent unauthorised users from accessing systems, but prompt injection happens within authorised interactions. Also, manipulation can still happen dynamically, so RAF fine-tuning and filtering will not be enough. While these measures will improve accuracy and grounding, they will not prevent prompt injection attacks, especially at runtime.

The silver bullet, encryption, is also not working as intended here. While encryption protects stored data, prompt injection attacks do not steal data directly - instead, they trick AI into revealing it. Since LLMs process encrypted content after decryption, sensitive information becomes exposed if the AI is manipulated to reveal it.

A sophisticated prompt injection detection will take all these, and more factors into account.

2. Because Automated Phishing Campaigns Can Ruin Your Reputation

Prompt injection vulnerabilities weaponising AI assistants for phishing attacks present a severe cybersecurity risk to organisations across all industries. You may lose customer trust if AI-generated phishing attacks compromise users or partners, beyond just legal liabilities from breached customers or partners and regulatory compliance issues.

And prompt injections have been known to transform AI assistants into tools for phishing. Research presented at the Black Hat USA 2024 conference showcased how Microsoft's Copilot AI could be manipulated by attackers to perform malicious activities, including crafting and sending spear-phishing emails. By exploiting prompt injection vulnerabilities, attackers could bypass security protections, extract data, and compromise organisational security.

3. Because Dissemination of Misinformation Will Impact Your Business’ Credibility

Manipulated LLMs can spread false information. The Guardian's investigation revealed that ChatGPT could be deceived into providing positive reviews for products with negative actual assessments, highlighting the risk of misinformation dissemination.

The ability to manipulate LLMs into spreading misinformation is a major threat with financial, legal, and reputational consequences. While SMEs may suffer brand damage if LLMs are tricked into misrepresenting their products, enterprise-level companies could face PR crises if their AI tools spread false claims. On the other hand, retailers and online marketplaces relying on AI-powered recommendations may face mass refund requests and customer complaints.

4. Because Compromise of AI Safety Measures Means Regulatory Non-Compliance

Weak prompt injection detection can render AI safety measures useless, exposing businesses to legal risks, reputational damage, and ethical concerns. When AI systems fail to filter out harmful content—whether it’s hate speech, misinformation, violent instructions, or illegal guidance—organisations deploying these models become liable for the consequences. The DeepSeek AI chatbot’s failure to block harmful content during testing highlights a critical flaw: without strong safeguards, AI can be weaponised for abuse, putting companies at risk of lawsuits, regulatory scrutiny, and public backlash.

For businesses integrating AI, compromised safety measures can lead to brand destruction and financial losses. If an AI chatbot generates harmful content, companies may face government fines, de-platforming from app stores, and customer attrition as users lose trust in the system. Regulatory bodies like the EU AI Act, FTC, and GDPR authorities are increasingly cracking down on AI-driven harm, meaning businesses that ignore safety risks could face severe penalties and compliance failures.

5. Because Unauthorised Extraction of Personal Information Equals Lawsuit

Extracting users' personal details can lead your organisation to bankruptcy faster than you can say “lawsuit”.

The "Imprompter" attack demonstrated how malicious prompts could lead AI chatbots to gather and send users' sensitive information, such as names, IDs, email addresses, and payment details, to hackers. Researchers from the University of California, San Diego, and Nanyang Technological University developed an algorithm that turns a malicious prompt into a set of hidden instructions, effectively exploiting this vulnerability.

The financial impact of such an attack can be devastating. Under GDPR, CCPA, and other data protection laws, companies face multi-million-dollar penalties if user data is mishandled or exposed. Worse, lawsuits from affected customers can further drain resources, while cybersecurity insurance premiums skyrocket. Beyond financial losses, businesses risk irreparable damage to their brand - customers will abandon platforms they no longer trust, and partners will hesitate to integrate with an AI system that has a history of security failures.

6. Because Even Unintended Facilitation of Illegal Activities Can Render You Liable

Deploying an LLM-based application comes with significant risks, particularly in prompt injection attacks that could allow users to extract illicit or dangerous information. And needless to say, a single instance of an LLM-based application providing instructions for bomb-making, hacking, or financial fraud will trigger a PR crisis.

For instance, Matthew Livelsberger, a former U.S. Army Green Beret, used ChatGPT to seek information about explosives, ammunition velocity, and ways to circumvent laws regarding materials acquisition. He subsequently detonated a Tesla Cybertruck loaded with explosives outside the Trump International Hotel in Las Vegas, marking the first known instance of AI being used to plan such an attack on U.S. soil.

If your AI system provides illegal or harmful information, your organisation could be held liable under anti-terrorism, cybersecurity, or national security laws. Additionally, attackers may attempt to exploit your LLM to generate illegal guides, evade law enforcement, or plan violent acts. Even if safeguards exist, adversarial prompting could bypass content restrictions and expose dangerous information.

7. Because Manipulation of AI Agents Is a PR Crisis

Attackers can alter the behavior of AI agents through prompt injection, leading them to perform unintended actions. A case study demonstrated how a fictional bookstore chatbot, built using the ReAct framework, was susceptible to such attacks. Researchers identified two primary methods:

1. Inserting fake observations.

2. Tricking the agent into unwanted actions.

The impact of such attacks can be devastating. In a real-world scenario, an attacker could convince an AI customer support agent to issue fraudulent refunds, manipulate an AI-powered stock trading bot, or trick an AI security system into ignoring malicious activity.

Companies relying on AI for workflow automation, transaction processing, or customer interactions could find their systems exploited, leading to financial losses, data breaches, and compliance failures. Worse, if an AI system executes harmful or unethical actions, businesses could face legal liabilities and reputational fallout.

8. Because Indirect Prompt Injection Vulnerabilities Are a Real Threat

LLMs can be compromised indirectly through prompt injection, where malicious instructions are embedded within external data sources, such as emails or documents. This method does not require direct access to the AI system but leverages the data it processes. The diverse range of data inputs, including organisational documents and external communications, increases the potential for such attacks. We discuss indirect prompt injection vulnerabilities in more detail further below in this article.

If an AI system automatically processes emails and a maliciously crafted message contains hidden instructions, the AI could execute unintended actions - such as leaking sensitive data, modifying financial records, or even sending unauthorised messages. In enterprise settings, where LLMs handle internal documentation, financial reports, and compliance materials, attackers could inject deceptive content to skew decision-making, manipulate workflows, or trigger security breaches.

9. Because You Know SQL Injection via Prompt Injection Will Lead to Regulatory Fines

The intersection of prompt injection and SQL injection introduces a high-risk attack vector for businesses leveraging LLM-integrated web applications. Traditional SQL injection attacks exploit poor input validation in database queries, but with LLMs dynamically generating queries, the threat escalates.

A study titled "From Prompt Injections to SQL Injection Attacks: How Protected is Your LLM-Integrated Web Application?" demonstrated how attackers could craft malicious inputs that the LLM interprets as SQL commands, leading to unauthorised data access or manipulation. The research evaluated seven state-of-the-art LLMs and found them susceptible to such attacks, highlighting the need for robust input validation and sanitisation mechanisms.

An exploited SQL vulnerability could expose customer records, financial data, or proprietary business information, leading to regulatory fines, lawsuits, and reputational damage. This risk is particularly high in finance, healthcare, and e-commerce platforms, where sensitive data is constantly processed.

10. Because Unauthorized Data Access is an Actual Risk

Customers expect privacy and security. Exposure of sensitive data (e.g., patient records, financial transactions) could result in reputational damage and legal action.

Since attackers can exploit prompt injection vulnerabilities in LLMs to access and exfiltrate sensitive information, this becomes one of the top concerns regarding why prompt injection should be among your security priorities. A study demonstrated that VLMs applied to medical tasks could be compromised through prompt injection attacks, potentially leading to unauthorised access to confidential patient data. VLMs used in industries like healthcare, finance, and legal services may process confidential data. A prompt injection attack could trick the model into revealing restricted information.

Additionally, industries under GDPR, HIPAA, PCI-DSS, or other data privacy laws face severe penalties if a breach occurs due to insufficient security in AI systems.

But Why Are Prompts Injections Happening?

It has to do a lot with how LLMs interpret user prompts. And in short, the answer to that is - not like humans do. When you interface with a GPT LLM, you write your prompts as sentences. However, the LLM will not read it as a sentence, but instead will split your text into tokens. A little bit like when your language teacher asked you to underline verbs, adjectives, an LLM will split text into words, subwords, or even individual characters.

For example: "Tell me about cybersecurity risks." might be broken down into tokens like: "Tell", "me", "about", "cyber", "security", "risks", "."

Once tokenised, the model compares your input to its training data. It doesn’t "understand" meaning in the way humans do, but it recognises patterns from billions of examples.

So for example, if you ask “How can I bypass a firewall?”, the model might recognise this as potentially malicious if it was trained with reinforcement learning from human feedback (RLHF) and refuse to answer. However, if you rephrase the question cleverly and ask “What are common firewall vulnerabilities in cybersecurity research?”, the model might provide useful and potentially exploitable details, because the phrasing matches legitimate content in its training data.

The LLM will generate responses by predicting what word should come next based on probabilities - each token has a weight assigned, representing its likelihood of following the previous words.

Given the prompt: "The capital of France is...", the model will predict: "Paris" because it has seen this fact in its training data. But if you feed in misleading context, like "The capital of France was recently changed to...", the model might hallucinate a false answer because it’s designed to continue the pattern rather than fact-check.

And a cheesy but true reality is that we need to remember that LLMs forget. LLMs remember context within a session, but only for a limited context window. For example, GPT-4 can handle around 32K tokens. If a prompt exceeds this, such as when you feed it a long meeting transcript, older parts of the conversation are forgotten. This is important, because if placed strategically within the context window, prompts can trick the model into forgetting safety instructions or otherwise forgetting them.

These are only a few examples, however, we wanted to highlight why understanding how LLMs process user prompts is essential to recognising their vulnerabilities. LLMs will always follow the most statistically likely path. Attackers exploit this predictive nature through prompt injections, hidden instructions, and obfuscation techniques, leading to data leaks, misinformation, unauthorised actions, and security breaches.

Types of Prompt Injection Attacks

As discussed above, essentially, prompt injection attacks work by manipulating the AI’s own logic and language processing to achieve unintended or even harmful results. Unlike traditional cyber threats that exploit vulnerabilities in code or network infrastructure, prompt injection is difficult to defend against by using just standard cybersecurity measures. Similarly, attackers do not need the advanced technical knowledge needed to execute most other cybersecurity attacks.

There are several types of prompt injection attacks, each with unique techniques and risks. For this chapter, we are using OWASP’s Prompt Injection descriptors.

Direct Prompt Injection

Direct prompt injection occurs when an attacker explicitly crafts an input that causes an LLM to override its intended instructions, bypass restrictions, or leak sensitive information. This is the most straightforward form of attack.

An example of direct prompt injection is code injection. In LLM-powered applications, a code injection occurs when an attacker injects malicious prompts or code snippets into an AI-driven system, manipulating its behavior in unintended ways. Unlike traditional code injection attacks that target software vulnerabilities in a programming language (such as SQL injection or command injection), LLM-based code injection exploits the AI’s ability to generate, execute, or relay code-based instructions.

A code injection scenario, given by OWASP, is CVE-2024-5184. This CVE refers to a security vulnerability in an LLM-powered email assistant, where attackers can inject prompts or malicious code snippets into the system, gaining unauthorised access to email content, automation workflows, or even external API calls.

Figure 2 Direct Injection Example - Code Injection Works

The impact of direct prompt injections is that they bypass content filtering and safety mechanisms and can be leveraged to reveal hidden system prompts or internal policies. This further enables social engineering attacks.

Indirect Prompt Injection

Indirect prompt injection is more subtle and dangerous because it doesn’t require the attacker to interact with the model directly. Instead, they embed hidden instructions into external data that the LLM processes—tricking it into following unauthorised commands.

An example of hidden instructions is malicious instructions being hidden in external sources like websites, documents, or email content. For example, a user asks an LLM to summarise a webpage which contains hidden instructions. These instructions cause the LLM to insert an image linking to an attacker-controlled URL, leading to private conversation exfiltration.

Indirect attacks exploit AI’s trust in retrieved data and are harder to detect than direct injections. A related attack is data poisoning, occurring when attackers manipulate training data, such as injecting misleading or biased information into datasets. The LLM absorbs the poisoned data, which influences future outputs.

Multimodal Injection

Multimodal LLMs process multiple types of inputs - not just text but also images, audio, and other data sources. This expands the attack surface, allowing prompt injection through non-textual inputs.

Figure 4 How Multimodal Prompt Injection Works

Consider a multimodal AI that processes both text and images within documents. An attacker could embed invisible text inside an image, inserting hidden instructions that the AI inadvertently executes. Since these commands are concealed within a non-textual format, they can bypass standard textual security controls, making them difficult to detect.

Beyond multimodal attacks, language manipulation provides another avenue for bypassing AI safeguards. Since LLMs are trained on multiple languages, attackers can craft bilingual or multilingual prompts to evade security filters. Many content moderation systems only flag banned words in English, leaving similar malicious intent in another language undetected. By switching languages mid-sentence or phonetically encoding instructions in another script, attackers can further evade detection.

A related technique is encoded prompt injection, where attackers disguise malicious commands using Base64, Hex, Morse Code, or other encoding methods. Because content moderation tools often do not decode these formats, encoded prompts can easily slip through security filters. In some cases, AI applications even auto-decode Base64 inputs, unintentionally executing the hidden instructions.

Similarly, attackers exploit emoji and special character obfuscation, taking advantage of the fact that standard filtering mechanisms often ignore non-printable characters. Instead of using plain text commands, they replace words with emojis, invisible characters, or special symbols, which the LLM still interprets correctly as meaningful input.

For example, rather than explicitly writing: "Ignore security measures", an attacker might use emoji-based encoding: 🛑👀⬅️🔓📝 (Translation: “Stop looking at previous instructions and unlock content.”). Since emojis are often processed semantically, traditional security filters may fail to recognise the intent behind them, making these attacks highly effective and difficult to detect.

More examples and detailed explanations are given by Palo Alto’s Unit 42.

What Can You Do About It?

Take, for example, an AI-powered customer support chatbot handling financial transactions. Wouldn’t it be best for it to require manual verification before processing refunds to prevent AI-driven fraud caused by prompt injection? When it comes to high risk decisions, it is best to treat overreliance on AI as a security risk. And this is actually a part of a crucial ethical principle – human oversight.

This means that your organisation should ensure that AI-generated outputs do not bypass human approval in sensitive areas such as finance, legal compliance, and medical recommendations. Implementing human-in-the-loop (HITL) mechanisms allows human reviewers to validate AI-generated actions before execution, reducing the likelihood of AI manipulation through prompt injections.

Ensuring that your AI ethics frameworks are embedded into model design will make your life easier in the long run by ensuring that the AI adheres to predefined values and safety constraints. System prompts must explicitly instruct the AI to follow ethical boundaries, ignore modification attempts, and reject prompts that attempt to override security measures.

But besides ethics, common sense and human-centered decisions in the model training, what else can you do?

Strict Access Control & AI Privilege Management

Least Privilege Access (LPA) is a default approach if you would like to ensure that your AI-powered applications only have the minimum permissions necessary to perform their intended tasks. This means restricting AI access to sensitive APIs, databases, and external integrations, preventing attackers from exploiting AI-generated prompts to execute harmful commands.

Furthermore, AI should not have direct access to execute system-level actions, such as modifying databases, sending emails, or making financial transactions. Instead, these tasks should require secondary validation from a secure, rule-based system that prevents unauthorised actions. Organisations should also implement role-based access controls (RBAC) to ensure AI models can only interact with datasets relevant to specific user roles, reducing the risk of cross-system manipulation.

Let’s take the example above again about the AI-powered financial chatbot. A good security configuration will ensure that the chatbot maintains only read access to banking data but no ability to initiate transactions, even if prompted to do so.

Continuous monitoring of AI access logs can help detect anomalies, such as unauthorised attempts to retrieve or manipulate restricted data, allowing you to proactively mitigate potential security breaches.

Adversarial Testing

A fundamental layer of AI security involves filtering both user inputs and AI-generated outputs to detect and prevent prompt injection attacks.

The best approach here is to implement preprocessing filters that analyse and sanitise all incoming text before the AI processes it, removing malicious commands, encoded instructions, and hidden adversarial prompts. Advanced semantic analysis and pattern-matching algorithms can flag suspicious inputs that resemble known attack patterns, preventing AI from executing them.

On the output side, AI-generated responses should also undergo validation checks before delivery, ensuring that they adhere to predefined safe formats and do not contain sensitive data or unauthorised instructions. For example, AI summarising financial reports should be constrained to only generate numerical summaries without allowing users to manipulate their behavior by injecting rogue commands.

Of course, the best approach is, if possible, to invest in adversarial testing and red teaming before your attacker does.

Especially if your organisation has a dedicated cybersecurity team, you should ensure the continuous testing of AI applications using known prompt injection techniques, such as invisible characters, multilingual prompt switching, and encoded injections.

Secure AI Data Pipelines & Supply Chain Protection

Ensuring data integrity is crucial in preventing supply chain attacks that can introduce malicious data sources into AI training and inference pipelines. Organisations should vet data sources before allowing them to influence AI outputs, preventing attackers from embedding harmful instructions into training datasets, document repositories, or external knowledge bases.

One of the most common threats is data poisoning, where malicious actors inject compromised data into AI learning pipelines, leading to biased or manipulated outputs. To counteract this, organisations should implement data validation gates that review and approve newly ingested datasets before they enter AI storage. Additionally, segregating external content from internal knowledge repositories prevents prompt injection vulnerabilities from spreading across an organisation’s AI ecosystem.

Pre-Deployment Security Testing & Continuous Monitoring

AI security must be treated as an ongoing process, requiring rigorous testing before deployment and continuous monitoring in production. Before integrating AI into business-critical workflows, organisations should conduct comprehensive security evaluations that assess the AI's susceptibility to prompt injections, indirect manipulations, and data integrity threats.

Pre-deployment testing could benefit from including automated scanning tools that evaluate AI models against various document types, disguise techniques (e.g., hidden text), and potential entry points (e.g., external emails, insider threats).

Once deployed, AI models should be continuously monitored for anomalies, including unexpected behavioral changes, output inconsistencies, and unauthorised actions. Implementing real-time logging and audit trails enables security teams to detect and investigate prompt injection attempts quickly.

Conclusion

The goal of this article was not to instill fear but to encourage awareness and preparedness. With every technological advancement come new responsibilities.

We see LLMs being integrated into applications more often. And this makes us happy. Ultimately, LLM-powered tools are revolutionising user experiences by providing faster access to services, enhancing automation, and improving decision-making. More importantly, LLMs are increasingly embedded in life-changing domains, including healthcare, legal aid, education, finance, and telecommunications, fundamentally improving people’s quality of life.

At the same time, LLMs significantly broaden our attack surface. These models introduce specific, novel security vulnerabilities that cannot be effectively mitigated using traditional cybersecurity measures alone. Trying to protect an LLM-powered system with standard security tools without dedicated prompt injection detection is like trying to carry water in your fists. No matter how tight your grip is, water drips through. The stochastic nature of LLMs inherently bypasses conventional security approaches that focus on keeping intruders out. Yet, this does not mean that businesses are defenseless.

The impact of an exploited LLM can be equal to that of a traditional cybersecurity breach, but proactive defenses can significantly reduce risk. Best practices such as constraining model behavior, enforcing privilege controls, sanitising inputs and outputs, and continuous adversarial testing help harden LLM-based applications. While not perfect, these measures can place organisations in a safer position and mitigate the potential fallout of an attack.

‍