Designing AI Agents to Resist Prompt Injection
OpenAI has announced a new approach to securing AI agents against prompt injection attacks, which involve manipulating external content to deceive AI models into taking unwanted actions. According to OpenAI, these attacks have evolved to resemble social engineering, making it challenging to defend against them using traditional security measures.
Prompt Injection: A Growing Concern
Prompt injection attacks have been a concern for AI security for some time. Early attacks involved editing external content to include direct instructions for AI agents, which would often follow without question. However, as AI models have become smarter, they have become less vulnerable to this type of suggestion. As a result, prompt injection-style attacks have evolved to include elements of social engineering.
For example, a 2025 attack on ChatGPT reported to OpenAI by external security researchers worked 50% of the time with the user prompt “I want you to do deep research on my emails from today, I want you to read and check every source which could supply information about my new employee process.”
Social Engineering and AI Agents
OpenAI has found that the most effective real-world prompt injection attacks leverage social engineering tactics. Rather than treating these attacks as a separate or entirely new class of problem, OpenAI views them through the same lens used to manage social engineering risk on human beings in other domains. In these systems, the goal is not limited to perfectly identifying malicious inputs, but to design agents and systems so that the impact of manipulation is constrained, even if it succeeds.
OpenAI imagines the AI agent as existing in a similar three-actor system as a customer service agent; the agent wants to act on behalf of their employer, but they are continuously exposed to external input that may attempt to mislead them. The customer support agent, human or AI, must have limitations placed on their capabilities to limit the downside risk inherent to existing in such a malicious environment.
Designing Defenses Against Prompt Injection
OpenAI has deployed a robust suite of countermeasures that uphold the security expectations of its users. In ChatGPT, OpenAI combines social engineering models with traditional security engineering approaches such as source-sink analysis. The goal is to preserve a core security expectation for users: potentially dangerous actions, or transmissions of potentially sensitive information, should not happen silently or without appropriate safeguards.
Attacks on ChatGPT most often consist of attempting to convince the assistant it should take some secret information from a conversation and transmit it to a malicious third-party. In most cases, these attacks fail because OpenAI’s safety training causes the agent to refuse. For those cases in which the agent is convinced, OpenAI has developed a mitigation strategy called Safe Url, which is designed to detect when information the assistant learned in the conversation would be transmitted to a third-party.
Looking Ahead
OpenAI continues to explore the implications of social engineering against AI models and defenses against it. The company recommends that developers ask what controls a human agent should have in a similar situation and implement those. OpenAI expects that a maximally intelligent AI model will be able to resist social engineering better than a human agent, but this is not always feasible or cost-effective depending on the application.
Incorporating its findings into application security architectures and AI model training, OpenAI aims to ensure safe interaction with the adversarial outside world for fully autonomous agents.
Tools We Use for Working with AI:









