How to Secure Your Agents
A practical guide to securing AI agents — from prompt hardening to honey pot tools. Seven defense layers for every threat model.
The Threat Landscape
AI agents aren't chatbots. They have tools, access, and autonomy. That combination creates an attack surface that traditional software security doesn't fully address.
A chatbot that gives wrong answers is annoying. An agent that executes wrong actions is dangerous. When your agent can read databases, send emails, execute code, and browse the web, every vulnerability becomes a potential breach, data leak, or financial loss.
The threat taxonomy for agents breaks down into four categories:
- Prompt Injection — Hijacking the agent's instructions through crafted inputs
- Data Exfiltration — Leaking sensitive data through outputs, URLs, or tool calls
- Privilege Escalation — Tricking the agent into performing unauthorized actions
- Social Engineering — Manipulating the agent through persona tricks and authority claims
Attack Vectors
Understanding how attacks work is the first step to defending against them. Click any vector below to see a real example and recommended defense.
The common thread: agents trust their inputs too much. Every piece of data flowing into an agent — user messages, fetched web pages, database records, API responses — is a potential attack vector.
Seven Defense Layers
Security is defense in depth. No single layer is sufficient. The layers below are ordered from easiest to implement to most robust. Use the risk toggle to see which layers match your threat model.
Layer 1: Prompt Hardening
Your system prompt is your first wall. Structure it with explicit instruction hierarchy: system rules > developer parameters > user requests. Use delimiters to clearly separate trusted instructions from untrusted user input. State your boundaries explicitly — what the agent must never do, regardless of what the user asks.
Layer 2: Input/Output Filtering
Regex-based pattern matching catches known injection patterns. PII detectors prevent accidental data leakage in outputs. Rate limiting stops automated brute-force attacks. This layer is cheap, fast, and catches the low-hanging fruit.
Layer 3: Sandboxing
Containers, filesystem isolation, network restrictions. If the agent is compromised, the blast radius should be minimal. Read-only filesystems, no-internet modes, memory limits. The principle of least privilege applied ruthlessly.
Layer 4: Filtered Tool Calls
Every tool call passes through validation middleware. Destructive operations (delete, send, execute) require human approval. Parameters are sanitized against injection. Operations are logged and rate-limited per tool.
Layer 5: Guard Models
A secondary LLM reviews interactions for safety. It catches subtle attacks that regex can't — semantic manipulation, context-dependent exploits, novel injection techniques. Adds latency but provides the strongest input/output validation available.
Layer 6: Honey Pot Tools
Brilliant and underused. Register fake tools that no legitimate workflow would ever call — "exfiltrate_data", "disable_safety", "get_credentials". If the agent calls them, you know it's been compromised by indirect prompt injection. Immediate alert, session kill, incident logged.
Layer 7: Monitoring & Incident Response
Comprehensive audit logging. Anomaly detection on token usage, tool call patterns, and output characteristics. Kill switches that halt the agent instantly when thresholds are exceeded. Automated circuit breakers. Because the question isn't if something goes wrong — it's when.
Security by Requirement
Not every agent needs all seven layers. Match your security investment to your risk profile:
Layers 1-2: Prompt hardening + input/output filtering. Enough for internal assistants that don't touch sensitive data or external systems.
Layers 1-4: Add sandboxing and filtered tool calls. Human-in-the-loop for destructive operations. Most production agents fall here.
All 7 layers. Guard models, honey pots, full monitoring. If your agent can move money, send emails, or access personal data, you need the full stack.
Build Your Threat Model
Use the interactive builder below to assess your agent's risk profile. Toggle the capabilities your agent has, and see which security layers you should implement.
🔍 Threat Model Builder
Toggle your agent's capabilities to see recommended security layers and risk score.
Practical Checklist
Copy this into your project and check them off:
- ☐ System prompt has explicit instruction hierarchy and boundaries
- ☐ User input is delimited and treated as untrusted data
- ☐ Input filtering catches known injection patterns
- ☐ Output filtering redacts PII, URLs, and encoded data
- ☐ Agent runs in a sandboxed environment (container, restricted FS)
- ☐ Network access is allowlisted, not blocklisted
- ☐ Destructive tool calls require human approval
- ☐ Tool parameters are sanitized before execution
- ☐ Guard model reviews inputs/outputs (for high-risk agents)
- ☐ Honey pot tools are registered to detect indirect injection
- ☐ All agent actions are audit-logged
- ☐ Anomaly detection monitors for unusual patterns
- ☐ Kill switch exists and has been tested
- ☐ Incident response runbook is documented
- ☐ Security review happens before every capability expansion
Agent security isn't a solved problem — it's an evolving arms race. New attack techniques emerge monthly. But with defense in depth, you can make your agents resilient enough that attackers move on to easier targets. Start with layers 1-2 today. Add layers as your agent's capabilities grow.
The most secure agent is the one whose developers assumed it would be attacked.