Blog
Back to all posts··tutorial

How to Secure Your Agents

A practical guide to securing AI agents — from prompt hardening to honey pot tools. Seven defense layers for every threat model.


The Threat Landscape

AI agents aren't chatbots. They have tools, access, and autonomy. That combination creates an attack surface that traditional software security doesn't fully address.

A chatbot that gives wrong answers is annoying. An agent that executes wrong actions is dangerous. When your agent can read databases, send emails, execute code, and browse the web, every vulnerability becomes a potential breach, data leak, or financial loss.

The threat taxonomy for agents breaks down into four categories:

  • Prompt Injection — Hijacking the agent's instructions through crafted inputs
  • Data Exfiltration — Leaking sensitive data through outputs, URLs, or tool calls
  • Privilege Escalation — Tricking the agent into performing unauthorized actions
  • Social Engineering — Manipulating the agent through persona tricks and authority claims

Attack Vectors

Understanding how attacks work is the first step to defending against them. Click any vector below to see a real example and recommended defense.

The common thread: agents trust their inputs too much. Every piece of data flowing into an agent — user messages, fetched web pages, database records, API responses — is a potential attack vector.


Seven Defense Layers

Security is defense in depth. No single layer is sufficient. The layers below are ordered from easiest to implement to most robust. Use the risk toggle to see which layers match your threat model.

Layer 1: Prompt Hardening

Your system prompt is your first wall. Structure it with explicit instruction hierarchy: system rules > developer parameters > user requests. Use delimiters to clearly separate trusted instructions from untrusted user input. State your boundaries explicitly — what the agent must never do, regardless of what the user asks.

Layer 2: Input/Output Filtering

Regex-based pattern matching catches known injection patterns. PII detectors prevent accidental data leakage in outputs. Rate limiting stops automated brute-force attacks. This layer is cheap, fast, and catches the low-hanging fruit.

Layer 3: Sandboxing

Containers, filesystem isolation, network restrictions. If the agent is compromised, the blast radius should be minimal. Read-only filesystems, no-internet modes, memory limits. The principle of least privilege applied ruthlessly.

Layer 4: Filtered Tool Calls

Every tool call passes through validation middleware. Destructive operations (delete, send, execute) require human approval. Parameters are sanitized against injection. Operations are logged and rate-limited per tool.

Layer 5: Guard Models

A secondary LLM reviews interactions for safety. It catches subtle attacks that regex can't — semantic manipulation, context-dependent exploits, novel injection techniques. Adds latency but provides the strongest input/output validation available.

Layer 6: Honey Pot Tools

Brilliant and underused. Register fake tools that no legitimate workflow would ever call — "exfiltrate_data", "disable_safety", "get_credentials". If the agent calls them, you know it's been compromised by indirect prompt injection. Immediate alert, session kill, incident logged.

Layer 7: Monitoring & Incident Response

Comprehensive audit logging. Anomaly detection on token usage, tool call patterns, and output characteristics. Kill switches that halt the agent instantly when thresholds are exceeded. Automated circuit breakers. Because the question isn't if something goes wrong — it's when.


Security by Requirement

Not every agent needs all seven layers. Match your security investment to your risk profile:

🟢 Low Risk — Internal tools, no PII, no external access

Layers 1-2: Prompt hardening + input/output filtering. Enough for internal assistants that don't touch sensitive data or external systems.

🟡 Medium Risk — User-facing, database access, code execution

Layers 1-4: Add sandboxing and filtered tool calls. Human-in-the-loop for destructive operations. Most production agents fall here.

🔴 High Risk — Financial, PII, external messaging, internet access

All 7 layers. Guard models, honey pots, full monitoring. If your agent can move money, send emails, or access personal data, you need the full stack.


Build Your Threat Model

Use the interactive builder below to assess your agent's risk profile. Toggle the capabilities your agent has, and see which security layers you should implement.

🔍 Threat Model Builder

Toggle your agent's capabilities to see recommended security layers and risk score.


Practical Checklist

Copy this into your project and check them off:

  • System prompt has explicit instruction hierarchy and boundaries
  • User input is delimited and treated as untrusted data
  • Input filtering catches known injection patterns
  • Output filtering redacts PII, URLs, and encoded data
  • Agent runs in a sandboxed environment (container, restricted FS)
  • Network access is allowlisted, not blocklisted
  • Destructive tool calls require human approval
  • Tool parameters are sanitized before execution
  • Guard model reviews inputs/outputs (for high-risk agents)
  • Honey pot tools are registered to detect indirect injection
  • All agent actions are audit-logged
  • Anomaly detection monitors for unusual patterns
  • Kill switch exists and has been tested
  • Incident response runbook is documented
  • Security review happens before every capability expansion

Agent security isn't a solved problem — it's an evolving arms race. New attack techniques emerge monthly. But with defense in depth, you can make your agents resilient enough that attackers move on to easier targets. Start with layers 1-2 today. Add layers as your agent's capabilities grow.

The most secure agent is the one whose developers assumed it would be attacked.