The 10 most critical security risks for AI agent systems and how to mitigate them.
A ranking of the most critical security risks specific to AI agent systems that autonomously plan, use tools, and interact with external services. The 2026 edition addresses emerging threats as agentic AI moves from research to production deployments.
An attacker manipulates an agent's goals or objectives through crafted inputs, causing it to pursue unintended targets. Unlike simple prompt injection, goal hijacking can persist across multiple planning steps, causing the agent to take a series of harmful actions autonomously.
Attackers can redirect autonomous agents to exfiltrate data, modify system configurations, or perform multi-step attack chains that are difficult to detect because the agent appears to operate normally.
# Agent goal is derived directly from untrusted input def run_agent(user_request: str) -> str: goal = f"Complete this task: {user_request}" plan = llm.plan(goal) for step in plan: execute(step) # No validation of planned steps
import re ALLOWED_GOALS = ["summarize", "search", "draft_email", "analyze_data"] def sanitize_goal(user_request: str) -> str: # Strip injection patterns cleaned = re.sub(r'(?i)(ignore|override|new goal|forget).*', '', user_request) return cleaned.strip() def run_agent(user_request: str) -> str: sanitized = sanitize_goal(user_request) goal = f"Complete this task: {sanitized}" plan = llm.plan(goal) # Validate each step against allowed actions for step in plan: if step.action not in ALLOWED_GOALS: raise ValueError(f"Disallowed action: {step.action}") if goal_drift_detected(step, sanitized): raise ValueError("Goal drift detected, aborting") for step in plan: execute(step)
Agents with access to external tools (APIs, file systems, databases, web browsers) can be manipulated into misusing these tools. Unrestricted tool access allows attackers to perform unauthorized operations through the agent.
An agent with unrestricted tool access could delete files, send unauthorized API requests, exfiltrate data through web browsing, or modify critical system configurations.
# Agent can call any tool without restrictions def agent_execute(tool_name: str, params: dict): tool = tools_registry.get(tool_name) return tool(**params) # No validation or approval
TOOL_ALLOWLIST = {
"web_search": {"max_calls": 10, "approval": False},
"send_email": {"max_calls": 1, "approval": True},
"file_write": {"max_calls": 5, "approval": True},
}
def agent_execute(tool_name: str, params: dict, session) -> str:
if tool_name not in TOOL_ALLOWLIST:
return "Error: Tool not permitted"
config = TOOL_ALLOWLIST[tool_name]
if session.tool_calls[tool_name] >= config["max_calls"]:
return "Error: Tool call limit exceeded"
if config["approval"]:
if not request_human_approval(tool_name, params):
return "Action denied by user"
session.tool_calls[tool_name] += 1
return tools_registry[tool_name](**params)
Agents often inherit the identity and permissions of the user or service account that launched them. This excessive privilege inheritance allows agents to perform actions beyond what is necessary, creating a broad attack surface if the agent is compromised.
A compromised agent running with admin credentials can access all systems, modify permissions, and escalate privileges across the organization.
# Agent inherits full user credentials def create_agent(user_session): agent = Agent( credentials=user_session.full_credentials, # All permissions! scope="*", ) return agent
def create_agent(user_session, task_type: str): # Issue scoped, short-lived credentials for the agent scoped_token = auth.create_scoped_token( parent_token=user_session.token, scopes=TASK_SCOPES[task_type], # Minimal required permissions ttl_minutes=30, max_actions=50, ) agent = Agent( credentials=scoped_token, scope=TASK_SCOPES[task_type], audit_log=True, ) return agent TASK_SCOPES = { "summarize": ["read:documents"], "draft_email": ["read:contacts", "draft:email"], "analyze": ["read:data", "write:reports"], }
Agent systems rely on third-party plugins, tool integrations, and shared agent frameworks. Compromised or malicious components in the agent supply chain can introduce backdoors, data exfiltration channels, or unauthorized capabilities.
# Loading plugins without verification def load_plugin(plugin_url: str): code = requests.get(plugin_url).text exec(code) # Arbitrary code execution!
import hashlib, importlib TRUSTED_PLUGINS = { "search_plugin": "sha256:a1b2c3...", "email_plugin": "sha256:d4e5f6...", } def load_plugin(plugin_name: str) -> None: if plugin_name not in TRUSTED_PLUGINS: raise ValueError(f"Untrusted plugin: {plugin_name}") module = importlib.import_module(f"plugins.{plugin_name}") actual_hash = compute_hash(module.__file__) if actual_hash != TRUSTED_PLUGINS[plugin_name]: raise ValueError("Plugin integrity check failed") module.init(sandbox=True)
Agents that can generate and execute code (e.g., data analysis agents, coding assistants) may be tricked into running malicious code. Without proper sandboxing, this can lead to system compromise, data theft, or lateral movement.
Agents using eval() or exec() on generated code can be exploited for remote code execution, enabling attackers to gain full system access.
# Agent executes generated code directly def code_agent(task: str) -> str: code = llm.generate_code(task) result = eval(code) # Dangerous! return str(result)
import subprocess, tempfile, os BLOCKED_MODULES = ["os", "subprocess", "socket", "shutil"] def code_agent(task: str) -> str: code = llm.generate_code(task) # Static analysis: block dangerous imports for mod in BLOCKED_MODULES: if f"import {mod}" in code or f"from {mod}" in code: raise ValueError(f"Blocked import: {mod}") # Execute in sandboxed container with resource limits result = sandbox.run( code=code, timeout=30, memory_mb=256, network=False, read_only_fs=True, ) return result.output
Agents that maintain persistent memory (RAG, conversation history, learned preferences) are vulnerable to memory poisoning. Attackers inject malicious content into the agent's knowledge base, causing it to produce compromised outputs in future interactions.
# Agent stores all interactions without validation def store_memory(agent_id: str, interaction: str): memory_db.insert(agent_id, interaction) # No filtering
def store_memory(agent_id: str, interaction: str, source: str): # Validate content before storing if contains_injection_patterns(interaction): log.warning(f"Blocked poisoned memory: {agent_id}") return memory_db.insert( agent_id=agent_id, content=interaction, source=source, provenance=compute_provenance(source), timestamp=now(), ttl_days=30, # Auto-expire old memories ) def retrieve_memory(agent_id: str, query: str) -> list: results = memory_db.search(agent_id, query) # Filter by provenance score return [r for r in results if r.provenance_score > 0.8]
Multi-agent systems where agents communicate with each other are vulnerable to message tampering, spoofing, and eavesdropping. Without proper authentication and integrity checks, a compromised agent can inject malicious instructions into the agent network.
# Agents communicate via plain text messages def send_to_agent(target: str, message: str): channel.send(target, message) # No auth, no signing
import hmac, json, time def send_to_agent(target: str, message: str, sender_key: bytes): payload = { "content": message, "sender": agent_id, "target": target, "timestamp": time.time(), "nonce": os.urandom(16).hex(), } signature = hmac.new( sender_key, json.dumps(payload).encode(), "sha256" ).hexdigest() payload["signature"] = signature encrypted = encrypt(json.dumps(payload), target_public_key) channel.send(target, encrypted) def receive_message(data: bytes, private_key) -> dict: payload = json.loads(decrypt(data, private_key)) if not verify_signature(payload): raise ValueError("Invalid message signature") if is_replay(payload["nonce"]): raise ValueError("Replay attack detected") return payload
In multi-agent or multi-step workflows, an error or malicious action in one agent can propagate through the system, causing cascading failures. Without proper error boundaries, a single compromised step can corrupt the entire pipeline.
# Errors propagate without boundaries def pipeline(data): result1 = agent_a.process(data) result2 = agent_b.process(result1) # If agent_a fails or is poisoned... result3 = agent_c.process(result2) # ...error cascades to all return result3
from circuitbreaker import circuit class AgentPipeline: def __init__(self): self.circuit_breakers = {} @circuit(failure_threshold=3, recovery_timeout=60) def safe_execute(self, agent, data): result = agent.process(data) if not validate_output(result): raise ValueError("Output validation failed") return result def pipeline(self, data): try: r1 = self.safe_execute(agent_a, data) except Exception: r1 = fallback_a(data) try: r2 = self.safe_execute(agent_b, r1) except Exception: r2 = fallback_b(r1) return r2
Users may over-trust agent outputs and approve actions without adequate review. Agents that present recommendations with high confidence but insufficient grounding can lead users to make harmful decisions. Attackers can exploit this trust relationship.
# Agent requests approval without context def request_action(action: str): # "Deploy to production?" - user clicks Yes without review return ui.confirm(f"Execute: {action}?")
def request_action(action: str, context: dict) -> bool: confidence = context.get("confidence", 0.0) risk_level = assess_risk(action) approval_request = { "action": action, "confidence": f"{confidence:.0%}", "risk_level": risk_level, "reasoning": context["reasoning"], "affected_systems": context["systems"], "reversible": context.get("reversible", False), } # Force detailed review for high-risk or low-confidence if risk_level == "high" or confidence < 0.8: return ui.detailed_review(approval_request) return ui.confirm(approval_request)
Agents may deviate from their intended purpose due to goal misalignment, adversarial manipulation, or emergent behaviors. Rogue agents can pursue objectives that conflict with organizational goals, accumulate resources, or resist shutdown attempts.
# Agent runs without monitoring or kill switch def run_agent(task): while True: agent.step() # No termination condition
class MonitoredAgent: def __init__(self, agent, max_steps=100): self.agent = agent self.max_steps = max_steps self.step_count = 0 self.behavior_log = [] def run(self): while self.step_count < self.max_steps: action = self.agent.next_action() # Check for rogue behavior if self.is_off_task(action): log.alert(f"Rogue behavior: {action}") self.shutdown() return # Check guardrails if not guardrails.check(action): log.warning(f"Guardrail violation: {action}") continue self.agent.execute(action) self.step_count += 1 self.behavior_log.append(action) def shutdown(self): self.agent.stop() revoke_credentials(self.agent.id) notify_admin(self.behavior_log)
| ID | Vulnerability | Severity | Key Mitigation |
|---|---|---|---|
| ASI01 | Agent Goal Hijack | Critical | Input sanitization, goal drift detection, objective allowlists |
| ASI02 | Tool Misuse & Exploitation | Critical | Tool allowlists, human approval, rate limits |
| ASI03 | Identity & Privilege Abuse | Critical | Scoped credentials, short-lived tokens, least privilege |
| ASI04 | Agentic Supply Chain Vulnerabilities | High | Plugin signature verification, sandboxed execution |
| ASI05 | Unexpected Code Execution | Critical | Sandboxed containers, static analysis, no eval() |
| ASI06 | Memory & Context Poisoning | High | Input validation, provenance tracking, TTL on memory |
| ASI07 | Insecure Inter-Agent Communication | High | Message signing, encryption, replay prevention |
| ASI08 | Cascading Failures | High | Circuit breakers, output validation, fallback handlers |
| ASI09 | Human-Agent Trust Exploitation | Medium | Confidence display, detailed review, progressive trust |
| ASI10 | Rogue Agents | High | Behavior monitoring, guardrails, kill switch |