人工智能代理系统最关键的 10 个安全风险以及如何降低这些风险。
针对自主规划、使用工具并与外部服务交互的人工智能代理系统的最关键安全风险排名。随着人工智能代理系统从研究走向生产部署,2026 年版将针对新出现的威胁进行分析。
攻击者通过精心设计的输入操纵代理的目标或目的,使其追求非预期目标。与简单的提示注入不同,目标劫持可持续跨越多个规划步骤,导致代理自主采取一系列有害行动。
攻击者可以重定向自主代理,以外泄数据、修改系统配置或执行多步骤攻击链,由于代理看似正常运行,因此很难被发现。
# Agent goal is derived directly from untrusted input def run_agent(user_request: str) -> str: goal = f"Complete this task: {user_request}" plan = llm.plan(goal) for step in plan: execute(step) # No validation of planned steps
import re ALLOWED_GOALS = ["summarize", "search", "draft_email", "analyze_data"] def sanitize_goal(user_request: str) -> str: # Strip injection patterns cleaned = re.sub(r'(?i)(ignore|override|new goal|forget).*', '', user_request) return cleaned.strip() def run_agent(user_request: str) -> str: sanitized = sanitize_goal(user_request) goal = f"Complete this task: {sanitized}" plan = llm.plan(goal) # Validate each step against allowed actions for step in plan: if step.action not in ALLOWED_GOALS: raise ValueError(f"Disallowed action: {step.action}") if goal_drift_detected(step, sanitized): raise ValueError("Goal drift detected, aborting") for step in plan: execute(step)
可访问外部工具(应用程序接口、文件系统、数据库、网络浏览器)的代理可被操纵滥用这些工具。不受限制的工具访问允许攻击者通过代理执行未经授权的操作。
拥有无限制工具访问权限的代理可以删除文件、发送未经授权的 API 请求、通过网页浏览外泄数据或修改关键系统配置。
# Agent can call any tool without restrictions def agent_execute(tool_name: str, params: dict): tool = tools_registry.get(tool_name) return tool(**params) # No validation or approval
TOOL_ALLOWLIST = {
"web_search": {"max_calls": 10, "approval": False},
"send_email": {"max_calls": 1, "approval": True},
"file_write": {"max_calls": 5, "approval": True},
}
def agent_execute(tool_name: str, params: dict, session) -> str:
if tool_name not in TOOL_ALLOWLIST:
return "Error: Tool not permitted"
config = TOOL_ALLOWLIST[tool_name]
if session.tool_calls[tool_name] >= config["max_calls"]:
return "Error: Tool call limit exceeded"
if config["approval"]:
if not request_human_approval(tool_name, params):
return "Action denied by user"
session.tool_calls[tool_name] += 1
return tools_registry[tool_name](**params)
代理通常会继承启动代理的用户或服务账户的身份和权限。这种过度的权限继承允许代理执行超出必要范围的操作,一旦代理被入侵,就会产生广泛的攻击面。
使用管理员凭证运行的受损代理可以访问所有系统、修改权限并在整个组织内提升权限。
# Agent inherits full user credentials def create_agent(user_session): agent = Agent( credentials=user_session.full_credentials, # All permissions! scope="*", ) return agent
def create_agent(user_session, task_type: str): # Issue scoped, short-lived credentials for the agent scoped_token = auth.create_scoped_token( parent_token=user_session.token, scopes=TASK_SCOPES[task_type], # Minimal required permissions ttl_minutes=30, max_actions=50, ) agent = Agent( credentials=scoped_token, scope=TASK_SCOPES[task_type], audit_log=True, ) return agent TASK_SCOPES = { "summarize": ["read:documents"], "draft_email": ["read:contacts", "draft:email"], "analyze": ["read:data", "write:reports"], }
代理系统依赖于第三方插件、工具集成和共享代理框架。代理供应链中被破坏或恶意的组件可能会引入后门、数据外泄渠道或未经授权的功能。
# Loading plugins without verification def load_plugin(plugin_url: str): code = requests.get(plugin_url).text exec(code) # Arbitrary code execution!
import hashlib, importlib TRUSTED_PLUGINS = { "search_plugin": "sha256:a1b2c3...", "email_plugin": "sha256:d4e5f6...", } def load_plugin(plugin_name: str) -> None: if plugin_name not in TRUSTED_PLUGINS: raise ValueError(f"Untrusted plugin: {plugin_name}") module = importlib.import_module(f"plugins.{plugin_name}") actual_hash = compute_hash(module.__file__) if actual_hash != TRUSTED_PLUGINS[plugin_name]: raise ValueError("Plugin integrity check failed") module.init(sandbox=True)
可以生成和执行代码的代理(如数据分析代理、编码助手)可能会被诱骗运行恶意代码。如果没有适当的沙箱,这可能会导致系统受损、数据被盗或横向移动。
在生成的代码上使用 eval() 或 exec() 的代理可被用于远程代码执行,使攻击者能够获得完全的系统访问权限。
# Agent executes generated code directly def code_agent(task: str) -> str: code = llm.generate_code(task) result = eval(code) # Dangerous! return str(result)
import subprocess, tempfile, os BLOCKED_MODULES = ["os", "subprocess", "socket", "shutil"] def code_agent(task: str) -> str: code = llm.generate_code(task) # Static analysis: block dangerous imports for mod in BLOCKED_MODULES: if f"import {mod}" in code or f"from {mod}" in code: raise ValueError(f"Blocked import: {mod}") # Execute in sandboxed container with resource limits result = sandbox.run( code=code, timeout=30, memory_mb=256, network=False, read_only_fs=True, ) return result.output
保持持久内存(RAG、对话历史、学习偏好)的代理很容易受到内存中毒的影响。攻击者会向代理的知识库中注入恶意内容,导致代理在未来的交互中产生受影响的输出。
# Agent stores all interactions without validation def store_memory(agent_id: str, interaction: str): memory_db.insert(agent_id, interaction) # No filtering
def store_memory(agent_id: str, interaction: str, source: str): # Validate content before storing if contains_injection_patterns(interaction): log.warning(f"Blocked poisoned memory: {agent_id}") return memory_db.insert( agent_id=agent_id, content=interaction, source=source, provenance=compute_provenance(source), timestamp=now(), ttl_days=30, # Auto-expire old memories ) def retrieve_memory(agent_id: str, query: str) -> list: results = memory_db.search(agent_id, query) # Filter by provenance score return [r for r in results if r.provenance_score > 0.8]
代理相互通信的多代理系统很容易受到信息篡改、欺骗和窃听的影响。如果没有适当的身份验证和完整性检查,受损的代理可以向代理网络注入恶意指令。
# Agents communicate via plain text messages def send_to_agent(target: str, message: str): channel.send(target, message) # No auth, no signing
import hmac, json, time def send_to_agent(target: str, message: str, sender_key: bytes): payload = { "content": message, "sender": agent_id, "target": target, "timestamp": time.time(), "nonce": os.urandom(16).hex(), } signature = hmac.new( sender_key, json.dumps(payload).encode(), "sha256" ).hexdigest() payload["signature"] = signature encrypted = encrypt(json.dumps(payload), target_public_key) channel.send(target, encrypted) def receive_message(data: bytes, private_key) -> dict: payload = json.loads(decrypt(data, private_key)) if not verify_signature(payload): raise ValueError("Invalid message signature") if is_replay(payload["nonce"]): raise ValueError("Replay attack detected") return payload
在多代理或多步骤工作流中,一个代理的错误或恶意操作可能会在系统中传播,造成连锁故障。如果没有适当的错误界限,一个出错的步骤就会破坏整个流水线。
# Errors propagate without boundaries def pipeline(data): result1 = agent_a.process(data) result2 = agent_b.process(result1) # If agent_a fails or is poisoned... result3 = agent_c.process(result2) # ...error cascades to all return result3
from circuitbreaker import circuit class AgentPipeline: def __init__(self): self.circuit_breakers = {} @circuit(failure_threshold=3, recovery_timeout=60) def safe_execute(self, agent, data): result = agent.process(data) if not validate_output(result): raise ValueError("Output validation failed") return result def pipeline(self, data): try: r1 = self.safe_execute(agent_a, data) except Exception: r1 = fallback_a(data) try: r2 = self.safe_execute(agent_b, r1) except Exception: r2 = fallback_b(r1) return r2
用户可能会过度信任代理的输出结果,并在没有充分审查的情况下批准行动。如果代理提出的建议可信度很高,但依据不足,就会导致用户做出有害的决定。攻击者可以利用这种信任关系。
# Agent requests approval without context def request_action(action: str): # "Deploy to production?" - user clicks Yes without review return ui.confirm(f"Execute: {action}?")
def request_action(action: str, context: dict) -> bool: confidence = context.get("confidence", 0.0) risk_level = assess_risk(action) approval_request = { "action": action, "confidence": f"{confidence:.0%}", "risk_level": risk_level, "reasoning": context["reasoning"], "affected_systems": context["systems"], "reversible": context.get("reversible", False), } # Force detailed review for high-risk or low-confidence if risk_level == "high" or confidence < 0.8: return ui.detailed_review(approval_request) return ui.confirm(approval_request)
由于目标不一致、对手操纵或突发行为,代理可能会偏离预期目的。流氓代理可能会追求与组织目标相冲突的目标、积累资源或抵制关闭尝试。
# Agent runs without monitoring or kill switch def run_agent(task): while True: agent.step() # No termination condition
class MonitoredAgent: def __init__(self, agent, max_steps=100): self.agent = agent self.max_steps = max_steps self.step_count = 0 self.behavior_log = [] def run(self): while self.step_count < self.max_steps: action = self.agent.next_action() # Check for rogue behavior if self.is_off_task(action): log.alert(f"Rogue behavior: {action}") self.shutdown() return # Check guardrails if not guardrails.check(action): log.warning(f"Guardrail violation: {action}") continue self.agent.execute(action) self.step_count += 1 self.behavior_log.append(action) def shutdown(self): self.agent.stop() revoke_credentials(self.agent.id) notify_admin(self.behavior_log)
| 身份证 | 脆弱性 | 严重性 | 关键缓解措施 |
|---|---|---|---|
| ASI01 | 特工目标劫持 | Critical | 输入净化、目标漂移检测、目标允许列表 |
| ASI02 | 工具滥用与开发 | Critical | 工具允许列表、人工审批、费率限制 |
| ASI03 | 身份与特权滥用 | Critical | 范围凭证、短期令牌、最小特权 |
| ASI04 | 代理供应链漏洞 | High | 插件签名验证、沙箱执行 |
| ASI05 | 意外代码执行 | Critical | 沙箱容器、静态分析、无 eval() |
| ASI06 | 记忆与语境中毒 | High | 输入验证、出处跟踪、内存 TTL |
| ASI07 | 不安全的代理间通信 | High | 信息签名、加密、防止重放 |
| ASI08 | 级联故障 | High | 断路器、输出验证、后备处理程序 |
| ASI09 | 人与代理之间的信任利用 | Medium | 信心展示、详细审查、逐步信任 |
| ASI10 | 流氓特工 | High | 行为监控、防护栏、断电开关 |