机器学习系统中最关键的 10 个安全风险以及如何降低这些风险。
OWASP 机器学习安全 Top 10 确定了机器学习系统特有的最重大安全风险。与传统软件不同,机器学习系统容易受到针对训练数据、模型内部结构和推理管道的独特攻击。本指南涵盖对抗性攻击、数据中毒、模型盗窃、供应链风险等内容,并附有实用的 Python 代码示例。
输入操纵攻击(对抗性攻击)通过精心设计的输入,使 ML 模型做出错误的预测。对图像、文本或其他输入进行微小的、通常不易察觉的扰动,就能骗过分类器、绕过检测系统并躲过内容过滤器。这是最著名的 ML 特定攻击向量。
对抗性实例可以绕过对安全至关重要的人工智能系统:自动驾驶汽车感知、恶意软件检测、欺诈检测和内容审核。攻击者可将停车标志归类为限速标志,或使恶意软件在基于 ML 的防病毒软件面前显得无害。
import numpy as np from tensorflow import keras # Model with no adversarial robustness model = keras.models.load_model("classifier.h5") def predict(image): # Direct prediction — no input validation or preprocessing result = model.predict(np.expand_dims(image, axis=0)) return np.argmax(result) # No confidence threshold check # No input bounds validation # Vulnerable to FGSM, PGD, C&W attacks
import numpy as np from tensorflow import keras from art.defences.preprocessor import SpatialSmoothing from art.defences.detector.evasion import BinaryInputDetector # Load adversarially trained model model = keras.models.load_model("classifier_robust.h5") # Input preprocessing to remove perturbations smoother = SpatialSmoothing(window_size=3) detector = BinaryInputDetector(model) def predict_secure(image): # Validate input bounds if image.min() < 0 or image.max() > 1: raise ValueError("Input out of expected range") # Detect adversarial input if detector.detect(image): raise ValueError("Adversarial input detected") # Apply spatial smoothing defense cleaned = smoother(image)[0] result = model.predict(np.expand_dims(cleaned, axis=0)) # Reject low-confidence predictions confidence = np.max(result) if confidence < 0.85: return {"label": "uncertain", "confidence": confidence} return {"label": np.argmax(result), "confidence": confidence}
数据中毒攻击将恶意样本注入训练数据集,破坏模型的学习行为。攻击者可以引入后门(导致特定错误分类的触发模式)、移动决策边界或降低整体模型的准确性。当训练数据来自互联网或用户生成的内容时,这种情况尤为危险。
中毒模型在干净输入时可能表现正常,但在出现特定触发模式时就会分类错误。例如,恶意软件分类器可以识别任何包含特定字节序列的恶意软件样本。这种攻击是隐蔽的,因为模型在干净数据上的准确率仍然很高。
import pandas as pd from sklearn.ensemble import RandomForestClassifier # Training on unvalidated, crowdsourced data data = pd.read_csv("user_submitted_data.csv") # No validation! # No outlier detection or data quality checks X = data.drop("label", axis=1) y = data["label"] model = RandomForestClassifier() model.fit(X, y) # Training directly on untrusted data! # No comparison against clean baseline # No data provenance tracking
import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier, IsolationForest from sklearn.model_selection import cross_val_score # Load data with provenance tracking data = pd.read_csv("training_data.csv") data_hash = hashlib.sha256(data.to_csv().encode()).hexdigest() log.info(f"Training data hash: {data_hash}") X = data.drop("label", axis=1) y = data["label"] # Detect and remove anomalous samples iso_forest = IsolationForest(contamination=0.05, random_state=42) outlier_mask = iso_forest.fit_predict(X) == 1 X_clean, y_clean = X[outlier_mask], y[outlier_mask] log.info(f"Removed {(~outlier_mask).sum()} outliers from {len(X)} samples") # Train and validate against baseline model = RandomForestClassifier(random_state=42) scores = cross_val_score(model, X_clean, y_clean, cv=5) if scores.mean() < BASELINE_ACCURACY - 0.05: raise ValueError("Model accuracy dropped — possible data poisoning") model.fit(X_clean, y_clean)
模型反转攻击通过查询模型并分析其输出结果来重建敏感的训练数据。攻击者可以恢复训练过程中使用的私人信息,如人脸、医疗记录或个人数据。对于在敏感数据集(医疗保健、生物识别、金融数据)上训练的模型来说,这一点尤为重要。
模型反转会暴露训练数据中的个人身份信息,从而违反数据隐私法规(GDPR、HIPAA)。攻击者通过 API 访问人脸识别模型,可以重建训练集中的个人面孔。
from flask import Flask, request, jsonify app = Flask(__name__) model = load_model("face_classifier.h5") @app.route("/predict", methods=["POST"]) def predict(): image = request.files["image"] result = model.predict(preprocess(image)) # Returns full probability vector — enables model inversion! return jsonify({ "probabilities": result.tolist(), # All class probabilities! "prediction": int(np.argmax(result)), "confidence": float(np.max(result)) }) # No rate limiting, no query logging # Unlimited API access for gradient estimation
from flask import Flask, request, jsonify from flask_limiter import Limiter import numpy as np app = Flask(__name__) limiter = Limiter(app, default_limits=["100/hour"]) model = load_model("face_classifier_dp.h5") # Trained with DP @app.route("/predict", methods=["POST"]) @limiter.limit("100/hour") def predict(): image = request.files["image"] result = model.predict(preprocess(image)) # Return only top-1 prediction — no probability vector prediction = int(np.argmax(result)) log_query(request.remote_addr, prediction) # Audit logging return jsonify({ "prediction": prediction # No probabilities, no confidence scores })
成员推断攻击可确定模型的训练数据集中是否使用了特定数据点。通过分析模型的置信度分数和已知输入与未知输入的行为,攻击者可以推断出私人成员信息。这对于在敏感数据上训练的模型来说是一个重大的隐私威胁。
成员推理可以揭示特定个人的数据被用于训练--例如,确认临床数据集中有病人的记录,或者某人的脸被用于监控训练。这违反了隐私预期,并可能违反 GDPR 等法规。
from sklearn.neural_network import MLPClassifier # Overfitted model — memorizes training data model = MLPClassifier( hidden_layer_sizes=(512, 512, 256), # Over-parameterized! max_iter=1000, # No regularization # No early stopping ) model.fit(X_train, y_train) # Model memorizes training data → membership inference possible # Training accuracy: 99.9% vs Test accuracy: 82% # This gap indicates overfitting = information leakage def predict_with_confidence(x): proba = model.predict_proba([x])[0] return {"probabilities": proba.tolist()} # Leaks membership info!
from sklearn.neural_network import MLPClassifier import numpy as np # Regularized model with early stopping to reduce overfitting model = MLPClassifier( hidden_layer_sizes=(128, 64), max_iter=500, alpha=0.01, # L2 regularization early_stopping=True, # Prevents memorization validation_fraction=0.15, ) model.fit(X_train, y_train) # Verify train/test gap is small (low overfitting) train_acc = model.score(X_train, y_train) test_acc = model.score(X_test, y_test) assert train_acc - test_acc < 0.05, "Overfitting detected!" def predict_secure(x): pred = model.predict([x])[0] return {"prediction": int(pred)} # Label only, no probabilities
模型窃取(模型提取)攻击通过系统查询专有 ML 模型并在输入输出对上训练一个代理模型,从而创建一个专有 ML 模型的功能副本。窃取的模型可用于寻找对抗性示例、商业竞争或逆向工程模型的训练数据。
被盗模型意味着知识产权和竞争优势的损失。提取的模型可用于离线设计对抗性攻击或了解决策边界。数百万美元的培训投资可通过数千次 API 查询进行复制。
from flask import Flask, request, jsonify app = Flask(__name__) model = load_proprietary_model() @app.route("/predict", methods=["POST"]) def predict(): data = request.json["features"] result = model.predict_proba([data])[0] # Returns full probability distribution return jsonify({ "probabilities": result.tolist(), "prediction": int(np.argmax(result)) }) # No rate limiting — unlimited queries # No anomaly detection on query patterns # Attacker can extract model with ~10K queries
from flask import Flask, request, jsonify from flask_limiter import Limiter import numpy as np app = Flask(__name__) limiter = Limiter(app, default_limits=["50/hour"]) # Watermarked model for theft detection model = load_watermarked_model() query_monitor = QueryPatternDetector() @app.route("/predict", methods=["POST"]) @limiter.limit("50/hour") def predict(): data = request.json["features"] api_key = request.headers.get("X-API-Key") # Detect extraction patterns (uniform sampling, grid queries) if query_monitor.is_suspicious(api_key, data): log_alert(f"Possible extraction: {api_key}") return jsonify({"error": "rate limited"}), 429 result = model.predict([data])[0] return jsonify({ "prediction": int(result) # Label only, no probabilities })
人工智能供应链攻击的目标是人工智能开发管道:来自模型中心的预训练模型、第三方数据集、人工智能框架和依赖关系。恶意模型可能包含隐藏的后门,被攻击的库可能会注入漏洞。ML 框架(Pickle、SavedModel)使用的序列化格式可在加载时执行任意代码。
加载恶意模型文件可执行任意代码(Pickle 反序列化攻击)。来自不受信任来源的预训练模型可能包含后门。被破坏的 ML 库会影响所有下游用户。与传统软件供应链相比,ML 供应链的安全控制较少。
import pickle import torch # Loading untrusted model — arbitrary code execution! with open("model_from_internet.pkl", "rb") as f: model = pickle.load(f) # DANGEROUS: can execute any code! # Loading unverified PyTorch model model = torch.load("untrusted_model.pt") # Uses pickle internally! # Using unvetted model from public hub from transformers import AutoModel model = AutoModel.from_pretrained("random-user/suspicious-model") # No hash verification, no security scan
import torch import hashlib from safetensors.torch import load_file # Use SafeTensors — no arbitrary code execution model_state = load_file("model.safetensors") # Safe format! model = MyModel() model.load_state_dict(model_state) # Verify model hash before loading EXPECTED_HASH = "sha256:a1b2c3d4..." with open("model.safetensors", "rb") as f: actual_hash = "sha256:" + hashlib.sha256(f.read()).hexdigest() assert actual_hash == EXPECTED_HASH, "Model integrity check failed!" # Use trusted models from verified organizations from transformers import AutoModel model = AutoModel.from_pretrained( "google/bert-base-uncased", # Verified organization revision="a265f77", # Pin to specific commit )
迁移学习攻击利用了微调预训练模型的常见做法。嵌入基础模型中的后门会在微调过程中持续存在,并在下游模型中保持活跃。发布流行预训练模型的攻击者可以入侵所有使用该模型作为基础的应用程序。
预训练模型中的后门之所以能在微调中存活,是因为它们嵌入了深度层,而深度层在迁移学习过程中通常会被冻结。一个被破坏的基础模型可以影响成千上万个下游应用程序。这种攻击具有可扩展性,而且难以检测。
from transformers import AutoModelForSequenceClassification # Fine-tuning an unvetted pre-trained model model = AutoModelForSequenceClassification.from_pretrained( "unknown-user/bert-finetuned-sentiment", # Untrusted source! num_labels=2 ) # Freezing base layers — preserves any hidden backdoor for param in model.base_model.parameters(): param.requires_grad = False # Backdoor in frozen layers persists! # Fine-tune only the classification head trainer.train() # Backdoor remains undetected
from transformers import AutoModelForSequenceClassification from neural_cleanse import BackdoorDetector # Use only verified, trusted base models model = AutoModelForSequenceClassification.from_pretrained( "google/bert-base-uncased", # Trusted source num_labels=2, revision="main", ) # Scan pre-trained model for backdoors before fine-tuning detector = BackdoorDetector(model) if detector.scan(): raise SecurityError("Potential backdoor detected in base model") # Fine-tune ALL layers (not just head) to overwrite potential backdoors for param in model.parameters(): param.requires_grad = True # Train all layers # Validate with clean test set + trigger test set trainer.train() evaluate_for_backdoors(model, trigger_test_set)
当生产数据分布与训练数据分布存在显著差异时,就会出现模型偏斜(训练服务偏斜)。这种情况可能随着时间的推移自然发生(数据漂移),也可能是攻击者故意造成的,他们操纵生产输入分布来降低模型性能或使预测产生偏差。
模型偏斜会导致无声故障,在这种情况下,模型会产生不正确但有把握的预测。在金融系统中,攻击者可以利用偏斜绕过欺诈检测。在推荐系统中,可以故意诱导偏斜,以推广特定内容或产品。
import joblib # Deploy model with no drift monitoring model = joblib.load("model_trained_2023.pkl") def predict(features): # No check if input distribution has changed # No feature validation against training schema return model.predict([features])[0] # Model may be months/years old # No monitoring of prediction distribution # Silent degradation goes undetected
import joblib import numpy as np from scipy import stats from evidently import ColumnDriftMetric model = joblib.load("model.pkl") training_stats = joblib.load("training_stats.pkl") def predict_with_monitoring(features): # Validate feature schema and ranges for i, (val, stat) in enumerate(zip(features, training_stats)): z_score = abs((val - stat["mean"]) / stat["std"]) if z_score > 5: log.warning(f"Feature {i} out of distribution: z={z_score:.1f}") prediction = model.predict([features])[0] # Log prediction distribution for drift monitoring metrics_collector.log(features, prediction) # Periodic drift detection (run by monitoring job) # drift_report = ColumnDriftMetric().calculate(reference, current) # Alert if drift detected → trigger retraining return prediction
输出完整性攻击是在模型预测离开模型之后、到达消费应用之前对其进行篡改。这包括对预测 API 的中间人攻击、对模型服务基础设施的操纵以及对缓存预测的篡改。攻击的目标是推理管道而非模型本身。
被篡改的预测会导致下游系统做出错误的决定:批准欺诈交易、误诊病情或覆盖安全系统。由于模型本身没有受到破坏,因此标准模型监控无法检测到攻击。
import requests # Consuming model predictions over unencrypted HTTP def get_prediction(features): response = requests.post( "http://ml-service/predict", # HTTP, not HTTPS! json={"features": features} ) result = response.json() # No integrity verification of the response # No validation of prediction format return result["prediction"] # Could be tampered!
import requests import hmac import hashlib def get_prediction(features): response = requests.post( "https://ml-service/predict", # HTTPS (TLS) json={"features": features}, headers={"Authorization": f"Bearer {API_TOKEN}"}, verify=True # Verify TLS certificate ) result = response.json() # Verify response integrity with HMAC signature signature = response.headers.get("X-Signature") expected = hmac.new( SHARED_SECRET, str(result).encode(), hashlib.sha256 ).hexdigest() if not hmac.compare_digest(signature, expected): raise IntegrityError("Response signature mismatch!") # Validate prediction is within expected range pred = result["prediction"] if pred not in VALID_LABELS: raise ValueError(f"Unexpected prediction: {pred}") return pred
模型中毒直接修改训练模型的权重、参数或架构,以注入后门或改变行为。与数据中毒(破坏训练数据)不同,模型中毒针对的是模型工件本身--通过受损的模型库、内部威胁或对模型存储和部署管道的供应链攻击。
直接中毒的模型可能包含极具针对性的后门,通过标准测试几乎无法检测到。攻击者可以精确控制模型在触发输入时的行为。如果模型注册表或部署管道遭到破坏,每次部署都会使用中毒模型。
import mlflow # Loading model from registry with no integrity checks model_uri = "models:/fraud-detector/Production" model = mlflow.pyfunc.load_model(model_uri) # No hash verification # No signature validation # No comparison with expected model metrics # Model registry has weak access controls # Anyone with push access can replace the model predictions = model.predict(new_data)
import mlflow import hashlib from sigstore.verify import Verifier # Verify model signature before loading model_uri = "models:/fraud-detector/Production" model_path = mlflow.artifacts.download_artifacts(model_uri) # Cryptographic signature verification verifier = Verifier.production() verifier.verify( model_path, expected_identity="ml-team@company.com" ) # Verify model hash against approved registry model_hash = hash_directory(model_path) approved_hash = get_approved_hash("fraud-detector", "Production") assert model_hash == approved_hash, "Model integrity check failed!" # Validate model metrics on reference dataset before serving model = mlflow.pyfunc.load_model(model_path) ref_score = evaluate(model, reference_dataset) assert ref_score >= MINIMUM_ACCURACY, "Model quality below threshold" predictions = model.predict(new_data)
| 身份证 | 脆弱性 | 严重性 | 关键缓解措施 |
|---|---|---|---|
| ML01 | 输入操纵攻击 | Critical | 对抗训练、输入验证、置信度阈值 |
| ML02 | 数据中毒攻击 | Critical | 离群点检测、数据出处、基线比较 |
| ML03 | 模型反转攻击 | High | 差分隐私、最小输出、速率限制 |
| ML04 | 成员推理攻击 | High | 正则化、DP 训练、无概率暴露 |
| ML05 | 盗窃模型 | Critical | 速率限制、水印、查询模式检测 |
| ML06 | 人工智能供应链攻击 | Critical | 安全传感器、哈希验证、仅限可信来源 |
| ML07 | 迁移学习攻击 | High | 可信基础模型、后门扫描、全面微调 |
| ML08 | 模型倾斜 | Medium | 漂移监测、输入验证、自动再训练 |
| ML09 | 输出完整性攻击 | High | TLS/mTLS、响应签名、输出验证 |
| ML10 | 模型中毒 | Critical | 模型签名、注册表访问控制、度量验证 |