什么是 OWASP ML 安全 Top 10?

OWASP 机器学习安全 Top 10 确定了机器学习系统特有的最重大安全风险。与传统软件不同,机器学习系统容易受到针对训练数据、模型内部结构和推理管道的独特攻击。本指南涵盖对抗性攻击、数据中毒、模型盗窃、供应链风险等内容,并附有实用的 Python 代码示例。

1️⃣ ML01 - 输入操纵攻击

Critical

概述

输入操纵攻击(对抗性攻击)通过精心设计的输入,使 ML 模型做出错误的预测。对图像、文本或其他输入进行微小的、通常不易察觉的扰动,就能骗过分类器、绕过检测系统并躲过内容过滤器。这是最著名的 ML 特定攻击向量。

风险

对抗性实例可以绕过对安全至关重要的人工智能系统:自动驾驶汽车感知、恶意软件检测、欺诈检测和内容审核。攻击者可将停车标志归类为限速标志,或使恶意软件在基于 ML 的防病毒软件面前显得无害。

漏洞代码示例

Python ❌ Bad
import numpy as np
from tensorflow import keras

# Model with no adversarial robustness
model = keras.models.load_model("classifier.h5")

def predict(image):
    # Direct prediction — no input validation or preprocessing
    result = model.predict(np.expand_dims(image, axis=0))
    return np.argmax(result)
    # No confidence threshold check
    # No input bounds validation
    # Vulnerable to FGSM, PGD, C&W attacks

安全代码示例

Python ✅ Good
import numpy as np
from tensorflow import keras
from art.defences.preprocessor import SpatialSmoothing
from art.defences.detector.evasion import BinaryInputDetector

# Load adversarially trained model
model = keras.models.load_model("classifier_robust.h5")

# Input preprocessing to remove perturbations
smoother = SpatialSmoothing(window_size=3)
detector = BinaryInputDetector(model)

def predict_secure(image):
    # Validate input bounds
    if image.min() < 0 or image.max() > 1:
        raise ValueError("Input out of expected range")

    # Detect adversarial input
    if detector.detect(image):
        raise ValueError("Adversarial input detected")

    # Apply spatial smoothing defense
    cleaned = smoother(image)[0]
    result = model.predict(np.expand_dims(cleaned, axis=0))

    # Reject low-confidence predictions
    confidence = np.max(result)
    if confidence < 0.85:
        return {"label": "uncertain", "confidence": confidence}
    return {"label": np.argmax(result), "confidence": confidence}

缓解措施清单

2️⃣ ML02 - 数据中毒攻击

Critical

概述

数据中毒攻击将恶意样本注入训练数据集,破坏模型的学习行为。攻击者可以引入后门(导致特定错误分类的触发模式)、移动决策边界或降低整体模型的准确性。当训练数据来自互联网或用户生成的内容时,这种情况尤为危险。

风险

中毒模型在干净输入时可能表现正常,但在出现特定触发模式时就会分类错误。例如,恶意软件分类器可以识别任何包含特定字节序列的恶意软件样本。这种攻击是隐蔽的,因为模型在干净数据上的准确率仍然很高。

漏洞代码示例

Python ❌ Bad
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# Training on unvalidated, crowdsourced data
data = pd.read_csv("user_submitted_data.csv")  # No validation!

# No outlier detection or data quality checks
X = data.drop("label", axis=1)
y = data["label"]

model = RandomForestClassifier()
model.fit(X, y)  # Training directly on untrusted data!

# No comparison against clean baseline
# No data provenance tracking

安全代码示例

Python ✅ Good
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier, IsolationForest
from sklearn.model_selection import cross_val_score

# Load data with provenance tracking
data = pd.read_csv("training_data.csv")
data_hash = hashlib.sha256(data.to_csv().encode()).hexdigest()
log.info(f"Training data hash: {data_hash}")

X = data.drop("label", axis=1)
y = data["label"]

# Detect and remove anomalous samples
iso_forest = IsolationForest(contamination=0.05, random_state=42)
outlier_mask = iso_forest.fit_predict(X) == 1
X_clean, y_clean = X[outlier_mask], y[outlier_mask]
log.info(f"Removed {(~outlier_mask).sum()} outliers from {len(X)} samples")

# Train and validate against baseline
model = RandomForestClassifier(random_state=42)
scores = cross_val_score(model, X_clean, y_clean, cv=5)
if scores.mean() < BASELINE_ACCURACY - 0.05:
    raise ValueError("Model accuracy dropped — possible data poisoning")

model.fit(X_clean, y_clean)

缓解措施清单

3️⃣ ML03 - 模型反转攻击

High

概述

模型反转攻击通过查询模型并分析其输出结果来重建敏感的训练数据。攻击者可以恢复训练过程中使用的私人信息,如人脸、医疗记录或个人数据。对于在敏感数据集(医疗保健、生物识别、金融数据)上训练的模型来说,这一点尤为重要。

风险

模型反转会暴露训练数据中的个人身份信息,从而违反数据隐私法规(GDPR、HIPAA)。攻击者通过 API 访问人脸识别模型,可以重建训练集中的个人面孔。

漏洞代码示例

Python (API) ❌ Bad
from flask import Flask, request, jsonify

app = Flask(__name__)
model = load_model("face_classifier.h5")

@app.route("/predict", methods=["POST"])
def predict():
    image = request.files["image"]
    result = model.predict(preprocess(image))
    # Returns full probability vector — enables model inversion!
    return jsonify({
        "probabilities": result.tolist(),  # All class probabilities!
        "prediction": int(np.argmax(result)),
        "confidence": float(np.max(result))
    })
    # No rate limiting, no query logging
    # Unlimited API access for gradient estimation

安全代码示例

Python (API) ✅ Good
from flask import Flask, request, jsonify
from flask_limiter import Limiter
import numpy as np

app = Flask(__name__)
limiter = Limiter(app, default_limits=["100/hour"])
model = load_model("face_classifier_dp.h5")  # Trained with DP

@app.route("/predict", methods=["POST"])
@limiter.limit("100/hour")
def predict():
    image = request.files["image"]
    result = model.predict(preprocess(image))

    # Return only top-1 prediction — no probability vector
    prediction = int(np.argmax(result))
    log_query(request.remote_addr, prediction)  # Audit logging

    return jsonify({
        "prediction": prediction
        # No probabilities, no confidence scores
    })

缓解措施清单

4️⃣ ML04 - 成员推理攻击

High

概述

成员推断攻击可确定模型的训练数据集中是否使用了特定数据点。通过分析模型的置信度分数和已知输入与未知输入的行为,攻击者可以推断出私人成员信息。这对于在敏感数据上训练的模型来说是一个重大的隐私威胁。

风险

成员推理可以揭示特定个人的数据被用于训练--例如,确认临床数据集中有病人的记录,或者某人的脸被用于监控训练。这违反了隐私预期,并可能违反 GDPR 等法规。

漏洞代码示例

Python ❌ Bad
from sklearn.neural_network import MLPClassifier

# Overfitted model — memorizes training data
model = MLPClassifier(
    hidden_layer_sizes=(512, 512, 256),  # Over-parameterized!
    max_iter=1000,
    # No regularization
    # No early stopping
)
model.fit(X_train, y_train)

# Model memorizes training data → membership inference possible
# Training accuracy: 99.9% vs Test accuracy: 82%
# This gap indicates overfitting = information leakage

def predict_with_confidence(x):
    proba = model.predict_proba([x])[0]
    return {"probabilities": proba.tolist()}  # Leaks membership info!

安全代码示例

Python ✅ Good
from sklearn.neural_network import MLPClassifier
import numpy as np

# Regularized model with early stopping to reduce overfitting
model = MLPClassifier(
    hidden_layer_sizes=(128, 64),
    max_iter=500,
    alpha=0.01,               # L2 regularization
    early_stopping=True,       # Prevents memorization
    validation_fraction=0.15,
)
model.fit(X_train, y_train)

# Verify train/test gap is small (low overfitting)
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)
assert train_acc - test_acc < 0.05, "Overfitting detected!"

def predict_secure(x):
    pred = model.predict([x])[0]
    return {"prediction": int(pred)}  # Label only, no probabilities

缓解措施清单

5️⃣ ML05 - 型号盗窃

Critical

概述

模型窃取(模型提取)攻击通过系统查询专有 ML 模型并在输入输出对上训练一个代理模型,从而创建一个专有 ML 模型的功能副本。窃取的模型可用于寻找对抗性示例、商业竞争或逆向工程模型的训练数据。

风险

被盗模型意味着知识产权和竞争优势的损失。提取的模型可用于离线设计对抗性攻击或了解决策边界。数百万美元的培训投资可通过数千次 API 查询进行复制。

漏洞代码示例

Python (API) ❌ Bad
from flask import Flask, request, jsonify

app = Flask(__name__)
model = load_proprietary_model()

@app.route("/predict", methods=["POST"])
def predict():
    data = request.json["features"]
    result = model.predict_proba([data])[0]

    # Returns full probability distribution
    return jsonify({
        "probabilities": result.tolist(),
        "prediction": int(np.argmax(result))
    })
    # No rate limiting — unlimited queries
    # No anomaly detection on query patterns
    # Attacker can extract model with ~10K queries

安全代码示例

Python (API) ✅ Good
from flask import Flask, request, jsonify
from flask_limiter import Limiter
import numpy as np

app = Flask(__name__)
limiter = Limiter(app, default_limits=["50/hour"])

# Watermarked model for theft detection
model = load_watermarked_model()
query_monitor = QueryPatternDetector()

@app.route("/predict", methods=["POST"])
@limiter.limit("50/hour")
def predict():
    data = request.json["features"]
    api_key = request.headers.get("X-API-Key")

    # Detect extraction patterns (uniform sampling, grid queries)
    if query_monitor.is_suspicious(api_key, data):
        log_alert(f"Possible extraction: {api_key}")
        return jsonify({"error": "rate limited"}), 429

    result = model.predict([data])[0]
    return jsonify({
        "prediction": int(result)  # Label only, no probabilities
    })

缓解措施清单

6️⃣ ML06 - 人工智能供应链攻击

Critical

概述

人工智能供应链攻击的目标是人工智能开发管道:来自模型中心的预训练模型、第三方数据集、人工智能框架和依赖关系。恶意模型可能包含隐藏的后门,被攻击的库可能会注入漏洞。ML 框架(Pickle、SavedModel)使用的序列化格式可在加载时执行任意代码。

风险

加载恶意模型文件可执行任意代码(Pickle 反序列化攻击)。来自不受信任来源的预训练模型可能包含后门。被破坏的 ML 库会影响所有下游用户。与传统软件供应链相比,ML 供应链的安全控制较少。

漏洞代码示例

Python ❌ Bad
import pickle
import torch

# Loading untrusted model — arbitrary code execution!
with open("model_from_internet.pkl", "rb") as f:
    model = pickle.load(f)  # DANGEROUS: can execute any code!

# Loading unverified PyTorch model
model = torch.load("untrusted_model.pt")  # Uses pickle internally!

# Using unvetted model from public hub
from transformers import AutoModel
model = AutoModel.from_pretrained("random-user/suspicious-model")
# No hash verification, no security scan

安全代码示例

Python ✅ Good
import torch
import hashlib
from safetensors.torch import load_file

# Use SafeTensors — no arbitrary code execution
model_state = load_file("model.safetensors")  # Safe format!
model = MyModel()
model.load_state_dict(model_state)

# Verify model hash before loading
EXPECTED_HASH = "sha256:a1b2c3d4..."
with open("model.safetensors", "rb") as f:
    actual_hash = "sha256:" + hashlib.sha256(f.read()).hexdigest()
assert actual_hash == EXPECTED_HASH, "Model integrity check failed!"

# Use trusted models from verified organizations
from transformers import AutoModel
model = AutoModel.from_pretrained(
    "google/bert-base-uncased",  # Verified organization
    revision="a265f77",           # Pin to specific commit
)

缓解措施清单

7️⃣ ML07 - 转移学习攻击

High

概述

迁移学习攻击利用了微调预训练模型的常见做法。嵌入基础模型中的后门会在微调过程中持续存在,并在下游模型中保持活跃。发布流行预训练模型的攻击者可以入侵所有使用该模型作为基础的应用程序。

风险

预训练模型中的后门之所以能在微调中存活,是因为它们嵌入了深度层,而深度层在迁移学习过程中通常会被冻结。一个被破坏的基础模型可以影响成千上万个下游应用程序。这种攻击具有可扩展性,而且难以检测。

漏洞代码示例

Python ❌ Bad
from transformers import AutoModelForSequenceClassification

# Fine-tuning an unvetted pre-trained model
model = AutoModelForSequenceClassification.from_pretrained(
    "unknown-user/bert-finetuned-sentiment",  # Untrusted source!
    num_labels=2
)

# Freezing base layers — preserves any hidden backdoor
for param in model.base_model.parameters():
    param.requires_grad = False  # Backdoor in frozen layers persists!

# Fine-tune only the classification head
trainer.train()  # Backdoor remains undetected

安全代码示例

Python ✅ Good
from transformers import AutoModelForSequenceClassification
from neural_cleanse import BackdoorDetector

# Use only verified, trusted base models
model = AutoModelForSequenceClassification.from_pretrained(
    "google/bert-base-uncased",  # Trusted source
    num_labels=2,
    revision="main",
)

# Scan pre-trained model for backdoors before fine-tuning
detector = BackdoorDetector(model)
if detector.scan():
    raise SecurityError("Potential backdoor detected in base model")

# Fine-tune ALL layers (not just head) to overwrite potential backdoors
for param in model.parameters():
    param.requires_grad = True  # Train all layers

# Validate with clean test set + trigger test set
trainer.train()
evaluate_for_backdoors(model, trigger_test_set)

缓解措施清单

8️⃣ ML08 - 模型倾斜

Medium

概述

当生产数据分布与训练数据分布存在显著差异时,就会出现模型偏斜(训练服务偏斜)。这种情况可能随着时间的推移自然发生(数据漂移),也可能是攻击者故意造成的,他们操纵生产输入分布来降低模型性能或使预测产生偏差。

风险

模型偏斜会导致无声故障,在这种情况下,模型会产生不正确但有把握的预测。在金融系统中,攻击者可以利用偏斜绕过欺诈检测。在推荐系统中,可以故意诱导偏斜,以推广特定内容或产品。

漏洞代码示例

Python ❌ Bad
import joblib

# Deploy model with no drift monitoring
model = joblib.load("model_trained_2023.pkl")

def predict(features):
    # No check if input distribution has changed
    # No feature validation against training schema
    return model.predict([features])[0]
    # Model may be months/years old
    # No monitoring of prediction distribution
    # Silent degradation goes undetected

安全代码示例

Python ✅ Good
import joblib
import numpy as np
from scipy import stats
from evidently import ColumnDriftMetric

model = joblib.load("model.pkl")
training_stats = joblib.load("training_stats.pkl")

def predict_with_monitoring(features):
    # Validate feature schema and ranges
    for i, (val, stat) in enumerate(zip(features, training_stats)):
        z_score = abs((val - stat["mean"]) / stat["std"])
        if z_score > 5:
            log.warning(f"Feature {i} out of distribution: z={z_score:.1f}")

    prediction = model.predict([features])[0]

    # Log prediction distribution for drift monitoring
    metrics_collector.log(features, prediction)

    # Periodic drift detection (run by monitoring job)
    # drift_report = ColumnDriftMetric().calculate(reference, current)
    # Alert if drift detected → trigger retraining

    return prediction

缓解措施清单

9️⃣ ML09 - 输出完整性攻击

High

概述

输出完整性攻击是在模型预测离开模型之后、到达消费应用之前对其进行篡改。这包括对预测 API 的中间人攻击、对模型服务基础设施的操纵以及对缓存预测的篡改。攻击的目标是推理管道而非模型本身。

风险

被篡改的预测会导致下游系统做出错误的决定:批准欺诈交易、误诊病情或覆盖安全系统。由于模型本身没有受到破坏,因此标准模型监控无法检测到攻击。

漏洞代码示例

Python ❌ Bad
import requests

# Consuming model predictions over unencrypted HTTP
def get_prediction(features):
    response = requests.post(
        "http://ml-service/predict",  # HTTP, not HTTPS!
        json={"features": features}
    )
    result = response.json()
    # No integrity verification of the response
    # No validation of prediction format
    return result["prediction"]  # Could be tampered!

安全代码示例

Python ✅ Good
import requests
import hmac
import hashlib

def get_prediction(features):
    response = requests.post(
        "https://ml-service/predict",  # HTTPS (TLS)
        json={"features": features},
        headers={"Authorization": f"Bearer {API_TOKEN}"},
        verify=True  # Verify TLS certificate
    )
    result = response.json()

    # Verify response integrity with HMAC signature
    signature = response.headers.get("X-Signature")
    expected = hmac.new(
        SHARED_SECRET, str(result).encode(), hashlib.sha256
    ).hexdigest()
    if not hmac.compare_digest(signature, expected):
        raise IntegrityError("Response signature mismatch!")

    # Validate prediction is within expected range
    pred = result["prediction"]
    if pred not in VALID_LABELS:
        raise ValueError(f"Unexpected prediction: {pred}")
    return pred

缓解措施清单

🔟 ML10 - 模型中毒

Critical

概述

模型中毒直接修改训练模型的权重、参数或架构,以注入后门或改变行为。与数据中毒(破坏训练数据)不同,模型中毒针对的是模型工件本身--通过受损的模型库、内部威胁或对模型存储和部署管道的供应链攻击。

风险

直接中毒的模型可能包含极具针对性的后门,通过标准测试几乎无法检测到。攻击者可以精确控制模型在触发输入时的行为。如果模型注册表或部署管道遭到破坏,每次部署都会使用中毒模型。

漏洞代码示例

Python ❌ Bad
import mlflow

# Loading model from registry with no integrity checks
model_uri = "models:/fraud-detector/Production"
model = mlflow.pyfunc.load_model(model_uri)

# No hash verification
# No signature validation
# No comparison with expected model metrics
# Model registry has weak access controls
# Anyone with push access can replace the model

predictions = model.predict(new_data)

安全代码示例

Python ✅ Good
import mlflow
import hashlib
from sigstore.verify import Verifier

# Verify model signature before loading
model_uri = "models:/fraud-detector/Production"
model_path = mlflow.artifacts.download_artifacts(model_uri)

# Cryptographic signature verification
verifier = Verifier.production()
verifier.verify(
    model_path,
    expected_identity="ml-team@company.com"
)

# Verify model hash against approved registry
model_hash = hash_directory(model_path)
approved_hash = get_approved_hash("fraud-detector", "Production")
assert model_hash == approved_hash, "Model integrity check failed!"

# Validate model metrics on reference dataset before serving
model = mlflow.pyfunc.load_model(model_path)
ref_score = evaluate(model, reference_dataset)
assert ref_score >= MINIMUM_ACCURACY, "Model quality below threshold"

predictions = model.predict(new_data)

缓解措施清单

📊 汇总表

身份证 脆弱性 严重性 关键缓解措施
ML01输入操纵攻击Critical对抗训练、输入验证、置信度阈值
ML02数据中毒攻击Critical离群点检测、数据出处、基线比较
ML03模型反转攻击High差分隐私、最小输出、速率限制
ML04成员推理攻击High正则化、DP 训练、无概率暴露
ML05盗窃模型Critical速率限制、水印、查询模式检测
ML06人工智能供应链攻击Critical安全传感器、哈希验证、仅限可信来源
ML07迁移学习攻击High可信基础模型、后门扫描、全面微调
ML08模型倾斜Medium漂移监测、输入验证、自动再训练
ML09输出完整性攻击HighTLS/mTLS、响应签名、输出验证
ML10模型中毒Critical模型签名、注册表访问控制、度量验证